Installing and Using Impala
Installing and Using Impala
Important Notice
(c) 2010-2015 Cloudera, Inc. All rights reserved.
Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service
names or slogans contained in this document are trademarks of Cloudera and its
suppliers or licensors, and may not be copied, imitated or used, in whole or in part,
without the prior written permission of Cloudera or the applicable trademark holder.
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software
Foundation. All other trademarks, registered trademarks, product names and
company names or logos mentioned in this document are the property of their
respective owners. Reference to any products, services, processes or other
information, by trade name, trademark, manufacturer, supplier or otherwise does
not constitute or imply endorsement, sponsorship or recommendation thereof by
us.
Complying with all applicable copyright laws is the responsibility of the user. Without
limiting the rights under copyright, no part of this document may be reproduced,
stored in or introduced into a retrieval system, or transmitted in any form or by any
means (electronic, mechanical, photocopying, recording, or otherwise), or for any
purpose, without the express written permission of Cloudera.
Cloudera may have patents, patent applications, trademarks, copyrights, or other
intellectual property rights covering subject matter in this document. Except as
expressly provided in any written license agreement from Cloudera, the furnishing
of this document does not give you any license to these patents, trademarks
copyrights, or other intellectual property. For information about patents covering
Cloudera products, see http://tiny.cloudera.com/patents.
The information in this document is subject to change without notice. Cloudera
shall not be liable for any damages resulting from technical errors or omissions
which may be present in this document, or from use of this document.
Cloudera, Inc.
1001 Page Mill Road Bldg 2
Palo Alto, CA 94304
[email protected]
US: 1-888-789-1488
Intl: 1-650-362-0488
www.cloudera.com
Release Information
Version: 1.4.x
Date: September 8, 2015
Table of Contents
Introducing Cloudera Impala..................................................................................11
Impala Benefits..........................................................................................................................................11
How Cloudera Impala Works with CDH...................................................................................................11
Primary Impala Features..........................................................................................................................12
Impala Tutorial.........................................................................................................21
Tutorials for Getting Started.....................................................................................................................21
Set Up Some Basic .csv Tables..............................................................................................................................21
Point an Impala Table at Existing Data Files.......................................................................................................23
Describe the Impala Table.....................................................................................................................................25
Query the Impala Table..........................................................................................................................................25
Data Loading and Querying Examples.................................................................................................................26
Advanced Tutorials....................................................................................................................................28
Attaching an External Partitioned Table to an HDFS Directory Structure.......................................................28
Switching Back and Forth Between Impala and Hive........................................................................................30
Cross Joins and Cartesian Products with the CROSS JOIN Operator................................................................31
Impala Administration............................................................................................33
Admission Control and Query Queuing...................................................................................................33
Overview of Impala Admission Control................................................................................................................33
How Impala Admission Control Relates to YARN...............................................................................................34
How Impala Schedules and Enforces Limits on Concurrent Queries...............................................................34
How Admission Control works with Impala Clients (JDBC, ODBC, HiveServer 2).............................................35
Configuring Admission Control.............................................................................................................................35
Guidelines for Using Admission Control..............................................................................................................40
Literals........................................................................................................................................................63
Numeric Literals.....................................................................................................................................................63
String Literals..........................................................................................................................................................63
Boolean Literals......................................................................................................................................................64
Timestamp Literals.................................................................................................................................................64
NULL.........................................................................................................................................................................64
SQL Operators............................................................................................................................................65
Arithmetic Operators..............................................................................................................................................65
BETWEEN Operator................................................................................................................................................66
Comparison Operators...........................................................................................................................................67
IN Operator..............................................................................................................................................................67
IS NULL Operator....................................................................................................................................................67
LIKE Operator..........................................................................................................................................................68
Logical Operators....................................................................................................................................................68
REGEXP Operator...................................................................................................................................................70
RLIKE Operator........................................................................................................................................................71
SQL Statements.........................................................................................................................................78
DDL Statements.....................................................................................................................................................78
DML Statements.....................................................................................................................................................79
ALTER TABLE Statement.......................................................................................................................................79
ALTER VIEW Statement.........................................................................................................................................83
COMPUTE STATS Statement..................................................................................................................................84
CREATE DATABASE Statement.............................................................................................................................87
CREATE FUNCTION Statement..............................................................................................................................88
CREATE TABLE Statement.....................................................................................................................................90
CREATE VIEW Statement.......................................................................................................................................95
DESCRIBE Statement.............................................................................................................................................96
DROP DATABASE Statement...............................................................................................................................100
DROP FUNCTION Statement...............................................................................................................................101
DROP TABLE Statement......................................................................................................................................101
DROP VIEW Statement........................................................................................................................................102
EXPLAIN Statement.............................................................................................................................................103
INSERT Statement................................................................................................................................................105
INVALIDATE METADATA Statement...................................................................................................................111
LOAD DATA Statement........................................................................................................................................114
REFRESH Statement............................................................................................................................................116
SELECT Statement................................................................................................................................................118
SHOW Statement.................................................................................................................................................135
USE Statement......................................................................................................................................................138
Built-in Functions....................................................................................................................................138
Mathematical Functions......................................................................................................................................139
Type Conversion Functions.................................................................................................................................145
Partitioning.............................................................................................................233
When to Use Partitioned Tables............................................................................................................233
SQL Statements for Partitioned Tables................................................................................................233
Static and Dynamic Partitioning Clauses..............................................................................................234
Permissions for Partition Subdirectories.............................................................................................234
Partition Pruning for Queries.................................................................................................................234
Partition Key Columns............................................................................................................................236
Setting Different File Formats for Partitions.......................................................................................236
Impala Benefits
Impala provides:
Familiar SQL interface that data scientists and analysts already know
Ability to interactively query data on big data in Apache Hadoop
Distributed queries in a cluster environment, for convenient scaling and to make use of cost-effective
commodity hardware
Ability to share data files between different components with no copy or export/import step; for example,
to write with Pig and read with Impala, or to write with Impala and read with Hive
Single system for big data processing and analytics, so customers can avoid costly modeling and ETL just
for analytics
With these options, you can use Impala in heterogeneous environments, with JDBC or ODBC applications running
on non-Linux platforms. You can also use Impala on combination with various Business Intelligence tools that
use the JDBC and ODBC interfaces.
Each impalad daemon process, running on separate nodes in a cluster, listens to several ports for incoming
requests. Requests from impala-shell and Hue are routed to the impalad daemons through the same port.
The impalad daemons listen on separate ports for JDBC and ODBC requests.
1 query
10 queries
100 queries
1000 queries
2000 queries
250 GB
35
70
500 GB
10
70
135
1 TB
15
135
270
15 TB
20
200
N/A
N/A
30 TB
40
400
N/A
N/A
60 TB
80
800
N/A
N/A
Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is required
to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can estimate
the number of node using our equation above.
N > QPM * D / 100GB
N > 400 * 50GB / 100GB
N > 200
Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500.
Depending on the complexity of the query, the processing rate of query might change. If the query has more
joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the process
rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and filtering on
numbers, the processing rate can be higher.
Estimating Memory Requirements
Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the joined
tables, using the COMPUTE STATS statement. However, joining big tables does consume more memory. Follow
the steps below to calculate the minimum memory requirement.
Suppose you are running the following join:
select a.*,
from a, b
where a.key
and b.col_1
and b.col_4
And suppose table B is smaller than table A (but still a large table).
The memory requirement for the query is the right-hand table (B), after decompression, filtering (b.col_n in
...) and after projection (only using certain columns) must be less than the total memory of the entire cluster.
Cluster Total Memory Requirement = Size of the smaller table *
selectivity factor from the predicate *
projection factor * compression ratio
In this case, assume that table B is 100 TB in Parquet format with 200 columns. The predicate on B (b.col_1
in ...and b.col_4 in ...) will select only 10% of the rows from B and for projection, we are only projecting
So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1 TB
of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries of this
magnitude.
Impala Tutorial
Impala Tutorial
This section includes tutorial scenarios that demonstrate how to begin using Impala once the software is
installed. It focuses on techniques for loading data, because once you have some data in tables and can query
that data, you can quickly progress to more advanced Impala features.
Note:
Where practical, the tutorials take you from ground zero to having the desired Impala tables and
data. In some cases, you might need to download additional files from outside sources, set up additional
software components, modify commands or scripts to fit your own configuration, or substitute your
own sample data.
Before trying these tutorial lessons, install Impala:
If you already have a CDH environment set up and just need to add Impala to it, follow the installation process
described in Impala Installation. Make sure to also install Hive and its associated metastore database if you
do not already have Hive configured.
To set up Impala and all its prerequisites at once, in a minimal configuration that you can use for experiments
and then discard, set up the Cloudera QuickStart VM, which includes CDH and Impala on CentOS 6.3 (64-bit).
For more information, see the Cloudera QuickStart VM.
/user
cloudera cloudera
mapred
mapred
hue
supergroup
Here is some sample data, for two tables named TAB1 and TAB2.
Copy the following content to .csv files in your local filesystem:
tab1.csv:
1,true,123.123,2012-10-24 08:55:00
2,false,1243.5,2012-10-25 13:40:00
3,false,24453.325,2008-08-22 09:33:21.123
Impala Tutorial
4,false,243423.325,2007-05-12 22:32:21.33454
5,true,243.325,1953-04-22 09:11:33
tab2.csv:
1,true,12789.123
2,false,1243.5
3,false,24453.325
4,false,2423.3254
5,true,243.325
60,false,243565423.325
70,true,243.325
80,false,243423.325
90,true,243.325
Put each .csv file into a separate HDFS directory using commands like the following, which use paths available
in the Impala Demo VM:
$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1
$ hdfs dfs -ls /user/cloudera/sample_data/tab1
Found 1 items
-rw-r--r-1 cloudera cloudera
192 2013-04-02 20:08
/user/cloudera/sample_data/tab1/tab1.csv
$ hdfs dfs -put tab2.csv /user/cloudera/sample_data/tab2
$ hdfs dfs -ls /user/cloudera/sample_data/tab2
Found 1 items
-rw-r--r-1 cloudera cloudera
158 2013-04-02 20:09
/user/cloudera/sample_data/tab2/tab2.csv
The name of each data file is not significant. In fact, when Impala examines the contents of the data directory
for the first time, it considers all files in the directory to make up the data of the table, regardless of how many
files there are or what the files are named.
To understand what paths are available within your own HDFS filesystem and what the permissions are for the
various directories and files, issue hdfs dfs -ls / and work your way down the tree doing -ls operations for
the various directories.
Use the impala-shell command to create tables, either interactively or through a SQL script.
The following example shows creating three tables. For each table, the example shows creating columns with
various attributes such as Boolean or integer types. The example also includes commands that provide information
about how the data is formatted, such as rows terminating with commas, which makes sense in the case of
importing data from a .csv file. Where we already have .csv files containing data in the HDFS directory tree,
we specify the location of the directory containing the appropriate .csv file. Impala considers all the data from
all the files in that directory to represent the data for the table.
DROP TABLE IF EXISTS tab1;
-- The EXTERNAL clause means the data is located outside the central location for Impala
data files
-- and is preserved when the associated Impala table is dropped. We expect the data to
already
-- exist in the directory specified by the LOCATION clause.
CREATE EXTERNAL TABLE tab1
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE,
col_3 TIMESTAMP
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/cloudera/sample_data/tab1';
DROP TABLE IF EXISTS tab2;
-- TAB2 is an external table, similar to TAB1.
CREATE EXTERNAL TABLE tab2
Impala Tutorial
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/cloudera/sample_data/tab2';
DROP TABLE IF EXISTS tab3;
-- Leaving out the EXTERNAL clause means the data will be managed
-- in the central Impala data directory tree. Rather than reading
-- existing data files when the table is created, we load the
-- data after creating the table.
CREATE TABLE tab3
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE,
month INT,
day INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Note: Getting through these CREATE TABLE statements successfully is an important validation step
to confirm everything is configured correctly with the Hive metastore and HDFS permissions. If you
receive any errors during the CREATE TABLE statements:
Make sure you followed the installation instructions closely, in Impala Installation.
Make sure the hive.metastore.warehouse.dir property points to a directory that Impala can
write to. The ownership should be hive:hive, and the impala user should also be a member of
the hive group.
If the value of hive.metastore.warehouse.dir is different in the Cloudera Manager dialogs and
in the Hive shell, you might need to designate the hosts running impalad with the gateway role
for Hive, and deploy the client configuration files to those hosts.
Impala Tutorial
2|AAAAAAAACAAAAAAA|819667|1461|31655|2452318|2452288|Dr.|Amy|Moses|Y|9|4|1966|TOGO||Amy.Moses@Ov
k9KjHH.com|2452318|
3|AAAAAAAADAAAAAAA|1473522|6247|48572|2449130|2449100|Miss|Latisha|Hamilton|N|18|9|1979|NIUE||La
[email protected]|2452313|
4|AAAAAAAAEAAAAAAA|1703214|3986|39558|2450030|2450000|Dr.|Michael|White|N|7|6|1983|MEXICO||Micha
[email protected]|2452361|
5|AAAAAAAAFAAAAAAA|953372|4470|36368|2449438|2449408|Sir|Robert|Moran|N|8|5|1956|FIJI||Robert.Mo
[email protected]|2452469|
...
Impala Tutorial
Note:
Currently, the impala-shell interpreter requires that any command entered interactively be a single
line, so if you experiment with these commands yourself, either save to a .sql file and use the -f
option to run the script, or wrap each command onto one line before pasting into the shell.
Impala Tutorial
50000
Returned 1 row(s) in 0.19s
Passing a single command to the impala-shell command. The query is executed, the results are returned,
and the shell exits. Make sure to quote the command, preferably with single quotation marks to avoid shell
expansion of characters such as *.
$ impala-shell -i impala-host -q 'select count(*) from customer_address'
Connected to localhost:21000
50000
Returned 1 row(s) in 0.29s
Loading Data
Loading data involves:
Establishing a data set. The example below uses .csv files.
Creating tables to which to load data.
Loading the data into the tables you created.
Sample Queries
To run these sample queries, create a SQL query file query.sql, copy and paste each query into the query file,
and then run the query file using the shell. For example, to run query.sql on impala-host, you might use the
command:
impala-shell.sh -i impala-host -f query.sql
The examples and results below assume you have loaded the sample data into the tables as described above.
Example: Examining Contents of Tables
Let's start by verifying that the tables do contain the data we expect. Because Impala often deals with tables
containing millions or billions of rows, when examining tables of unknown size, include the LIMIT clause to
avoid huge amounts of unnecessary output, as in the final query. (If your interactive query starts displaying an
unexpected volume of data, press Ctrl-C in impala-shell to cancel the query.)
SELECT * FROM tab1;
SELECT * FROM tab2;
SELECT * FROM tab2 LIMIT 5;
Results:
+----+-------+------------+-------------------------------+
| id | col_1 | col_2
| col_3
|
+----+-------+------------+-------------------------------+
| 1 | true | 123.123
| 2012-10-24 08:55:00
|
| 2 | false | 1243.5
| 2012-10-25 13:40:00
|
| 3 | false | 24453.325 | 2008-08-22 09:33:21.123000000 |
| 4 | false | 243423.325 | 2007-05-12 22:32:21.334540000 |
| 5 | true | 243.325
| 1953-04-22 09:11:33
|
+----+-------+------------+-------------------------------+
+----+-------+---------------+
| id | col_1 | col_2
|
+----+-------+---------------+
| 1 | true | 12789.123
|
Impala Tutorial
| 2 | false | 1243.5
|
| 3 | false | 24453.325
|
| 4 | false | 2423.3254
|
| 5 | true | 243.325
|
| 60 | false | 243565423.325 |
| 70 | true | 243.325
|
| 80 | false | 243423.325
|
| 90 | true | 243.325
|
+----+-------+---------------+
+----+-------+-----------+
| id | col_1 | col_2
|
+----+-------+-----------+
| 1 | true | 12789.123 |
| 2 | false | 1243.5
|
| 3 | false | 24453.325 |
| 4 | false | 2423.3254 |
| 5 | true | 243.325
|
+----+-------+-----------+
Results:
+-------+-----------------+-----------------+
| col_1 | max(tab2.col_2) | min(tab2.col_2) |
+-------+-----------------+-----------------+
| false | 24453.325
| 1243.5
|
| true | 12789.123
| 243.325
|
+-------+-----------------+-----------------+
Results:
+----+-------+-----------+
| id | col_1 | col_2
|
+----+-------+-----------+
| 1 | true | 12789.123 |
| 3 | false | 24453.325 |
+----+-------+-----------+
Impala Tutorial
Results:
+----+-------+---------+-------+-----+
| id | col_1 | col_2
| month | day |
+----+-------+---------+-------+-----+
| 1 | true | 123.123 | 10
| 24 |
| 2 | false | 1243.5 | 10
| 25 |
+----+-------+---------+-------+-----+
Advanced Tutorials
These tutorials walk you through advanced scenarios or specialized features.
values
values
values
values
values
Back in the Linux shell, we examine the HDFS directory structure. (Your Impala data directory might be in a
different location; for historical reasons, it is sometimes under the HDFS path /user/hive/warehouse.) We
use the hdfs dfs -ls command to examine the nested subdirectories corresponding to each partitioning
column, with separate subdirectories at each level (with = in their names) representing the different values for
each partitioning column. When we get to the lowest level of subdirectory, we use the hdfs dfs -cat command
to examine the data file and see CSV-formatted data produced by the INSERT statement in Impala.
$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db
Found 1 items
drwxrwxrwt
- impala hive
0 2013-08-07 12:24
/user/impala/warehouse/external_partitions.db/logs
$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs
Found 1 items
drwxr-xr-x
- impala hive
0 2013-08-07 12:24
/user/impala/warehouse/external_partitions.db/logs/year=2013
$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013
Found 2 items
drwxr-xr-x
- impala hive
0 2013-08-07 12:23
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07
drwxr-xr-x
- impala hive
0 2013-08-07 12:24
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=08
Impala Tutorial
$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07
Found 2 items
drwxr-xr-x
- impala hive
0 2013-08-07 12:22
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28
drwxr-xr-x
- impala hive
0 2013-08-07 12:23
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=29
$ hdfs dfs -ls
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28
Found 2 items
drwxr-xr-x
- impala hive
0 2013-08-07 12:21
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host1
drwxr-xr-x
- impala hive
0 2013-08-07 12:22
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host2
$ hdfs dfs -ls
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host1
Found 1 items
-rw-r--r-3 impala hive
12 2013-08-07 12:21
/user/impala/warehouse/external_partiti
ons.db/logs/year=2013/month=07/day=28/host=host1/3981726974111751120--8907184999369517436_822630111_data.0
$ hdfs dfs -cat
/user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/\
host=host1/3981726974111751120--8 907184999369517436_822630111_data.0
foo,foo,foo
Still in the Linux shell, we use hdfs dfs -mkdir to create several data directories outside the HDFS directory
tree that Impala controls (/user/impala/warehouse in this example, maybe different in your case). Depending
on your configuration, you might need to log in as a user with permission to write into this HDFS directory tree;
for example, the commands shown here were run while logged in as the hdfs user.
$
$
$
$
$
hdfs
hdfs
hdfs
hdfs
hdfs
dfs
dfs
dfs
dfs
dfs
-mkdir
-mkdir
-mkdir
-mkdir
-mkdir
-p
-p
-p
-p
-p
/user/impala/data/logs/year=2013/month=07/day=28/host=host1
/user/impala/data/logs/year=2013/month=07/day=28/host=host2
/user/impala/data/logs/year=2013/month=07/day=28/host=host1
/user/impala/data/logs/year=2013/month=07/day=29/host=host1
/user/impala/data/logs/year=2013/month=08/day=01/host=host1
We make a tiny CSV file, with values different than in the INSERT statements used earlier, and put a copy within
each subdirectory that we will use as an Impala partition.
$ cat >dummy_log_data
bar,baz,bletch
$ hdfs dfs -mkdir -p
/user/impala/data/external_partitions/year=2013/month=08/day=01/host=host1
$ hdfs dfs -mkdir -p
/user/impala/data/external_partitions/year=2013/month=07/day=28/host=host1
$ hdfs dfs -mkdir -p
/user/impala/data/external_partitions/year=2013/month=07/day=28/host=host2
$ hdfs dfs -mkdir -p
/user/impala/data/external_partitions/year=2013/month=07/day=29/host=host1
$ hdfs dfs -put dummy_log_data
/user/impala/data/logs/year=2013/month=07/day=28/host=host1
$ hdfs dfs -put dummy_log_data
/user/impala/data/logs/year=2013/month=07/day=28/host=host2
$ hdfs dfs -put dummy_log_data
/user/impala/data/logs/year=2013/month=07/day=29/host=host1
$ hdfs dfs -put dummy_log_data
/user/impala/data/logs/year=2013/month=08/day=01/host=host1
Back in the impala-shell interpreter, we move the original Impala-managed table aside, and create a new
external table with a LOCATION clause pointing to the directory under which we have set up all the partition
subdirectories and data files.
use external_partitions;
alter table logs rename to logs_original;
create external table logs (field1 string, field2 string, field3 string)
partitioned by (year string, month string, day string, host string)
row format delimited fields terminated by ','
location '/user/impala/data/logs';
Impala Tutorial
Because partition subdirectories and data files come and go during the data lifecycle, you must identify each of
the partitions through an ALTER TABLE statement before Impala recognizes the data files they contain.
alter
alter
alter
alter
table
table
table
table
logs add
log_type
log_type
log_type
partition (year="2013",month="07",day="28",host="host1")
add partition (year="2013",month="07",day="28",host="host2");
add partition (year="2013",month="07",day="29",host="host1");
add partition (year="2013",month="08",day="01",host="host1");
We issue a REFRESH statement for the table, always a safe practice when data files have been manually added,
removed, or changed. Then the data is ready to be queried. The SELECT * statement illustrates that the data
from our trivial CSV file was recognized in each of the partitions where we copied it. Although in this case there
are only a few rows, we include a LIMIT clause on this test query just in case there is more data than we expect.
refresh log_type;
select * from log_type limit 100;
+--------+--------+--------+------+-------+-----+-------+
| field1 | field2 | field3 | year | month | day | host |
+--------+--------+--------+------+-------+-----+-------+
| bar
| baz
| bletch | 2013 | 07
| 28 | host1 |
| bar
| baz
| bletch | 2013 | 08
| 01 | host1 |
| bar
| baz
| bletch | 2013 | 07
| 29 | host1 |
| bar
| baz
| bletch | 2013 | 07
| 28 | host2 |
+--------+--------+--------+------+-------+-----+-------+
Impala Tutorial
Cross Joins and Cartesian Products with the CROSS JOIN Operator
Originally, Impala restricted join queries so that they had to include at least one equality comparison between
the columns of the tables on each side of the join operator. With the huge tables typically processed by Impala,
any miscoded query that produced a full Cartesian product as a result set could consume a huge amount of
cluster resources.
In Impala 1.2.2 and higher, this restriction is lifted when you use the CROSS JOIN operator in the query. You still
cannot remove all WHERE clauses from a query like SELECT * FROM t1 JOIN t2 to produce all combinations
of rows from both tables. But you can use the CROSS JOIN operator to explicitly request such a Cartesian product.
Typically, this operation is applicable for smaller tables, where the result set still fits within the memory of a
single Impala node.
The following example sets up data for use in a series of comic books where characters battle each other. At
first, we use an equijoin query, which only allows characters from the same time period and the same planet to
meet.
[localhost:21000] > create table heroes (name string, era string, planet string);
[localhost:21000] > create table villains (name string, era string, planet string);
[localhost:21000] > insert into heroes values
> ('Tesla','20th century','Earth'),
> ('Pythagoras','Antiquity','Earth'),
> ('Zopzar','Far Future','Mars');
Inserted 3 rows in 2.28s
[localhost:21000] > insert into villains values
> ('Caligula','Antiquity','Earth'),
> ('John Dillinger','20th century','Earth'),
> ('Xibulor','Far Future','Venus');
Inserted 3 rows in 1.93s
[localhost:21000] > select concat(heroes.name,' vs. ',villains.name) as battle
> from heroes join villains
> where heroes.era = villains.era and heroes.planet = villains.planet;
+--------------------------+
| battle
|
+--------------------------+
| Tesla vs. John Dillinger |
| Pythagoras vs. Caligula |
+--------------------------+
Returned 2 row(s) in 0.47s
Readers demanded more action, so we added elements of time travel and space travel so that any hero could
face any villain. Prior to Impala 1.2.2, this type of query was impossible because all joins had to reference matching
values between the two tables:
[localhost:21000] > -- Cartesian product not possible in Impala 1.1.
> select concat(heroes.name,' vs. ',villains.name) as battle from
heroes join villains;
ERROR: NotImplementedException: Join between 'heroes' and 'villains' requires at least
one conjunctive equality predicate between the two tables
With Impala 1.2.2, we rewrite the query slightly to use CROSS JOIN rather than JOIN, and now the result set
includes all combinations:
[localhost:21000] > -- Cartesian product available in Impala 1.2.2 with the CROSS JOIN
syntax.
> select concat(heroes.name,' vs. ',villains.name) as battle from
heroes cross join villains;
+-------------------------------+
| battle
|
+-------------------------------+
| Tesla vs. Caligula
|
| Tesla vs. John Dillinger
|
| Tesla vs. Xibulor
|
| Pythagoras vs. Caligula
|
| Pythagoras vs. John Dillinger |
| Pythagoras vs. Xibulor
|
| Zopzar vs. Caligula
|
Impala Tutorial
| Zopzar vs. John Dillinger
|
| Zopzar vs. Xibulor
|
+-------------------------------+
Returned 9 row(s) in 0.33s
The full combination of rows from both tables is known as the Cartesian product. This type of result set is often
used for creating grid data structures. You can also filter the result set by including WHERE clauses that do not
explicitly compare columns between the two tables. The following example shows how you might produce a list
of combinations of year and quarter for use in a chart, and then a shorter list with only selected quarters.
[localhost:21000] > create table x_axis (x int);
[localhost:21000] > create table y_axis (y int);
[localhost:21000] > insert into x_axis values (1),(2),(3),(4);
Inserted 4 rows in 2.14s
[localhost:21000] > insert into y_axis values (2010),(2011),(2012),(2013),(2014);
Inserted 5 rows in 1.32s
[localhost:21000] > select y as year, x as quarter from x_axis cross join y_axis;
+------+---------+
| year | quarter |
+------+---------+
| 2010 | 1
|
| 2011 | 1
|
| 2012 | 1
|
| 2013 | 1
|
| 2014 | 1
|
| 2010 | 2
|
| 2011 | 2
|
| 2012 | 2
|
| 2013 | 2
|
| 2014 | 2
|
| 2010 | 3
|
| 2011 | 3
|
| 2012 | 3
|
| 2013 | 3
|
| 2014 | 3
|
| 2010 | 4
|
| 2011 | 4
|
| 2012 | 4
|
| 2013 | 4
|
| 2014 | 4
|
+------+---------+
Returned 20 row(s) in 0.38s
[localhost:21000] > select y as year, x as quarter from x_axis cross join y_axis where
x in (1,3);
+------+---------+
| year | quarter |
+------+---------+
| 2010 | 1
|
| 2011 | 1
|
| 2012 | 1
|
| 2013 | 1
|
| 2014 | 1
|
| 2010 | 3
|
| 2011 | 3
|
| 2012 | 3
|
| 2013 | 3
|
| 2014 | 3
|
+------+---------+
Returned 10 row(s) in 0.39s
Impala Administration
Impala Administration
As an administrator, you monitor Impala's use of resources and take action when necessary to keep Impala
running smoothly and avoid conflicts with other Hadoop components running on the same cluster. When you
detect that an issue has happened or could happen in the future, you reconfigure Impala or other components
such as HDFS or even the hardware of the cluster itself to resolve or avoid problems.
Related tasks:
As an administrator, you can expect to perform installation, upgrade, and configuration tasks for Impala on all
machines in a cluster. See Impala Installation, Upgrading Impala, and Configuring Impala for details.
For additional security tasks typically performed by administrators, see Impala Security Configuration.
For a detailed example of configuring a cluster to share resources between Impala queries and MapReduce jobs,
see Setting up a Multi-tenant Cluster for Impala and MapReduce
Impala Administration
The admission control feature lets you set a cluster-wide upper limit on the number of concurrent Impala queries
and on the memory used by those queries. Any additional queries are queued until the earlier ones finish, rather
than being cancelled or running slowly and causing contention. As other queries finish, the queued queries are
allowed to proceed.
For details on the internal workings of admission control, see How Impala Schedules and Enforces Limits on
Concurrent Queries on page 34.
Impala Administration
of time so that the statements controlled by the queueing system are primarily queries, where order is not
significant. Or, if a sequence of statements needs to happen in strict order (such as an INSERT followed by a
SELECT), submit all those statements through a single session, while connected to the same impalad node.
The limit on the number of concurrent queries is a soft one, To achieve high throughput, Impala makes quick
decisions at the node level about which queued queries to dispatch. Therefore, Impala might slightly exceed the
limit from time to time.
To avoid a large backlog of queued requests, you can also set an upper limit on the size of the queue for queries
that are delayed. When the number of queued queries exceeds this limit, further queries are cancelled rather
than being queued. You can also configure a timeout period, after which queued queries are cancelled, to avoid
indefinite waits. If a cluster reaches this state where queries are cancelled due to too many concurrent requests
or long waits for query execution to begin, that is a signal for an administrator to take action, either by provisioning
more resources, scheduling work on the cluster to smooth out the load, or by doing Impala performance tuning
to enable higher throughput.
How Admission Control works with Impala Clients (JDBC, ODBC, HiveServer 2)
Most aspects of admission control work transparently with client interfaces such as JDBC and ODBC:
If a SQL statement is put into a queue rather than running immediately, the API call blocks until the statement
is dequeued and begins execution. At that point, the client program can request to fetch results, which might
also block until results become available.
If a SQL statement is cancelled because it has been queued for too long or because it exceeded the memory
limit during execution, the error is returned to the client program with a descriptive error message.
Admission control has the following limitations or special behavior when used with JDBC or ODBC applications:
If you want to submit queries to different resource pools through the REQUEST_POOL query option, as described
in REQUEST_POOL on page 201, that option is only settable for a session through the impala-shell interpreter
or cluster-wide through an impalad startup option.
The MEM_LIMIT query option, sometimes useful to work around problems caused by inaccurate memory
estimates for complicated queries, is only settable through the impala-shell interpreter and cannot be
used directly through JDBC or ODBC applications.
Admission control does not use the other resource-related query options, RESERVATION_REQUEST_TIMEOUT
or V_CPU_CORES. Those query options only apply to the YARN resource management framework.
Impala Administration
Note: Because Cloudera Manager 5 includes a GUI for these settings but Cloudera Manager 4 does
not, if you are using Cloudera Manager 4, include the appropriate configuration options in the impalad
command-line options safety valve field.
For a straightforward configuration using a single resource pool named default, you can specify configuration
options on the command line and skip the fair-scheduler.xml and llama-site.xml configuration files.
The impalad configuration options related to the admission control feature are:
--default_pool_max_queued
Purpose: Maximum number of requests allowed to be queued before rejecting requests. Because this
limit applies cluster-wide, but each Impala node makes independent decisions to run queries immediately
or queue them, it is a soft limit; the overall number of queued queries might be slightly higher during
times of heavy load. A negative value or 0 indicates requests are always rejected once the maximum
concurrent requests are executing. Ignored if fair_scheduler_config_path and llama_site_path
are set.
Type: int64
Default: 0
--default_pool_max_requests
Purpose: Maximum number of concurrent outstanding requests allowed to run before incoming requests
are queued. Because this limit applies cluster-wide, but each Impala node makes independent decisions
to run queries immediately or queue them, it is a soft limit; the overall number of concurrent queries
might be slightly higher during times of heavy load. A negative value indicates no limit. Ignored if
fair_scheduler_config_path and llama_site_path are set.
Type: int64
Default: -1
--default_pool_mem_limit
Purpose: Maximum amount of memory that all outstanding requests in this pool can use before new
requests to this pool are queued. Specified in bytes, megabytes, or gigabytes by a number followed by
the suffix b (optional), m, or g, either upper- or lowercase. You can specify floating-point values for
megabytes and gigabytes, to represent fractional numbers such as 1.5. You can also specify it as a
percentage of the physical memory by specifying the suffix %. 0 or no setting indicates no limit. Defaults
to bytes if no unit is given. Because this limit applies cluster-wide, but each Impala node makes
independent decisions to run queries immediately or queue them, it is a soft limit; the overall memory
used by concurrent queries might be slightly higher during times of heavy load. Ignored if
fair_scheduler_config_path and llama_site_path are set.
Note: Impala relies on the statistics produced by the COMPUTE STATS statement to estimate
memory usage for each query. See COMPUTE STATS Statement on page 84 for guidelines
about how and when to use this statement.
Type: string
Default: "" (empty string, meaning unlimited)
--disable_admission_control
Purpose: Turns off the admission control feature entirely, regardless of other configuration option settings.
Type: Boolean
Default: false
--disable_pool_max_requests
Purpose: Disables all per-pool limits on the maximum number of running requests.
Impala Administration
Type: Boolean
Default: false
--disable_pool_mem_limits
Purpose: Maximum amount of time (in milliseconds) that a request waits to be admitted before timing
out.
Type: int64
Default: 60000
For an advanced configuration with multiple resource pools using different settings, set up the
fair-scheduler.xml and llama-site.xml configuration files manually. Provide the paths to each one using
the impalad command-line options, --fair_scheduler_allocation_path and --llama_site_path
respectively.
The Impala admission control feature only uses the Fair Scheduler configuration settings to determine how to
map users and groups to different resource pools. For example, you might set up different resource pools with
separate memory limits, and maximum number of concurrent and queued queries, for different categories of
users within your organization. For details about all the Fair Scheduler configuration settings, see the Apache
wiki.
The Impala admission control feature only uses a small subset of possible settings from the llama-site.xml
configuration file:
llama.am.throttling.maximum.placed.reservations.queue_name
llama.am.throttling.maximum.queued.reservations.queue_name
For details about all the Llama configuration settings, see the documentation on Github.
See Examples of Admission Control Configurations on page 38 for sample configuration files for admission
control using multiple resource pools, without Cloudera Manager.
Impala Administration
Examples of Admission Control Configurations
For full instructions about configuring dynamic resource pools through Cloudera Manager, see Dynamic Resource
Pools in the Cloudera Manager documentation. The following examples demonstrate some important points
related to the Impala admission control feature.
The following figure shows a sample of the Dynamic Resource Pools page in Cloudera Manager, accessed through
the Clusters > ClusterName > Other > Dynamic Resource Pools > Configuration menu choice. Numbers from all
the resource pools are combined into the topmost root pool. The default pool is for users who are not assigned
any other pool by the user-to-pool mapping settings. The development and production pools show how you
can set different limits for different classes of users, for total memory, number of concurrent queries, and number
of queries that can be queued.
Figure 1: Sample Settings for Cloudera Manager Dynamic Resource Pools Page
The following figure shows a sample of the Placement Rules page in Cloudera Manager, accessed through the
Clusters > ClusterName > Other > Dynamic Resource Pools > Configuration > Placement Rules menu choice.
The settings demonstrate a reasonable configuration of a pool named default to service all requests where
the specified resource pool does not exist, is not explicitly set, or the user or group is not authorized for the
specified pool.
38 | Cloudera Impala User Guide
Impala Administration
Impala Administration
<queue name="default">
<maxResources>50000 mb, 0 vcores</maxResources>
<aclSubmitApps>*</aclSubmitApps>
</queue>
<queue name="development">
<maxResources>200000 mb, 0 vcores</maxResources>
<aclSubmitApps>user1,user2 dev,ops,admin</aclSubmitApps>
</queue>
<queue name="production">
<maxResources>1000000 mb, 0 vcores</maxResources>
<aclSubmitApps> ops,admin</aclSubmitApps>
</queue>
</queue>
<queuePlacementPolicy>
<rule name="specified" create="false"/>
<rule name="default" />
</queuePlacementPolicy>
</allocations>
llama-site.xml:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>llama.am.throttling.maximum.placed.reservations.root.default</name>
<value>10</value>
</property>
<property>
<name>llama.am.throttling.maximum.queued.reservations.root.default</name>
<value>50</value>
</property>
<property>
<name>llama.am.throttling.maximum.placed.reservations.root.development</name>
<value>50</value>
</property>
<property>
<name>llama.am.throttling.maximum.queued.reservations.root.development</name>
<value>100</value>
</property>
<property>
<name>llama.am.throttling.maximum.placed.reservations.root.production</name>
<value>100</value>
</property>
<property>
<name>llama.am.throttling.maximum.queued.reservations.root.production</name>
<value>200</value>
</property>
</configuration>
Impala Administration
immediately or to queue them. These decisions rely on information passed back and forth between nodes by
the statestore service. If a sudden surge in requests causes more queries than anticipated to run concurrently,
then as a fallback, the overall Impala memory limit and the Linux cgroups mechanism serve as hard limits to
prevent overallocation of memory, by cancelling queries if necessary.
If you have trouble getting a query to run because its estimated memory usage is too high, you can override the
estimate by setting the MEM_LIMIT query option in impala-shell, then issuing the query through the shell in
the same session. The MEM_LIMIT value is treated as the estimated amount of memory, overriding the estimate
that Impala would generate based on table and column statistics. This value is used only for making admission
control decisions, and is not pre-allocated by the query.
In impala-shell, you can also specify which resource pool to direct queries to by setting the REQUEST_POOL
query option. (This option was named YARN_POOL during the CDH 5 beta period.)
The statements affected by the admission control feature are primarily queries, but also include statements
that write data such as INSERT and CREATE TABLE AS SELECT. Most write operations in Impala are not
resource-intensive, but inserting into a Parquet table can require substantial memory due to buffering a
substantial amount of data before writing out each Parquet data block. See Loading Data into Parquet Tables
on page 247 for instructions about inserting data efficiently into Parquet tables.
Although admission control does not scrutinize memory usage for other kinds of DDL statements, if a query is
queued due to a limit on concurrent queries or memory usage, subsequent statements in the same session are
also queued so that they are processed in the correct order:
-- This query could be queued to avoid out-of-memory at times of heavy load.
select * from huge_table join enormous_table using (id);
-- If so, this subsequent statement in the same session is also queued
-- until the previous statement completes.
drop table huge_table;
If you set up different resource pools for different users and groups, consider reusing any classifications and
hierarchy you developed for use with Sentry security. See Enabling Sentry Authorization for Impala for details.
For details about all the Fair Scheduler configuration settings, see the Apache wiki, in particular the tags such
as <queue> and <aclSubmitApps> to map users and groups to particular resource pools (queues).
Impala Administration
The Llama Daemon
Llama is a system that mediates resource management between Cloudera Impala and Hadoop YARN. Llama
enables Impala to reserve, use, and release resource allocations in a Hadoop cluster. Llama is only required if
resource management is enabled in Impala.
By default, YARN allocates resources bit-by-bit as needed by MapReduce jobs. Impala needs all resources
available at the same time, so that intermediate results can be exchanged between cluster nodes, and queries
do not stall partway through waiting for new resources to be allocated. Llama is the intermediary process that
ensures all requested resources are available before each Impala query actually begins.
For Llama installation instructions, see Llama installation.
For management through Cloudera Manager, see Adding the Llama Role.
Impala Administration
resource requests to YARN and coordinating with Impala so that queries only begin executing when all needed
resources have been granted by YARN.
For information about setting up the YARN and Llama services, see the instructions for YARN and Llama in the
CDH 5 Installation Guide.
Impala Administration
-llama_max_request_attempts: Maximum number of times a request to reserve, expand, or release
resources is retried until the request is cancelled. Attempts are only counted after Impala is registered with
Llama. That is, a request survives at mostllama_max_request_attempts-1 re-registrations. Defaults to
5.
-llama_registration_timeout_secs: Maximum number of seconds that Impala will attempt to register
or re-register with Llama. If registration is unsuccessful, Impala cancels the action with an error, which could
result in an impalad startup failure or a cancelled query. A setting of -1 means try indefinitely. Defaults to
30.
-llama_registration_wait_secs: Number of seconds to wait between attempts during Llama registration.
Defaults to 3.
Impala Administration
Setting the Idle Query and Idle Session Timeouts for impalad
To keep long-running queries or idle sessions from tying up cluster resources, you can set timeout intervals for
both individual queries, and entire sessions. Specify the following startup options for the impalad daemon:
The --idle_query_timeout option specifies the time in seconds after which an idle query is cancelled. This
could be a query whose results were all fetched but was never closed, or one whose results were partially
fetched and then the client program stopped requesting further results. This condition is most likely to occur
in a client program using the JDBC or ODBC interfaces, rather than in the interactive impala-shell interpreter.
Once the query is cancelled, the client program cannot retrieve any further results.
The --idle_session_timeout option specifies the time in seconds after which an idle session is expired.
A session is idle when no activity is occurring for any of the queries in that session, and the session has not
started any new queries. Once a session is expired, you cannot issue any new query requests to it. The session
remains open, but the only operation you can perform is to close it. The default value of 0 means that sessions
never expire.
For instructions on changing impalad startup options, see Modifying Impala Startup Options.
Impala Administration
3. Copy the keytab file from the proxy host to all other hosts in the cluster that run the impalad daemon. (For
optimal performance, impalad should be running on all DataNodes in the cluster.) Put the keytab file in a
secure location on each of these other hosts.
4. On systems not managed by Cloudera Manager, add an entry impala/actual_hostname@realm to the
keytab on each host running the impalad daemon.
5. For each impalad node, merge the existing keytab with the proxys keytab using ktutil, producing a new
keytab file. For example:
$ ktutil
ktutil: read_kt proxy.keytab
ktutil: read_kt impala.keytab
ktutil: write_kt proxy_impala.keytab
ktutil: quit
6. Make sure that the impala user has permission to read this merged keytab file.
7. Change some configuration settings for each host in the cluster that participates in the load balancing.
In the impalad option definition, or the Cloudera Manager safety valve (Cloudera Manager 4) or advanced
configuration snippet (Cloudera Manager 5), add:
--principal=impala/proxy_host@realm
--be_principal=impala/actual_host@realm
--keytab_file=path_to_merged_keytab
Note: Every host has a different --be_principal because the actual host name is different
on each host.
On a cluster managed by Cloudera Manager, create a role group to set the configuration values from the
preceding step on a per-host basis.
On a cluster not managed by Cloudera Manager, see Modifying Impala Startup Options for the procedure
to modify the startup options.
8. On a cluster managed by Cloudera Manager, restart the Impala service.
On a cluster not managed by Cloudera Manager, restart the impalad daemons on all hosts in the cluster,
as well as the statestored and catalogd daemons.
Impala Administration
#
by adding the '-r' option to the SYSLOGD_OPTIONS in
#
/etc/sysconfig/syslog
#
# 2) configure local2 events to go to the /var/log/haproxy.log
#
file. A line like the following can be added to
#
/etc/sysconfig/syslog
#
#
local2.*
/var/log/haproxy.log
#
log
127.0.0.1 local0
log
127.0.0.1 local1 notice
chroot
/var/lib/haproxy
pidfile
/var/run/haproxy.pid
maxconn
4000
user
haproxy
group
haproxy
daemon
# turn on stats unix socket
#stats socket /var/lib/haproxy/stats
#--------------------------------------------------------------------# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#
# You might need to adjust timing values to prevent timeouts.
#--------------------------------------------------------------------defaults
mode
http
log
global
option
httplog
option
dontlognull
option http-server-close
option forwardfor
except 127.0.0.0/8
option
redispatch
retries
3
maxconn
3000
contimeout 5000
clitimeout 50000
srvtimeout 50000
#
# This sets up the admin page for HA Proxy at port 25002.
#
listen stats :25002
balance
mode http
stats enable
stats auth username:password
# This is the setup for Impala. Impala client connect to load_balancer_host:25003.
# HAProxy will balance connections among the list of servers listed below.
# The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original
ODBC driver.
# For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.
listen impala :25003
mode tcp
option tcplog
balance leastconn
server
server
server
server
symbolic_name_1
symbolic_name_2
symbolic_name_3
symbolic_name_4
impala-host-1.example.com:21000
impala-host-2.example.com:21000
impala-host-3.example.com:21000
impala-host-4.example.com:21000
Impala Administration
Use compact binary file formats where practical. Numeric and time-based data in particular can be stored
in more compact form in binary data files. Depending on the file format, various compression and encoding
features can reduce file size even further. You can specify the STORED AS clause as part of the CREATE TABLE
statement, or ALTER TABLE with the SET FILEFORMAT clause for an existing table or partition within a
partitioned table. See How Impala Works with Hadoop File Formats on page 239 for details about file formats,
especially Using the Parquet File Format with Impala Tables on page 246. See CREATE TABLE Statement on
page 90 and ALTER TABLE Statement on page 79 for syntax details.
You manage underlying data files differently depending on whether the corresponding Impala table is defined
as an internal or external table:
Use the DESCRIBE FORMATTED statement to check if a particular table is internal (managed by Impala)
or external, and to see the physical location of the data files in HDFS. See DESCRIBE Statement on page
96 for details.
For Impala-managed (internal) tables, use DROP TABLE statements to remove data files. See DROP
TABLE Statement on page 101 for details.
For tables not managed by Impala (external tables), use appropriate HDFS-related commands such as
hadoop fs, hdfs dfs, or distcp, to create, move, copy, or delete files within HDFS directories that are
accessible by the impala user. Issue a REFRESH table_name statement after adding or removing any
files from the data directory of an external table. See REFRESH Statement on page 116 for details.
Use external tables to reference HDFS data files in their original location. With this technique, you avoid
copying the files, and you can map more than one Impala table to the same set of data files. When you
drop the Impala table, the data files are left undisturbed. See External Tables on page 74 for details.
Use the LOAD DATA statement to move HDFS files into the data directory for an Impala table from inside
Impala, without the need to specify the HDFS path of the destination directory. This technique works for
both internal and external tables. See LOAD DATA Statement on page 114 for details.
Make sure that that the HDFS trashcan is configured correctly. When you remove files from HDFS, the space
might not be reclaimed for use by other files until sometime later, when the trashcan is emptied. See DROP
TABLE Statement on page 101 and the FAQ entry, Why is space not freed up when I issue DROP TABLE? in
the SQL section for details. See User Account Requirements for permissions needed for the HDFS trashcan
to operate correctly.
Drop all tables in a database before dropping the database itself. See DROP DATABASE Statement on page
100 for details.
Clean up temporary files after failed INSERT statements. If an INSERT statement encounters an error, and
you see a directory named .impala_insert_staging left behind in the data directory for the table, it might
contain temporary data files taking up space in HDFS. You might be able to salvage these data files, for
example if they are complete but could not be moved into place due to a permission error. Or, you might
delete those files through commands such as hadoop fs or hdfs dfs, to reclaim space before re-trying the
INSERT. Issue DESCRIBE FORMATTED table_name to see the HDFS path where you can check for temporary
files.
By default, intermediate files used during large sort operations are stored in the directory
/tmp/impala-scratch. These files are removed when the sort operation finishes. (Multiple concurrent
queries can perform ORDER BY queries that use the external sort technique, without any name conflicts for
these temporary files.) You can specify a different location by starting the impalad daemon with the
--scratch_dirs="path_to_directory" configuration option. The scratch directory must be on the local
filesystem, not in HDFS. You might specify different directory paths for different hosts, depending on the
capacity and speed of the available storage devices. Impala will not start if it cannot create or read and write
files in the scratch directory. If there is less than 1 GB free on the filesystem where that directory resides,
Impala still runs, but writes a warning message to its log.
Comments
Impala supports the familiar styles of SQL comments:
All text from a -- sequence to the end of the line is considered a comment and ignored. This type of comment
can occur on a single line by itself, or after all or part of a statement.
All text from a /* sequence to the next */ sequence is considered a comment and ignored. This type of
comment can stretch over multiple lines. This type of comment can occur on one or more lines by itself, in
the middle of a statement, or before or after a statement.
For example:
-- This line is a comment about a table.
create table ...;
/*
This is a multi-line comment about a query.
*/
select ...;
select * from t /* This is an embedded comment about a query. */ where ...;
select * from t -- This is a trailing comment within a multi-line command.
where ...;
Data Types
Impala supports a set of data types that you can use for table columns, expression values, and function arguments
and return values.
Related information: Literals on page 63, INT Data Type on page 59, SMALLINT Data Type on page 60, TINYINT
Data Type on page 62, Mathematical Functions on page 139
precision represents the total number of digits that can be represented by the column, regardless of the location
of the decimal point. This value must be between 1 and 38. For example, representing integer values up to 9999,
and floating-point values up to 99.99, both require a precision of 4. You can also represent corresponding negative
values, without any change in the precision. For example, the range -9999 to 9999 still only requires a precision
of 4.
scale represents the number of fractional digits. This value must be less than or equal to precision. A scale of
0 produces integral values, with no fractional part. If precision and scale are equal, all the digits come after the
decimal point, making all the values between 0 and 0.999... or 0 and -0.999...
When precision and scale are omitted, a DECIMAL value is treated as DECIMAL(9,0), that is, an integer value
ranging from -999,999,999 to 999,999,999. This is the largest DECIMAL value that can still be represented in
4 bytes. If precision is specified but scale is omitted, Impala uses a value of zero for the scale.
Both precision and scale must be specified as integer literals, not any other kind of constant expressions.
To check the precision or scale for arbitrary values, you can call the precision() and scale() built-in functions.
For example, you might use these values to figure out how many characters are required for various fields in a
report, or to understand the rounding characteristics of a formula as applied to a particular DECIMAL column.
Range:
The maximum precision value is 38. Thus, the largest integral value is represented by DECIMAL(38,0) (999...
with 9 repeated 38 times). The most precise fractional value (between 0 and 1, or 0 and -1) is represented by
DECIMAL(38,38), with 38 digits to the right of the decimal point. The value closest to 0 would be .0000...1 (37
zeros and the final 1). The value closest to 1 would be .999... (9 repeated 38 times).
For a given precision and scale, the range of DECIMAL values is the same in the positive and negative directions.
For example, DECIMAL(4,2) can represent from -99.99 to 99.99. This is different from other integral numeric
types where the positive and negative bounds differ slightly.
When you use DECIMAL values in arithmetic expressions, the precision and scale of the result value are determined
as follows:
For addition and subtraction, the precision and scale are based on the maximum possible result, that is, if
all the digits of the input values were 9s and the absolute values were added together.
For multiplication, the precision is the sum of the precisions of the input values. The scale is the sum of the
scales of the input values.
For division, Impala sets the precision and scale to values large enough to represent the whole and fractional
parts of the result.
For UNION, the scale is the larger of the scales of the input values, and the precision is increased if necessary
to accommodate any additional fractional digits. If the same input value has the largest precision and the
largest scale, the result value has the same precision and scale. If one value has a larger precision but smaller
scale, the scale of the result value is increased. For example, DECIMAL(20,2) UNION DECIMAL(8,6) produces
a result of type DECIMAL(24,6). The extra 4 fractional digits of scale (6-2) are accommodated by extending
the precision by the same amount (20+4).
To doublecheck, you can always call the PRECISION() and SCALE() functions on the results of an arithmetic
expression to see the relevant values, or use a CREATE TABLE AS SELECT statement to define a column
based on the return type of the expression.
Compatibility:
Using the DECIMAL type is only supported under CDH 5.1.0 and higher.
Use the DECIMAL data type in Impala for applications where you used the NUMBER data type in Oracle. The
Impala DECIMAL type does not support the Oracle idioms of * for scale or negative values for precision.
Conversions and casting:
To avoid potential conversion errors, you can use CAST() to convert DECIMAL values to FLOAT, TINYINT, SMALLINT,
INT, BIGINT, STRING, TIMESTAMP, or BOOLEAN. You can use exponential notation in DECIMAL literals or when
casting from STRING, for example 1.0e6 to represent one million.
If you cast a value with more fractional digits than the scale of the destination type, any extra fractional digits
are truncated (not rounded). Casting a value to a target type with not enough precision produces a result of NULL
and displays a runtime warning.
[localhost:21000] > select cast(1.239 as decimal(3,2));
+-----------------------------+
| cast(1.239 as decimal(3,2)) |
+-----------------------------+
| 1.23
|
+-----------------------------+
[localhost:21000] > select cast(1234 as decimal(3));
+----------------------------+
| cast(1234 as decimal(3,0)) |
+----------------------------+
| NULL
|
+----------------------------+
WARNINGS: Expression overflowed, returning NULL
When you specify integer literals, for example in INSERT ... VALUES statements or arithmetic expressions,
those numbers are interpreted as the smallest applicable integer type. You must use CAST() calls for some
combinations of integer literals and DECIMAL precision. For example, INT has a maximum value that is 10 digits
long, TINYINT has a maximum value that is 3 digits long, and so on. If you specify a value such as 123456 to go
into a DECIMAL column, Impala checks if the column has enough precision to represent the largest value of that
integer type, and raises an error if not. Therefore, use an expression like CAST(123456 TO DECIMAL(9,0)) for
DECIMAL columns with precision 9 or less, CAST(50 TO DECIMAL(2,0)) for DECIMAL columns with precision
2 or less, and so on. For DECIMAL columns with precision 10 or greater, Impala automatically interprets the value
Be aware that in memory and for binary file formats such as Parquet or Avro, DECIMAL(10) or higher consumes
8 bytes while DECIMAL(9) (the default for DECIMAL) or lower consumes 4 bytes. Therefore, to conserve space
in large tables, use the smallest-precision DECIMAL type that is appropriate and CAST() literal values where
necessary, rather than declaring DECIMAL columns with high precision for convenience.
To represent a very large or precise DECIMAL value as a literal, for example one that contains more digits than
can be represented by a BIGINT literal, use a quoted string or a floating-point value for the number, and CAST()
to the desired DECIMAL type:
insert into decimals_38_5 values (1), (2), (4), (8), (16), (1024), (32768), (65536),
(1000000),
(cast("999999999999999999999999999999" as decimal(38,5))),
(cast(999999999999999999999999999999. as decimal(38,5)));
The result of an aggregate function such as MAX(), SUM(), or AVG() on DECIMAL values is promoted to a
scale of 38, with the same precision as the underlying column. Thus, the result can represent the largest
possible value at that particular precision.
STRING columns, literals, or expressions can be converted to DECIMAL as long as the overall number of digits
and digits to the right of the decimal point fit within the specified precision and scale for the declared DECIMAL
type. By default, a DECIMAL value with no specified scale or precision can hold a maximum of 9 digits of an
integer value. If there are more digits in the string value than are allowed by the DECIMAL scale and precision,
the result is NULL.
The following examples demonstrate how STRING values with integer and fractional parts are represented
when converted to DECIMAL. If the precision is 0, the number is treated as an integer value with a maximum
of scale digits. If the precision is greater than 0, the scale must be increased to account for the digits both
to the left and right of the decimal point. As the precision increases, output values are printed with additional
trailing zeros after the decimal point if needed. Any trailing zeros after the decimal point in the STRING value
must fit within the number of digits specified by the precision.
[localhost:21000] > select cast('100' as decimal); -- Small integer value fits
within 9 digits of scale.
+-----------------------------+
| cast('100' as decimal(9,0)) |
+-----------------------------+
| 100
|
+-----------------------------+
[localhost:21000] > select cast('100' as decimal(3,0)); -- Small integer value
fits within 3 digits of scale.
+-----------------------------+
| cast('100' as decimal(3,0)) |
+-----------------------------+
| 100
|
+-----------------------------+
[localhost:21000] > select cast('100' as decimal(2,0)); -- 2 digits of scale is
Most built-in arithmetic functions such as SIN() and COS() continue to accept only DOUBLE values because
they are so commonly used in scientific context for calculations of IEEE 954-compliant values. The built-in
functions that accept and return DECIMAL are:
ABS()
CEIL()
COALESCE()
FLOOR()
FNV_HASH()
GREATEST()
IF()
ISNULL()
LEAST()
NEGATIVE()
NULLIF()
POSITIVE()
PRECISION()
ROUND()
SCALE()
TRUNCATE()
ZEROIFNULL()
When a DECIMAL value is converted to any of the integer types, any fractional part is truncated (that is,
rounded towards zero):
[localhost:21000] > create table num_dec_days (x decimal(4,1));
[localhost:21000] > insert into num_dec_days values (1), (2), (cast(4.5 as
decimal(4,1)));
[localhost:21000] > insert into num_dec_days values (cast(0.1 as decimal(4,1))),
(cast(.9 as decimal(4,1))), (cast(9.1 as decimal(4,1))), (cast(9.9 as
decimal(4,1)));
[localhost:21000] > select cast(x as int) from num_dec_days;
+----------------+
| cast(x as int) |
+----------------+
| 1
|
| 2
|
You cannot directly cast TIMESTAMP or BOOLEAN values to or from DECIMAL values. You can turn a DECIMAL
value into a time-related representation using a two-step process, by converting it to an integer value and
then using that result in a call to a date and time function such as from_unixtime().
[localhost:21000] > select from_unixtime(cast(cast(1000.0 as decimal) as bigint));
+-------------------------------------------------------------+
| from_unixtime(cast(cast(1000.0 as decimal(9,0)) as bigint)) |
+-------------------------------------------------------------+
| 1970-01-01 00:16:40
|
+-------------------------------------------------------------+
[localhost:21000] > select now() + interval cast(x as int) days from num_dec_days;
-- x is a DECIMAL column.
[localhost:21000] > create table num_dec_days (x decimal(4,1));
[localhost:21000] > insert into num_dec_days values (1), (2), (cast(4.5 as
decimal(4,1)));
[localhost:21000] > select now() + interval cast(x as int) days from num_dec_days;
-- The 4.5 value is truncated to 4 and becomes '4 days'.
+--------------------------------------+
| now() + interval cast(x as int) days |
+--------------------------------------+
| 2014-05-13 23:11:55.163284000
|
| 2014-05-14 23:11:55.163284000
|
| 2014-05-16 23:11:55.163284000
|
+--------------------------------------+
Because values in INSERT statements are checked rigorously for type compatibility, be prepared to use
CAST() function calls around literals, column references, or other expressions that you are inserting into a
DECIMAL column.
DECIMAL differences from integer and floating-point types:
With the DECIMAL type, you are concerned with the number of overall digits of a number rather than powers of
2 (as in TINYINT, SMALLINT, and so on). Therefore, the limits with integral values of DECIMAL types fall around
99, 999, 9999, and so on rather than 32767, 65535, 2 32 -1, and so on. For fractional values, you do not need to
account for imprecise representation of the fractional part according to the IEEE-954 standard (as in FLOAT and
DOUBLE). Therefore, when you insert a fractional value into a DECIMAL column, you can compare, sum, query,
GROUP BY, and so on that column and get back the the original values rather than some close but not identical
value.
FLOAT and DOUBLE can cause problems or unexpected behavior due to inability to precisely represent certain
fractional values, for example dollar and cents values for currency. You might find output values slightly different
than you inserted, equality tests that do not match precisely, or unexpected values for GROUP BY columns.
DECIMAL can help reduce unexpected behavior and rounding errors, at the expense of some performance overhead
for assignments and comparisons.
Literals and expressions:
When you use an integer literal such as 1 or 999 in a SQL statement, depending on the context, Impala will
treat it as either the smallest appropriate DECIMAL type, or the smallest integer type (TINYINT, SMALLINT,
INT, or BIGINT). To minimize memory usage, Impala prefers to treat the literal as the smallest appropriate
integer type.
When you use a floating-point literal such as 1.1 or 999.44 in a SQL statement, depending on the context,
Impala will treat it as either the smallest appropriate DECIMAL type, or the smallest floating-point type (FLOAT
or DOUBLE). To avoid loss of accuracy, Impala prefers to treat the literal as a DECIMAL.
Parquet and Avro tables use binary formats, In these tables, Impala stores each value in as few bytes as
possible depending on the precision specified for the DECIMAL column.
In memory, DECIMAL values with precision of 9 or less are stored in 4 bytes.
In memory, DECIMAL values with precision of 10 through 18 are stored in 8 bytes.
In memory, DECIMAL values with precision greater than 18 are stored in 16 bytes.
File format considerations:
The DECIMAL data type can be stored in any of the file formats supported by Impala, as described in How
Impala Works with Hadoop File Formats on page 239. Impala only writes to tables that use the Parquet and
text formats, so those formats are the focus for file format compatibility.
Impala can query Avro, RCFile, or SequenceFile tables containing DECIMAL columns, created by other Hadoop
components, on CDH 5.1 or higher only.
You can use DECIMAL columns in Impala tables that are mapped to HBase tables. Impala can query and
insert into such tables.
Text, RCFile, and SequenceFile tables all use ASCII-based formats. In these tables, each DECIMAL value takes
up as many bytes as there are digits in the value, plus an extra byte if the decimal point is present. The binary
format of Parquet or Avro files offers more compact storage for DECIMAL columns.
Parquet and Avro tables use binary formats, In these tables, Impala stores each value in 4, 8, or 16 bytes
depending on the precision specified for the DECIMAL column.
Parquet files containing DECIMAL columns are not expected to be readable under CDH 4. See the Compatibility
section for details.
UDF considerations: When writing a C++ UDF, use the DecimalVal data type defined in
/usr/include/impala_udf/udf.h.
Partitioning:
You can use a DECIMAL column as a partition key. Doing so provides a better match between the partition key
values and the HDFS directory names than using a DOUBLE or FLOAT partitioning column:
Schema evolution:
For text-based formats (text, RCFile, and SequenceFile tables), you can issue an ALTER TABLE ... REPLACE
COLUMNS statement to change the precision and scale of an existing DECIMAL column. As long as the values
in the column fit within the new precision and scale, they are returned correctly by a query. Any values that
do not fit within the new precision and scale are returned as NULL, and Impala reports the conversion error.
Leading zeros do not count against the precision value, but trailing zeros after the decimal point do.
[localhost:21000] > create table text_decimals (x string);
[localhost:21000] > insert into text_decimals values ("1"), ("2"), ("99.99"),
("1.234"), ("000001"), ("1.000000000");
[localhost:21000] > select * from text_decimals;
+-------------+
| x
|
+-------------+
| 1
|
| 2
|
| 99.99
|
| 1.234
|
For binary formats (Parquet and Avro tables), although an ALTER TABLE ... REPLACE COLUMNS statement
that changes the precision or scale of a DECIMAL column succeeds, any subsequent attempt to query the
changed column results in a fatal error. (The other columns can still be queried successfully.) This is because
the metadata about the columns is stored in the data files themselves, and ALTER TABLE does not actually
make any updates to the data files. If the metadata in the data files disagrees with the metadata in the
metastore database, Impala cancels the query.
Examples:
CREATE
INSERT
SELECT
SELECT
Restrictions:
Currently, the COMPUTE STATS statement under CDH 4 does not store any statistics for DECIMAL columns. When
Impala runs under CDH 5, which has better support for DECIMAL in the metastore database, COMPUTE STATS
does collect statistics for DECIMAL columns and Impala uses the statistics to optimize query performance.
Related information: Literals on page 63 for how numeric literals are sometimes interpreted as DECIMAL values;
Mathematical Functions on page 139 for the PRECISION() and SCALE() functions, and other functions whose
signatures now include DECIMAL arguments or return values
Related information: Literals on page 63, TINYINT Data Type on page 62, BIGINT Data Type on page 50, SMALLINT
Data Type on page 60, TINYINT Data Type on page 62, Mathematical Functions on page 139
Parquet considerations:
Physically, Parquet files represent TINYINT and SMALLINT values as 32-bit integers. Although Impala rejects
attempts to insert out-of-range values into such columns, if you create a new table with the CREATE TABLE
... LIKE PARQUET syntax, any TINYINT or SMALLINT columns in the original table turn into INT columns in
the new table.
Related information: Literals on page 63, TINYINT Data Type on page 62, BIGINT Data Type on page 50, TINYINT
Data Type on page 62, INT Data Type on page 59, Mathematical Functions on page 139
Time zones: Impala does not store timestamps using the local timezone to avoid undesired results from
unexpected time zone issues. Timestamps are stored relative to GMT.
Conversions: Impala automatically converts STRING literals of the correct format into TIMESTAMP values.
Timestamp values are accepted in the format YYYY-MM-DD HH:MM:SS.sssssssss, and can consist of just the
date, or just the time, with or without the fractional second portion. For example, you can specify TIMESTAMP
values such as '1966-07-30', '08:30:00', or '1985-09-25 17:45:30.005'. You can cast an integer or
floating-point value N to TIMESTAMP, producing a value that is N seconds past the start of the epoch date (January
1, 1970).
Note: In Impala 1.3 and higher, the FROM_UNIXTIME() and UNIX_TIMESTAMP() functions allow a
wider range of format strings, with more flexibility in element order, repetition of letter placeholders,
and separator characters. See Date and Time Functions on page 146 for details.
Partitioning:
Although you cannot use a TIMESTAMP column as a partition key, you can extract the individual years, months,
days, hours, and so on and partition based on those columns. Because the partition key column values are
represented in HDFS directory names, rather than as fields in the data files themselves, you can also keep the
original TIMESTAMP values if desired, without duplicating data or wasting storage space. See Partition Key
Columns on page 236 for more details on partitioning with date and time values.
[localhost:21000] > create table timeline (event string) partitioned by (happened
timestamp);
ERROR: AnalysisException: Type 'TIMESTAMP' is not supported as partition-column type
in column: happened
Restrictions:
Currently, Avro tables cannot contain TIMESTAMP columns. If you need to store date and time values in Avro
tables, as a workaround you can use a STRING representation of the values, convert the values to BIGINT with
the UNIX_TIMESTAMP() function, or create separate numeric columns for individual date and time fields using
the EXTRACT() function.
Related information: Literals on page 63; to convert to or from different date formats, or perform date arithmetic,
use the date and time functions described in Date and Time Functions on page 146. In particular, the
from_unixtime() function requires a case-sensitive format string such as "yyyy-MM-dd HH:mm:ss.SSSS",
matching one of the allowed variations of TIMESTAMP value (date plus time, only date, only time, optional fractional
seconds).
Related information:
See SQL Differences Between Impala and Hive on page 177 for details about differences in TIMESTAMP handling
between Impala and Hive.
Parquet considerations:
Physically, Parquet files represent TINYINT and SMALLINT values as 32-bit integers. Although Impala rejects
attempts to insert out-of-range values into such columns, if you create a new table with the CREATE TABLE
... LIKE PARQUET syntax, any TINYINT or SMALLINT columns in the original table turn into INT columns in
the new table.
Related information: Literals on page 63, INT Data Type on page 59, BIGINT Data Type on page 50, SMALLINT
Data Type on page 60, Mathematical Functions on page 139
Literals
Each of the Impala data types has corresponding notation for literal values of that type. You specify literal values
in SQL statements, such as in the SELECT list or WHERE clause of a query, or as an argument to a function call.
See Data Types on page 49 for a complete list of types, ranges, and conversion rules.
Numeric Literals
To write literals for the integer types (TINYINT, SMALLINT, INT, and BIGINT), use a sequence of digits with
optional leading zeros.
To write literals for the floating-point types (DECIMAL, FLOAT, and DOUBLE), use a sequence of digits with an
optional decimal point (. character). To preserve accuracy during arithmetic expressions, Impala interprets
floating-point literals as the DECIMAL type with the smallest appropriate precision and scale, until required by
the context to convert the result to FLOAT or DOUBLE.
Integer values are promoted to floating-point when necessary, based on the context.
You can also use exponential notation by including an e character. For example, 1e6 is 1 times 10 to the power
of 6 (1 million). A number in exponential notation is always interpreted as floating-point.
String Literals
String literals are quoted using either single or double quotation marks. You can use either kind of quotes for
string literals, even both kinds for different literals within the same statement.
Escaping special characters:
To encode special characters within a string literal, precede them with the backslash (\) escape character:
\t represents a tab.
\n represents a newline. This might cause extra line breaks in impala-shell output.
\r represents a linefeed. This might cause unusual formatting (making it appear that some content is
overwritten) in impala-shell output.
\b represents a backspace. This might cause unusual formatting (making it appear that some content is
overwritten) in impala-shell output.
\0 represents an ASCII nul character (not the same as a SQL NULL). This might not be visible in impala-shell
output.
\Z represents a DOS end-of-file character. This might not be visible in impala-shell output.
\% and \_ can be used to escape wildcard characters within the string passed to the LIKE operator.
\ followed by 3 octal digits represents the ASCII code of a single character; for example, \101 is ASCII 65, the
character A.
Use two consecutive backslashes (\\) to prevent the backslash from being interpreted as an escape character.
Use the backslash to escape single or double quotation mark characters within a string literal, if the literal
is enclosed by the same type of quotation mark.
If the character following the \ does not represent the start of a recognized escape sequence, the character
is passed through unchanged.
Quotes within quotes:
To include a single quotation character within a string value, enclose the literal with either single or double
quotation marks, and escape the single quote as a \' sequence. (The requirement to escape a single quote
inside double quotes might be lifted in later releases; if so, the escape character will be optional in that case.)
Boolean Literals
For BOOLEAN values, the literals are TRUE and FALSE, with no quotation marks and case-insensitive.
Examples:
select true;
select * from t1 where assertion = false;
select case bool_col when true then 'yes' when false 'no' else 'null' end from t1;
Timestamp Literals
For TIMESTAMP values, Impala automatically converts STRING literals of the correct format into TIMESTAMP
values. Timestamp values are accepted in the format YYYY-MM-DD HH:MM:SS.sssssssss, and can consist of
just the date, or just the time, with or without the fractional second portion. For example, you can specify
TIMESTAMP values such as '1966-07-30', '08:30:00', or '1985-09-25 17:45:30.005'. You can cast an
integer or floating-point value N to TIMESTAMP, producing a value that is N seconds past the start of the epoch
date (January 1, 1970).
You can also use INTERVAL expressions to add or subtract from timestamp literal values, such as '1966-07-30'
+ INTERVAL 5 YEARS + INTERVAL 3 DAYS. See TIMESTAMP Data Type on page 61 for details.
NULL
The notion of NULL values is familiar from all kinds of database systems, but each SQL dialect can have its own
behavior and restrictions on NULL values. For Big Data processing, the precise semantics of NULL values are
There is no NOT NULL clause when defining a column to prevent NULL values in that column.
There is no DEFAULT clause to specify a non-NULL default value.
If an INSERT operation mentions some columns but not others, the unmentioned columns contain NULL for
all inserted rows.
In Impala 1.2.1 and higher, all NULL values come at the end of the result set for ORDER BY ... ASC queries,
and at the beginning of the result set for ORDER BY ... DESC queries. In effect, NULL is considered greater
than all other values for sorting purposes. The original Impala behavior always put NULL values at the end,
even for ORDER BY ... DESC queries. The new behavior in Impala 1.2.1 makes Impala more compatible
with other popular database systems. In Impala 1.2.1 and higher, you can override or specify the sorting
behavior for NULL by adding the clause NULLS FIRST or NULLS LAST at the end of the ORDER BY clause.
Note: Because the NULLS FIRST and NULLS LAST keywords are not currently available in Hive
queries, any views you create using those keywords will not be available through Hive.
In all other contexts besides sorting with ORDER BY, comparing a NULL to anything else returns NULL, making
the comparison meaningless. For example, 10 > NULL produces NULL, 10 < NULL also produces NULL, 5
BETWEEN 1 AND NULL produces NULL, and so on.
Several built-in functions serve as shorthand for evaluating expressions and returning NULL, 0, or some other
substitution value depending on the expression result: ifnull(), isnull(), nvl(), nullif(), nullifzero(),
and zeroifnull(). See Conditional Functions on page 151 for details.
SQL Operators
SQL operators are a class of comparison functions that are widely used within the WHERE clauses of SELECT
statements.
Arithmetic Operators
The arithmetic operators use expressions with a left-hand argument, the operator, and then (in most cases) a
right-hand argument.
+ and -: Can be used either as unary or binary operators.
BETWEEN Operator
In a WHERE clause, compares an expression to both a lower and upper bound. The comparison is successful is
the expression is greater than or equal to the lower bound, and less than or equal to the upper bound. If the
bound values are switched, so the lower bound is greater than the upper bound, does not match any values.
Syntax: expression BETWEEN lower_bound AND upper_bound
Data types: Typically used with numeric data types. Works with any data type, although not very practical for
BOOLEAN values. (BETWEEN false AND true will match all BOOLEAN values.) Use CAST() if necessary to ensure
the lower and upper bound values are compatible types. Call string or date/time functions if necessary to extract
or transform the relevant portion to compare, especially if the value can be transformed into a number.
Usage notes: Be careful when using short string operands. A longer string that starts with the upper bound
value will not be included, because it is considered greater than the upper bound. For example, BETWEEN 'A'
and 'M' would not match the string value 'Midway'. Use functions such as upper(), lower(), substr(),
trim(), and so on if necessary to ensure the comparison works as expected.
Examples:
-- Retrieve data for January through June, inclusive.
select c1 from t1 where month between 1 and 6;
-- Retrieve data for names beginning with 'A' through 'M' inclusive.
Comparison Operators
Impala supports the familiar comparison operators for checking equality and sort order for the column data
types:
=, !=, <>: apply to all types.
<, <=, >, >=: apply to all types; for BOOLEAN, TRUE is considered greater than FALSE.
Alternatives:
The IN and BETWEEN operators provide shorthand notation for expressing combinations of equality, less than,
and greater than comparisons with a single operator.
Because comparing any value to NULL produces NULL rather than TRUE or FALSE, use the IS NULL and IS NOT
NULL operators to check if a value is NULL or not.
IN Operator
The IN operator compares an argument value to a set of values, and returns TRUE if the argument matches any
value in the set. The argument and the set of comparison values must be of compatible types.
Any expression using the IN operator could be rewritten as a series of equality tests connected with OR, but the
IN syntax is often clearer, more concise, and easier for Impala to optimize. For example, with partitioned tables,
queries frequently use IN clauses to filter data by comparing the partition key columns to specific values.
Examples:
-- Using IN is concise and self-documenting.
SELECT * FROM t1 WHERE c1 IN (1,2,10);
-- Equivalent to series of = comparisons ORed together.
SELECT * FROM t1 WHERE c1 = 1 OR c1 = 2 OR c1 = 10;
SELECT c1 AS "starts with vowel" FROM t2 WHERE upper(substr(c1,1,1)) IN
('A','E','I','O','U');
SELECT COUNT(DISTINCT(visitor_id)) FROM web_traffic WHERE month IN
('January','June','July');
IS NULL Operator
The IS NULL operator, and its converse the IS NOT NULL operator, test whether a specified value is NULL.
Because using NULL with any of the other comparison operators such as = or != also returns NULL rather than
TRUE or FALSE, you use a special-purpose comparison operator to check for this special condition.
Usage notes:
In many cases, NULL values indicate some incorrect or incomplete processing during data ingestion or conversion.
You might check whether any values in a column are NULL, and if so take some followup action to fill them in.
With sparse data, often represented in wide tables, it is common for most values to be NULL with only an
occasional non-NULL value. In those cases, you can use the IS NOT NULL operator to identify the rows containing
any data at all for a particular column, regardless of the actual value.
LIKE Operator
A comparison operator for STRING data, with basic wildcard capability using _ to match a single character and
% to match multiple characters. The argument expression must match the entire string value. Typically, it is
more efficient to put any % wildcard match at the end of the string.
Examples:
select distinct c_last_name from customer where c_last_name like 'Mc%' or c_last_name
like 'Mac%';
select count(c_last_name) from customer where c_last_name like 'M%';
select c_email_address from customer where c_email_address like '%.edu';
-- We can find 4-letter names beginning with 'M' by calling functions...
select distinct c_last_name from customer where length(c_last_name) = 4 and
substr(c_last_name,1,1) = 'M';
-- ...or in a more readable way by matching M followed by exactly 3 characters.
select distinct c_last_name from customer where c_last_name like 'M___';
For a more general kind of search operator using regular expressions, see REGEXP Operator on page 70.
Logical Operators
Logical operators return a BOOLEAN value, based on a binary or unary logical operation between arguments that
are also Booleans. Typically, the argument expressions use comparison operators.
The Impala logical operators are:
AND: A binary operator that returns true if its left-hand and right-hand arguments both evaluate to true,
NULL if either argument is NULL, and false otherwise.
OR: A binary operator that returns true if either of its left-hand and right-hand arguments evaluate to true,
NULL if one argument is NULL and the other is either NULL or false, and false otherwise.
NOT: A unary operator that flips the state of a Boolean expression from true to false, or false to true. If
the argument expression is NULL, the result remains NULL. (When NOT is used this way as a unary logical
operator, it works differently than the IS NOT NULL comparison operator, which returns true when applied
to a NULL.)
Examples:
These examples demonstrate the AND operator:
[localhost:21000] > select true and true;
+---------------+
| true and true |
+---------------+
| true
|
+---------------+
[localhost:21000] > select true and false;
REGEXP Operator
Tests whether a value matches a regular expression. Uses the POSIX regular expression syntax where ^ and $
match the beginning and end of the string, . represents any single character, * represents a sequence of zero
or more items, + represents a sequence of one or more items, ? produces a non-greedy match, and so on.
The regular expression must match the entire value, not just occur somewhere inside it. Use .* at the beginning
and/or the end if you only need to match characters anywhere in the middle. Thus, the ^ and $ atoms are often
redundant, although you might already have them in your expression strings that you reuse from elsewhere.
The RLIKE operator is a synonym for REGEXP.
The | symbol is the alternation operator, typically used within () to match different sequences. The () groups
do not allow backreferences. To retrieve the part of a value matched within a () section, use the
regexp_extract() built-in function.
Note:
In Impala 1.3.1 and higher, the REGEXP and RLIKE operators now match a regular expression string
that occurs anywhere inside the target string, the same as if the regular expression was enclosed on
each side by .*. See REGEXP Operator on page 70 for examples. Previously, these operators only
succeeded when the regular expression matched the entire target string. This change improves
compatibility with the regular expression support for popular database systems. There is no change
to the behavior of the regexp_extract() and regexp_replace() built-in functions.
Examples:
-- Find all customers whose first name starts with 'J', followed by 0 or more of any
character.
select c_first_name, c_last_name from customer where c_first_name regexp '^J.*';
-- Find 'Macdonald', where the first 'a' is optional and the 'D' can be upper- or
lowercase.
-- The ^...$ are required, to match the start and end of the value.
RLIKE Operator
Synonym for the REGEXP operator.
Aliases
When you write the names of tables, columns, or column expressions in a query, you can assign an alias at the
same time. Then you can specify the alias rather than the original name when making other references to the
table or column in the same statement. You typically specify aliases that are shorter, easier to remember, or
both than the original names. The aliases are printed in the query header, making them useful for
self-documenting output.
Aliases follow the same rules as identifiers when it comes to case insensitivity. Aliases can be longer than
identifiers (up to the maximum length of a Java string) and can include additional characters such as spaces
and dashes when they are quoted using backtick characters.
Alternatives:
Another way to define different names for the same tables or columns is to create views. See Views on page
74 for details.
Databases
In Impala, a database is a logical container for a group of tables. Each database defines a separate namespace.
Within a database, you can refer to the tables inside it using their unqualified names. Different databases can
contain tables with identical names.
Creating a database is a lightweight operation. There are no database-specific properties to configure. Therefore,
there is no ALTER DATABASE
Typically, you create a separate database for each project or application, to avoid naming conflicts between
tables and to make clear which tables are related to each other.
Each database is physically represented by a directory in HDFS.
There is a special database, named default, where you begin when you connect to Impala. Tables created in
default are physically located one level higher in HDFS than all the user-created databases.
Impala includes another predefined database, _impala_builtins, that serves as the location for the built-in
functions. To see the built-in functions, use a statement like the following:
show functions in _impala_builtins;
show functions in _impala_builtins like '*substring*';
Related statements: CREATE DATABASE Statement on page 87, DROP DATABASE Statement on page 100, USE
Statement on page 138
Functions
Functions let you apply arithmetic, string, or other computations and transformations to Impala data. You
typically use them in SELECT lists and WHERE clauses to filter and format query results so that the result set is
exactly what you want, with no further processing needed on the application side.
Scalar functions return a single result for each input row. See Built-in Functions on page 138.
Aggregate functions combine the results from multiple rows. See Aggregate Functions on page 158.
User-defined functions let you code your own logic. They can be either scalar or aggregate functions. See
User-Defined Functions (UDFs) on page 163.
Related statements: CREATE FUNCTION Statement on page 88, DROP FUNCTION Statement on page 101
Tables
Tables are the primary containers for data in Impala. They have the familiar row and column layout similar to
other database systems, plus some features such as partitioning often associated with higher-end data
warehouse systems.
Logically, each table has a structure based on the definition of its columns, partitions, and other properties.
Physically, each table is associated with a directory in HDFS. The table data consists of all the data files underneath
that directory:
Internal tables, managed by Impala, use directories inside the designated Impala work area.
External tables use arbitrary HDFS directories, where the data files are typically shared between different
Hadoop components.
Large-scale data is usually handled by partitioned tables, where the data files are divided among different
HDFS subdirectories.
Related statements: CREATE TABLE Statement on page 90, DROP TABLE Statement on page 101, ALTER TABLE
Statement on page 79 INSERT Statement on page 105, LOAD DATA Statement on page 114, SELECT Statement
on page 118
Internal Tables
The default kind of table produced by the CREATE TABLE statement is known as an internal table. (Its counterpart
is the external table, produced by the CREATE EXTERNAL TABLE syntax.)
Impala creates a directory in HDFS to hold the data files.
You load data by issuing INSERT statements in impala-shell or by using the LOAD DATA statement in Hive.
When you issue a DROP TABLE statement, Impala physically removes all the data files from the directory.
Views
Views are lightweight logical constructs that act as aliases for queries. You can specify a view name in a query
(a SELECT statement or the SELECT portion of an INSERT statement) where you would usually specify a table
name.
A view lets you:
Set up fine-grained security where a user can query some columns from a table but not other columns. See
Controlling Access at the Column Level through Views for details.
Issue complicated queries with compact and simple syntax:
-- Take a complicated reporting query, plug it into a CREATE VIEW statement...
create view v1 as select c1, c2, avg(c3) from t1 group by c3 order by c1 desc limit
10;
-- ... and now you can produce the report with 1 line of code.
select * from v1;
Reduce maintenance, by avoiding the duplication of complicated queries across multiple applications in
multiple languages:
create view v2 as select t1.c1, t1.c2, t2.c3 from t1 join t2 on (t1.id = t2.id);
-- This simple query is safer to embed in reporting applications than the longer
query above.
-- The view definition can remain stable even if the structure of the underlying
tables changes.
select c1, c2, c3 from v2;
Build a new, more refined query on top of the original query by adding new clauses, select-list expressions,
function calls, and so on:
create view average_price_by_category as select category, avg(price) as avg_price
from products group by category;
create view expensive_categories as select category, avg_price from
average_price_by_category order by avg_price desc limit 10000;
create view top_10_expensive_categories as select category, avg_price from
expensive_categories limit 10;
This technique lets you build up several more or less granular variations of the same query, and switch
between them when appropriate.
Set up aliases with intuitive names for tables, columns, result sets from joins, and so on:
-- The original tables might have cryptic names inherited from a legacy system.
create view action_items as select rrptsk as assignee, treq as due_date, dmisc as
notes from vxy_t1_br;
-- You can leave original names for compatibility, build new applications using
more intuitive ones.
select assignee, due_date, notes from action_items;
Avoid coding lengthy subqueries and repeating the same subquery text in many other queries.
The SQL statements that configure views are CREATE VIEW Statement on page 95, ALTER VIEW Statement on
page 83, and DROP VIEW Statement on page 102. You can specify view names when querying data (SELECT
Statement on page 118) and copying data from one table to another (INSERT Statement on page 105). The WITH
clause creates an inline view, that only exists for the duration of a single query.
[localhost:21000] > create view trivial as select * from customer;
[localhost:21000] > create view some_columns as select c_first_name, c_last_name,
c_login from customer;
[localhost:21000] > select * from some_columns limit 5;
Query finished, fetching results ...
+--------------+-------------+---------+
| c_first_name | c_last_name | c_login |
+--------------+-------------+---------+
| Javier
| Lewis
|
|
| Amy
| Moses
|
|
| Latisha
| Hamilton
|
|
| Michael
| White
|
|
| Robert
| Moran
|
|
+--------------+-------------+---------+
[localhost:21000] > create view ordered_results as select * from some_columns order by
c_last_name desc, c_first_name desc limit 1000;
[localhost:21000] > select * from ordered_results limit 5;
Query: select * from ordered_results limit 5
Query finished, fetching results ...
+--------------+-------------+---------+
| c_first_name | c_last_name | c_login |
+--------------+-------------+---------+
| Thomas
| Zuniga
|
|
| Sarah
| Zuniga
|
|
| Norma
| Zuniga
|
|
| Lloyd
| Zuniga
|
|
| Lisa
| Zuniga
|
|
+--------------+-------------+---------+
Returned 5 row(s) in 0.48s
The previous example uses descending order for ORDERED_RESULTS because in the sample TPCD-H data, there
are some rows with empty strings for both C_FIRST_NAME and C_LAST_NAME, making the lowest-ordered names
unuseful in a sample query.
create view visitors_by_day as select day, count(distinct visitors) as howmany from
web_traffic group by day;
create view top_10_days as select day, howmany from visitors_by_day order by howmany
limit 10;
select * from top_10_days;
Usage notes:
Prior to Impala 1.4.0, it was not possible to use the CREATE TABLE LIKE view_name syntax. In Impala 1.4.0
and higher, you can create a table with the same column definitions as a view using the CREATE TABLE LIKE
76 | Cloudera Impala User Guide
Related statements: CREATE VIEW Statement on page 95, ALTER VIEW Statement on page 83, DROP VIEW
Statement on page 102
SQL Statements
The Impala SQL dialect supports a range of standard elements, plus some extensions for Big Data use cases
related to data loading and data warehousing.
Note:
In the impala-shell interpreter, a semicolon at the end of each statement is required. Since the
semicolon is not actually part of the SQL syntax, we do not include it in the syntax definition of each
statement, but we do show it in examples intended to be run in impala-shell.
DDL Statements
DDL refers to Data Definition Language, a subset of SQL statements that change the structure of the database
schema in some way, typically by creating, deleting, or modifying schema objects such as databases, tables, and
views. Most Impala DDL statements start with the keywords CREATE, DROP, or ALTER.
The Impala DDL statements are:
After Impala executes a DDL command, information about available tables, columns, views, partitions, and so
on is automatically synchronized between all the Impala nodes in a cluster. (Prior to Impala 1.2, you had to issue
a REFRESH or INVALIDATE METADATA statement manually on the other nodes to make them aware of the
changes.)
If the timing of metadata updates is significant, for example if you use round-robin scheduling where each query
could be issued through a different Impala node, you can enable the SYNC_DDL query option to make the DDL
statement wait until all nodes have been notified about the metadata changes.
Although the INSERT statement is officially classified as a DML (data manipulation language) statement, it also
involves metadata changes that must be broadcast to all Impala nodes, and so is also affected by the SYNC_DDL
query option.
Because the SYNC_DDL query option makes each DDL operation take longer than normal, you might only enable
it before the last DDL operation in a sequence. For example, if if you are running a script that issues multiple of
DDL operations to set up an entire new schema, add several new partitions, and so on, you might minimize the
performance overhead by enabling the query option only before the last CREATE, DROP, ALTER, or INSERT
statement. The script only finishes when all the relevant metadata changes are recognized by all the Impala
nodes, so you could connect to any node and issue queries through it.
The classification of DDL, DML, and other statements is not necessarily the same between Impala and Hive.
Impala organizes these statements in a way intended to be familiar to people familiar with relational databases
or data warehouse products. Statements that modify the metastore database, such as COMPUTE STATS, are
classified as DDL. Statements that only query the metastore database, such as SHOW or DESCRIBE, are put into
a separate category of utility statements.
DML Statements
DML refers to Data Manipulation Language, a subset of SQL statements that modify the data stored in tables.
Because Impala focuses on query performance and leverages the append-only nature of HDFS storage, currently
Impala only supports a small set of DML statements:
INSERT Statement on page 105
LOAD DATA Statement on page 114
INSERT in Impala is primarily optimized for inserting large volumes of data in a single statement, to make
effective use of the multi-megabyte HDFS blocks. This is the way in Impala to create new data files. If you intend
to insert one or a few rows at a time, such as using the INSERT ... VALUES syntax, that technique is much
more efficient for Impala tables stored in HBase. See Using Impala to Query HBase Tables on page 265 for details.
LOAD DATA moves existing data files into the directory for an Impala table, making them immediately available
for Impala queries. This is one way in Impala to work with data files produced by other Hadoop components.
(CREATE EXTERNAL TABLE is the other alternative; with external tables, you can query existing data files, while
the files remain in their original location.)
To simulate the effects of an UPDATE or DELETE statement in other database systems, typically you use INSERT
or CREATE TABLE AS SELECT to copy data from one table to another, filtering out or changing the appropriate
rows during the copy operation.
Although Impala currently does not have an UPDATE statement, you can achieve a similar result by using Impala
tables stored in HBase. When you insert a row into an HBase table, and the table already contains a row with
the same value for the key column, the older row is hidden, effectively the same as a single-row UPDATE.
Related information:
The other major classifications of SQL statements are data definition language (see DDL Statements on page
78) and queries (see SELECT Statement on page 118).
TABLE
TABLE
TABLE
TABLE
name
name
name
name
For internal tables, his operation physically renames the directory within HDFS that contains the data files; the
original directory name no longer exists. By qualifying the table names with database names, you can use this
technique to move an internal table (and its associated data directory) from one database to another. For
example:
create database d1;
create database d2;
create database d3;
use d1;
create table mobile (x int);
use d2;
-- Move table from another database to the current one.
alter table d1.mobile rename to mobile;
use d1;
-- Move table from one database to another.
alter table d2.mobile rename to d3.mobile;
To change the physical location where Impala looks for data files associated with a table or partition:
ALTER TABLE table_name [PARTITION (partition_spec)] SET LOCATION
'hdfs_path_of_directory';
The TBLPROPERTIES clause is primarily a way to associate arbitrary user-specified data items with a particular
table.
The SERDEPROPERTIES clause sets up metadata defining how tables are read or written, needed in some cases
by Hive but not used extensively by Impala. You would use this clause primarily to change the delimiter in an
existing text table or partition, by setting the 'serialization.format' and 'field.delim' property values
to the new delimiter character:
-- This table begins life as pipe-separated text format.
create table change_to_csv (s1 string, s2 string) row format delimited fields terminated
by '|';
-- Then we change it to a CSV table.
alter table change_to_csv set SERDEPROPERTIES ('serialization.format'=',',
'field.delim'=',');
insert overwrite change_to_csv values ('stop','go'), ('yes','no');
!hdfs dfs -cat 'hdfs://hostname:8020/data_directory/dbname.db/change_to_csv/data_file';
stop,go
yes,no
Use the DESCRIBE FORMATTED statement to see the current values of these properties for an existing table.
See CREATE TABLE Statement on page 90 for more details about these clauses. See Setting Statistics Manually
through ALTER TABLE on page 213 for an example of using table properties to fine-tune the performance-related
table statistics.
To reorganize columns for a table:
ALTER
ALTER
ALTER
ALTER
TABLE
TABLE
TABLE
TABLE
table_name
table_name
table_name
table_name
The column_spec is the same as in the CREATE TABLE statement: the column name, then its data type, then
an optional comment. You can add multiple columns at a time. The parentheses are required whether you add
a single column or multiple columns. When you replace columns, all the original column definitions are discarded.
You might use this technique if you receive a new set of data files with different data types or columns in a
different order. (The data files are retained, so if the new columns are incompatible with the old ones, use INSERT
OVERWRITE or LOAD DATA OVERWRITE to replace all the data before issuing any further queries.)
You might use the CHANGE clause to rename a single column, or to treat an existing column as a different type
than before, such as to switch between treating a column as STRING and TIMESTAMP, or between INT and
To change the file format that Impala expects data to be in, for a table or partition:
ALTER TABLE table_name [PARTITION (partition_spec)] SET FILEFORMAT { PARQUET | TEXTFILE
| RCFILE | SEQUENCEFILE }
Because this operation only changes the table metadata, you must do any conversion of existing data using
regular Hadoop techniques outside of Impala. Any new data created by the Impala INSERT statement will be in
the new format. You cannot specify the delimiter for Text files; the data files must be comma-delimited.
To set the file format for a single partition, include the PARTITION clause. Specify all the same partitioning
columns for the table, with a constant value for each, to precisely identify the single partition affected by the
statement:
create table p1 (s string) partitioned by (month int, day int);
-- Each ADD PARTITION clause creates a subdirectory in HDFS.
alter table p1 add partition (month=1, day=1);
alter table p1 add partition (month=1, day=2);
alter table p1 add partition (month=2, day=1);
alter table p1 add partition (month=2, day=2);
-- Queries and INSERT statements will read and write files
-- in this format for this specific partition.
alter table p1 partition (month=2, day=2) set fileformat parquet;
To add or drop partitions for a table, the table must already be partitioned (that is, created with a PARTITIONED
BY clause). The partition is a physical directory in HDFS, with a name that encodes a particular column value (the
partition key). The Impala INSERT statement already creates the partition if necessary, so the ALTER TABLE
... ADD PARTITION is primarily useful for importing data by moving or copying existing data files into the
HDFS directory corresponding to a partition. (You can use the LOAD DATA statement to move files into the
partition directory, or ALTER TABLE ... PARTITION (...) SET LOCATION to point a partition at a directory
that already contains data files.
The DROP PARTITION clause is used to remove the HDFS directory and associated data files for a particular set
of partition key values; for example, if you always analyze the last 3 months worth of data, at the beginning of
each month you might drop the oldest partition that is no longer needed. Removing partitions reduces the
amount of metadata associated with the table and the complexity of calculating the optimal query plan, which
can simplify and speed up queries on partitioned tables, particularly join queries. Here is an example showing
the ADD PARTITION and DROP PARTITION clauses.
-- Create an empty table and define the partitioning scheme.
create table part_t (x int) partitioned by (month int);
-- Create an empty partition into which you could copy data files from some other
source.
alter table part_t add partition (month=1);
-- After changing the underlying data, issue a REFRESH statement to make the data
visible in Impala.
refresh part_t;
-- Later, do the same for the next month.
alter table part_t add partition (month=2);
-- Now you no longer need the older data.
alter table part_t drop partition (month=1);
-- If the table was partitioned by month and year, you would issue a statement like:
-- alter table part_t drop partition (year=2003,month=1);
-- which would require 12 ALTER TABLE statements to remove a year's worth of data.
-- If the data files for subsequent months were in a different file format,
-- you could set a different file format for the new partition as you create it.
alter table part_t add partition (month=3) set fileformat=parquet;
Note:
An alternative way to reorganize a table and its associated data files is to use CREATE TABLE to create
a variation of the original table, then use INSERT to copy the transformed or reordered data to the
new table. The advantage of ALTER TABLE is that it avoids making a duplicate copy of the data files,
allowing you to reorganize huge volumes of data in a space-efficient way using familiar Hadoop
techniques.
Cancellation: Cannot be cancelled.
To see the definition of a view, issue a DESCRIBE FORMATTED statement, which shows the query from the original
CREATE VIEW statement:
[localhost:21000] > create view v1 as select * from t1;
[localhost:21000] > describe formatted v1;
Query finished, fetching results ...
+------------------------------+------------------------------+----------------------+
| name
| type
| comment
|
+------------------------------+------------------------------+----------------------+
| # col_name
| data_type
| comment
|
|
| NULL
| NULL
|
| x
| int
| None
|
| y
| int
| None
|
than for HDFS-backed tables, but that metadata is still used for optimization when HBase tables are involved
in join queries.
Performance considerations: The statistics collected by COMPUTE STATS are used to optimize join queries and
resource-intensive INSERT operations.
Examples:
This example shows two tables, T1 and T2, with a small number distinct values linked by a parent-child
relationship between T1.ID and T2.PARENT. T1 is tiny, while T2 has approximately 100K rows. Initially, the
statistics includes physical measurements such as the number of files, the total size, and size measurements
for fixed-length columns such as with the INT type. Unknown values are represented by -1. After running
COMPUTE STATS for each table, much more information is available through the SHOW STATS statements. If you
were running a join query involving both of these tables, you would need statistics for both tables to get the
most effective optimization for the query.
[localhost:21000] > show table stats t1;
Query: show table stats t1
+-------+--------+------+--------+
| #Rows | #Files | Size | Format |
+-------+--------+------+--------+
| -1
| 1
| 33B | TEXT
|
+-------+--------+------+--------+
Returned 1 row(s) in 0.02s
[localhost:21000] > show table stats t2;
Query: show table stats t2
+-------+--------+----------+--------+
| #Rows | #Files | Size
| Format |
+-------+--------+----------+--------+
| -1
| 28
| 960.00KB | TEXT
|
+-------+--------+----------+--------+
Returned 1 row(s) in 0.01s
[localhost:21000] > show column stats t1;
Query: show column stats t1
+--------+--------+------------------+--------+----------+----------+
After creating a database, your impala-shell session or another impala-shell connected to the same node
can immediately access that database. To access the database through the Impala daemon on a different node,
issue the INVALIDATE METADATA statement first while connected to that other node.
If you connect to different Impala nodes within an impala-shell session for load-balancing purposes, you can
enable the SYNC_DDL query option to make each DDL statement wait before returning, until the new or changed
metadata has been received by all the Impala nodes. See SYNC_DDL on page 202 for details.
Examples:
create database first;
use first;
create table t1 (x int);
create database second;
use second;
-- Each database has its own namespace for tables.
-- You can reuse the same table names in each database.
create table t1 (s string);
create database temp;
-- You do not have to USE a database after creating it.
-- Just qualify the table name with the name of the database.
create table temp.t2 (x int, y int);
use database temp;
create table t3 (s string);
-- You cannot drop a database while it is selected by the USE statement.
drop database temp;
ERROR: AnalysisException: Cannot drop current default database: temp
-- The always-available database 'default' is a convenient one to USE.
use default;
-- Dropping the database is a fast way to drop all the tables within it.
drop database temp;
result when passed the same argument values. Impala might or might not skip some invocations of a UDF
if the result value is already known from a previous call. Therefore, do not rely on the UDF being called a
specific number of times, and do not return different result values based on some external factor such as
the current time, a random number function, or an external data source that could be updated while an Impala
query is in progress.
The names of the function arguments in the UDF are not significant, only their number, positions, and data
types.
You can overload the same function name by creating multiple versions of the function, each with a different
argument signature. For security reasons, you cannot make a UDF with the same name as any built-in
function.
In the UDF code, you represent the function return result as a struct. This struct contains 2 fields. The
first field is a boolean representing whether the value is NULL or not. (When this field is true, the return
value is interpreted as NULL.) The second field is the same type as the specified function return type, and
holds the return value when the function returns something other than NULL.
In the UDF code, you represent the function arguments as an initial pointer to a UDF context structure,
followed by references to zero or more structs, corresponding to each of the arguments. Each struct has
the same 2 fields as with the return value, a boolean field representing whether the argument is NULL, and
a field of the appropriate type holding any non-NULL argument value.
For sample code and build instructions for UDFs, see the sample directory supplied with Impala.
Because the file representing the body of the UDF is stored in HDFS, it is automatically available to all the
Impala nodes. You do not need to manually copy any UDF-related files between servers.
Because Impala currently does not have any ALTER FUNCTION statement, if you need to rename a function,
move it to a different database, or change its signature or other properties, issue a DROP FUNCTION statement
for the original function followed by a CREATE FUNCTION with the desired properties.
Because each UDF is associated with a particular database, either issue a USE statement before doing any
CREATE FUNCTION statements, or specify the name of the function as db_name.function_name.
If you connect to different Impala nodes within an impala-shell session for load-balancing purposes, you can
enable the SYNC_DDL query option to make each DDL statement wait before returning, until the new or changed
metadata has been received by all the Impala nodes. See SYNC_DDL on page 202 for details.
Compatibility:
Impala can run UDFs that were created through Hive, as long as they refer to Impala-compatible data types (not
composite or nested column types). Hive can run Java-based UDFs that were created through Impala, but not
Impala UDFs written in C++.
Cancellation: Cannot be cancelled.
Related information:
See User-Defined Functions (UDFs) on page 163 for more background information, usage instructions, and
examples for Impala UDFs.
Note: To clone the structure of a table and transfer data into it in a single operation, use the CREATE
TABLE AS SELECT syntax described in the next subsection.
When you clone the structure of an existing table using the CREATE TABLE ... LIKE syntax, the new table
keeps the same file format as the original one, so you only need to specify the STORED AS clause if you want to
use a different file format, or when specifying a view as the original table. (Creating a table like a view produces
a text table by default.)
Although normally Impala cannot create an HBase table directly, Impala can clone the structure of an existing
HBase table with the CREATE TABLE ... LIKE syntax, preserving the file format and metadata from the original
table.
There are some exceptions to the ability to use CREATE TABLE ... LIKE with an Avro table. For example, you
cannot use this technique for an Avro table that is specified with an Avro schema but no columns. When in
See SELECT Statement on page 118 for details about query syntax for the SELECT portion of a CREATE TABLE
AS SELECT statement.
The newly created table inherits the column names that you select from the original table, which you can override
by specifying column aliases in the query. Any column or table comments from the original table are not carried
over to the new table.
Sorting considerations: Although you can specify an ORDER BY clause in an INSERT ... SELECT statement,
any ORDER BY clause is ignored and the results are not necessarily sorted. An INSERT ... SELECT operation
potentially creates many different data files, prepared on different data nodes, and therefore the notion of the
data being stored in sorted order is impractical.
For example, the following statements show how you can clone all the data in a table, or a subset of the columns
and/or rows, or reorder columns, rename them, or construct them out of expressions:
-- Create new table and copy all data.
CREATE TABLE clone_of_t1 AS SELECT * FROM t1;
-- Same idea as CREATE TABLE LIKE, don't copy any data.
CREATE TABLE empty_clone_of_t1 AS SELECT * FROM t1 WHERE 1=0;
-- Copy some data.
CREATE TABLE subset_of_t1 AS SELECT * FROM t1 WHERE x > 100 AND y LIKE 'A%';
CREATE TABLE summary_of_t1 AS SELECT c1, sum(c2) AS total, avg(c2) AS average FROM t1
GROUP BY c2;
-- Switch file format.
CREATE TABLE parquet_version_of_t1 STORED AS PARQUET AS SELECT * FROM t1;
-- Create tables with different column order, names, or types than the original.
CREATE TABLE some_columns_from_t1 AS SELECT c1, c3, c5 FROM t1;
CREATE TABLE reordered_columns_from_t1 AS SELECT c4, c3, c1, c2 FROM t1;
CREATE TABLE synthesized_columns AS SELECT upper(c1) AS all_caps, c2+c3 AS total,
"California" AS state FROM t1;
The more complicated and hard-to-read the original query, the more benefit there is to simplifying the query
using a view.
To hide the underlying table and column names, to minimize maintenance problems if those names change.
In that case, you re-create the view using the new names, and all queries that use the view rather than the
underlying tables keep running with no changes.
To experiment with optimization techniques and make the optimized queries available to all applications.
For example, if you find a combination of WHERE conditions, join order, join hints, and so on that works the
best for a class of queries, you can establish a view that incorporates the best-performing techniques.
Applications can then make relatively simple queries against the view, without repeating the complicated
and optimized logic over and over. If you later find a better way to optimize the original query, when you
re-create the view, all the applications immediately take advantage of the optimized base query.
To simplify a whole class of related queries, especially complicated queries involving joins between multiple
tables, complicated expressions in the column list, and other SQL syntax that makes the query difficult to
understand and debug. For example, you might create a view that joins several tables, filters using several
WHERE conditions, and selects several columns from the result set. Applications might issue queries against
this view that only vary in their LIMIT, ORDER BY, and similar simple clauses.
For queries that require repeating complicated clauses over and over again, for example in the select list, ORDER
BY, and GROUP BY clauses, you can use the WITH clause as an alternative to creating a view.
If you connect to different Impala nodes within an impala-shell session for load-balancing purposes, you can
enable the SYNC_DDL query option to make each DDL statement wait before returning, until the new or changed
metadata has been received by all the Impala nodes. See SYNC_DDL on page 202 for details.
Examples:
create view v1 as select * from t1;
create view v2 as select c1, c3, c7 from t1;
create view v3 as select c1, cast(c3 as string) c3, concat(c4,c5) c5, trim(c6) c6,
"Constant" c8 from t1;
create view v4 as select t1.c1, t2.c2 from t1 join t2 on t1.id = t2.id;
create view some_db.v5 as select * from some_other_db.t1;
DESCRIBE Statement
The DESCRIBE statement displays metadata about a table, such as the column names and their data types. Its
syntax is:
DESCRIBE [FORMATTED] table
You can use the abbreviation DESC for the DESCRIBE statement.
The DESCRIBE FORMATTED variation displays additional information, in a format familiar to users of Apache
Hive. The extra information includes low-level details such as whether the table is internal or external, when it
was created, the file format, the location of the data in HDFS, whether the object is a table or a view, and (for
views) the text of the query from the view definition.
96 | Cloudera Impala User Guide
Related information:
For other tips about managing and reclaiming Impala disk space, see Managing Disk Space for Impala Data on
page 47.
Cancellation: Cannot be cancelled.
The select_query is a SELECT statement, optionally prefixed by a WITH clause. See SELECT Statement on page
118 for details.
The insert_stmt is an INSERT statement that inserts into or overwrites an existing table. It can use either the
INSERT ... SELECT or INSERT ... VALUES syntax. See INSERT Statement on page 105 for details.
The ctas_stmt is a CREATE TABLE statement using the AS SELECT clause, typically abbreviated as a CTAS
operation. See CREATE TABLE Statement on page 90 for details.
Usage notes:
You can interpret the output to judge whether the query is performing efficiently, and adjust the query and/or
the schema if not. For example, you might change the tests in the WHERE clause, add hints to make join operations
more efficient, introduce subqueries, change the order of tables in a join, add or change partitioning for a table,
collect column statistics and/or table statistics in Hive, or any other performance tuning steps.
The EXPLAIN output reminds you if table or column statistics are missing from any table involved in the query.
These statistics are important for optimizing queries involving large tables or multi-table joins. See COMPUTE
STATS Statement on page 84 for how to gather statistics, and How Impala Uses Statistics for Query Optimization
on page 212 for how to use this information for query tuning.
Read the EXPLAIN plan from bottom to top:
The last part of the plan shows the low-level details such as the expected amount of data that will be read,
where you can judge the effectiveness of your partitioning strategy and estimate how long it will take to
scan a table based on total data size and the size of the cluster.
As you work your way up, next you see the operations that will be parallelized and performed on each Impala
node.
At the higher levels, you see how data flows when intermediate result sets are combined and transmitted
from one node to another.
See EXPLAIN_LEVEL on page 194 for details about the EXPLAIN_LEVEL query option, which lets you customize
how much detail to show in the EXPLAIN plan depending on whether you are doing high-level or low-level
tuning, dealing with logical or physical aspects of the query.
If you come from a traditional database background and are not familiar with data warehousing, keep in mind
that Impala is optimized for full table scans across very large tables. The structure and distribution of this data
is typically not suitable for the kind of indexing and single-row lookups that are common in OLTP environments.
Seeing a query scan entirely through a large table is common, not necessarily an indication of an inefficient
query. Of course, if you can reduce the volume of scanned data by orders of magnitude, for example by using a
query that affects only certain partitions within a partitioned table, then you might be able to optimize a query
so that it executes in seconds rather than minutes.
For more information and examples to help you interpret EXPLAIN output, see Using the EXPLAIN Plan for
Performance Tuning on page 224.
Extended EXPLAIN output:
For performance tuning of complex queries, and capacity planning (such as using the admission control and
resource management features), you can enable more detailed and informative output for the EXPLAIN statement.
In the impala-shell interpreter, issue the command SET EXPLAIN_LEVEL=level, where level is an integer
from 0 to 3 or corresponding mnemonic values minimal, standard, extended, or verbose.
These examples show how the extended EXPLAIN output becomes more accurate and informative as statistics
are gathered by the COMPUTE STATS statement. Initially, much of the information about data size and distribution
is marked unavailable. Impala can determine the raw data size, but not the number of rows or number of
distinct values for each column without additional analysis. The COMPUTE STATS statement performs this
analysis, so a subsequent EXPLAIN statement has additional information to use in deciding how to optimize
the distributed query.
[localhost:21000] > set explain_level=extended;
EXPLAIN_LEVEL set to extended
[localhost:21000] > explain select x from t1;
[localhost:21000] > explain select x from t1;
+----------------------------------------------------------+
| Explain String
|
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=32.00MB VCores=1 |
|
|
| 01:EXCHANGE [PARTITION=UNPARTITIONED]
|
| | hosts=1 per-host-mem=unavailable
|
| | tuple-ids=0 row-size=4B cardinality=unavailable
|
| |
|
| 00:SCAN HDFS [default.t2, PARTITION=RANDOM]
|
|
partitions=1/1 size=36B
|
|
table stats: unavailable
|
|
column stats: unavailable
|
|
hosts=1 per-host-mem=32.00MB
|
|
tuple-ids=0 row-size=4B cardinality=unavailable
|
+----------------------------------------------------------+
[localhost:21000] > compute stats t1;
+-----------------------------------------+
| summary
|
+-----------------------------------------+
| Updated 1 partition(s) and 1 column(s). |
INSERT Statement
Impala supports inserting into tables and partitions that you create with the Impala CREATE TABLE statement,
or pre-defined tables and partitions created through Hive.
Syntax:
[with_clause]
INSERT { INTO | OVERWRITE } [TABLE] table_name
[(column_list)]
[ PARTITION (partition_clause)]
{
[hint_clause] select_statement
| VALUES (value [, value ...]) [, (value [, value ...]) ...]
}
partition_clause ::= col_name [= constant] [, col_name [= constant] ...]
hint_clause ::= [SHUFFLE] | [NOSHUFFLE]
syntax.)
Usage notes:
Impala currently supports:
INSERT INTO to append data to a table.
INSERT OVERWRITE to replace the data in a table.
Copy data from another table using SELECT query. In Impala 1.2.1 and higher, you can combine CREATE
TABLE and INSERT operations into a single step with the CREATE TABLE AS SELECT syntax, which bypasses
the actual INSERT keyword.
An optional WITH clause before the INSERT keyword, to define a subquery referenced in the SELECT portion.
Create one or more new rows using constant expressions through VALUES clause. (The VALUES clause was
added in Impala 1.0.1.)
Specify the names or order of columns to be inserted, different than the columns of the table being queried
by the INSERT statement. (This feature was added in Impala 1.1.)
An optional hint clause immediately before the SELECT keyword, to fine-tune the behavior when doing an
INSERT ... SELECT operation into partitioned Parquet tables. The hint keywords are [SHUFFLE] and
[NOSHUFFLE], including the square brackets. Inserting into partitioned Parquet tables can be a
resource-intensive operation because it potentially involves many files being written to HDFS simultaneously,
and separate 1 GB memory buffers being allocated to buffer the data for each partition. For usage details,
see Loading Data into Parquet Tables on page 247.
With the INSERT INTO TABLE syntax, each new set of inserted rows is appended to any existing data in the
table. This is how you would record small amounts of data that arrive continuously, or ingest new batches of
data alongside the existing data. For example, after running 2 INSERT INTO TABLE statements with 5 rows
each, the table contains 10 rows total:
[localhost:21000] > insert into table text_table select * from default.tab1;
Inserted 5 rows in 0.41s
[localhost:21000] > insert into table text_table select * from default.tab1;
Inserted 5 rows in 0.46s
With the INSERT OVERWRITE TABLE syntax, each new set of inserted rows replaces any existing data in the
table. This is how you load data to query in a data warehousing scenario where you analyze just the data for a
particular day, quarter, and so on, discarding the previous data each time. You might keep the entire set of data
in one raw table, and transfer and transform certain rows into a more compact and efficient form to perform
intensive analysis on that subset.
For example, here we insert 5 rows into a table using the INSERT INTO clause, then replace the data by inserting
3 rows with the INSERT OVERWRITE clause. Afterward, the table only contains the 3 rows from the final INSERT
statement.
[localhost:21000] > insert into table parquet_table select * from default.tab1;
Inserted 5 rows in 0.35s
[localhost:21000] > insert overwrite table parquet_table select * from default.tab1
limit 3;
Inserted 3 rows in 0.43s
[localhost:21000] > select count(*) from parquet_table;
+----------+
| count(*) |
+----------+
| 3
|
+----------+
Returned 1 row(s) in 0.43s
The VALUES clause lets you insert one or more rows by specifying constant values for all the columns. The
number, types, and order of the expressions must match the table definition.
Note: The INSERT ... VALUES technique is not suitable for loading large quantities of data into
HDFS-based tables, because the insert operations cannot be parallelized, and each one produces a
separate data file. Use it for setting up small dimension tables or tiny amounts of data for
experimenting with SQL syntax, or with HBase tables. Do not use it for large ETL jobs or benchmark
tests for load operations. Do not run scripts with thousands of INSERT ... VALUES statements that
insert a single row each time. If you do run INSERT ... VALUES operations to load data into a staging
table as one stage in an ETL pipeline, include multiple row values if possible within each VALUES
clause, and use a separate database to make cleanup easier if the operation does produce many tiny
files.
The following example shows how to insert one row or multiple rows, with expressions of different types, using
literal values, expressions, and function return values:
create
insert
create
insert
These examples show the type of not implemented error that you see when attempting to insert data into a
table with a file format that Impala currently does not write to:
DROP TABLE IF EXISTS sequence_table;
CREATE TABLE sequence_table
( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP )
STORED AS SEQUENCEFILE;
DROP TABLE IF EXISTS rc_table;
CREATE TABLE rc_table
Inserting data into partitioned tables requires slightly different syntax that divides the partitioning columns
from the others:
create table t1 (i int) partitioned by (x int, y string);
-- Select an INT column from another table.
-- All inserted rows will have the same x and y values, as specified in the INSERT
statement.
-- This technique of specifying all the partition key values is known as static
partitioning.
insert into t1 partition(x=10, y='a') select c1 from some_other_table;
-- Select two INT columns from another table.
-- All inserted rows will have the same y value, as specified in the INSERT statement.
-- Values from c2 go into t1.x.
-- Any partitioning columns whose value is not specified are filled in
-- from the columns specified last in the SELECT list.
-- This technique of omitting some partition key values is known as dynamic partitioning.
insert into t1 partition(x, y='b') select c1, c2 from some_other_table;
-- Select an INT and a STRING column from another table.
-- All inserted rows will have the same x value, as specified in the INSERT statement.
-- Values from c3 go into t1.y.
insert into t1 partition(x=20, y) select c1, c3 from some_other_table;
The following example shows how you can copy the data in all the columns from one table to another, copy the
data from only some columns, or specify the columns in the select list in a different order than they actually
appear in the table:
-- Start with 2 identical tables.
create table t1 (c1 int, c2 int);
create table t2 like t1;
-- If there is
-- all columns
insert into t2
insert into t2
Sorting considerations: Although you can specify an ORDER BY clause in an INSERT ... SELECT statement,
any ORDER BY clause is ignored and the results are not necessarily sorted. An INSERT ... SELECT operation
potentially creates many different data files, prepared on different data nodes, and therefore the notion of the
data being stored in sorted order is impractical.
Concurrency considerations: Each INSERT operation creates new data files with unique names, so you can run
multiple INSERT INTO statements simultaneously without filename conflicts. While data is being inserted into
an Impala table, the data is staged temporarily in a subdirectory inside the data directory; during this period,
you cannot issue queries against that table in Hive. If an INSERT operation fails, the temporary data file and the
subdirectory could be left behind in the data directory. If so, remove the relevant subdirectory and any data files
it contains manually, by issuing an hdfs dfs -rm -r command, specifying the full path of the work subdirectory,
whose name ends in _dir.
VALUES Clause
The VALUES clause is a general-purpose way to specify the columns of one or more rows, typically within an
INSERT statement.
Note: The INSERT ... VALUES technique is not suitable for loading large quantities of data into
HDFS-based tables, because the insert operations cannot be parallelized, and each one produces a
separate data file. Use it for setting up small dimension tables or tiny amounts of data for
experimenting with SQL syntax, or with HBase tables. Do not use it for large ETL jobs or benchmark
tests for load operations. Do not run scripts with thousands of INSERT ... VALUES statements that
insert a single row each time. If you do run INSERT ... VALUES operations to load data into a staging
table as one stage in an ETL pipeline, include multiple row values if possible within each VALUES
clause, and use a separate database to make cleanup easier if the operation does produce many tiny
files.
The following examples illustrate:
How to insert a single row using a VALUES clause.
How to insert multiple rows using a VALUES clause.
How the row or rows from a VALUES clause can be appended to a table through INSERT INTO, or replace the
contents of the table through INSERT OVERWRITE.
How the entries in a VALUES clause can be literals, function results, or any other kind of expression. See
Literals on page 63 for the notation to use for literal values, especially String Literals on page 63 for quoting
and escaping conventions for strings. See SQL Operators on page 65 and Built-in Functions on page 138 for
other things you can include in expressions with the VALUES clause.
[localhost:21000] > describe val_example;
Query: describe val_example
Query finished, fetching results ...
+-------+---------+---------+
| name | type
| comment |
+-------+---------+---------+
| id
| int
|
|
| col_1 | boolean |
|
| col_2 | double |
|
+-------+---------+---------+
[localhost:21000] > insert into val_example values (1,true,100.0);
Inserted 1 rows in 0.30s
[localhost:21000] > select * from val_example;
+----+-------+-------+
| id | col_1 | col_2 |
+----+-------+-------+
| 1 | true | 100
|
+----+-------+-------+
When used in an INSERT statement, the Impala VALUES clause can specify some or all of the columns in the
destination table, and the columns can be specified in a different order than they actually appear in the table.
To specify a different set or order of columns than in the table, use the syntax:
INSERT INTO destination
(col_x, col_y, col_z)
VALUES
(val_x, val_y, val_z);
Any columns in the table that are not listed in the INSERT statement are set to NULL.
To use a VALUES clause like a table in other statements, wrap it in parentheses and use AS clauses to specify
aliases for the entire object and any columns you need to refer to:
[localhost:21000] > select * from (values(4,5,6),(7,8,9)) as t;
+---+---+---+
| 4 | 5 | 6 |
+---+---+---+
| 4 | 5 | 6 |
| 7 | 8 | 9 |
+---+---+---+
[localhost:21000] > select * from (values(1 as c1, true as c2, 'abc' as
c3),(100,false,'xyz')) as t;
+-----+-------+-----+
| c1 | c2
| c3 |
+-----+-------+-----+
| 1
| true | abc |
| 100 | false | xyz |
+-----+-------+-----+
For example, you might use a tiny table constructed like this from constant literals or function return values as
part of a longer statement involving joins or UNION ALL.
HBase considerations:
You can use the INSERT statement with HBase tables as follows:
You can insert a single row or a small set of rows into an HBase table with the INSERT ... VALUES syntax.
This is a good use case for HBase tables with Impala, because HBase tables are not subject to the same kind
of fragmentation from many small insert operations as HDFS tables are.
You can insert any number of rows at once into an HBase table using the INSERT ... SELECT syntax.
If more than one inserted row has the same value for the HBase key column, only the last inserted row with
that value is visible to Impala queries. You can take advantage of this fact with INSERT ... VALUES
statements to effectively update rows one at a time, by inserting new rows with the same key values as
existing rows. Be aware that after an INSERT ... SELECT operation copying from an HDFS table, the HBase
table might contain fewer rows than were inserted, if the key column in the source table contained duplicate
values.
You cannot INSERT OVERWRITE into an HBase table. New rows are always appended.
When you create an Impala or Hive table that maps to an HBase table, the column order you specify with
the INSERT statement might be different than the order you declare with the CREATE TABLE statement.
110 | Cloudera Impala User Guide
when needed for a subsequent query, but reloads all the metadata for the table, which can be an expensive
operation, especially for large tables with many partitions. REFRESH reloads the metadata immediately, but only
loads the block location data for newly added data files, making it a less expensive operation overall. If data was
altered in some more extensive way, such as being reorganized by the HDFS balancer, use INVALIDATE METADATA
to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, the INVALIDATE
METADATA statement works just like the Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is
optimized for the common use case of adding new data files to an existing table, thus the table name argument
is now required.
The syntax for the INVALIDATE METADATA command is:
INVALIDATE METADATA [table_name]
By default, the cached metadata for all tables is flushed. If you specify a table name, only the metadata for that
one table is flushed. Even for a single table, INVALIDATE METADATA is more expensive than REFRESH, so prefer
REFRESH in the common case where you add new data files for an existing table.
A metadata update for an impalad instance is required if:
A metadata change occurs.
and the change is made from another impalad instance in your cluster, or through Hive.
and the change is made to a database to which clients such as the Impala shell or ODBC directly connect.
A metadata update for an Impala node is not required when you issue queries from the same Impala node where
you ran ALTER TABLE, INSERT, or other table-modifying statement.
Database and table metadata is typically modified by:
the table is referenced. For a huge table, that process could take a noticeable amount of time; thus you might
prefer to use REFRESH where practical, to avoid an unpredictable delay later, for example if the next reference
to the table is during a benchmark test.
The following example shows how you might use the INVALIDATE METADATA statement after creating new
tables (such as SequenceFile or HBase tables) through the Hive shell. Before the INVALIDATE METADATA
statement was issued, Impala would give a table not found error if you tried to refer to those table names.
The DESCRIBE statements cause the latest metadata to be immediately loaded for the tables, avoiding a delay
the next time those tables are queried.
[impalad-host:21000] > invalidate metadata;
[impalad-host:21000] > describe t1;
...
[impalad-host:21000] > describe t2;
...
For more examples of using REFRESH and INVALIDATE METADATA with a combination of Impala and Hive
operations, see Switching Back and Forth Between Impala and Hive on page 30.
If you need to ensure that the metadata is up-to-date when you start an impala-shell session, run
impala-shell with the -r or --refresh_after_connect command-line option. Because this operation adds
a delay to the next query against each table, potentially expensive for large tables with many partitions, try to
avoid using this option for day-to-day operations in a production environment.
HDFS considerations:
By default, the INVALIDATE METADATA command checks HDFS permissions of the underlying data files and
directories, caching this information so that a statement can be cancelled immediately if for example the impala
user does not have permission to write to the data directory for the table. (This checking does not apply if you
have set the catalogd configuration option --load_catalog_in_background=false.) Impala reports any
lack of write permissions as an INFO message in the log file, in case that represents an oversight. If you change
HDFS permissions to make data readable or writeable by the Impala user, issue another INVALIDATE METADATA
to make Impala aware of the change.
Examples:
This example illustrates creating a new database and new table in Hive, then doing an INVALIDATE METADATA
statement in Impala using the fully qualified table name, after which both the new table and the new database
are visible to Impala. The ability to specify INVALIDATE METADATA table_name for a table created in Hive is a
new capability in Impala 1.2.4. In earlier releases, that statement would have returned an error indicating an
unknown table, requiring you to do INVALIDATE METADATA with no table name, a more expensive operation
that reloaded metadata for all tables and databases.
$ hive
hive> create database new_db_from_hive;
OK
Time taken: 4.118 seconds
hive> create table new_db_from_hive.new_table_from_hive (x int);
OK
Time taken: 0.618 seconds
hive> quit;
$ impala-shell
[localhost:21000] > show databases like 'new*';
[localhost:21000] > refresh new_db_from_hive.new_table_from_hive;
ERROR: AnalysisException: Database does not exist: new_db_from_hive
[localhost:21000] > invalidate metadata new_db_from_hive.new_table_from_hive;
[localhost:21000] > show databases like 'new*';
+--------------------+
| name
|
+--------------------+
| new_db_from_hive
|
Next, we create a table and load an initial set of data into it. Remember, unless you specify a STORED AS clause,
Impala tables default to TEXTFILE format with Ctrl-A (hex 01) as the field delimiter. This example uses a
single-column table, so the delimiter is not significant. For large-scale ETL jobs, you would typically use binary
format data files such as Parquet or Avro, and load them into Impala tables that use the corresponding file
format.
[localhost:21000] > create table t1 (s string);
[localhost:21000] > load data inpath '/user/cloudera/thousand_strings.txt' into table
t1;
Query finished, fetching results ...
+----------------------------------------------------------+
| summary
|
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 1 |
+----------------------------------------------------------+
Returned 1 row(s) in 0.61s
[kilo2-202-961.cs1cloud.internal:21000] > select count(*) from t1;
Query finished, fetching results ...
+------+
| _c0 |
+------+
| 1000 |
+------+
Returned 1 row(s) in 0.67s
[localhost:21000] > load data inpath '/user/cloudera/thousand_strings.txt' into table
t1;
ERROR: AnalysisException: INPATH location '/user/cloudera/thousand_strings.txt' does
not exist.
As indicated by the message at the end of the previous example, the data file was moved from its original
location. The following example illustrates how the data file was moved into the Impala data directory for the
destination table, keeping its original filename:
$ hdfs dfs -ls /user/hive/warehouse/load_data_testing.db/t1
Found 1 items
-rw-r--r-1 cloudera cloudera
13926 2013-06-26 15:40
/user/hive/warehouse/load_data_testing.db/t1/thousand_strings.txt
The following example demonstrates the difference between the INTO TABLE and OVERWRITE TABLE clauses.
The table already contains 1000 rows. After issuing the LOAD DATA statement with the INTO TABLE clause, the
table contains 100 more rows, for a total of 1100. After issuing the LOAD DATA statement with the OVERWRITE
INTO TABLE clause, the former contents are gone, and now the table only contains the 10 rows from the
just-loaded data file.
[localhost:21000] > load data inpath '/user/cloudera/hundred_strings.txt' into table
t1;
Query finished, fetching results ...
REFRESH Statement
To accurately respond to queries, the Impala node that acts as the coordinator (the node to which you are
connected through impala-shell, JDBC, or ODBC) must have current metadata about those databases and
tables that are referenced in Impala queries. If you are not familiar with the way Impala uses metadata and how
it shares the same metastore database as Hive, see Overview of Impala Metadata and the Metastore on page
16 for background information.
Use the REFRESH statement to load the latest metastore metadata and block location data for a particular table
in these scenarios:
After loading new data files into the HDFS data directory for the table. (Once you have set up an ETL pipeline
to bring data into Impala on a regular basis, this is typically the most frequent reason why metadata needs
to be refreshed.)
After issuing ALTER TABLE, INSERT, LOAD DATA, or other table-modifying SQL statement in Hive.
You only need to issue the REFRESH statement on the node to which you connect to issue queries. The coordinator
node divides the work among all the Impala nodes in a cluster, and sends read requests for the correct HDFS
blocks without relying on the metadata on the other nodes.
REFRESH reloads the metadata for the table from the metastore database, and does an incremental reload of
the low-level block location data to account for any new data files added to the HDFS data directory for the
table. It is a low-overhead, single-table operation, specifically tuned for the common scenario where new data
files are added to HDFS.
The syntax for the REFRESH command is:
REFRESH table_name
Only the metadata for the specified table is flushed. The table must already exist and be known to Impala, either
because the CREATE TABLE statement was run in Impala rather than Hive, or because a previous INVALIDATE
METADATA statement caused Impala to reload its entire metadata catalog.
116 | Cloudera Impala User Guide
when needed for a subsequent query, but reloads all the metadata for the table, which can be an expensive
operation, especially for large tables with many partitions. REFRESH reloads the metadata immediately, but only
loads the block location data for newly added data files, making it a less expensive operation overall. If data was
altered in some more extensive way, such as being reorganized by the HDFS balancer, use INVALIDATE METADATA
to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, the INVALIDATE
METADATA statement works just like the Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is
optimized for the common use case of adding new data files to an existing table, thus the table name argument
is now required.
A metadata update for an impalad instance is required if:
A metadata change occurs.
and the change is made through Hive.
and the change is made to a database to which clients such as the Impala shell or ODBC directly connect.
A metadata update for an Impala node is not required after you run ALTER TABLE, INSERT, or other
table-modifying statement in Impala rather than Hive. Impala handles the metadata synchronization automatically
through the catalog service.
Database and table metadata is typically modified by:
Hive - through ALTER, CREATE, DROP or INSERT operations.
Impalad - through CREATE TABLE, ALTER TABLE, and INSERT operations. In Impala 1.2 and higher, such
changes are propagated to all Impala nodes by the Impala catalog service.
REFRESH causes the metadata for that table to be immediately reloaded. For a huge table, that process could
take a noticeable amount of time; but doing the refresh up front avoids an unpredictable delay later, for example
if the next reference to the table is during a benchmark test.
If you connect to different Impala nodes within an impala-shell session for load-balancing purposes, you can
enable the SYNC_DDL query option to make each DDL statement wait before returning, until the new or changed
metadata has been received by all the Impala nodes. See SYNC_DDL on page 202 for details.
Examples:
The following example shows how you might use the REFRESH statement after manually adding new HDFS data
files to the Impala data directory for a table:
[impalad-host:21000] > refresh t1;
[impalad-host:21000] > refresh t2;
[impalad-host:21000] > select * from t1;
...
For more examples of using REFRESH and INVALIDATE METADATA with a combination of Impala and Hive
operations, see Switching Back and Forth Between Impala and Hive on page 30.
Related impalad options:
In Impala 1.0, the -r option of impala-shell issued REFRESH to reload metadata for all tables.
In Impala 1.1 and higher, this option issues INVALIDATE METADATA because REFRESH now requires a table name
parameter. Due to the expense of reloading the metadata for all tables, the impala-shell -r option is not
recommended for day-to-day use in a production environment.
In Impala 1.2 and higher, the -r option is needed even less frequently, because metadata changes caused by
SQL statements in Impala are automatically broadcast to all Impala nodes.
HDFS considerations:
The REFRESH command checks HDFS permissions of the underlying data files and directories, caching this
information so that a statement can be cancelled immediately if for example the impala user does not have
permission to write to the data directory for the table. Impala reports any lack of write permissions as an INFO
message in the log file, in case that represents an oversight. If you change HDFS permissions to make data
readable or writeable by the Impala user, issue another REFRESH to make Impala aware of the change.
Important: After adding or replacing data in a table used in performance-critical queries, issue a
COMPUTE STATS statement to make sure all statistics are up-to-date. Consider updating statistics
for a table after any INSERT, LOAD DATA, or CREATE TABLE AS SELECT statement in Impala, or after
loading data through Hive and doing a REFRESH table_name in Impala. This technique is especially
important for tables that are very large, used in join queries, or both.
SELECT Statement
The SELECT statement performs queries, retrieving data from one or more tables and producing result sets
consisting of rows and columns.
The Impala INSERT statement also typically ends with a SELECT statement, to define data to copy from one
table to another.
Impala SELECT queries support:
SQL data types: BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, TIMESTAMP, STRING.
An optional WITH clause before the SELECT keyword, to define a subquery whose name or column names
can be referenced from later in the main query. This clause lets you abstract repeated clauses, such as
aggregation functions, that are referenced multiple times in the same query.
DISTINCT clause per query. See DISTINCT Operator on page 134 for details.
Subqueries in a FROM clause.
WHERE, GROUP BY, HAVING clauses.
ORDER BY. Prior to Impala 1.4.0, Impala required that queries using an ORDER BY clause also include a LIMIT
clause. In Impala 1.4.0 and higher, this restriction is lifted; sort operations that would exceed the Impala
memory limit automatically use a temporary disk work area to perform the sort.
Impala supports a wide variety of JOIN clauses. Left, right, semi, full, and outer joins are supported in all
Impala versions. The CROSS JOIN operator is available in Impala 1.2.2 and higher. During performance tuning,
you can override the reordering of join clauses that Impala does internally by including the keyword
STRAIGHT_JOIN immediately after the SELECT keyword
See Joins on page 119 for details and examples of join queries.
UNION ALL.
LIMIT.
118 | Cloudera Impala User Guide
External tables.
Relational operators such as greater than, less than, or equal to.
Arithmetic operators such as addition or subtraction.
Logical/Boolean operators AND, OR, and NOT. Impala does not support the corresponding symbols &&, ||, and
!.
Common SQL built-in functions such as COUNT, SUM, CAST, LIKE, IN, BETWEEN, and COALESCE. Impala specifically
supports built-ins described in Built-in Functions on page 138.
Cancellation: Can be cancelled. To cancel this statement, use Ctrl-C from the impala-shell interpreter, the
Cancel button from the Watch page in Hue, Actions > Cancel from the Queries list in Cloudera Manager, or Cancel
from the list of in-flight queries (for a particular node) on the Queries tab in the Impala web UI (port 25000).
Joins
A join query is one that combines data from two or more tables, and returns a result set containing items from
some or all of those tables.
Syntax:
Impala supports a wide variety of JOIN clauses. Left, right, semi, full, and outer joins are supported in all Impala
versions. The CROSS JOIN operator is available in Impala 1.2.2 and higher. During performance tuning, you can
override the reordering of join clauses that Impala does internally by including the keyword STRAIGHT_JOIN
immediately after the SELECT keyword
SELECT select_list FROM
table_or_subquery1 [INNER] JOIN table_or_subquery2 |
table_or_subquery1 {LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]} JOIN
table_or_subquery2 |
table_or_subquery1 LEFT SEMI JOIN table_or_subquery2
[ ON col1 = col2 [AND col3 = col4 ...] |
USING (col1 [, col2 ...]) ]
[other_join_clause ...]
[ WHERE where_clauses ]
SELECT select_list FROM
table_or_subquery1, table_or_subquery2 [, table_or_subquery3 ...]
[other_join_clause ...]
WHERE
col1 = col2 [AND col3 = col4 ...]
SELECT select_list FROM
table_or_subquery1 CROSS JOIN table_or_subquery2
[other_join_clause ...]
[ WHERE where_clauses ]
The ON clause is a general way to compare columns across the two tables, even if the column names are different.
The USING clause is a shorthand notation for specifying the join columns, when the column names are the same
in both tables. You can code equivalent WHERE clauses that compare the columns, instead of ON or USING clauses,
but that practice is not recommended because mixing the join comparisons with other filtering clauses is typically
less readable and harder to maintain.
Self-joins:
Impala can do self-joins, for example to join on two different columns in the same table to represent parent-child
relationships or other tree-structured data. There is no explicit syntax for this; just use the same table name
for both the left-hand and right-hand table, and assign different table aliases to use when referring to the fully
qualified column names:
-- Combine fields from both parent and child rows.
SELECT lhs.id, rhs.parent, lhs.c1, rhs.c2 FROM tree_data lhs, tree_data rhs WHERE lhs.id
= rhs.parent;
Cartesian joins:
To avoid producing huge result sets by mistake, Impala does not allow Cartesian joins of the form:
SELECT ... FROM t1 JOIN t2;
SELECT ... FROM t1, t2;
If you intend to join the tables based on common values, add ON or WHERE clauses to compare columns across
the tables. If you truly intend to do a Cartesian join, use the CROSS JOIN keyword as the join operator. The CROSS
JOIN form does not use any ON clause, because it produces a result set with all combinations of rows from the
left-hand and right-hand tables. The result set can still be filtered by subsequent WHERE clauses. For example:
SELECT ... FROM t1 CROSS JOIN t2;
SELECT ... FROM t1 CROSS JOIN t2 WHERE tests_on_non_join_columns;
An outer join retrieves all rows from the left-hand table, or the right-hand table, or both; wherever there is no
matching data in the table on the other side of the join, the corresponding columns in the result set are set to
NULL. To perform an outer join, include the OUTER keyword in the join operator, along with either LEFT, RIGHT,
or FULL:
SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id;
SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.id = t2.id;
SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;
For outer joins, Impala requires SQL-92 syntax; that is, the JOIN keyword instead of comma-separated table
names. Impala does not support vendor extensions such as (+) or *= notation for doing outer joins with SQL-89
query syntax.
Semi-joins:
Semi-joins are a relatively rarely used variation. With the left semi-join (the only kind of semi-join available with
Impala), only data from the left-hand table is returned, for rows where there is matching data in the right-hand
table, based on comparisons between join columns in ON or WHERE clauses. Only one instance of each row from
the left-hand table is returned, regardless of how many matching rows exist in the right-hand table.
SELECT t1.c1, t1.c2, t1.c2 FROM t1 LEFT SEMI JOIN t2 ON t1.id = t2.id;
Note:
Performance for join queries is a crucial aspect for Impala, because complex join queries are
resource-intensive operations. An efficient join query produces much less network traffic and CPU
overhead than an inefficient one. For best results:
Make sure that both table and column statistics are available for all the tables involved in a join
query, and especially for the columns referenced in any join conditions. Use SHOW TABLE STATS
table_name and SHOW COLUMN STATS table_name to check.
If table or column statistics are not available, join the largest table first. You can check the existence
of statistics with the SHOW TABLE STATS table_name and SHOW COLUMN STATS table_name
statements. In Impala 1.2.2 and higher, use the Impala COMPUTE STATS statement to collect
statistics at both the table and column levels, and keep the statistics up to date after any
substantial INSERT or LOAD DATA operation.
If table or column statistics are not available, join subsequent tables according to which table has
the most selective filter, based on overall size and WHERE clauses. Joining the table with the most
selective filter results in the fewest number of rows being returned.
For more information and examples of performance for join queries, see Performance Considerations
for Join Queries on page 206.
To control the result set from a join query, include the names of corresponding column names in both tables in
an ON or USING clause, or by coding equality comparisons for those columns in the WHERE clause.
[localhost:21000] > select c_last_name, ca_city from customer join customer_address
where c_customer_sk = ca_address_sk;
+-------------+-----------------+
| c_last_name | ca_city
|
+-------------+-----------------+
| Lewis
| Fairfield
|
| Moses
| Fairview
|
One potential downside of joins is the possibility of excess resource usage in poorly constructed queries. Impala
imposes restrictions on join queries to guard against such issues. To minimize the chance of runaway queries
on large data sets, Impala requires every join query to contain at least one equality predicate between the
columns of the various tables. For example, if T1 contains 1000 rows and T2 contains 1,000,000 rows, a query
SELECT columns FROM t1 JOIN t2 could return up to 1 billion rows (1000 * 1,000,000); Impala requires that
the query include a clause such as ON t1.c1 = t2.c2 or WHERE t1.c1 = t2.c2.
Because even with equality clauses, the result set can still be large, as we saw in the previous example, you
might use a LIMIT clause to return a subset of the results:
[localhost:21000] > select c_last_name, ca_city from customer, customer_address where
c_customer_sk = ca_address_sk limit 10;
+-------------+-----------------+
| c_last_name | ca_city
|
+-------------+-----------------+
| Lewis
| Fairfield
|
| Moses
| Fairview
|
| Hamilton
| Pleasant Valley |
| White
| Oak Ridge
|
| Moran
| Glendale
|
| Sharp
| Lakeview
|
| Wiles
| Farmington
|
| Shipman
| Union
|
| Gilbert
| New Hope
|
| Brunson
| Martinsville
|
+-------------+-----------------+
Returned 10 row(s) in 0.63s
Or you might use additional comparison operators or aggregation functions to condense a large result set into
a smaller set of values:
[localhost:21000] > -- Find the names of customers who live in one particular town.
[localhost:21000] > select distinct c_last_name from customer, customer_address where
c_customer_sk = ca_address_sk
and ca_city = "Green Acres";
+---------------+
| c_last_name
|
+---------------+
| Hensley
|
| Pearson
|
| Mayer
|
| Montgomery
|
| Ricks
|
...
| Barrett
|
| Price
|
| Hill
|
| Hansen
|
| Meeks
|
+---------------+
Returned 332 row(s) in 0.97s
[localhost:21000] > -- See how many different customers in this town have names starting
with "A".
[localhost:21000] > select count(distinct c_last_name) from customer, customer_address
where
c_customer_sk = ca_address_sk
Because a join query can involve reading large amounts of data from disk, sending large amounts of data across
the network, and loading large amounts of data into memory to do the comparisons and filtering, you might do
benchmarking, performance analysis, and query tuning to find the most efficient join queries for your data set,
hardware capacity, network configuration, and cluster workload.
The two categories of joins in Impala are known as partitioned joins and broadcast joins. If inaccurate table or
column statistics, or some quirk of the data distribution, causes Impala to choose the wrong mechanism for a
particular join, consider using query hints as a temporary workaround. For details, see Hints on page 133.
See these tutorials for examples of different kinds of joins:
Cross Joins and Cartesian Products with the CROSS JOIN Operator on page 31
ORDER BY Clause
The familiar ORDER BY clause of a SELECT statement sorts the result set based on the values from one or more
columns.
For distributed queries, this is a relatively expensive operation, because the entire result set must be produced
and transferred to one node before the sorting can happen. This can require more memory capacity than a query
without ORDER BY. Even if the query takes approximately the same time to finish with or without the ORDER
BY clause, subjectively it can appear slower because no results are available until all processing is finished, rather
than results coming back gradually as rows matching the WHERE clause are found. Therefore, if you only need
the first N results from the sorted result set, also include the LIMIT clause, which reduces network overhead
and the memory requirement on the coordinator node.
Note:
In Impala 1.4.0 and higher, the LIMIT clause is now optional (rather than required) for queries that
use the ORDER BY clause. Impala automatically uses a temporary disk work area to perform the sort
if the sort operation would otherwise exceed the Impala memory limit for a particular data node.
Syntax:
The full syntax for the ORDER BY clause is:
ORDER BY col1 [, col2 ...] [ASC | DESC] [NULLS FIRST | NULLS LAST]
The default sort order (the same as using the ASC keyword) puts the smallest values at the start of the result
set, and the largest values at the end. Specifying the DESC keyword reverses that order.
See NULL on page 64 for details about how NULL values are positioned in the sorted result set, and how to use
the NULLS FIRST and NULLS LAST clauses. (The sort position for NULL values in ORDER BY ... DESC queries
is changed in Impala 1.2.1 and higher to be more standards-compliant, and the NULLS FIRST and NULLS LAST
keywords are new in Impala 1.2.1.)
Prior to Impala 1.4.0, Impala required any query including an ORDER BY clause to also use a LIMIT clause. In
Impala 1.4.0 and higher, the LIMIT clause is optional for ORDER BY queries. In cases where sorting a huge result
set requires enough memory to exceed the Impala memory limit for a particular node, Impala automatically uses
a temporary disk work area to perform the sort operation.
Usage notes:
relatively inefficient to issue multiple queries like this against the large tables typically used with Impala:
SELECT page_title as "Page 1 of search results", page_url FROM search_content
WHERE LOWER(page_title) LIKE '%game%')
ORDER BY page_title LIMIT 10 OFFSET 0;
SELECT page_title as "Page 2 of search results", page_url FROM search_content
WHERE LOWER(page_title) LIKE '%game%')
ORDER BY page_title LIMIT 10 OFFSET 10;
SELECT page_title as "Page 3 of search results", page_url FROM search_content
WHERE LOWER(page_title) LIKE '%game%')
ORDER BY page_title LIMIT 10 OFFSET 20;
Internal details:
Impala sorts the intermediate results of an ORDER BY clause in memory whenever practical. In a cluster of N
data nodes, each node sorts roughly 1/Nth of the result set, the exact proportion varying depending on how the
data matching the query is distributed in HDFS.
If the size of the sorted intermediate result set on any data node would cause the query to exceed the Impala
memory limit, Impala sorts as much as practical in memory, then writes partially sorted data to disk. (This
technique is known in industry terminology as external sorting and spilling to disk.) As each 8 MB batch of
data is written to disk, Impala frees the corresponding memory to sort a new 8 MB batch of data. When all the
data has been processed, a final merge sort operation is performed to correctly order the in-memory and on-disk
results as the result set is transmitted back to the coordinator node. When external sorting becomes necessary,
Impala requires approximately 60 MB of RAM at a minimum for the buffers needed to read, write, and sort the
intermediate results. If more RAM is available on the data node, Impala will use the additional RAM to minimize
the amount of disk I/O for sorting.
This external sort technique is used as appropriate on each data node (possibly including the coordinator node)
to sort the portion of the result set that is processed on that node. When the sorted intermediate results are
sent back to the coordinator node to produce the final result set, the coordinator node uses a merge sort technique
to produce a final sorted result set without using any extra resources on the coordinator node.
Configuration for disk usage:
By default, intermediate files used during large sort operations are stored in the directory /tmp/impala-scratch.
These files are removed when the sort operation finishes. (Multiple concurrent queries can perform ORDER BY
queries that use the external sort technique, without any name conflicts for these temporary files.) You can
specify a different location by starting the impalad daemon with the --scratch_dirs="path_to_directory"
configuration option. The scratch directory must be on the local filesystem, not in HDFS. You might specify
different directory paths for different hosts, depending on the capacity and speed of the available storage devices.
Impala will not start if it cannot create or read and write files in the scratch directory. If there is less than 1
GB free on the filesystem where that directory resides, Impala still runs, but writes a warning message to its
log.
Restrictions:
Cloudera Impala User Guide | 125
With the lifting of the requirement to include a LIMIT clause in every ORDER BY query (in Impala 1.4 and higher):
Now the use of scratch disk space raises the possibility of an out of disk space error on a particular data
node, as opposed to the previous possibility of an out of memory error. Make sure to keep at least 1 GB
free on the filesystem used for temporary sorting work.
The query options DEFAULT_ORDER_BY_LIMIT and ABORT_ON_DEFAULT_LIMIT_EXCEEDED, which formerly
controlled the behavior of ORDER BY queries with no limit specified, are now ignored.
In Impala 1.2.1 and higher, all NULL values come at the end of the result set for ORDER BY ... ASC queries, and
at the beginning of the result set for ORDER BY ... DESC queries. In effect, NULL is considered greater than all
other values for sorting purposes. The original Impala behavior always put NULL values at the end, even for
ORDER BY ... DESC queries. The new behavior in Impala 1.2.1 makes Impala more compatible with other
GROUP BY Clause
Specify the GROUP BY clause in queries that use aggregation functions, such as COUNT(), SUM(), AVG(), MIN(),
and MAX(). Specify in the GROUP BY clause the names of all the columns that do not participate in the aggregation
operation.
For example, the following query finds the 5 items that sold the highest total quantity (using the SUM() function,
and also counts the number of sales transactions for those items (using the COUNT() function). Because the
column representing the item IDs is not used in any aggregation functions, we specify that column in the GROUP
BY clause.
select
ss_item_sk as Item,
count(ss_item_sk) as Times_Purchased,
sum(ss_quantity) as Total_Quantity_Purchased
from store_sales
group by ss_item_sk
order by sum(ss_quantity) desc
limit 5;
+-------+-----------------+--------------------------+
| item | times_purchased | total_quantity_purchased |
+-------+-----------------+--------------------------+
| 9325 | 372
| 19072
|
| 4279 | 357
| 18501
|
| 7507 | 371
| 18475
|
| 5953 | 369
| 18451
|
| 16753 | 375
| 18446
|
+-------+-----------------+--------------------------+
The HAVING clause lets you filter the results of aggregate functions, because you cannot refer to those expressions
in the WHERE clause. For example, to find the 5 lowest-selling items that were included in at least 100 sales
transactions, we could use this query:
select
ss_item_sk as Item,
count(ss_item_sk) as Times_Purchased,
sum(ss_quantity) as Total_Quantity_Purchased
from store_sales
group by ss_item_sk
having times_purchased >= 100
order by sum(ss_quantity)
limit 5;
+-------+-----------------+--------------------------+
| item | times_purchased | total_quantity_purchased |
+-------+-----------------+--------------------------+
| 13943 | 105
| 4087
|
| 2992 | 101
| 4176
|
| 4773 | 107
| 4204
|
| 14350 | 103
| 4260
|
| 11956 | 102
| 4275
|
+-------+-----------------+--------------------------+
When performing calculations involving scientific or financial data, remember that columns with type FLOAT or
DOUBLE are stored as true floating-point numbers, which cannot precisely represent every possible fractional
value. Thus, if you include a FLOAT or DOUBLE column in a GROUP BY clause, the results might not precisely match
literal values in your query or from an original Text data file. Use rounding operations, the BETWEEN operator, or
another arithmetic technique to match floating-point values that are near literal values you expect. For example,
Notice how wholesale cost values originally entered as decimal fractions such as 96.94 and 98.38 are slightly
larger or smaller in the result set, due to precision limitations in the hardware floating-point types. The imprecise
representation of FLOAT and DOUBLE values is why financial data processing systems often store currency using
data types that are less space-efficient but avoid these types of rounding errors.
Zero-length strings: For purposes of clauses such as DISTINCT and GROUP BY, Impala considers zero-length
strings (""), NULL, and space to all be different values.
HAVING Clause
Performs a filter operation on a SELECT query, by examining the results of aggregation functions rather than
testing each individual table row. Thus always used in conjunction with a function such as COUNT(), SUM(),
AVG(), MIN(), or MAX(), and typically with the GROUP BY clause also.
LIMIT Clause
The LIMIT clause in a SELECT query sets a maximum number of rows for the result set. Pre-selecting the
maximum size of the result set helps Impala to optimize memory usage while processing a distributed query.
Syntax:
LIMIT constant_integer_expression
The argument to the LIMIT clause must evaluate to a constant value. It can be a numeric literal, or another kind
of numeric expression involving operators, casts, and function return values. You cannot refer to a column or
use a subquery.
Usage notes:
This clause is useful in contexts such as:
To return exactly N items from a top-N query, such as the 10 highest-rated items in a shopping category or
the 50 hostnames that refer the most traffic to a web site.
To demonstrate some sample values from a table or a particular query. (To display some arbitrary items, use
a query with no ORDER BY clause. An ORDER BY clause causes additional memory and/or disk usage during
the query.)
To keep queries from returning huge result sets by accident if a table is larger than expected, or a WHERE
clause matches more rows than expected.
Originally, the value for the LIMIT clause had to be a numeric literal. In Impala 1.2.1 and higher, it can be a
numeric expression.
Prior to Impala 1.4.0, Impala required any query including an ORDER BY clause to also use a LIMIT clause. In
Impala 1.4.0 and higher, the LIMIT clause is optional for ORDER BY queries. In cases where sorting a huge result
For top-N and bottom-N queries, you use the ORDER BY and LIMIT clauses together:
[localhost:21000] > select x as "Top 3" from numbers order by x desc limit 3;
+-------+
| top 3 |
+-------+
| 5
|
| 4
|
| 3
|
+-------+
[localhost:21000] > select x as "Bottom 3" from numbers order by x limit 3;
+----------+
| bottom 3 |
+----------+
| 1
|
| 2
|
| 3
|
+----------+
OFFSET Clause
The OFFSET clause in a SELECT query causes the result set to start some number of rows after the logical first
item. The result set is numbered starting from zero, so OFFSET 0 produces the same result as leaving out the
OFFSET clause. Always use this clause in combination with ORDER BY (so that it is clear which item should be
first, second, and so on) and LIMIT (so that the result set covers a bounded range, such as items 0-9, 100-199,
and so on).
In Impala 1.2.1 and higher, you can combine a LIMIT clause with an OFFSET clause to produce a small result set
that is different from a top-N query, for example, to return items 11 through 20. This technique can be used to
simulate paged results. Because Impala queries typically involve substantial amounts of I/O, use this technique
only for compatibility in cases where you cannot rewrite the application logic. For best performance and scalability,
wherever practical, query as many items as you expect to need, cache them on the application side, and display
small groups of results to users using application logic.
Examples:
The following example shows how you could run a paging query originally written for a traditional database
application. Because typical Impala queries process megabytes or gigabytes of data and read large data files
from disk each time, it is inefficient to run a separate query to retrieve each small group of items. Use this
technique only for compatibility while porting older applications, then rewrite the application code to use a single
query with a large result set, and display pages of results from the cached result set.
[localhost:21000] > create table numbers (x int);
[localhost:21000] > insert into numbers select x from very_long_sequence;
Inserted 1000000 rows in 1.34s
[localhost:21000] > select x from numbers order by x limit 5 offset 0;
+----+
| x |
+----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+----+
Returned 5 row(s) in 0.26s
[localhost:21000] > select x from numbers order by x limit 5 offset 5;
+----+
| x |
+----+
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
+----+
Returned 5 row(s) in 0.23s
UNION Clause
The UNION clause lets you combine the result sets of multiple queries. By default, the result sets are combined
as if the DISTINCT operator was applied.
Usage notes:
The UNION keyword by itself is the same as UNION DISTINCT. Because eliminating duplicates can be a
memory-intensive process for a large result set, prefer UNION ALL where practical. (That is, when you know the
different queries in the union will not produce any duplicates, or where the duplicate values are acceptable.)
When an ORDER BY clause applies to a UNION ALL or UNION query, in Impala 1.4 and higher, the LIMIT clause
is no longer required. To make the ORDER BY and LIMIT clauses apply to the entire result set, turn the UNION
query into a subquery, SELECT from the subquery, and put the ORDER BY clause at the end, outside the subquery.
Examples:
First, we set up some sample data, including duplicate 1 values.
[localhost:21000] > create table few_ints (x int);
[localhost:21000] > insert into few_ints values (1), (1), (2), (3);
[localhost:21000] > set default_order_by_limit=1000;
This example shows how UNION ALL returns all rows from both queries, without any additional filtering to
eliminate duplicates. For the large result sets common with Impala queries, this is the most memory-efficient
technique.
[localhost:21000] > select x
+---+
| x |
+---+
| 1 |
| 1 |
| 2 |
| 3 |
+---+
Returned 4 row(s) in 0.41s
[localhost:21000] > select x
+---+
| x |
+---+
| 1 |
| 1 |
| 2 |
| 3 |
| 1 |
| 1 |
| 2 |
| 3 |
+---+
Returned 8 row(s) in 0.42s
[localhost:21000] > select *
few_ints) as t1 order by x;
+---+
| x |
+---+
| 1 |
| 1 |
| 1 |
| 1 |
| 2 |
| 2 |
| 3 |
| 3 |
+---+
Returned 8 row(s) in 0.53s
[localhost:21000] > select x
+----+
| x |
+----+
This example shows how the UNION clause without the ALL keyword condenses the result set to eliminate all
duplicate values, making the query take more time and potentially more memory. The extra processing typically
makes this technique not recommended for queries that return result sets with millions or billions of values.
[localhost:21000] > select x from few_ints union select x+1 from few_ints;
+---+
| x |
+---+
| 3 |
| 4 |
| 1 |
| 2 |
+---+
Returned 4 row(s) in 0.51s
[localhost:21000] > select x from few_ints union select 10;
+----+
| x |
+----+
| 2 |
| 10 |
| 1 |
| 3 |
+----+
Returned 4 row(s) in 0.49s
[localhost:21000] > select * from (select x from few_ints union select x from few_ints)
as t1 order by x;
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
+---+
Returned 3 row(s) in 0.53s
WITH Clause
A clause that can be added before a SELECT statement, to define aliases for complicated expressions that are
referenced multiple times within the body of the SELECT. Similar to CREATE VIEW, except that the table and
column names defined in the WITH clause do not persist after the query finishes, and do not conflict with names
used in actual tables or views. Also known as subquery factoring.
You can rewrite a query using subqueries to work the same as with the WITH clause. The purposes of the WITH
clause are:
Convenience and ease of maintenance from less repetition with the body of the query. Typically used with
queries involving UNION, joins, or aggregation functions where the similar complicated expressions are
referenced multiple times.
SQL code that is easier to read and understand by abstracting the most complex part of the query into a
separate block.
Improved compatibility with SQL from other database systems that support the same clause (primarily Oracle
Database).
Note:
The Impala WITH clause does not support recursive queries in the WITH, which is supported in
some other database systems.
Hints
The Impala SQL dialect supports query hints, for fine-tuning the inner workings of queries. Specify hints as a
temporary workaround for expensive queries, where missing statistics or other factors cause inefficient
performance. The hints are represented as keywords surrounded by [] square brackets; include the brackets
in the text of the SQL statement.
The [BROADCAST] and [SHUFFLE] hints control the execution strategy for join queries. Specify one of the
following constructs immediately after the JOIN keyword in a query:
[SHUFFLE] - Makes that join operation use the partitioned technique, which divides up corresponding rows
from both tables using a hashing algorithm, sending subsets of the rows to other nodes for processing. (The
keyword SHUFFLE is used to indicate a partitioned join, because that type of join is not related to partitioned
tables.) Since the alternative broadcast join mechanism is the default when table and index statistics are
unavailable, you might use this hint for queries where broadcast joins are unsuitable; typically, partitioned
joins are more efficient for joins between large tables of similar size.
[BROADCAST] - Makes that join operation use the broadcast technique that sends the entire contents of
the right-hand table to all nodes involved in processing the join. This is the default mode of operation when
table and index statistics are unavailable, so you would typically only need it if stale metadata caused Impala
to mistakenly choose a partitioned join operation. Typically, broadcast joins are more efficient in cases where
one table is much smaller than the other. (Put the smaller table on the right side of the JOIN operator.)
To see which join strategy is used for a particular query, examine the EXPLAIN output for that query.
Note:
Because hints can prevent queries from taking advantage of new metadata or improvements in query
planning, use them only when required to work around performance issues, and be prepared to remove
them when they are no longer required, such as after a new Impala release or bug fix.
In particular, the [BROADCAST] and [SHUFFLE] hints are expected to be needed much less frequently
in Impala 1.2.2 and higher, because the join order optimization feature in combination with the
COMPUTE STATS statement now automatically choose join order and join mechanism without the
need to rewrite the query and add hints. See Performance Considerations for Join Queries on page
206 for details.
For example, this query joins a large customer table with a small lookup table of less than 100 rows. The
right-hand table can be broadcast efficiently to all nodes involved in the join. Thus, you would use the
[broadcast] hint to force a broadcast join strategy:
select customer.address, state_lookup.state_name
from customer join [broadcast] state_lookup
on customer.state_id = state_lookup.state_id;
For joins involving three or more tables, the hint applies to the tables on either side of that specific JOIN keyword.
The joins are processed from left to right. For example, this query joins t1 and t2 using a partitioned join, then
joins that result set to t3 using a broadcast join:
select t1.name, t2.id, t3.price
from t1 join [shuffle] t2 join [broadcast] t3
on t1.id = t2.id and t2.id = t3.id;
For more background information and performance considerations for join queries, see Joins on page 119.
When inserting into partitioned tables, especially using the Parquet file format, you can include a hint in the
INSERT statement to fine-tune the overall performance of the operation and its resource usage:
These hints are available in Impala 1.2.2 and higher.
You would only use these hints if an INSERT into a partitioned Parquet table was failing due to capacity
limits, or if such an INSERT was succeeding but with less-than-optimal performance.
To use these hints, put the hint keyword [SHUFFLE] or [NOSHUFFLE] (including the square brackets) after
the PARTITION clause, immediately before the SELECT keyword.
[SHUFFLE] selects an execution plan that minimizes the number of files being written simultaneously to
HDFS, and the number of 1 GB memory buffers holding data for individual partitions. Thus it reduces overall
resource usage for the INSERT operation, allowing some INSERT operations to succeed that otherwise would
fail. It does involve some data transfer between the nodes so that the data files for a particular partition are
all constructed on the same node.
[NOSHUFFLE] selects an execution plan that might be faster overall, but might also produce a larger number
of small data files or exceed capacity limits, causing the INSERT operation to fail. Use [SHUFFLE] in cases
where an INSERT statement fails or runs inefficiently due to all nodes attempting to construct data for all
partitions.
Impala automatically uses the [SHUFFLE] method if any partition key column in the source table, mentioned
in the INSERT ... SELECT query, does not have column statistics. In this case, only the [NOSHUFFLE] hint
would have any effect.
If column statistics are available for all partition key columns in the source table mentioned in the INSERT
... SELECT query, Impala chooses whether to use the [SHUFFLE] or [NOSHUFFLE] technique based on the
estimated number of distinct values in those columns and the number of nodes involved in the INSERT
operation. In this case, you might need the [SHUFFLE] or the [NOSHUFFLE] hint to override the execution
plan selected by Impala.
DISTINCT Operator
The DISTINCT operator in a SELECT statement filters the result set to remove duplicates:
-- Returns the unique values from one column.
-- NULL is included in the set of values if any rows have a NULL in this column.
select distinct c_birth_country from customer;
-- Returns the unique combinations of values from multiple columns.
select distinct c_salutation, c_last_name from customer;
You can use DISTINCT in combination with an aggregation function, typically COUNT(), to find how many different
values a column contains:
-- Counts the unique values from one column.
-- NULL is not included as a distinct value in the count.
One construct that Impala SQL does not support is using DISTINCT in more than one aggregation function in
the same query. For example, you could not have a single query with both COUNT(DISTINCT c_first_name)
and COUNT(DISTINCT c_last_name) in the SELECT list.
Zero-length strings: For purposes of clauses such as DISTINCT and GROUP BY, Impala considers zero-length
strings (""), NULL, and space to all be different values.
Note:
Impala only allows a single COUNT(DISTINCT columns) expression in each query.
If you do not need precise accuracy, you can produce an estimate of the distinct values for a column
by specifying NDV(column); a query can contain multiple instances of NDV(column).
To produce the same result as multiple COUNT(DISTINCT) expressions, you can use the following
technique for queries involving a single table:
select v1.c1 result1, v2.c1 result2 from
(select count(distinct col1) as c1 from t1) v1
cross join
(select count(distinct col2) as c1 from t1) v2;
Because CROSS JOIN is an expensive operation, prefer to use the NDV() technique wherever practical.
Note:
In contrast with some database systems that always return DISTINCT values in sorted order, Impala
does not do any ordering of DISTINCT values. Always include an ORDER BY clause if you need the
values in alphabetical or numeric sorted order.
SHOW Statement
The SHOW statement is a flexible way to get information about different types of Impala objects. You can issue
a SHOW object_type statement to see the appropriate objects in the current database, or SHOW object_type
IN database_name to see objects in a specific database.
Syntax:
To display a list of available objects of a particular kind, issue these statements:
SHOW
SHOW
SHOW
SHOW
SHOW
SHOW
SHOW
SHOW
The optional pattern argument is a quoted string literal, using Unix-style * wildcards and allowing | for alternation.
The preceding LIKE keyword is also optional. All object names are stored in lowercase, so use all lowercase
letters in the pattern string. For example:
show databases 'a*';
show databases like 'a*';
show tables in some_db like '*fact*';
Usage notes:
When authorization is enabled, the output of the SHOW statement is limited to those objects for which you have
some privilege. There might be other database, tables, and so on, but their names are concealed. If you believe
an object exists but you cannot see it in the SHOW output, check with the system administrator if you need to
be granted a new privilege for that object. See Enabling Sentry Authorization for Impala for how to set up
authorization and add privileges for specific kinds of objects.
SHOW DATABASES:
The SHOW DATABASES statement is often the first one you issue when connecting to an instance for the first
time. You typically issue SHOW DATABASES to see the names you can specify in a USE db_name statement, then
after switching to a database you issue SHOW TABLES to see the names you can specify in SELECT and INSERT
statements.
The output of SHOW DATABASES includes the special _impala_builtins database, which lets you view definitions
of built-in functions, as described under SHOW FUNCTIONS.
SHOW CREATE TABLE:
As a schema changes over time, you might run a CREATE TABLE statement followed by several ALTER TABLE
statements. To capture the cumulative effect of all those statements, SHOW CREATE TABLE displays a CREATE
TABLE statement that would reproduce the current structure of a table. You can use this output in scripts that
set up or clone a group of tables, rather than trying to reproduce the original sequence of CREATE TABLE and
ALTER TABLE statements. When creating variations on the original table, or cloning the original table on a
different system, you might need to edit the SHOW CREATE TABLE output to change things such as the database
name, LOCATION field, and so on that might be different on the destination system.
SHOW TABLE STATS, SHOW COLUMN STATS:
The SHOW TABLE STATS and SHOW COLUMN STATS variants are important for tuning performance and diagnosing
performance issues, especially with the largest tables and the most complex join queries. See How Impala Uses
Statistics for Query Optimization on page 212 for usage information and examples.
SHOW PARTITIONS:
SHOW PARTITIONS displays information about each partition for a partitioned table. (The output is the same as
the SHOW TABLE STATS statement, but SHOW PARTITIONS only works on a partitioned table.) Because it displays
table statistics for all partitions, the output is more informative if you have run the COMPUTE STATS statement
after creating all the partitions. See COMPUTE STATS Statement on page 84 for details. For example, on a CENSUS
table partitioned on the YEAR column:
[localhost:21000] > show partitions census;
+-------+-------+--------+------+---------+
| year | #Rows | #Files | Size | Format |
+-------+-------+--------+------+---------+
| 2000 | -1
| 0
| 0B
| TEXT
|
| 2004 | -1
| 0
| 0B
| TEXT
|
| 2008 | -1
| 0
| 0B
| TEXT
|
| 2010 | -1
| 0
| 0B
| TEXT
|
| 2011 | 4
| 1
| 22B | TEXT
|
| 2012 | 4
| 1
| 22B | TEXT
|
| 2013 | 1
| 1
| 231B | PARQUET |
| Total | 9
| 3
| 275B |
|
+-------+-------+--------+------+---------+
SHOW FUNCTIONS:
By default, SHOW FUNCTIONS displays user-defined functions (UDFs) and SHOW AGGREGATE FUNCTIONS displays
user-defined aggregate functions (UDAFs) associated with a particular database. The output from SHOW
FUNCTIONS includes the argument signature of each function. You specify this argument signature as part of
To search for functions that use a particular data type, specify a case-sensitive data type name in all capitals:
show functions in _impala_builtins like '*BIGINT*';
+----------------------------------------+
| name
|
+----------------------------------------+
| adddate(TIMESTAMP, BIGINT)
|
| bin(BIGINT)
|
| coalesce(BIGINT...)
|
...
Examples:
This example shows how you might locate a particular table on an unfamiliar system. The DEFAULT database
is the one you initially connect to; a database with that name is present on every system. You can issue SHOW
TABLES IN db_name without going into a database, or SHOW TABLES once you are inside a particular database.
[localhost:21000] > show databases;
+--------------------+
| name
|
+--------------------+
| _impala_builtins
|
| analyze_testing
|
| avro
|
| ctas
|
| d1
|
| d2
|
| d3
|
| default
|
| file_formats
|
| hbase
|
| load_data
|
| partitioning
|
| regexp_testing
|
| reports
|
| temporary
|
+--------------------+
Returned 14 row(s) in 0.02s
[localhost:21000] > show tables in file_formats;
+--------------------+
| name
|
+--------------------+
| parquet_table
|
USE Statement
By default, when you connect to an Impala instance, you begin in a database named default. Issue the statement
USE db_name to switch to another database within an impala-shell session. The current database is where
any CREATE TABLE, INSERT, SELECT, or other statements act when you specify a table without prefixing it with
a database name.
Usage notes:
Switching the default database is convenient in the following situations:
To avoid qualifying each reference to a table with the database name. For example, SELECT * FROM t1
JOIN t2 rather than SELECT * FROM db.t1 JOIN db.t2.
To do a sequence of operations all within the same database, such as creating a table, inserting data, and
querying the table.
To start the impala-shell interpreter and automatically issue a USE statement for a particular database, specify
the option -d db_name for the impala-shell command. The -d option is useful to run SQL scripts, such as
setup or test scripts, against multiple databases without hardcoding a USE statement into the SQL source.
Examples:
See CREATE DATABASE Statement on page 87 for examples covering CREATE DATABASE, USE, and DROP DATABASE.
Cancellation: Cannot be cancelled.
Built-in Functions
Impala supports several categories of built-in functions. These functions let you perform mathematical
calculations, string manipulation, date calculations, and other kinds of data transformations directly in SELECT
statements. The built-in functions let a SQL query return results with all formatting, calculating, and type
conversions applied, rather than performing time-consuming postprocessing in another application. By applying
function calls where practical, you can make a SQL query that is as convenient as an expression in a procedural
programming language or a formula in a spreadsheet.
The categories of functions supported by Impala are:
When you use a FROM clause and specify a column name as a function argument, the function is applied for each
item in the result set:
select concat('Country = ',country_code) from all_countries where population > 100000000;
select round(price) as dollar_value from product_catalog where price between 0.0 and
100.0;
Typically, if any argument to a built-in function is NULL, the result value is also NULL:
select cos(null);
select power(2,null);
select concat('a',null,'b');
Aggregate functions are a special category with different rules. These functions calculate a return value across
all the items in a result set, so they require a FROM clause in the query:
select count(product_id) from product_catalog;
select max(height), avg(height) from census_data where age > 20;
Aggregate functions also ignore NULL values rather than returning a NULL result. For example, if some rows
have NULL for a particular column, those rows are ignored when computing the AVG() for that column. Likewise,
specifying COUNT(col_name) in a query counts only those rows where col_name contains a non-NULL value.
Aggregate functions are a special category with different rules. These functions calculate a return value across
all the items in a result set, so they do require a FROM clause in the query:
select count(product_id) from product_catalog;
select max(height), avg(height) from census_data where age > 20;
Aggregate functions also ignore NULL values rather than returning a NULL result. For example, if some rows
have NULL for a particular column, those rows are ignored when computing the AVG() for that column. Likewise,
specifying COUNT(col_name) in a query counts only those rows where col_name contains a non-NULL value.
Mathematical Functions
Impala supports the following mathematical functions:
abs(double a), abs(decimal(p,s) a)
Purpose: Returns the binary representation of an integer value, that is, a string of 0 and 1 digits.
Return type: string
ceil(double a), ceiling(double a), ceil(decimal(p,s) a), ceiling(decimal(p,s) a)
Purpose: Returns the smallest integer that is greater than or equal to the argument.
Return type: int or decimal(p,s) based on the type of the input argument
conv(bigint num, int from_base, int to_base), conv(string num, int from_base, int to_base)
Purpose: Returns a string representation of an integer value in a particular base. The input value can be
a string, for example to convert a hexadecimal number such as fce2 to decimal. To use the return value
as a number (for example, when converting to base 10), use CAST() to convert to the appropriate type.
Return type: string
cos(double a)
Purpose: Returns the mathematical constant e raised to the power of the argument.
Return type: double
floor(double a)
Purpose: Returns the largest integer that is less than or equal to the argument.
Return type: int
fmod(double a, double b), fmod(float a, float b)
Purpose: Returns a consistent 64-bit value derived from the input argument, for convenience of
implementing hashing logic in an application.
Return type: BIGINT
Usage notes:
You might use the return value in an application where you perform load balancing, bucketing, or some
other technique to divide processing or storage.
For short argument values, the high-order bits of the result have relatively low entropy:
[localhost:21000] > create table b (x boolean);
[localhost:21000] > insert into b values (true), (true), (false), (false);
[localhost:21000] > select x, fnv_hash(x) from b;
+-------+---------------------+
| x
| fnv_hash(x)
|
+-------+---------------------+
| true | 2062020650953872396 |
| true | 2062020650953872396 |
| false | 2062021750465500607 |
| false | 2062021750465500607 |
+-------+---------------------+
Purpose: Returns the hexadecimal representation of an integer value, or of the characters in a string.
Return type: string
is_inf(double a),
Purpose: Tests whether a value is equal to the special value inf, signifying infinity.
Return type: boolean
Usage notes:
Infinity and NaN can be specified in text data files as inf and nan respectively, and Impala interprets
them as these special values. They can also be produced by certain arithmetic expressions; for example,
pow(-1, 0.5) returns infinity and 1/0 returns NaN. Or you can cast the literal values, such as CAST('nan'
AS DOUBLE) or CAST('inf' AS DOUBLE).
is_nan(double a),
Purpose: Tests whether a value is equal to the special value NaN, signifying not a number.
Return type: boolean
Usage notes:
Infinity and NaN can be specified in text data files as inf and nan respectively, and Impala interprets
them as these special values. They can also be produced by certain arithmetic expressions; for example,
pow(-1, 0.5) returns infinity and 1/0 returns NaN. Or you can cast the literal values, such as CAST('nan'
AS DOUBLE) or CAST('inf' AS DOUBLE).
least(bigint a[, bigint b ...]), least(double a[, double b ...]), least(decimal(p,s) a[,
decimal(p,s) b ...]), least(string a[, string b ...]), least(timestamp a[, timestamp b ...])
Purpose: Returns the logarithm of the second argument to the specified base.
Return type: double
log10(double a)
Purpose: Returns the smallest value of the associated integral type (a negative number).
Return type: The same as the integral type being checked.
Usage notes: Use the corresponding min_ and max_ functions to check if all values in a column are within
the allowed range, before copying data or altering column definitions. If not, switch to the next higher
integral type or to a DECIMAL with sufficient precision.
negative(int a), negative(double a), negative(decimal(p,s) a)
Purpose: Returns the argument with the sign reversed; returns a positive value if the argument was
already negative.
Return type: int, double, or decimal(p,s) depending on type of argument
Usage notes: Use -abs(a) instead if you need to ensure all return values are negative.
pi()
Purpose: Returns the original argument unchanged (even if the argument is negative).
Return type: int, double, or decimal(p,s) depending on type of argument
Usage notes: Use abs() instead if you need to ensure all return values are positive.
pow(double a, double p), power(double a, double p)
Purpose: Returns the first argument raised to the power of the second argument.
Return type: double
precision(numeric_expression)
Purpose: Computes the precision (number of decimal digits) needed to represent the type of the argument
expression as a DECIMAL value.
Usage notes:
Typically used in combination with the scale() function, to determine the appropriate
DECIMAL(precision,scale) type to declare in a CREATE TABLE statement or CAST() function.
Return type: int
Examples:
The following examples demonstrate how to check the precision and scale of numeric literals or other
numeric expressions. Impala represents numeric literals in the smallest appropriate type. 5 is a TINYINT
value, which ranges from -128 to 127, therefore 3 decimal digits are needed to represent the entire range,
and because it is an integer value there are no fractional digits. 1.333 is interpreted as a DECIMAL value,
with 4 digits total and 3 digits after the decimal point.
[localhost:21000] > select precision(5), scale(5);
+--------------+----------+
| precision(5) | scale(5) |
+--------------+----------+
| 3
| 0
|
+--------------+----------+
[localhost:21000] > select precision(1.333), scale(1.333);
+------------------+--------------+
Purpose: Returns the first argument divided by the second argument, discarding any fractional part.
Avoids promoting arguments to DOUBLE as happens with the / SQL operator.
Return type: int
radians(double a)
Purpose: Returns a random value between 0 and 1. After rand() is called with a seed argument, it
produces a consistent random sequence based on the seed value.
Return type: double
Usage notes: Currently, the random sequence is reset after each query, and multiple calls to rand()
within the same query return the same value each time. For different number sequences that are different
for each query, pass a unique seed value to each call to rand(). For example, select
rand(unix_timestamp()) from ...
round(double a), round(double a, int d), round(decimal a, int_type d)
Purpose: Rounds a floating-point value. By default (with a single argument), rounds to the nearest integer.
Values ending in .5 are rounded up for positive numbers, down for negative numbers (that is, away from
zero). The optional second argument specifies how many digits to leave after the decimal point; values
greater than zero produce a floating-point return value rounded to the requested number of digits to
the right of the decimal point.
Return type: bigint for single floatargument. double for double argument when second argument
greater than zero. For DECIMAL values, the smallest DECIMAL(p,s) type with appropriate precision and
scale.
scale(numeric_expression)
Purpose: Computes the scale (number of decimal digits to the right of the decimal point) needed to
represent the type of the argument expression as a DECIMAL value.
Usage notes:
Typically used in combination with the precision() function, to determine the appropriate
DECIMAL(precision,scale) type to declare in a CREATE TABLE statement or CAST() function.
Return type: int
Examples:
The following examples demonstrate how to check the precision and scale of numeric literals or other
numeric expressions. Impala represents numeric literals in the smallest appropriate type. 5 is a TINYINT
value, which ranges from -128 to 127, therefore 3 decimal digits are needed to represent the entire range,
Purpose: Returns a string of characters with ASCII values corresponding to pairs of hexadecimal digits
in the argument.
Return type: string
Purpose: Returns the specified date and time plus some number of months.
Return type: timestamp
Usage notes: Same as months_add(). Available in Impala 1.4 and higher. For compatibility when porting
code with vendor extensions.
adddate(timestamp startdate, int days), adddate(timestamp startdate, bigint days),
Purpose: Adds a specified number of days to a TIMESTAMP value. Similar to date_add(), but starts with
an actual TIMESTAMP value instead of a string that is converted to a TIMESTAMP.
Return type: timestamp
current_timestamp()
Purpose: Adds a specified number of days to a TIMESTAMP value. The first argument can be a string,
which is automatically cast to TIMESTAMP if it uses the recognized format, as described in TIMESTAMP
Data Type on page 61. With an INTERVAL expression as the second argument, you can calculate a delta
value using other units such as weeks, years, hours, seconds, and so on; see TIMESTAMP Data Type on
page 61 for details.
Return type: timestamp
date_sub(timestamp startdate, int days), date_sub(timestamp startdate, interval_expression)
Purpose: Subtracts a specified number of days from a TIMESTAMP value. The first argument can be a
string, which is automatically cast to TIMESTAMP if it uses the recognized format, as described in
TIMESTAMP Data Type on page 61. With an INTERVAL expression as the second argument, you can
calculate a delta value using other units such as weeks, years, hours, seconds, and so on; see TIMESTAMP
Data Type on page 61 for details.
Return type: timestamp
datediff(string enddate, string startdate)
Purpose: Returns the number of days between two dates represented as strings.
Return type: int
day(string date), dayofmonth(string date)
Purpose: Returns the day field from a date represented as a string, converted to the string corresponding
to that day name. The range of return values is 'Sunday' to 'Saturday'. Used in report-generating
Purpose: Returns the day field from a date represented as a string, corresponding to the day of the week.
The range of return values is 1 (Sunday) to 7 (Saturday).
Return type: int
dayofyear(timestamp date)
Purpose: Returns the day field from a TIMESTAMP value, corresponding to the day of the year. The range
of return values is 1 (January 1) to 366 (December 31 of a leap year).
Return type: int
days_add(timestamp startdate, int days), days_add(timestamp startdate, bigint days)
Purpose: Adds a specified number of days to a TIMESTAMP value. Similar to date_add(), but starts with
an actual TIMESTAMP value instead of a string that is converted to a TIMESTAMP.
Return type: timestamp
days_sub(timestamp startdate, int days), days_sub(timestamp startdate, bigint days)
Purpose: Subtracts a specified number of days from a TIMESTAMP value. Similar to date_sub(), but
starts with an actual TIMESTAMP value instead of a string that is converted to a TIMESTAMP.
Return type: timestamp
extract(timestamp, string unit)
Purpose: Returns one of the numeric date or time fields from a TIMESTAMP value.
Unit argument: The unit string can be one of year, month, day, hour, minute, second, or millisecond.
This argument value is case-insensitive.
Usage notes: Typically used in GROUP BY queries to arrange results by hour, day, month, and so on. You
can also use this function in an INSERT ... SELECT into a partitioned table to split up TIMESTAMP values
into individual parts, if the partitioned table has separate partition key columns representing year, month,
day, and so on. If you need to divide by more complex units of time, such as by week or by quarter, use
the TRUNC() function instead.
Return type: int
from_unixtime(bigint unixtime[, string format])
Purpose: Converts the number of seconds from the Unix epoch to the specified time into a string.
Return type: string
Usage notes: The format string accepts the variations allowed for the TIMESTAMP data type: date plus
time, date by itself, time by itself, and optional fractional seconds for the time. See TIMESTAMP Data
Type on page 61 for details.
Currently, the format string is case-sensitive, especially to distinguish m for minutes and M for months.
In Impala 1.3 and higher, you can switch the order of elements, use alternative separator characters, and
use a different number of placeholders for each unit. Adding more instances of y, d, H, and so on produces
output strings zero-padded to the requested number of characters. The exception is M for months, where
M produces a non-padded value such as 3, MM produces a zero-padded value such as 03, MMM produces
an abbreviated month name such as Mar, and sequences of 4 or more M are not allowed. A date string
including all fields could be "yyyy-MM-dd HH:mm:ss.SSSSSS", "dd/MM/yyyy HH:mm:ss.SSSSSS", "MMM
dd, yyyy HH.mm.ss (SSSSSS)" or other combinations of placeholders and separator characters.
Purpose: Converts a specified UTC timestamp value into the appropriate value for a specified time zone.
Return type: timestamp
hour(string date)
Purpose: Returns the specified date and time plus some number of hours.
Return type: timestamp
hours_sub(timestamp date, int hours), hours_sub(timestamp date, bigint hours)
Purpose: Returns the specified date and time minus some number of hours.
Return type: timestamp
microseconds_add(timestamp date, int microseconds), microseconds_add(timestamp date, bigint
microseconds)
Purpose: Returns the specified date and time plus some number of microseconds.
Return type: timestamp
Purpose: Returns the specified date and time minus some number of microseconds.
Return type: timestamp
milliseconds_add(timestamp date, int milliseconds), milliseconds_add(timestamp date, bigint
milliseconds)
Purpose: Returns the specified date and time plus some number of milliseconds.
Return type: timestamp
milliseconds_sub(timestamp date, int milliseconds), milliseconds_sub(timestamp date, bigint
milliseconds)
Purpose: Returns the specified date and time minus some number of milliseconds.
Return type: timestamp
minute(string date)
Purpose: Returns the specified date and time plus some number of minutes.
Return type: timestamp
minutes_sub(timestamp date, int minutes), minutes_sub(timestamp date, bigint minutes)
Purpose: Returns the specified date and time minus some number of minutes.
Return type: timestamp
month(string date)
Purpose: Returns the specified date and time plus some number of months.
Return type: timestamp
months_sub(timestamp date, int months), months_sub(timestamp date, bigint months)
Purpose: Returns the specified date and time minus some number of months.
Return type: timestamp
nanoseconds_add(timestamp date, int nanoseconds), nanoseconds_add(timestamp date, bigint
nanoseconds)
Purpose: Returns the specified date and time plus some number of nanoseconds.
Return type: timestamp
nanoseconds_sub(timestamp date, int nanoseconds), nanoseconds_sub(timestamp date, bigint
nanoseconds)
Purpose: Returns the specified date and time minus some number of nanoseconds.
Return type: timestamp
now()
Purpose: Returns the current date and time (in the UTC time zone) as a timestamp value.
Return type: timestamp
Purpose: Returns the specified date and time plus some number of seconds.
Return type: timestamp
seconds_sub(timestamp date, int seconds), seconds_sub(timestamp date, bigint seconds)
Purpose: Returns the specified date and time minus some number of seconds.
Return type: timestamp
subdate(timestamp startdate, int days), subdate(timestamp startdate, bigint days),
Purpose: Subtracts a specified number of days from a TIMESTAMP value. Similar to date_sub(), but
starts with an actual TIMESTAMP value instead of a string that is converted to a TIMESTAMP.
Return type: timestamp
to_date(timestamp)
Purpose: Returns a string representation of the date field from a timestamp value.
Return type: string
to_utc_timestamp(timestamp, string timezone)
Purpose: Converts a specified timestamp value in a specified time zone into the corresponding value for
the UTC time zone.
Return type: timestamp
trunc(timestamp, string unit)
SYYYY, YYYY, YEAR, SYEAR, YYY, YY, Y: Year (rounds up on July 1).
Q: Quarter (rounds up on the sixteenth day of the second month of the quarter).
MONTH, MON, MM, RM: Month (rounds up on the sixteenth day).
WW, W: Same day of the week as the first day of the month.
DDD, DD, J: Day of the month.
DAY, DY, D: Starting day of the week.
HH, HH12, HH24: Hour. A TIMESTAMP value rounded or truncated to the hour is always represented in
24-hour notation, even for the HH12 argument string.
MI: Minute.
Usage notes: Typically used in GROUP BY queries to aggregate results from the same hour, day, week,
month, quarter, and so on. You can also use this function in an INSERT ... SELECT into a partitioned
table to divide TIMESTAMP values into the correct partition.
Because the return value is a TIMESTAMP, if you cast the result of TRUNC() to STRING, you will often see
zeroed-out portions such as 00:00:00 in the time field. If you only need the individual units such as
hour, day, month, or year, use the EXTRACT() function instead. If you need the individual units from a
truncated TIMESTAMP value, run the TRUNCATE() function on the original value, then run EXTRACT() on
the result.
Return type: timestamp
Purpose: Returns an integer value representing the current date and time as a delta from the Unix epoch,
or converts from a specified date and time value represented as a TIMESTAMP or STRING.
Return type: bigint
Usage notes: See from_unixtime() for details about the patterns you can use in the format string to
represent the position of year, month, day, and so on in the date string. In Impala 1.3 and higher, you
have more flexibility to switch the positions of elements and use different separator characters.
unix_timestamp() and from_unixtime() are often used in combination to convert a TIMESTAMP value
weekofyear(string date)
Purpose: Returns the corresponding week (1-53) from a date represented as a string.
Return type: int
weeks_add(timestamp date, int weeks), weeks_add(timestamp date, bigint weeks)
Purpose: Returns the specified date and time plus some number of weeks.
Return type: timestamp
weeks_sub(timestamp date, int weeks), weeks_sub(timestamp date, bigint weeks)
Purpose: Returns the specified date and time minus some number of weeks.
Return type: timestamp
year(string date)
Purpose: Returns the specified date and time plus some number of years.
Return type: timestamp
years_sub(timestamp date, int years), years_sub(timestamp date, bigint years)
Purpose: Returns the specified date and time minus some number of years.
Return type: timestamp
Conditional Functions
Impala supports the following conditional functions for testing equality, comparison operators, and nullity:
CASE a WHEN b THEN c [WHEN d THEN e]... [ELSE f] END
Purpose: Compares an expression to one or more possible values, and returns a corresponding result
when a match is found.
Return type: same as the initial argument value, except that integer values are promoted to BIGINT and
floating-point values are promoted to DOUBLE; use CAST() when inserting into a smaller numeric column
CASE WHEN a THEN b [WHEN c THEN d]... [ELSE e] END
Purpose: Tests whether any of a sequence of expressions is true, and returns a corresponding result for
the first true expression.
Purpose: Returns the first specified argument that is not NULL, or NULL if all arguments are NULL.
Return type: same as the initial argument value, except that integer values are promoted to BIGINT and
floating-point values are promoted to DOUBLE; use CAST() when inserting into a smaller numeric column
if(boolean condition, type ifTrue, type ifFalseOrNull)
Purpose: Tests an expression and returns a corresponding result depending on whether the result is
true, false, or NULL.
Return type: same as the ifTrue argument value
ifnull(type a, type ifNotNull)
Purpose: Alias for the isnull() function, with the same behavior. To simplify porting SQL with vendor
extensions to Impala.
Added in: Impala 1.3.0
isnull(type a, type ifNotNull)
Purpose: Tests if an expression is NULL, and returns the expression result value if not. If the first argument
is NULL, returns the second argument.
Compatibility notes: Equivalent to the nvl() function from Oracle Database or ifnull() from MySQL.
The nvl() and ifnull() functions are also available in Impala.
Return type: same as the first argument value
nullif(expr1,expr2)
Purpose: Returns NULL if the two specified arguments are equal. If the specified arguments are not equal,
returns the value of expr1. The data types of the expressions must be compatible, according to the
conversion rules from Data Types on page 49. You cannot use an expression that evaluates to NULL for
expr1; that way, you can distinguish a return value of NULL from an argument value of NULL, which would
never match expr2.
Usage notes: This function is effectively shorthand for a CASE expression of the form:
CASE
WHEN expr1 = expr2 THEN NULL
ELSE expr1
END
It is commonly used in division expressions, to produce a NULL result instead of a divide-by-zero error
when the divisor is equal to zero:
select 1.0 / nullif(c1,0) as reciprocal from t1;
You might also use it for compatibility with other database systems that support the same NULLIF()
function.
Return type: same as the initial argument value, except that integer values are promoted to BIGINT and
floating-point values are promoted to DOUBLE; use CAST() when inserting into a smaller numeric column
Added in: Impala 1.3.0
nullifzero(numeric_expr)
Purpose: Returns NULL if the numeric expression evaluates to 0, otherwise returns the result of the
expression.
Purpose: Alias for the isnull() function. Tests if an expression is NULL, and returns the expression
result value if not. If the first argument is NULL, returns the second argument. Equivalent to the nvl()
function from Oracle Database or ifnull() from MySQL.
Return type: same as the first argument value
Added in: Impala 1.1
zeroifnull(numeric_expr)
Purpose: Returns 0 if the numeric expression evaluates to NULL, otherwise returns the result of the
expression.
Usage notes: Used to avoid unexpected results due to unexpected propagation of NULL values in numeric
calculations. Serves as shorthand for a more elaborate CASE expression, to simplify porting SQL with
vendor extensions to Impala.
Return type: same as the initial argument value, except that integer values are promoted to BIGINT and
floating-point values are promoted to DOUBLE; use CAST() when inserting into a smaller numeric column
Added in: Impala 1.3.0
String Functions
Impala supports the following string functions:
ascii(string str)
Purpose: Returns the numeric ASCII code of the first character of the argument.
Return type: int
char_length(string a), character_length(string a)
Purpose: Returns the length in characters of the argument string. Aliases for the length() function.
Return type: int
concat(string a, string b...)
Purpose: Returns a single string representing all the argument values joined together.
Return type: string
Usage notes: concat() and concat_ws() are appropriate for concatenating the values of multiple
columns within the same row, while group_concat() joins together values from different rows.
concat_ws(string sep, string a, string b...)
Purpose: Returns a single string representing the second and following argument values joined together,
delimited by a specified separator.
Return type: string
Usage notes: concat() and concat_ws() are appropriate for concatenating the values of multiple
columns within the same row, while group_concat() joins together values from different rows.
Purpose: Returns the position (starting from 1) of the first occurrence of a specified string within a
comma-separated string. Returns NULL if either argument is NULL, 0 if the search string is not found, or
0 if the search string contains a comma.
Return type: int
group_concat(string s [, string sep])
Purpose: Returns a single string representing the argument value concatenated together for each row
of the result set. If the optional separator string is specified, the separator is added between each pair
of concatenated values.
Return type: string
Usage notes: concat() and concat_ws() are appropriate for concatenating the values of multiple
columns within the same row, while group_concat() joins together values from different rows.
By default, returns a single string covering the whole result set. To include other columns or values in
the result set, or to produce multiple concatenated strings for subsets of rows, include a GROUP BY clause
in the query.
initcap(string str)
Purpose: Returns the input string with the first letter capitalized.
Return type: string
instr(string str, string substr)
Purpose: Returns the position (starting from 1) of the first occurrence of a substring within a longer
string.
Return type: int
length(string a)
Purpose: Returns the position (starting from 1) of the first occurrence of a substring within a longer
string, optionally after a particular position.
Return type: int
lower(string a), lcase(string a)
Purpose: Returns a string of a specified length, based on the first argument string. If the specified string
is too short, it is padded on the left with a repeating sequence of the characters from the pad string. If
the specified string is too long, it is truncated on the right.
Return type: string
ltrim(string a)
Purpose: Returns the argument string with any leading spaces removed from the left side.
Return type: string
parse_url(string urlString, string partToExtract [, string keyToExtract])
Purpose: Returns the portion of a URL corresponding to a specified part. The part argument can be
'PROTOCOL', 'HOST', 'PATH', 'REF', 'AUTHORITY', 'FILE', 'USERINFO', or 'QUERY'. Uppercase is
Purpose: Returns the specified () group from a string based on a regular expression pattern. Group 0
refers to the entire extracted string, while group 1, 2, and so on refers to the first, second, and so on (...)
portion.
Return type: string
The Impala regular expression syntax conforms to the POSIX Extended Regular Expression syntax used
by the Boost library. For details, see the Boost documentation. It has most idioms familiar from regular
expressions in Perl, Python, and so on. It does not support .*? for non-greedy matches.
Because the impala-shell interpreter uses the \ character for escaping, use \\ to represent the regular
expression escape character in any regular expressions that you submit through impala-shell. You
might prefer to use the equivalent character class names, such as [[:digit:]] instead of \d which you
would have to escape as \\d.
Examples:
This example shows how group 0 matches the full pattern string, including the portion outside any ()
group:
[localhost:21000] > select regexp_extract('abcdef123ghi456jkl','.*(\\d+)',0);
+-----------------------------------------------------+
| regexp_extract('abcdef123ghi456jkl', '.*(\\d+)', 0) |
+-----------------------------------------------------+
| abcdef123ghi456
|
+-----------------------------------------------------+
Returned 1 row(s) in 0.11s
This example shows how group 1 matches just the contents inside the first () group in the pattern string:
[localhost:21000] > select regexp_extract('abcdef123ghi456jkl','.*(\\d+)',1);
+-----------------------------------------------------+
| regexp_extract('abcdef123ghi456jkl', '.*(\\d+)', 1) |
+-----------------------------------------------------+
| 456
|
+-----------------------------------------------------+
Returned 1 row(s) in 0.11s
The Boost regular expression syntax does not support the .*? idiom for non-greedy matches. This
example shows how a pattern string starting with .* matches the longest possible portion of the source
string, effectively serving as a greedy match and returning the rightmost set of lowercase letters. A
pattern string both starting and ending with .* finds two potential matches of equal length, and returns
the first one found (the leftmost set of lowercase letters), effectively serving as a non-greedy match.
[localhost:21000] > select regexp_extract('AbcdBCdefGHI','.*([[:lower:]]+)',1);
+-------------------------------------------------------+
| regexp_extract('abcdbcdefghi', '.*([[:lower:]]+)', 1) |
+-------------------------------------------------------+
| def
|
+-------------------------------------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > select regexp_extract('AbcdBCdefGHI','.*([[:lower:]]+).*',1);
+---------------------------------------------------------+
| regexp_extract('abcdbcdefghi', '.*([[:lower:]]+).*', 1) |
Purpose: Returns the initial argument with the regular expression pattern replaced by the final argument
string.
Return type: string
The Impala regular expression syntax conforms to the POSIX Extended Regular Expression syntax used
by the Boost library. For details, see the Boost documentation. It has most idioms familiar from regular
expressions in Perl, Python, and so on. It does not support .*? for non-greedy matches.
Because the impala-shell interpreter uses the \ character for escaping, use \\ to represent the regular
expression escape character in any regular expressions that you submit through impala-shell. You
might prefer to use the equivalent character class names, such as [[:digit:]] instead of \d which you
would have to escape as \\d.
Examples:
These examples show how you can replace parts of a string matching a pattern with replacement text,
which can include backreferences to any () groups in the pattern string. The backreference numbers
start at 1, and any \ characters must be escaped as \\.
Replace a character pattern with new text:
[localhost:21000] > select regexp_replace('aaabbbaaa','b+','xyz');
+------------------------------------------+
| regexp_replace('aaabbbaaa', 'b+', 'xyz') |
+------------------------------------------+
| aaaxyzaaa
|
+------------------------------------------+
Returned 1 row(s) in 0.11s
Replace a character pattern with substitution text that includes the original matching text:
[localhost:21000] > select regexp_replace('aaabbbaaa','(b+)','<\\1>');
+----------------------------------------------+
| regexp_replace('aaabbbaaa', '(b+)', '<\\1>') |
+----------------------------------------------+
| aaa<bbb>aaa
|
+----------------------------------------------+
Returned 1 row(s) in 0.11s
Purpose: Returns a string of a specified length, based on the first argument string. If the specified string
is too short, it is padded on the right with a repeating sequence of the characters from the pad string. If
the specified string is too long, it is truncated on the right.
Return type: string
rtrim(string a)
Purpose: Returns the argument string with any trailing spaces removed from the right side.
Return type: string
space(int n)
Purpose: Returns a concatenated string of the specified number of spaces. Shorthand for repeat('
',n).
Return type: string
strleft(string a, int num_chars)
Purpose: Returns the leftmost characters of the string. Shorthand for a call to substr() with 2 arguments.
Return type: string
strright(string a, int num_chars)
Purpose: Returns the rightmost characters of the string. Shorthand for a call to substr() with 2
arguments.
Return type: string
substr(string a, int start [, int len]), substring(string a, int start [, int len])
Purpose: Returns the portion of the string starting at a specified point, optionally with a specified maximum
length. The characters in the string are indexed starting at 1.
Return type: string
translate(string input, string from, string to)
Purpose: Returns the input string with a set of characters replaced by another set of characters.
Return type: string
trim(string a)
Purpose: Returns the input string with both leading and trailing spaces removed. The same as passing
the string through both ltrim() and rtrim().
Return type: string
upper(string a), ucase(string a)
Miscellaneous Functions
Impala supports the following utility functions that do not operate on a particular column or data type:
current_database()
Purpose: Returns the database that the session is currently using, either default if no database has
been selected, or whatever database the session switched to through a USE statement or the impalad
-d option.
Return type: string
Purpose: Returns the process ID of the impalad daemon that the session is connected to. You can use
it during low-level debugging, to issue Linux commands that trace, show the arguments, and so on the
impalad process.
Return type: int
user()
Purpose: Returns the username of the Linux user who is connected to the impalad daemon. Typically
called a single time, in a query without any FROM clause, to understand how authorization settings apply
in a security context; once you know the logged-in user name, you can check which groups that user
belongs to, and from the list of groups you can check which roles are available to those groups through
the authorization policy file.
Return type: string
version()
Purpose: Returns information such as the precise version number and build date for the impalad daemon
that you are currently connected to. Typically used to confirm that you are connected to the expected
level of Impala to use a particular feature, or to connect to several nodes and confirm they are all running
the same level of impalad.
Return type: string (with one or more embedded newlines)
Aggregate Functions
Aggregate functions are a special category with different rules. These functions calculate a return value across
all the items in a result set, so they require a FROM clause in the query:
select count(product_id) from product_catalog;
select max(height), avg(height) from census_data where age > 20;
Aggregate functions also ignore NULL values rather than returning a NULL result. For example, if some rows
have NULL for a particular column, those rows are ignored when computing the AVG() for that column. Likewise,
specifying COUNT(col_name) in a query counts only those rows where col_name contains a non-NULL value.
AVG Function
An aggregate function that returns the average value from a set of numbers. Its single argument can be numeric
column, or the numeric result of a function or expression applied to the column value. Rows with a NULL value
for the specified column are ignored. If the table is empty, or all the values supplied to AVG are NULL, AVG returns
NULL.
When the query contains a GROUP BY clause, returns one value for each combination of grouping values.
Return type: DOUBLE
Examples:
-- Average all the non-NULL values in a column.
insert overwrite avg_t values (2),(4),(6),(null),(null);
-- The average of the above values is 4: (2+4+6) / 3. The 2 NULL values are ignored.
select avg(x) from avg_t;
-- Average only certain values from the column.
select avg(x) from t1 where month = 'January' and year = '2013';
-- Apply a calculation to the value of the column before averaging.
select avg(x/3) from t1;
-- Apply a function to the value of the column before averaging.
-- Here we are substituting a value of 0 for all NULLs in the column,
-- so that those rows do factor into the return value.
select avg(isnull(x,0)) from t1;
-- Apply some number-returning function to a string column and average the results.
-- If column s contains any NULLs, length(s) also returns NULL and those rows are
COUNT Function
An aggregate function that returns the number of rows, or the number of non-NULL rows, that meet certain
conditions:
The notation COUNT(*) includes NULL values in the total.
The notation COUNT(column_name) only considers rows where the column contains a non-NULL value.
You can also combine COUNT with the DISTINCT operator to eliminate duplicates before counting, and to
count the combinations of values across multiple columns.
When the query contains a GROUP BY clause, returns one value for each combination of grouping values.
Return type: BIGINT
Examples:
-- How many rows total are in the table, regardless of NULL values?
select count(*) from t1;
-- How many rows are in the table with non-NULL values for a column?
select count(c1) from t1;
-- Count the rows that meet certain conditions.
-- Again, * includes NULLs, so COUNT(*) might be greater than COUNT(col).
select count(*) from t1 where x > 10;
select count(c1) from t1 where x > 10;
-- Can also be used in combination with DISTINCT and/or GROUP BY.
-- Combine COUNT and DISTINCT to find the number of unique values.
-- Must use column names rather than * with COUNT(DISTINCT ...) syntax.
-- Rows with NULL values are not counted.
select count(distinct c1) from t1;
-- Rows with a NULL value in _either_ column are not counted.
select count(distinct c1, c2) from t1;
-- Return more than one result.
select month, year, count(distinct visitor_id) from web_stats group by month, year;
Note:
Impala only allows a single COUNT(DISTINCT columns) expression in each query.
If you do not need precise accuracy, you can produce an estimate of the distinct values for a column
by specifying NDV(column); a query can contain multiple instances of NDV(column).
To produce the same result as multiple COUNT(DISTINCT) expressions, you can use the following
technique for queries involving a single table:
select v1.c1 result1, v2.c1 result2 from
(select count(distinct col1) as c1 from t1) v1
cross join
(select count(distinct col2) as c1 from t1) v2;
Because CROSS JOIN is an expensive operation, prefer to use the NDV() technique wherever practical.
MAX Function
An aggregate function that returns the maximum value from a set of numbers. Opposite of the MIN function.
Its single argument can be numeric column, or the numeric result of a function or expression applied to the
column value. Rows with a NULL value for the specified column are ignored. If the table is empty, or all the values
supplied to MAX are NULL, MAX returns NULL.
When the query contains a GROUP BY clause, returns one value for each combination of grouping values.
Return type: Same as the input argument
Examples:
-- Find the largest value for this column in the table.
select max(c1) from t1;
-- Find the largest value for this column from a subset of the table.
select max(c1) from t1 where month = 'January' and year = '2013';
-- Find the largest value from a set of numeric function results.
select max(length(s)) from t1;
-- Can also be used in combination with DISTINCT and/or GROUP BY.
-- Return more than one result.
select month, year, max(purchase_price) from store_stats group by month, year;
-- Filter the input to eliminate duplicates before performing the calculation.
select max(distinct x) from t1;
MIN Function
An aggregate function that returns the minimum value from a set of numbers. Opposite of the MAX function.
Its single argument can be numeric column, or the numeric result of a function or expression applied to the
column value. Rows with a NULL value for the specified column are ignored. If the table is empty, or all the values
supplied to MIN are NULL, MIN returns NULL.
When the query contains a GROUP BY clause, returns one value for each combination of grouping values.
Return type: Same as the input argument
Examples:
-- Find the smallest value for this column in the table.
select min(c1) from t1;
-- Find the smallest value for this column from a subset of the table.
select min(c1) from t1 where month = 'January' and year = '2013';
-- Find the smallest value from a set of numeric function results.
select min(length(s)) from t1;
-- Can also be used in combination with DISTINCT and/or GROUP BY.
-- Return more than one result.
select month, year, min(purchase_price) from store_stats group by month, year;
-- Filter the input to eliminate duplicates before performing the calculation.
select min(distinct x) from t1;
SUM Function
An aggregate function that returns the sum of a set of numbers. Its single argument can be numeric column,
or the numeric result of a function or expression applied to the column value. Rows with a NULL value for the
specified column are ignored. If the table is empty, or all the values supplied to MIN are NULL, SUM returns NULL.
When the query contains a GROUP BY clause, returns one value for each combination of grouping values.
Return type: BIGINT for integer arguments, DOUBLE for floating-point arguments
Examples:
-- Total all the values for this column in the table.
select sum(c1) from t1;
-- Find the total for this column from a subset of the table.
select sum(c1) from t1 where month = 'January' and year = '2013';
-- Find the total from a set of numeric function results.
select sum(length(s)) from t1;
-- Often used with functions that return predefined values to compute a score.
select sum(case when grade = 'A' then 1.0 when grade = 'B' then 0.75 else 0) as
class_honors from test_scores;
-- Can also be used in combination with DISTINCT and/or GROUP BY.
-- Return more than one result.
select month, year, sum(purchase_price) from store_stats group by month, year;
-- Filter the input to eliminate duplicates before performing the calculation.
select sum(distinct x) from t1;
This example demonstrates that, because the return value of these aggregate functions is a STRING, you convert
the result with CAST if you need to do further calculations as a numeric value.
[localhost:21000] > create table score_stats as select cast(stddev(score) as
decimal(7,4)) `standard_deviation`, cast(variance(score) as decimal(7,4)) `variance`
from test_scores;
+-------------------+
| summary
|
+-------------------+
| Inserted 1 row(s) |
+-------------------+
[localhost:21000] > desc score_stats;
+--------------------+--------------+---------+
| name
| type
| comment |
+--------------------+--------------+---------+
| standard_deviation | decimal(7,4) |
|
| variance
| decimal(7,4) |
|
+--------------------+--------------+---------+
A user-defined aggregate function (UDAF) accepts a group of values and returns a single value. You use
UDAFs to summarize and condense sets of rows, in the same style as the built-in COUNT, MAX(), SUM(), and
AVG() functions. When called in a query that uses the GROUP BY clause, the function is called once for each
combination of GROUP BY values. For example:
-- Evaluates multiple rows but returns a single value.
select closest_restaurant(latitude, longitude) from places;
-- Evaluates batches of rows and returns a separate value for each batch.
select most_profitable_location(store_id, sales, expenses, tax_rate, depreciation)
from franchise_data group by year;
Currently, Impala does not support other categories of user-defined functions, such as user-defined table
functions (UDTFs) or window functions.
You can find the sample files mentioned here in the Impala github repo.
Installing the UDF Development Package
To develop UDFs for Impala, download and install the impala-udf-devel package containing header files,
sample source, and build configuration files. Start at http://archive.cloudera.com/impala/ and locate the
appropriate .repo or list file for your operating system version, such as the .repo file for RHEL 6. Use the
166 | Cloudera Impala User Guide
For the basic declarations needed to write a scalar UDF, see the header file udf-sample.h within the sample
build environment, which defines a simple function named AddUdf():
#ifndef IMPALA_UDF_SAMPLE_UDF_H
#define IMPALA_UDF_SAMPLE_UDF_H
#include <impala_udf/udf.h>
using namespace impala_udf;
IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2);
#endif
For sample C++ code for a simple function named AddUdf(), see the source file udf-sample.cc within the
sample build environment:
#include "udf-sample.h"
// In this sample we are declaring a UDF that adds two ints and returns an int.
IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2) {
if (arg1.is_null || arg2.is_null) return IntVal::null();
return IntVal(arg1.val + arg2.val);
}
// Multiple UDFs can be defined in the same file
The call from the SQL query must pass at least one argument to the variable-length portion of the argument
list.
When Impala calls the function, it fills in the initial set of required arguments, then passes the number of extra
arguments and a pointer to the first of those optional arguments.
Currently, only THREAD_SCOPE is implemented, not FRAGMENT_SCOPE. See udf.h for details about the scope
values.
For a serious problem that requires cancelling the query, a UDF can set an error flag that prevents the query
from returning any results. The signature for this function is:
void SetError(const char* error_msg);
Then, unpack the sample code in udf_samples.tar.gz and use that as a template to set up your build
environment.
To build the original samples:
# Process CMakeLists.txt and set up appropriate Makefiles.
cmake .
# Generate shared libraries from UDF and UDAF sample code,
# udf_samples/libudfsample.so and udf_samples/libudasample.so
make
The sample code to examine, experiment with, and adapt is in these files:
udf-sample.h: Header file that declares the signature for a scalar UDF (AddUDF).
udf-sample.cc: Sample source for a simple UDF that adds two integers. Because Impala can reference
multiple function entry points from the same shared library, you could add other UDF functions in this file
and add their signatures to the corresponding header file.
udf-sample-test.cc: Basic unit tests for the sample UDF.
uda-sample.h: Header file that declares the signature for sample aggregate functions. The SQL functions
will be called COUNT, AVG, and STRINGCONCAT. Because aggregate functions require more elaborate coding
to handle the processing for multiple phases, there are several underlying C++ functions such as CountInit,
AvgUpdate, and StringConcatFinalize.
uda-sample.cc: Sample source for simple UDAFs that demonstrate how to manage the state transitions
as the underlying functions are called during the different phases of query processing.
The UDAF that imitates the COUNT function keeps track of a single incrementing number; the merge
functions combine the intermediate count values from each Impala node, and the combined number is
returned verbatim by the finalize function.
The UDAF that imitates the AVG function keeps track of two numbers, a count of rows processed and the
sum of values for a column. These numbers are updated and merged as with COUNT, then the finalize
function divides them to produce and return the final average value.
The UDAF that concatenates string values into a comma-separated list demonstrates how to manage
storage for a string that increases in length as the function is called for multiple rows.
uda-sample-test.cc: basic unit tests for the sample UDAFs.
We build a shared library, libudfsample.so, and put the library file into HDFS where Impala can read it:
$ make
[ 0%] Generating udf_samples/uda-sample.ll
[ 16%] Built target uda-sample-ir
[ 33%] Built target udasample
[ 50%] Built target uda-sample-test
[ 50%] Generating udf_samples/udf-sample.ll
[ 66%] Built target udf-sample-ir
Scanning dependencies of target udfsample
[ 83%] Building CXX object CMakeFiles/udfsample.dir/udf-sample.o
Linking CXX shared library udf_samples/libudfsample.so
[ 83%] Built target udfsample
Linking CXX executable udf_samples/udf-sample-test
[100%] Built target udf-sample-test
$ hdfs dfs -put ./udf_samples/libudfsample.so /user/hive/udfs/libudfsample.so
Finally, we go into the impala-shell interpreter where we set up some sample data, issue CREATE FUNCTION
statements to set up the SQL function names, and call the functions in some queries:
[localhost:21000] > create database udf_testing;
[localhost:21000] > use udf_testing;
[localhost:21000] > create function has_vowels (string) returns boolean location
'/user/hive/udfs/libudfsample.so' symbol='HasVowels';
[localhost:21000] > select has_vowels('abc');
+------------------------+
| udfs.has_vowels('abc') |
+------------------------+
| true
|
+------------------------+
Returned 1 row(s) in 0.13s
[localhost:21000] > select has_vowels('zxcvbnm');
+----------------------------+
| udfs.has_vowels('zxcvbnm') |
+----------------------------+
| false
|
+----------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > select has_vowels(null);
+-----------------------+
| udfs.has_vowels(null) |
+-----------------------+
| NULL
|
We add the function bodies to a C++ source file (in this case, uda-sample.cc):
void SumOfSquaresInit(FunctionContext* context, BigIntVal* val) {
val->is_null = false;
val->val = 0;
}
void SumOfSquaresInit(FunctionContext* context, DoubleVal* val) {
val->is_null = false;
val->val = 0.0;
}
void SumOfSquaresUpdate(FunctionContext* context, const BigIntVal& input, BigIntVal*
val) {
if (input.is_null) return;
val->val += input.val * input.val;
}
void SumOfSquaresUpdate(FunctionContext* context, const DoubleVal& input, DoubleVal*
val) {
if (input.is_null) return;
val->val += input.val * input.val;
}
void SumOfSquaresMerge(FunctionContext* context, const BigIntVal& src, BigIntVal* dst)
{
dst->val += src.val;
}
void SumOfSquaresMerge(FunctionContext* context, const DoubleVal& src, DoubleVal* dst)
{
dst->val += src.val;
}
BigIntVal SumOfSquaresFinalize(FunctionContext* context, const BigIntVal& val) {
return val;
}
DoubleVal SumOfSquaresFinalize(FunctionContext* context, const DoubleVal& val) {
return val;
}
To create the SQL function, we issue a CREATE AGGREGATE FUNCTION statement and specify the underlying
C++ function names for the different phases:
[localhost:21000] > use udf_testing;
[localhost:21000] > create table sos (x bigint, y double);
[localhost:21000] > insert into sos values (1, 1.1), (2, 2.2), (3, 3.3), (4, 4.4);
Inserted 4 rows in 1.10s
[localhost:21000] > create aggregate function sum_of_squares(bigint) returns bigint
> location '/user/hive/udfs/libudasample.so'
> init_fn='SumOfSquaresInit'
> update_fn='SumOfSquaresUpdate'
> merge_fn='SumOfSquaresMerge'
> finalize_fn='SumOfSquaresFinalize';
[localhost:21000] > -- Compute the same value using literals or the UDA;
[localhost:21000] > select 1*1 + 2*2 + 3*3 + 4*4;
+-------------------------------+
| 1 * 1 + 2 * 2 + 3 * 3 + 4 * 4 |
+-------------------------------+
| 30
|
+-------------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > select sum_of_squares(x) from sos;
+------------------------+
| udfs.sum_of_squares(x) |
+------------------------+
| 30
|
+------------------------+
Returned 1 row(s) in 0.35s
Until we create the overloaded version of the UDA, it can only handle a single data type. To allow it to handle
DOUBLE as well as BIGINT, we issue another CREATE AGGREGATE FUNCTION statement:
[localhost:21000] > select sum_of_squares(y) from sos;
ERROR: AnalysisException: No matching function with signature:
udfs.sum_of_squares(DOUBLE).
[localhost:21000] > create aggregate function sum_of_squares(double) returns double
> location '/user/hive/udfs/libudasample.so'
> init_fn='SumOfSquaresInit'
> update_fn='SumOfSquaresUpdate'
> merge_fn='SumOfSquaresMerge'
> finalize_fn='SumOfSquaresFinalize';
[localhost:21000] > -- Compute the same value using literals or the UDA;
[localhost:21000] > select 1.1*1.1 + 2.2*2.2 + 3.3*3.3 + 4.4*4.4;
+-----------------------------------------------+
| 1.1 * 1.1 + 2.2 * 2.2 + 3.3 * 3.3 + 4.4 * 4.4 |
+-----------------------------------------------+
Typically, you use a UDA in queries with GROUP BY clauses, to produce a result set with a separate aggregate
value for each combination of values from the GROUP BY clause. Let's change our sample table to use 0 to
indicate rows containing even values, and 1 to flag rows containing odd values. Then the GROUP BY query can
return two values, the sum of the squares for the even values, and the sum of the squares for the odd values:
[localhost:21000] > insert overwrite sos values (1, 1), (2, 0), (3, 1), (4, 0);
Inserted 4 rows in 1.24s
[localhost:21000] > -- Compute 1 squared + 3 squared, and 2 squared + 4 squared;
[localhost:21000] > select y, sum_of_squares(x) from sos group by y;
+---+------------------------+
| y | udfs.sum_of_squares(x) |
+---+------------------------+
| 1 | 10
|
| 0 | 20
|
+---+------------------------+
Returned 2 row(s) in 0.43s
Because CROSS JOIN is an expensive operation, prefer to use the NDV() technique wherever
practical.
User-defined functions (UDFs) are supported starting in Impala 1.2. See User-Defined Functions (UDFs) on page
163 for full details on Impala UDFs.
Impala supports high-performance UDFs written in C++, as well as reusing some Java-based Hive UDFs.
Impala supports scalar UDFs and user-defined aggregate functions (UDAFs). Impala does not currently
support user-defined table generating functions (UDTFs).
Only Impala-supported column types are supported in Java-based UDFs.
Impala does not currently support these HiveQL statements:
For YEAR columns, change to the smallest Impala integer type that has sufficient range. See Data Types on
page 49 for details about ranges, casting, and so on for the various numeric data types.
180 | Cloudera Impala User Guide
Impala does not support notation such as b'0101' for bit literals.
For BLOB values, use STRING to represent CLOB or TEXT types (character based large objects) up to 32 KB in
size. Binary large objects such as BLOB, RAW BINARY, and VARBINARY do not currently have an equivalent in
Impala.
For Boolean-like types such as BOOL, use the Impala BOOLEAN type.
Because Impala currently does not support composite or nested types, any spatial data types in other
database systems do not have direct equivalents in Impala. You could represent spatial values in string
format and write UDFs to process them. See User-Defined Functions (UDFs) on page 163 for details. Where
practical, separate spatial types into separate tables so that Impala can still work with the non-spatial data.
Take out any DEFAULT clauses. Impala can use data files produced from many different sources, such as Pig,
Hive, or MapReduce jobs. The fast import mechanisms of LOAD DATA and external tables mean that Impala
is flexible about the format of data files, and Impala does not necessarily validate or cleanse data before
querying it. When copying data through Impala INSERT statements, you can use conditional functions such
as CASE or NVL to substitute some other value for NULL fields; see Conditional Functions on page 151 for
details.
Cloudera Impala User Guide | 181
For any other type not supported in Impala, you could represent their values in string format and write UDFs
to process them. See User-Defined Functions (UDFs) on page 163 for details.
To detect the presence of unsupported or unconvertable data types in data files, do initial testing with the
ABORT_ON_ERROR=true query option in effect. This option causes queries to fail immediately if they encounter
disallowed type conversions. See ABORT_ON_ERROR on page 193 for details. For example:
set abort_on_error=true;
select count(*) from (select * from t1);
-- The above query will fail if the data files for T1 contain any
-- values that can't be converted to the expected Impala data types.
-- For example, if T1.C1 is defined as INT but the column contains
-- floating-point values like 1.1, the query will return an error.
When an alias is declared for an expression in a query, that alias cannot be referenced again within the same
query block:
-- Can't reference AVERAGE twice in the SELECT list where it's defined.
select avg(x) as average, average+1 from t1 group by x;
ERROR: AnalysisException: couldn't resolve column reference: 'average'
-- Although it can be referenced again later in the same query.
select avg(x) as average from t1 group by x having average > 3;
For Impala, either repeat the expression again, or abstract the expression into a WITH clause, creating named
columns that can be referenced multiple times anywhere in the base query:
-- The following 2 query forms are equivalent.
select avg(x) as average, avg(x)+1 from t1 group by x;
with avg_t as (select avg(x) average from t1 group by x) select average, average+1
from avg_t;
Impala does not support certain rarely used join types that are less appropriate for high-volume tables used
for data warehousing. In some cases, Impala supports join types but requires explicit syntax to ensure you
do not do inefficient joins of huge tables by accident. For example, Impala does not support natural joins or
anti-joins, and requires the CROSS JOIN operator for Cartesian products. See Joins on page 119 for details
on the syntax for Impala join clauses.
Impala has a limited choice of partitioning types. Partitions are defined based on each distinct combination
of values for one or more partition key columns. Impala does not redistribute or check data to create evenly
distributed partitions; you must choose partition key columns based on your knowledge of the data volume
and distribution. Adapt any tables that use range, list, hash, or key partitioning to use the Impala partition
syntax for CREATE TABLE and ALTER TABLE statements. Impala partitioning is similar to range partitioning
where every range has exactly one value, or key partitioning where the hash function produces a separate
bucket for every combination of key values. See Partitioning on page 233 for usage details, and CREATE TABLE
Statement on page 90 and ALTER TABLE Statement on page 79 for syntax.
Explanation
-B or --delimited
--print_header
Explanation
-o filename or --output_file
filename
Stores all query results in the specified file. Typically used to store the
results of a single query issued from the command line with the -q option.
Also works for interactive sessions; you see the messages such as number
of rows fetched, but not the actual result set. To suppress these incidental
messages when combining the -q and -o options, redirect stderr to
/dev/null. Added in Impala 1.0.1.
--output_delimiter=character
-p or --show_profiles
-h or --help
-i hostname or
--impalad=hostname
Connects to the impalad daemon on the specified host. The default port
of 21000 is assumed unless you provide another value. You can connect
to any host in your cluster that is running impalad. If you connect to an
instance of impalad that was started with an alternate port specified by
the --fe_port flag, provide that alternative port.
-q query or --query=query
Passes a query or other shell command from the command line. The shell
immediately exits after processing the statement. It is limited to a single
statement, which could be a SELECT, CREATE TABLE, SHOW TABLES, or
any other statement recognized in impala-shell. Because you cannot
pass a USE statement and another query, fully qualify the names for any
tables outside the default database. (Or use the -f option to pass a file
with a USE statement followed by other queries.)
-f query_file or
--query_file=query_file
Passes a SQL query from a file. Files must be semicolon (;) delimited.
-k or --kerberos
-s kerberos_service_name or
--kerberos_service_name=name
-V or --verbose
--quiet
-v or --version
-c
-r or --refresh_after_connect
Explanation
-d default_db or
--database=default_db
-ssl
--ca_cert
-l
-u
--strict_unicode
Causes the shell to ignore invalid Unicode code points in input strings.
Note: Replace impalad-host with the host name you have configured for any DataNode running
Impala in your environment. The changed prompt indicates a successful connection.
help
use
history
insert
version
quit
refresh
select
Note: Commands must be terminated by a semi-colon. A command can span multiple lines.
For example:
[impalad-host:21000] > select * from alltypessmall limit 5
Query: select * from alltypessmall limit 5
Query finished, fetching results ...
2009
3
50
true
0
0
0
0
0
0
03/01/09
0
2009-03-01 00:00:00
2009
3
51
false
1
1
1
10
1.100000023841858
10.1
03/01/09
1
2009-03-01 00:01:00
2009
3
52
true
2
2
2
20
2.200000047683716
20.2
03/01/09
2
2009-03-01 00:02:00.100000000
2009
3
53
false
3
3
3
30
3.299999952316284
30.3
03/01/09
3
2009-03-01 00:03:00.300000000
2009
3
54
true
4
4
4
40
4.400000095367432
40.4
03/01/09
4
2009-03-01 00:04:00.600000000
Returned 5 row(s) in 0.10s
[impalad-host:21000] >
Explanation
alter
Changes the underlying structure or settings of an Impala table, or a table shared between
Impala and Hive. See ALTER TABLE Statement on page 79 and ALTER VIEW Statement
on page 83 for details.
compute stats
connect
Connects to the specified instance of impalad. The default port of 21000 is assumed
unless you provide another value. You can connect to any host in your cluster that is
running impalad. If you connect to an instance of impalad that was started with an
Explanation
alternate port specified by the --fe_port flag, you must provide that alternate port.
See Connecting to impalad through impala-shell on page 189 for examples.
The SET command has no effect until the impala-shell interpreter is connected to an
Impala server. Once you are connected, any query options you set remain in effect as
you issue subsequent CONNECT commands to connect to different Impala servers,
describe
Shows the columns, column data types, and any column comments for a specified table.
DESCRIBE FORMATTED shows additional information such as the HDFS data directory,
partitions, and internal properties for the table. See DESCRIBE Statement on page 96
for details about the basic DESCRIBE output and the DESCRIBE FORMATTED variant. You
can use DESC as shorthand for the DESCRIBE command.
drop
Removes a schema object, and in some cases its associated data files. See DROP TABLE
Statement on page 101, DROP VIEW Statement on page 102, DROP DATABASE Statement
on page 100, and DROP FUNCTION Statement on page 101 for details.
explain
Provides the execution plan for a query. EXPLAIN represents a query as a series of steps.
For example, these steps might be map/reduce stages, metastore operations, or file
system operations such as move or rename. See EXPLAIN Statement on page 103 and
Using the EXPLAIN Plan for Performance Tuning on page 224 for details.
help
history
insert
Writes the results of a query to a specified table. This either overwrites table data or
appends data to the existing table content. See INSERT Statement on page 105 for details.
invalidate
metadata
Updates impalad metadata for table existence and structure. Use this command after
creating, dropping, or altering databases, tables, or partitions in Hive. See INVALIDATE
METADATA Statement on page 111 for details.
profile
Displays low-level information about the most recent query. Used for performance
diagnosis and tuning. The report starts with the same information as produced by the
EXPLAIN statement and the SUMMARY command. See Using the Query Profile for
Performance Tuning on page 226 for details.
quit
Exits the shell. Remember to include the final semicolon so that the shell recognizes
the end of the command.
refresh
Refreshes impalad metadata for the locations of HDFS blocks corresponding to Impala
data files. Use this command after loading new data files into an Impala table through
Hive or through HDFS commands. See REFRESH Statement on page 116 for details.
select
Specifies the data set on which to complete some action. All information returned from
select can be sent to some output such as the console or a file or can be used to
complete some other element of query. See SELECT Statement on page 118 for details.
set
Manages query options for an impala-shell session. The available options are the
ones listed in Query Options for the SET Command on page 192. These options are used
for query tuning and troubleshooting. Issue SET with no arguments to see the current
query options, either based on the impalad defaults, as specified by you at impalad
startup, or based on earlier SET commands in the same session. To modify option values,
issue commands with the syntax set option=value. To restore an option to its default,
Explanation
use the unset command. Some options take Boolean values of true and false. Others
take numeric arguments, or quoted string values.
The SET command has no effect until the impala-shell interpreter is connected to an
Impala server. Once you are connected, any query options you set remain in effect as
you issue subsequent CONNECT commands to connect to different Impala servers,
shell
Executes the specified command in the operating system shell without exiting
impala-shell. You can use the ! character as shorthand for the shell command.
Note: Quote any instances of the -- or /* tokens to avoid them being
interpreted as the start of a comment. To embed comments within source or
! commands, use the shell comment character # before the comment portion
of the line.
show
Displays metastore data for schema objects created and accessed through Impala, Hive,
or both. show can be used to gather information about databases or tables by following
the show command with one of those choices. See SHOW Statement on page 135 for
details.
summary
unset
Removes any user-specified value for a query option and returns the option to its default
value. See Query Options for the SET Command on page 192 for the available query
options.
use
Indicates the database against which to execute subsequent commands. Lets you avoid
using fully qualified names when referring to tables in databases other than default.
See USE Statement on page 138 for details. Not effective with the -q option, because
that option only allows a single statement in the argument.
version
ABORT_ON_DEFAULT_LIMIT_EXCEEDED
Now that the ORDER BY clause no longer requires an accompanying LIMIT clause in Impala 1.4.0 and higher,
this query option is deprecated and has no effect.
ABORT_ON_ERROR
When this option is enabled, Impala cancels a query immediately when any of the nodes encounters an error,
rather than continuing and possibly returning incomplete results. This option is enabled by default to hep you
gather maximum diagnostic information when an errors occurs, for example, whether the same problem occurred
on all nodes or only a single node. Currently, the errors that Impala can skip over involve data corruption, such
as a column that contains a string value when expected to contain an integer value.
To control how much logging Impala does for non-fatal errors when ABORT_ON_ERROR is turned off, use the
MAX_ERRORS option.
Type: BOOLEAN
Default: false (shown as 0 in output of SET command)
ALLOW_UNSUPPORTED_FORMATS
An obsolete query option from early work on support for file formats. Do not use. Might be removed in the future.
Type: BOOLEAN
Default: false (shown as 0 in output of SET command)
BATCH_SIZE
Number of rows evaluated at a time by SQL operators. Unspecified or a size of 0 uses a predefined default size.
Primarily for Cloudera testing.
Default: 0 (meaning 1024)
DEBUG_ACTION
Introduces artificial problem conditions within queries. For internal Cloudera debugging and troubleshooting.
Type: STRING
Default: empty string
DEFAULT_ORDER_BY_LIMIT
Now that the ORDER BY clause no longer requires an accompanying LIMIT clause in Impala 1.4.0 and higher,
this query option is deprecated and has no effect.
Prior to Impala 1.4.0, Impala queries that use the ORDER BY clause must also include a LIMIT clause, to avoid
accidentally producing huge result sets that must be sorted. Sorting a huge result set is a memory-intensive
operation. In Impala 1.4.0 and higher, Impala uses a temporary disk work area to perform the sort if that operation
would otherwise exceed the Impala memory limit on a particular host.
Default: -1 (no default limit)
DISABLE_CODEGEN
This is a debug option, intended for diagnosing and working around issues that cause crashes. If a query fails
with an illegal instruction or other hardware-specific message, try setting DISABLE_CODEGEN=true and running
the query again. If the query succeeds only when the DISABLE_CODEGEN option is turned on, submit the problem
to Cloudera support and include that detail in the problem report. Do not otherwise run with this setting turned
on, because it results in lower overall performance.
EXPLAIN_LEVEL
Controls the amount of detail provided in the output of the EXPLAIN statement. The basic output can help you
identify high-level performance issues such as scanning a higher volume of data or more partitions than you
expect. The higher levels of detail show how intermediate results flow between nodes and how different SQL
operations such as ORDER BY, GROUP BY, joins, and WHERE clauses are implemented within a distributed query.
Type: STRING or INT
Default: 1 (might be incorrectly reported as 0 in output of SET command)
Arguments:
The allowed range of numeric values for this option is 0 to 3:
0 or MINIMAL: A barebones list, one line per operation. Primarily useful for checking the join order in very long
queries where the regular EXPLAIN output is too long to read easily.
1 or STANDARD: The default level of detail, showing the logical way that work is split up for the distributed
query.
2 or EXTENDED: Includes additional detail about how the query planner uses statistics in its decision-making
process, to understand how a query could be tuned by gathering statistics, using query hints, adding or
removing predicates, and so on.
3 or VERBOSE: The maximum level of detail, showing how work is split up within each node into query
fragments that are connected in a pipeline. This extra detail is primarily useful for low-level performance
testing and tuning within Impala itself, rather than for rewriting the SQL code at the user level.
Note: Prior to Impala 1.3, the allowed argument range for EXPLAIN_LEVEL was 0 to 1: level 0 had the
mnemonic NORMAL, and level 1 was VERBOSE. In Impala 1.3 and higher, NORMAL is not a valid mnemonic
value, and VERBOSE still applies to the highest level of detail but now corresponds to level 3. You
might need to adjust the values if you have any older impala-shell script files that set the
EXPLAIN_LEVEL query option.
Changing the value of this option controls the amount of detail in the output of the EXPLAIN statement. The
extended information from level 2 or 3 is especially useful during performance tuning, when you need to confirm
whether the work for the query is distributed the way you expect, particularly for the most resource-intensive
operations such as join queries against large tables, queries against tables with large numbers of partitions,
and insert operations for Parquet tables. The extended information also helps to check estimated resource
usage when you use the admission control or resource management features explained in Impala Administration
on page 33. See EXPLAIN Statement on page 103 for the syntax of the EXPLAIN statement, and Using the EXPLAIN
Plan for Performance Tuning on page 224 for details about how to use the extended information.
Usage notes:
As always, read the EXPLAIN output from bottom to top. The lowest lines represent the initial work of the query
(scanning data files), the lines in the middle represent calculations done on each node and how intermediate
results are transmitted from one node to another, and the topmost lines represent the final results being sent
back to the coordinator node.
The numbers in the left column are generated internally during the initial planning phase and do not represent
the actual order of operations, so it is not significant if they appear out of order in the EXPLAIN output.
As the warning message demonstrates, most of the information needed for Impala to do efficient query planning,
and for you to understand the performance characteristics of the query, requires running the COMPUTE STATS
statement for the table:
[localhost:21000] > compute stats t1;
+-----------------------------------------+
| summary
|
+-----------------------------------------+
| Updated 1 partition(s) and 2 column(s). |
+-----------------------------------------+
[localhost:21000] > explain select * from t1;
+------------------------------------------------------------------------+
| Explain String
|
+------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
|
|
| F01:PLAN FRAGMENT [PARTITION=UNPARTITIONED]
|
|
01:EXCHANGE [PARTITION=UNPARTITIONED]
|
|
hosts=0 per-host-mem=unavailable
|
|
tuple-ids=0 row-size=20B cardinality=0
|
|
|
| F00:PLAN FRAGMENT [PARTITION=RANDOM]
|
|
DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, PARTITION=UNPARTITIONED] |
|
00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM]
|
|
partitions=1/1 size=0B
|
|
table stats: 0 rows total
|
|
column stats: all
|
|
hosts=0 per-host-mem=0B
|
|
tuple-ids=0 row-size=20B cardinality=0
|
+------------------------------------------------------------------------+
Joins and other complicated, multi-part queries are the ones where you most commonly need to examine the
EXPLAIN output and customize the amount of detail in the output. This example shows the default EXPLAIN
output for a three-way join query, then the equivalent output with a [SHUFFLE] hint to change the join mechanism
between the first two tables from a broadcast join to a shuffle join.
[localhost:21000] > set explain_level=1;
[localhost:21000] > explain select one.*, two.*, three.* from t1 one, t1 two, t1 three
where one.x = two.x and two.x = three.x;
+------------------------------------------------------------------------------------+
| Explain String
|
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=4.00GB VCores=3
|
|
|
| 07:EXCHANGE [PARTITION=UNPARTITIONED]
|
| |
|
| 04:HASH JOIN [INNER JOIN, BROADCAST]
|
| | hash predicates: two.x = three.x
|
| |
|
| |--06:EXCHANGE [BROADCAST]
|
| | |
|
| | 02:SCAN HDFS [explain_plan.t1 three]
|
| |
partitions=1/1 size=0B
|
| |
For a join involving many different tables, the default EXPLAIN output might stretch over several pages, and the
only details you care about might be the join order and the mechanism (broadcast or shuffle) for joining each
pair of tables. In that case, you might set EXPLAIN_LEVEL to its lowest value of 0, to focus on just the join order
and join mechanism for each stage. The following example shows how the rows from the first and second joined
tables are hashed and divided among the nodes of the cluster for further filtering; then the entire contents of
the third table are broadcast to all nodes for the final stage of join processing.
[localhost:21000] > set explain_level=0;
[localhost:21000] > explain select one.*, two.*, three.* from t1 one join [shuffle] t1
two join t1 three where one.x = two.x and two.x = three.x;
+---------------------------------------------------------+
| Explain String
|
+---------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 |
|
|
| 08:EXCHANGE [PARTITION=UNPARTITIONED]
|
| 04:HASH JOIN [INNER JOIN, BROADCAST]
|
| |--07:EXCHANGE [BROADCAST]
|
| | 02:SCAN HDFS [explain_plan.t1 three]
|
| 03:HASH JOIN [INNER JOIN, PARTITIONED]
|
| |--06:EXCHANGE [PARTITION=HASH(two.x)]
|
| | 01:SCAN HDFS [explain_plan.t1 two]
|
| 05:EXCHANGE [PARTITION=HASH(one.x)]
|
| 00:SCAN HDFS [explain_plan.t1 one]
|
+---------------------------------------------------------+
HBASE_CACHE_BLOCKS
Setting this option is equivalent to calling the setCacheBlocks method of the class
org.apache.hadoop.hbase.client.Scan, in an HBase Java application. Helps to control the memory pressure on
the HBase region server, in conjunction with the HBASE_CACHING query option. See HBASE_CACHING on page
199 for details.
Type: BOOLEAN
Default: false (shown as 0 in output of SET command)
HBASE_CACHING
Setting this option is equivalent to calling the setCaching method of the class
org.apache.hadoop.hbase.client.Scan, in an HBase Java application. Helps to control the memory pressure on
the HBase region server, in conjunction with the HBASE_CACHE_BLOCKS query option. See HBASE_CACHE_BLOCKS
on page 199 for details.
Type: BOOLEAN
Default: 0
MAX_ERRORS
Maximum number of non-fatal errors for any particular query that are recorded in the Impala log file. For example,
if a billion-row table had a non-fatal data error in every row, you could diagnose the problem without all billion
errors being logged. Unspecified or 0 indicates the built-in default value of 1000.
This option only controls how many errors are reported. To specify whether Impala continues or halts when it
encounters such errors, use the ABORT_ON_ERROR option.
Default: 0 (meaning 1000 errors)
Cloudera Impala User Guide | 199
MAX_SCAN_RANGE_LENGTH
Maximum length of the scan range. Interacts with the number of HDFS blocks in the table to determine how
many CPU cores across the cluster are involved with the processing for a query. (Each core processes one scan
range.)
Lowering the value can sometimes increase parallelism if you have unused CPU capacity, but a too-small value
can limit query performance because each scan range involves extra overhead.
Only applicable to HDFS tables. Has no effect on Parquet tables. Unspecified or 0 indicates backend default,
which is the same as the HDFS block size for each table, typically several megabytes for most file formats, or 1
GB for Parquet tables.
Although the scan range can be arbitrarily long, Impala internally uses an 8 MB read buffer so that it can query
tables with huge block sizes without allocating equivalent blocks of memory.
Default: 0
MEM_LIMIT
When resource management is not enabled, defines the maximum amount of memory a query can allocate on
each node. If query processing exceeds the specified memory limit on any node, Impala cancels the query
automatically. Memory limits are checked periodically during query processing, so the actual memory in use
might briefly exceed the limit without the query being cancelled.
When resource management is enabled in CDH 5, the mechanism for this option changes. If set, it overrides the
automatic memory estimate from Impala. Impala requests this amount of memory from YARN on each node,
and the query does not proceed until that much memory is available. The actual memory used by the query
could be lower, since some queries use much less memory than others. With resource management, the
MEM_LIMIT setting acts both as a hard limit on the amount of memory a query can use on any node (enforced
by YARN and a guarantee that that much memory will be available on each node while the query is being
executed. When resource management is enabled but no MEM_LIMIT setting is specified, Impala estimates the
amount of memory needed on each node for each query, requests that much memory from YARN before starting
the query, and then internally sets the MEM_LIMIT on each node to the requested amount of memory during
the query. Thus, if the query takes more memory than was originally estimated, Impala detects that the MEM_LIMIT
is exceeded and cancels the query itself.
Default: 0
NUM_NODES
Limit the number of nodes that process a query, typically during debugging. Only accepts the values 0 (meaning
all nodes) or 1 (meaning all work is done on the coordinator node). If you are diagnosing a problem that you
suspect is due to a timing issue due to distributed query processing, you can set NUM_NODES=1 to verify if the
problem still occurs when all the work is done on a single node.
You might set the NUM_NODES option to 1 briefly, during INSERT or CREATE TABLE AS SELECT statements.
Normally, those statements produce one or more data files per data node. If the write operation involves small
amounts of data, a Parquet table, and/or a partitioned table, the default behavior could produce many small
files when intuitively you might expect only a single output file. SET NUM_NODES=1 turns off the distributed
aspect of the write operation, making it more likely to produce only one or a few data files.
Default: 0
PARQUET_COMPRESSION_CODEC
When Impala writes Parquet data files using the INSERT statement, the underlying compression is controlled
by the PARQUET_COMPRESSION_CODEC query option. The allowed values for this query option are SNAPPY (the
default), GZIP, and NONE. The option value is not case-sensitive. See Snappy and GZip Compression for Parquet
Data Files on page 249 for details and examples.
If the option is set to an unrecognized value, all kinds of queries will fail due to the invalid option setting, not
just queries involving Parquet tables.
Default: SNAPPY
Related information:
For information about the Parquet file format, and how compressing the data files affects query performance,
see Using the Parquet File Format with Impala Tables on page 246.
PARQUET_FILE_SIZE
Specifies the maximum size of each Parquet data file produced by Impala INSERT statements. For small or
partitioned tables where the default Parquet block size of 1 GB is much larger than needed for each data file,
you can increase parallelism by specifying a smaller size, resulting in more HDFS blocks that can be processed
by different nodes. Reducing the file size also reduces the memory required to buffer each block before writing
it to disk.
Specify the size in bytes, for example:
set PARQUET_FILE_SIZE=128000000;
INSERT INTO parquet_table SELECT * FROM text_table;
REQUEST_POOL
The pool or queue name that queries should be submitted to. Only applies when you enable the Impala admission
control feature (CDH 4 or CDH 5; see Admission Control and Query Queuing on page 33), or the YARN resource
management feature (CDH 5 only; see Using YARN Resource Management with Impala (CDH 5 Only) on page
41). Specifies the name of the pool used by requests from Impala to the resource manager.
Formerly known as YARN_POOL during the CDH 5 beta period. Renamed to reflect that it can be used both with
YARN and with the lightweight admission control feature introduced in Impala 1.3.
Default: empty (use the user-to-pool mapping defined by an impalad startup option in the Impala configuration
file)
Cloudera Impala User Guide | 201
SUPPORT_START_OVER
Leave this setting false.
Default: false
SYNC_DDL
When enabled, causes any DDL operation such as CREATE TABLE or ALTER TABLE to return only when the
changes have been propagated to all other Impala nodes in the cluster by the Impala catalog service. That way,
if you issue a subsequent CONNECT statement in impala-shell to connect to a different node in the cluster,
you can be sure that other node will already recognize any added or changed tables. (The catalog service
automatically broadcasts the DDL changes to all nodes automatically, but without this option there could be a
period of inconsistency if you quickly switched to another node.)
Although INSERT is classified as a DML statement, when the SYNC_DDL option is enabled, INSERT statements
also delay their completion until all the underlying data and metadata changes are propagated to all Impala
nodes. Internally, Impala inserts have similarities with DDL statements in traditional database systems, because
they create metadata needed to track HDFS block locations for new files and they potentially add new partitions
to partitioned tables.
Note: Because this option can introduce a delay after each write operation, if you are running a
sequence of CREATE DATABASE, CREATE TABLE, ALTER TABLE, INSERT, and similar statements within
a setup script, to minimize the overall delay you can enable the SYNC_DDL query option only near the
end, before the final DDL statement.
Default: false
Gathering statistics for all the tables is straightforward, one COMPUTE STATS statement per table:
[localhost:21000] > compute stats small;
+-----------------------------------------+
| summary
|
+-----------------------------------------+
| Updated 1 partition(s) and 3 column(s). |
+-----------------------------------------+
Returned 1 row(s) in 4.26s
[localhost:21000] > compute stats medium;
+-----------------------------------------+
| summary
|
+-----------------------------------------+
| Updated 1 partition(s) and 5 column(s). |
+-----------------------------------------+
Returned 1 row(s) in 42.11s
[localhost:21000] > compute stats big;
+-----------------------------------------+
| summary
|
+-----------------------------------------+
| Updated 1 partition(s) and 5 column(s). |
+-----------------------------------------+
Returned 1 row(s) in 165.44s
With statistics in place, Impala can choose a more effective join order rather than following the left-to-right
sequence of tables in the query, and can choose BROADCAST or PARTITIONED join strategies based on the overall
sizes and number of rows in the table:
[localhost:21000] > explain select count(*) from medium join big where big.id =
medium.id;
Query: explain select count(*) from medium join big where big.id = medium.id
+-----------------------------------------------------------+
| Explain String
|
+-----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=937.23MB VCores=2 |
|
|
| PLAN FRAGMENT 0
|
|
PARTITION: UNPARTITIONED
|
|
|
|
6:AGGREGATE (merge finalize)
|
|
| output: SUM(COUNT(*))
|
|
| cardinality: 1
|
|
| per-host memory: unavailable
|
|
| tuple ids: 2
|
|
|
|
|
5:EXCHANGE
|
|
cardinality: 1
|
|
per-host memory: unavailable
|
|
tuple ids: 2
|
|
|
| PLAN FRAGMENT 1
|
|
PARTITION: RANDOM
|
|
|
|
STREAM DATA SINK
|
|
EXCHANGE ID: 5
|
|
UNPARTITIONED
|
|
|
|
3:AGGREGATE
|
|
| output: COUNT(*)
|
|
| cardinality: 1
|
|
| per-host memory: 10.00MB
|
|
| tuple ids: 2
|
When queries like these are actually run, the execution times are relatively consistent regardless of the table
order in the query text. Here are examples using both the unique ID column and the VAL column containing
duplicate values:
[localhost:21000] > select count(*) from big join small
Query: select count(*) from big join small on (big.id =
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
Returned 1 row(s) in 21.68s
[localhost:21000] > select count(*) from small join big
Query: select count(*) from small join big on (big.id =
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
Returned 1 row(s) in 20.45s
on (big.id = small.id);
small.id)
on (big.id = small.id);
small.id)
[localhost:21000] > select count(*) from big join small on (big.val = small.val);
+------------+
| count(*)
|
+------------+
| 2000948962 |
+------------+
Returned 1 row(s) in 108.85s
[localhost:21000] > select count(*) from small join big on (big.val = small.val);
+------------+
| count(*)
|
+------------+
| 2000948962 |
Note: When examining the performance of join queries and the effectiveness of the join order
optimization, make sure the query involves enough data and cluster resources to see a difference
depending on the query plan. For example, a single data file of just a few megabytes will reside in a
single HDFS block and be processed on a single node. Likewise, if you use a single-node or two-node
cluster, there might not be much difference in efficiency for the broadcast or partitioned join strategies.
Table Statistics
The Impala query planner can make use of statistics about entire tables and partitions when that metadata is
available in the metastore database. This metadata is used on its own for certain optimizations, and used in
combination with column statistics for other optimizations.
To gather table statistics after loading data into a table or partition, use one of the following techniques:
Issue the statement COMPUTE STATS in Impala. This statement, new in Impala 1.2.2, is the preferred method
because:
It gathers table statistics and statistics for all partitions and columns in a single operation.
It does not rely on any special Hive settings, metastore configuration, or separate database to hold the
statistics.
If you need to adjust statistics incrementally for an existing table, such as after adding a partition or
inserting new data, you can use an ALTER TABLE statement such as:
alter table analysis_data set tblproperties('numRows'='new_value');
to update that one property rather than re-processing the whole table.
Load the data through the INSERT OVERWRITE statement in Hive, while the Hive setting hive.stats.autogather
is enabled.
Issue an ANALYZE TABLE statement in Hive, for the entire table or a specific partition.
ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE
STATISTICS [NOSCAN];
To gather statistics for a store table partitioned by state and city, and both of its partitions:
ANALYZE TABLE store PARTITION(s_state, s_county) COMPUTE STATISTICS;
To check that table statistics are available for a table, and see the details of those statistics, use the statement
SHOW TABLE STATS table_name. See SHOW Statement on page 135 for details.
If you use the Hive-based methods of gathering statistics, see the Hive wiki for information about the required
configuration on the Hive side. Cloudera recommends using the Impala COMPUTE STATS statement to avoid
potential configuration and scalability issues with the statistics-gathering process.
Column Statistics
The Impala query planner can make use of statistics about individual columns when that metadata is available
in the metastore database. This technique is most valuable for columns compared across tables in join queries,
to help estimate how many rows the query will retrieve from each table. Currently, Impala does not create this
metadata itself. Use the ANALYZE TABLE statement in the Hive shell to gather these statistics. (This statement
works from Hive whether you create the table in Impala or in Hive.)
Note:
For column statistics to be effective in Impala, you also need to have table statistics for the applicable
tables, as described in Table Statistics on page 212. If you use the Impala COMPUTE STATS statement,
both table and column statistics are automatically gathered at the same time, for all columns in the
table.
Currently, the COMPUTE STATS statement under CDH 4 does not store any statistics for DECIMAL
columns. When Impala runs under CDH 5, which has better support for DECIMAL in the metastore
database, COMPUTE STATS does collect statistics for DECIMAL columns and Impala uses the statistics
to optimize query performance.
Note: Prior to Impala 1.4.0, COMPUTE STATS counted the number of NULL values in each column and
recorded that figure in the metastore database. Because Impala does not currently make use of the
NULL count during query planning, Impala 1.4.0 and higher speeds up the COMPUTE STATS statement
by skipping this NULL counting.
To check whether column statistics are available for a particular set of columns, use the SHOW COLUMN STATS
table_name statement, or check the extended EXPLAIN output for a query against that table that refers to
those columns. See SHOW Statement on page 135 and EXPLAIN Statement on page 103 for details.
In practice, the COMPUTE STATS statement should be fast enough that this technique is not needed. It is most
useful as a workaround for in case of performance issues where you might adjust the numRows value higher or
lower to produce the ideal join order.
The following example shows how statistics are represented for a partitioned table. In this case, we have set
up a table to hold the world's most trivial census data, a single STRING field, partitioned by a YEAR column. The
table statistics include a separate entry for each partition, plus final totals for the numeric fields. The column
statistics include some easily deducible facts for the partitioning column, such as the number of distinct values
(the number of partition subdirectories).
localhost:21000] > describe census;
+------+----------+---------+
| name | type
| comment |
+------+----------+---------+
| name | string
|
|
| year | smallint |
|
+------+----------+---------+
Returned 2 row(s) in 0.02s
[localhost:21000] > show table stats census;
+-------+-------+--------+------+---------+
| year | #Rows | #Files | Size | Format |
+-------+-------+--------+------+---------+
| 2000 | -1
| 0
| 0B
| TEXT
|
| 2004 | -1
| 0
| 0B
| TEXT
|
| 2008 | -1
| 0
| 0B
| TEXT
|
| 2010 | -1
| 0
| 0B
| TEXT
|
| 2011 | 0
| 1
| 22B | TEXT
|
| 2012 | -1
| 1
| 22B | TEXT
|
| 2013 | -1
| 1
| 231B | PARQUET |
| Total | 0
| 3
| 275B |
|
+-------+-------+--------+------+---------+
Returned 8 row(s) in 0.02s
[localhost:21000] > show column stats census;
+--------+----------+------------------+--------+----------+----------+
| Column | Type
| #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+----------+------------------+--------+----------+----------+
| name
| STRING
| -1
| -1
| -1
| -1
|
| year
| SMALLINT | 7
| -1
| 2
| 2
|
+--------+----------+------------------+--------+----------+----------+
Returned 2 row(s) in 0.02s
The following example shows how the statistics are filled in by a COMPUTE STATS statement in Impala.
[localhost:21000] > compute stats census;
+-----------------------------------------+
| summary
|
+-----------------------------------------+
| Updated 3 partition(s) and 1 column(s). |
+-----------------------------------------+
Returned 1 row(s) in 2.16s
[localhost:21000] > show table stats census;
+-------+-------+--------+------+---------+
| year | #Rows | #Files | Size | Format |
+-------+-------+--------+------+---------+
| 2000 | -1
| 0
| 0B
| TEXT
|
| 2004 | -1
| 0
| 0B
| TEXT
|
| 2008 | -1
| 0
| 0B
| TEXT
|
| 2010 | -1
| 0
| 0B
| TEXT
|
| 2011 | 4
| 1
| 22B | TEXT
|
| 2012 | 4
| 1
| 22B | TEXT
|
For examples showing how some queries work differently when statistics are available, see Examples of Join
Order Optimization on page 207. You can see how Impala executes a query differently in each case by observing
the EXPLAIN output before and after collecting statistics. Measure the before and after query times, and examine
the throughput numbers in before and after SUMMARY or PROFILEoutput, to verify how much the improved plan
speeds up performance.
For details about the hdfs cacheadmin command, see the CDH documentation.
Once HDFS caching is enabled and one or more pools are available, see Enabling HDFS Caching for Impala Tables
and Partitions on page 218 for how to choose which Impala data to load into the HDFS cache. On the Impala side,
you specify the cache pool name defined by the hdfs cacheadmin command in the Impala DDL statements
that enable HDFS caching for a table or partition, such as CREATE TABLE ... CACHED IN pool or ALTER
TABLE ... SET CACHED IN pool.
For queries involving smaller amounts of data, or in single-user workloads, you might not notice a significant
difference in query response time with or without HDFS caching. Even with HDFS caching turned off, the data
for the query might still be in the Linux OS buffer cache. The benefits become clearer as data volume increases,
and especially as the system processes more concurrent queries. HDFS caching improves the scalability of the
overall system. That is, it prevents query performance from declining when the workload outstrips the capacity
of the Linux OS cache.
SELECT considerations:
The Impala HDFS caching feature interacts with the SELECT statement and query performance as follows:
Impala automatically reads from memory any data that has been designated as cached and actually loaded
into the HDFS cache. (It could take some time after the initial request to fully populate the cache for a table
with large size or many partitions.) The speedup comes from two aspects: reading from RAM instead of disk,
and accessing the data straight from the cache area instead of copying from one RAM area to another. This
second aspect yields further performance improvement over the standard OS caching mechanism, which
still results in memory-to-memory copying of cached data.
For small amounts of data, the query speedup might not be noticeable in terms of wall clock time. The
performance might be roughly the same with HDFS caching turned on or off, due to recently used data being
held in the Linux OS cache. The difference is more pronounced with:
Data volumes (for all queries running concurrently) that exceed the size of the Linux OS cache.
A busy cluster running many concurrent queries, where the reduction in memory-to-memory copying
and overall memory usage during queries results in greater scalability and throughput.
Thus, to really exercise and benchmark this feature in a development environment, you might need to
simulate realistic workloads and concurrent queries that match your production environment.
One way to simulate a heavy workload on a lightly loaded system is to flush the OS buffer cache (on each
data node) between iterations of queries against the same tables or partitions:
$ sync
$ echo 1 > /proc/sys/vm/drop_caches
Impala queries take advantage of HDFS cached data regardless of whether the cache directive was issued
by Impala or externally through the hdfs cacheadmin command, for example for an external table where
the cached data files might be accessed by several different Hadoop components.
If your query returns a large result set, the time reported for the query could be dominated by the time needed
to print the results on the screen. To measure the time for the underlying query processing, query the COUNT()
of the big result set, which does all the same processing but only prints a single line to the screen.
2. After the query completes, review the contents of the Impala logs. You should find a recent message similar
to the following:
Total remote scan volume = 0
The presence of remote scans may indicate impalad is not running on the correct nodes. This can be because
some DataNodes do not have impalad running or it can be because the impalad instance that is starting the
query is unable to contact one or more of the impalad instances.
To understand the causes of this issue:
1. Connect to the debugging web server. By default, this server runs on port 25000. This page lists all impalad
instances running in your cluster. If there are fewer instances than you expect, this often indicates some
DataNodes are not running impalad. Ensure impalad is started on all DataNodes.
2. If you are using multi-homed hosts, ensure that the Impala daemon's hostname resolves to the interface
on which impalad is running. The hostname Impala is using is displayed when impalad starts. If you need
to explicitly set the hostname, use the --hostname flag.
3. Check that statestored is running as expected. Review the contents of the state store log to ensure all
instances of impalad are listed as having connected to the state store.
Reviewing Impala Logs
You can review the contents of the Impala logs for signs that short-circuit reads or block location tracking are
not functioning. Before checking logs, execute a simple query against a small HDFS dataset. Completing a query
task generates log messages using current settings. Information on starting Impala and executing queries can
be found in Starting Impala and Using the Impala Shell (impala-shell Command) on page 187. Information on
logging can be found in Using Impala Logging on page 275. Log messages and their interpretations are as follows:
Log Message
Interpretation
Native checksumming is
not enabled.
Notice how the longest initial phase of the query is measured in seconds (s), while later phases working on
smaller intermediate results are measured in milliseconds (ms) or even nanoseconds (ns).
Here is an example from a more complicated query, as it would appear in the PROFILE output:
Operator
#Hosts
Avg Time
Max Time
#Rows Est. #Rows Peak Mem Est.
Peak Mem Detail
-----------------------------------------------------------------------------------------------------------------------09:MERGING-EXCHANGE
1
79.738us
79.738us
5
5
0
-1.00 B UNPARTITIONED
05:TOP-N
3
84.693us
88.810us
5
5 12.00 KB
120.00 B
04:AGGREGATE
3
5.263ms
6.432ms
5
5 44.00 KB
10.00 MB MERGE FINALIZE
08:AGGREGATE
3
16.659ms
27.444ms
52.52K
600.12K
3.20 MB
15.11 MB MERGE
07:EXCHANGE
3
2.644ms
5.1ms
52.52K
600.12K
0
0 HASH(o_orderpriority)
03:AGGREGATE
3 342.913ms 966.291ms
52.52K
600.12K 10.80 MB
2s171ms
144.87K
600.12K
13.63 MB
8.692ms
57.22K
15.00K
1s978ms
57.22K
15.00K
24.21 MB
8s558ms
3.79M
600.12K
32.29 MB
Partitioning
Partitioning
By default, all the data files for a table are located in a single directory. Partitioning is a technique for physically
dividing the data during loading, based on values from one or more columns, to speed up queries that test those
columns. For example, with a school_records table partitioned on a year column, there is a separate data
directory for each different year value, and all the data for that year is stored in a data file in that directory. A
query that includes a WHERE condition such as YEAR=1966, YEAR IN (1989,1999), or YEAR BETWEEN 1984
AND 1989 can examine only the data files from the appropriate directory or directories, greatly reducing the
amount of data to read and test.
See Attaching an External Partitioned Table to an HDFS Directory Structure on page 28 for an example that
illustrates the syntax for creating partitioned tables, the underlying directory structure in HDFS, and how to
attach a partitioned Impala external table to data files stored elsewhere in HDFS.
Parquet is a popular format for partitioned Impala tables because it is well suited to handle huge data volumes.
See Query Performance for Impala Parquet Tables on page 248 for performance considerations for partitioned
Parquet tables.
See NULL on page 64 for details about how NULL values are represented in partitioned tables.
Partitioning
statements; you can replace the contents of a specific partition but you cannot append data to a specific
partition.
By default, if an INSERT statement creates any new subdirectories underneath a partitioned table, those
subdirectories are assigned default HDFS permissions for the impala user. To make each subdirectory have
the same permissions as its parent directory in HDFS, specify the --insert_inherit_permissions startup
option for the impalad daemon.
Although the syntax of the SELECT statement is the same whether or not the table is partitioned, the way
queries interact with partitioned tables can have a dramatic impact on performance and scalability. The
mechanism that lets queries skip certain partitions during a query is known as partition pruning; see Partition
Pruning for Queries on page 234 for details.
In Impala 1.4 and higher, there is a SHOW PARTITIONS statement that displays information about each
partition in a table. See SHOW Statement on page 135 for details.
When you specify some partition key columns in an INSERT statement, but leave out the values, Impala determines
which partition to insert This technique is called dynamic partitioning:
insert into t1 partition(x, y='b') select c1, c2 from some_other_table;
-- Create new partition if necessary based on variable year, month, and day; insert a
single value.
insert into weather partition (year, month, day) select 'cloudy',2014,4,21;
-- Create new partition if necessary for specified year and month but variable day;
insert a single value.
insert into weather partition (year=2014, month=04, day) select 'sunny',22;
The more key columns you specify in the PARTITION clause, the fewer columns you need in the SELECT list. The
trailing columns in the SELECT list are substituted in order for the partition key columns with no specified value.
Partitioning
To check the effectiveness of partition pruning for a query, check the EXPLAIN output for the query before running
it. For example, this example shows a table with 3 partitions, where the query only reads 1 of them. The notation
#partitions=1/3 in the EXPLAIN plan confirms that Impala can do the appropriate partition pruning.
[localhost:21000] > insert into census partition (year=2010) values ('Smith'),('Jones');
[localhost:21000] > insert into census partition (year=2011) values
('Smith'),('Jones'),('Doe');
[localhost:21000] > insert into census partition (year=2012) values ('Smith'),('Doe');
[localhost:21000] > select name from census where year=2010;
+-------+
| name |
+-------+
| Smith |
| Jones |
+-------+
[localhost:21000] > explain select name from census where year=2010;
+------------------------------------------------------------------+
| Explain String
|
+------------------------------------------------------------------+
| PLAN FRAGMENT 0
|
|
PARTITION: UNPARTITIONED
|
|
|
|
1:EXCHANGE
|
|
|
| PLAN FRAGMENT 1
|
|
PARTITION: RANDOM
|
|
|
|
STREAM DATA SINK
|
|
EXCHANGE ID: 1
|
|
UNPARTITIONED
|
|
|
|
0:SCAN HDFS
|
|
table=predicate_propagation.census #partitions=1/3 size=12B |
+------------------------------------------------------------------+
Impala can even do partition pruning in cases where the partition key column is not directly compared to a
constant, by applying the transitive property to other parts of the WHERE clause. This technique is known as
predicate propagation, and is available in Impala 1.2.2 and higher. In this example, the census table includes
another column indicating when the data was collected, which happens in 10-year intervals. Even though the
query does not compare the partition key column (YEAR) to a constant value, Impala can deduce that only the
partition YEAR=2010 is required, and again only reads 1 out of 3 partitions.
[localhost:21000] > drop table census;
[localhost:21000] > create table census (name string, census_year int) partitioned by
(year int);
[localhost:21000] > insert into census partition (year=2010) values
('Smith',2010),('Jones',2010);
[localhost:21000] > insert into census partition (year=2011) values
('Smith',2020),('Jones',2020),('Doe',2020);
[localhost:21000] > insert into census partition (year=2012) values
('Smith',2020),('Doe',2020);
[localhost:21000] > select name from census where year = census_year and
census_year=2010;
+-------+
| name |
+-------+
| Smith |
| Jones |
+-------+
[localhost:21000] > explain select name from census where year = census_year and
census_year=2010;
+------------------------------------------------------------------+
| Explain String
|
+------------------------------------------------------------------+
| PLAN FRAGMENT 0
|
|
PARTITION: UNPARTITIONED
|
|
|
|
1:EXCHANGE
|
|
|
| PLAN FRAGMENT 1
|
Partitioning
|
PARTITION: RANDOM
|
|
|
|
STREAM DATA SINK
|
|
EXCHANGE ID: 1
|
|
UNPARTITIONED
|
|
|
|
0:SCAN HDFS
|
|
table=predicate_propagation.census #partitions=1/3 size=22B |
|
predicates: census_year = 2010, year = census_year
|
+------------------------------------------------------------------+
For a report of the volume of data that was actually read and processed at each stage of the query, check the
output of the SUMMARY command immediately after running the query. For a more detailed analysis, look at the
output of the PROFILE command; it includes this same summary report near the start of the profile output.
If a view applies to a partitioned table, any partition pruning is determined by the clauses in the original query.
Impala does not prune additional columns if the query on the view includes extra WHERE clauses referencing the
partition key columns.
Partitioning
[localhost:21000] > insert into census partition (year=2013) values
('Flores'),('Bogomolov'),('Cooper'),('Appiah');
At this point, the HDFS directory for year=2012 contains a text-format data file, while the HDFS directory for
year=2013 contains a Parquet data file. As always, when loading non-trivial data, you would use INSERT ...
SELECT or LOAD DATA to import data in large batches, rather than INSERT ... VALUES which produces small
files that are inefficient for real-world queries.
For other file types that Impala cannot create natively, you can switch into Hive and issue the ALTER TABLE
... SET FILEFORMAT statements and INSERT or LOAD DATA statements there. After switching back to Impala,
issue a REFRESH table_name statement so that Impala recognizes any partitions or new data added through
Hive.
Format
Parquet
Structured
Snappy, GZIP;
Yes.
currently Snappy by
default
Text
Unstructured LZO
Avro
Structured
RCFile
Structured
SequenceFile Structured
Format
Text
Unstructured LZO
The data files created by any INSERT statements will use the Ctrl-A character (hex 01) as a separator between
each column value.
A common use case is to import existing text files into an Impala table. The syntax is more verbose; the significant
part is the FIELDS TERMINATED BY clause, which must be preceded by the ROW FORMAT DELIMITED clause.
The statement can end with a STORED AS TEXTFILE clause, but that clause is optional because text format
tables are the default. For example:
create table csv(id int, s string, n int, t timestamp, b boolean)
row format delimited
fields terminated by ',';
create table tsv(id int, s string, n int, t timestamp, b boolean)
row format delimited
fields terminated by '\t';
create table pipe_separated(id int, s string, n int, t timestamp, b boolean)
row format delimited
fields terminated by '|'
stored as textfile;
You can create tables with specific separator characters to import text files in familiar formats such as CSV, TSV,
or pipe-separated. You can also use these tables to produce output data files, by copying data into them through
the INSERT ... SELECT syntax and then extracting the data files from the Impala data directory.
In Impala 1.3.1 and higher, you can specify a delimiter character '\0' to use the ASCII 0 (nul) character for text
tables:
create table nul_separated(id int, s string, n int, t timestamp, b boolean)
row format delimited
fields terminated by '\0'
stored as textfile;
This can be a useful technique to see how Impala represents special values within a text-format data file. Use
the DESCRIBE FORMATTED statement to see the HDFS directory where the data files are stored, then use Linux
242 | Cloudera Impala User Guide
Note: Because Impala and the HDFS infrastructure are optimized for multi-megabyte files, avoid the
INSERT ... VALUES notation when you are inserting many rows. Each INSERT ... VALUES
statement produces a new tiny file, leading to fragmentation and reduced performance. When creating
any substantial volume of new data, use one of the bulk loading techniques such as LOAD DATA or
INSERT ... SELECT. Or, use an HBase table for single-row INSERT operations, because HBase tables
are not subject to the same fragmentation issues as tables stored on HDFS.
When you create a text file for use with an Impala text table, specify \N to represent a NULL value. For the
differences between NULL and empty strings, see NULL on page 64.
If a text file has fewer fields than the columns in the corresponding Impala table, all the corresponding columns
are set to NULL when the data in that file is read by an Impala query.
If a text file has more fields than the columns in the corresponding Impala table, the extra fields are ignored
when the data in that file is read by an Impala query.
You can also use manual HDFS operations such as hdfs dfs -put or hdfs dfs -cp to put data files in the
data directory for an Impala table. When you copy or move new data files into the HDFS directory for the Impala
table, issue a REFRESH table_name statement in impala-shell before issuing the next query against that
table, to make Impala recognize the newly added files.
sudo
sudo
sudo
sudo
yum
yum
yum
yum
update
install hadoop-lzo-cdh4 # For clusters running CDH 4.
install hadoop-lzo
# For clusters running CDH 5 or higher.
install impala-lzo
sudo
sudo
sudo
sudo
apt-get update
zypper install hadoop-lzo-cdh4 # For clusters running CDH 4.
zypper install hadoop-lzo
# For clusters running CDH 5 or higher.
zypper install impala-lzo
sudo
sudo
sudo
sudo
zypper update
apt-get install hadoop-lzo-cdh4 # For clusters running CDH 4.
apt-get install hadoop-lzo
# For clusters running CDH 5 or higher.
apt-get install impala-lzo
Note:
The level of the impala-lzo-cdh4 package is closely tied to the version of Impala you use. Any
time you upgrade Impala, re-do the installation command for impala-lzo on each applicable
machine to make sure you have the appropriate version of that package.
3. For core-site.xml on the client and server (that is, in the configuration directories for both Impala and
Hadoop), append com.hadoop.compression.lzo.LzopCodec to the comma-separated list of codecs. For
example:
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>
26
Once you have created LZO-compressed text tables, you can convert data stored in other tables (regardless of
file format) by using the INSERT ... SELECT statement in Hive.
Files in an LZO-compressed table must use the .lzo extension. Examine the files in the HDFS data directory
after doing the INSERT in Hive, to make sure the files have the right extension. If the required settings are not
in place, you end up with regular uncompressed files, and Impala cannot access the table because it finds data
files with the wrong (uncompressed) format.
After loading data into an LZO-compressed text table, index the files so that they can be split. You index the
files by running a Java class, com.hadoop.compression.lzo.DistributedLzoIndexer, through the Linux
command line. This Java class is included in the hadoop-lzo package.
Run the indexer using a command like the following:
$ hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
com.hadoop.compression.lzo.DistributedLzoIndexer /hdfs_location_of_table/
Note: If the path of the JAR file in the preceding example is not recognized, do a find command to
locate hadoop-lzo-*-gplextras.jar and use that path.
Indexed files have the same name as the file they index, with the .index extension. If the data files are not
indexed, Impala queries still work, but the queries read the data from remote DataNodes, which is very inefficient.
Cloudera Impala User Guide | 245
Format
Parquet
Structured
Snappy, GZIP;
Yes.
currently Snappy by
default
Or, to clone the column names and data types of an existing table:
[impala-host:21000] > create table parquet_table_name LIKE other_table_name STORED AS
PARQUET;
In Impala 1.4.0 and higher, you can derive column definitions from a raw Parquet data file, even without an
existing Impala table. For example, you can create an external table pointing to an HDFS directory, and base the
column definitions on one of the files in that directory:
CREATE EXTERNAL TABLE ingest_existing_files LIKE PARQUET
'/user/etl/destination/datafile1.dat'
LOCATION '/user/etl/destination'
STORED AS PARQUET;
Or, you can refer to an existing data file and create a new empty table with suitable column definitions. Then
you can use INSERT to create new data files or LOAD DATA to transfer existing data files into the new table.
CREATE TABLE columns_from_data_file LIKE PARQUET '/user/etl/destination/datafile1.dat'
STORED AS PARQUET;
In this example, the new table is partitioned by year, month, and day. These partition key columns are not part
of the data file, so you specify them in the CREATE TABLE statement:
CREATE TABLE columns_from_data_file LIKE PARQUET '/user/etl/destination/datafile1.dat'
See CREATE TABLE Statement on page 90 for more details about the CREATE TABLE LIKE PARQUET syntax.
Once you have created a table, to insert data into that table, use a command similar to the following, again with
your own table names:
[impala-host:21000] > insert overwrite table parquet_table_name select * from
other_table_name;
If the Parquet table has a different number of columns or different column names than the other table, specify
the names of columns from the other table rather than * in the SELECT statement.
The query processes only 2 columns out of a large number of total columns. If the table is partitioned by the
STATE column, it is even more efficient because the query only has to read and decode 1 column from each data
file, and it can read only the data files in the partition directory for the state 'CA', skipping the data files for all
the other states, which will be physically located in other directories.
Impala would have to read the entire contents of each 1 GB data file, and decompress the contents of each
column for each row group, negating the I/O optimizations of the column-oriented format. This query might
still be faster for a Parquet table than a table with some other file format, but it does not take advantage of the
unique strengths of Parquet data files.
Impala can optimize queries on Parquet tables, especially join queries, better when statistics are available for
all the tables. Issue the COMPUTE STATS statement for each table after substantial amounts of data are loaded
into or appended to it. See COMPUTE STATS Statement on page 84 for details.
Note: Currently, a known issue (IMPALA-488) could cause excessive memory usage during a COMPUTE
STATS operation on a Parquet table. As a workaround, issue the command SET
NUM_SCANNER_THREADS=2 in impala-shell before issuing the COMPUTE STATS statement. Then
issue UNSET NUM_SCANNER_THREADS before continuing with queries.
Because Parquet data files are typically sized at about 1 GB, each directory will have a different number of data
files and the row groups will be arranged differently.
At the same time, the less agressive the compression, the faster the data can be decompressed. In this case
using a table with a billion rows, a query that evaluates all the values for a particular column runs faster with
no compression than with Snappy compression, and faster with Snappy compression than with Gzip compression.
Then in the shell, we copy the relevant data files into the data directory for this new table. Rather than using
hdfs dfs -cp as with typical files, we use hdfs distcp -pb to ensure that the special 1 GB block size of the
Parquet data files is preserved.
$ hdfs distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_snappy \
/user/hive/warehouse/parquet_compression.db/parquet_everything
...MapReduce output...
$ hdfs distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_gzip \
/user/hive/warehouse/parquet_compression.db/parquet_everything
...MapReduce output...
$ hdfs distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_none \
/user/hive/warehouse/parquet_compression.db/parquet_everything
...MapReduce output...
If you are running a level of Impala that is older than 1.1.1, do the metadata update through Hive:
ALTER TABLE table_name SET SERDE 'parquet.hive.serde.ParquetHiveSerDe';
ALTER TABLE table_name SET FILEFORMAT
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat";
Impala 1.1.1 and higher can reuse Parquet data files created by Hive, without any action required.
Impala supports the scalar data types that you can encode in a Parquet data file, but not composite or nested
types such as maps or arrays. If any column of a table uses such an unsupported data type, Impala cannot
access that table.
If you copy Parquet data files between nodes, or even between different directories on the same node, make
sure to preserve the block size by using the command hadoop distcp -pb. To verify that the block size was
preserved, issue the command hdfs fsck -blocks HDFS_path_of_impala_table_dir and check that the
average block size is at or near 1 GB. (The hadoop distcp operation typically leaves some directories behind,
with names matching _distcp_logs_*, that you can delete from the destination directory afterward.) See the
Hadoop DistCP Guide for details.
Here are techniques to help you produce large data files in Parquet INSERT operations, and to compact existing
too-small data files:
When inserting into a partitioned Parquet table, use statically partitioned INSERT statements where the
partition key values are specified as constant values. Ideally, use a separate INSERT statement for each
partition.
You might set the NUM_NODES option to 1 briefly, during INSERT or CREATE TABLE AS SELECT statements.
Normally, those statements produce one or more data files per data node. If the write operation involves
small amounts of data, a Parquet table, and/or a partitioned table, the default behavior could produce many
small files when intuitively you might expect only a single output file. SET NUM_NODES=1 turns off the
distributed aspect of the write operation, making it more likely to produce only one or a few data files.
Be prepared to reduce the number of partition key columns from what you are used to with traditional analytic
database systems.
Do not expect Impala-written Parquet files to fill up the entire Parquet block size (1 GB by default). Impala
estimates on the conservative side when figuring out how much data to write to each Parquet file. Typically,
1 GB of uncompressed data in memory is reduced down to much less than 1 GB on disk by the compression
and encoding techniques in the Parquet file format. Impala reserves 1 GB of memory to buffer the data before
writing, but the actual data file might be smaller, in the hundreds of megabytes. The final data file size varies
depending on the compressibility of the data. Therefore, it is not an indication of a problem if 1 GB of text
data is turned into 2 Parquet data files, each less than 1 GB.
If you accidentally end up with a table with many small data files, consider using one or more of the preceding
techniques and copying all the data into a new Parquet table, either through CREATE TABLE AS SELECT or
INSERT ... SELECT statements.
To avoid rewriting queries to change table names, you can adopt a convention of always running important
queries against a view. Changing the view definition immediately switches any subsequent queries to use
the new underlying tables:
create view production_table as select * from table_with_many_small_files;
-- CTAS or INSERT...SELECT all the data into a more efficient layout...
alter view production_table as select * from table_with_few_big_files;
select * from production_table where c1 = 100 and c2 < 50 and ...;
Format
Avro
Structured
Each field of the record becomes a column of the table. Note that any other information, such as the record
name, is ignored.
Note: For nullable Avro columns, make sure to put the "null" entry before the actual type name. In
Impala, all columns are nullable; Impala currently does not have a NOT NULL clause. Any non-nullable
property is only enforced on the Avro side.
Most column types map directly from Avro to Impala under the same names. These are the exceptions and
special cases to consider:
The DECIMAL type is defined in Avro as a BYTE type with the logicalType property set to "decimal" and
a specified precision and scale. Use DECIMAL in Avro tables only under CDH 5. The infrastructure and
components under CDH 4 do not have reliable DECIMAL support.
The Avro long type maps to BIGINT in Impala.
256 | Cloudera Impala User Guide
Once the Avro table is created and contains data, you can query it through the impala-shell command:
---------
Now in the Hive shell, you change the type of a column and add a new column with a default value:
-- Promote column "a" from INT to FLOAT (no need to update Avro schema)
ALTER TABLE avro_table CHANGE A A FLOAT;
-- Add column "c" with default
ALTER TABLE avro_table ADD COLUMNS (c int);
ALTER TABLE avro_table SET TBLPROPERTIES (
'avro.schema.literal'='{
"type": "record",
"name": "my_record",
"fields": [
{"name": "a", "type": "int"},
{"name": "b", "type": "string"},
{"name": "c", "type": "int", "default": 10}
]}');
Format
RCFile
Structured
Because Impala can query some kinds of tables that it cannot currently write to, after creating tables of certain
file formats, you might use the Hive shell to load the data. See How Impala Works with Hadoop File Formats on
page 239 for details. After loading data into a table through Hive or other mechanism outside of Impala, issue a
REFRESH table_name statement the next time you connect to the Impala node, before querying the table, to
make Impala recognize the new data.
Important: See Known Issues in the Current Production Release (Impala 1.4.x) for potential
compatibility issues with RCFile tables created in Hive 0.12, due to a change in the default RCFile
SerDe for Hive.
For example, here is how you might create some RCFile tables in Impala (by specifying the columns explicitly,
or cloning the structure of another table), load data through Hive, and query them through Impala:
$ impala-shell -i
[localhost:21000]
[localhost:21000]
[localhost:21000]
localhost
> create table rcfile_table (x int) stored as rcfile;
> create table rcfile_clone like some_other_table stored as rcfile;
> quit;
localhost
> select * from rcfile_table;
in 0.23s
> -- Make Impala recognize the data loaded through Hive;
> refresh rcfile_table;
> select * from rcfile_table;
in 0.23s
SET hive.exec.compress.output=true;
SET mapred.max.split.size=256000000;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
INSERT OVERWRITE TABLE new_table SELECT * FROM old_table;
If you are converting partitioned tables, you must complete additional steps. In such a case, specify additional
settings similar to the following:
hive> CREATE TABLE new_table (your_cols) PARTITIONED BY (partition_cols) STORED AS
new_format;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE new_table PARTITION(comma_separated_partition_cols) SELECT
* FROM old_table;
Remember that Hive does not require that you specify a source format for it. Consider the case of converting a
table with two partition columns called year and month to a Snappy compressed RCFile. Combining the
components outlined previously to complete this table conversion, you would specify settings similar to the
following:
hive>
hive>
hive>
hive>
hive>
hive>
hive>
hive>
To complete a similar process for a table that includes partitions, you would specify settings similar to the
following:
hive> CREATE TABLE tbl_rc (int_col INT, string_col STRING) PARTITIONED BY (year INT)
STORED AS RCFILE;
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
Note:
The compression type is specified in the following command:
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
Format
SequenceFile Structured
Because Impala can query some kinds of tables that it cannot currently write to, after creating tables of certain
file formats, you might use the Hive shell to load the data. See How Impala Works with Hadoop File Formats on
page 239 for details. After loading data into a table through Hive or other mechanism outside of Impala, issue a
REFRESH table_name statement the next time you connect to the Impala node, before querying the table, to
make Impala recognize the new data.
For example, here is how you might create some SequenceFile tables in Impala (by specifying the columns
explicitly, or cloning the structure of another table), load data through Hive, and query them through Impala:
$ impala-shell -i
[localhost:21000]
[localhost:21000]
sequencefile;
[localhost:21000]
localhost
> create table seqfile_table (x int) stored as sequencefile;
> create table seqfile_clone like some_other_table stored as
> quit;
$ hive
hive> insert into table seqfile_table select x from some_other_table;
3 Rows loaded to seqfile_table
Time taken: 19.047 seconds
hive> quit;
$ impala-shell -i
[localhost:21000]
Returned 0 row(s)
[localhost:21000]
localhost
> select * from seqfile_table;
in 0.23s
> -- Make Impala recognize the data loaded through Hive;
SET hive.exec.compress.output=true;
SET mapred.max.split.size=256000000;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
insert overwrite table new_table select * from old_table;
If you are converting partitioned tables, you must complete additional steps. In such a case, specify additional
settings similar to the following:
hive> create table new_table (your_cols) partitioned by (partition_cols) stored as
new_format;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> insert overwrite table new_table partition(comma_separated_partition_cols) select
* from old_table;
Remember that Hive does not require that you specify a source format for it. Consider the case of converting a
table with two partition columns called year and month to a Snappy compressed SequenceFile. Combining the
components outlined previously to complete this table conversion, you would specify settings similar to the
following:
hive>
hive>
hive>
hive>
hive>
hive>
hive>
hive>
To complete a similar process for a table that includes partitions, you would specify settings similar to the
following:
hive> CREATE TABLE tbl_seq (int_col INT, string_col STRING) PARTITIONED BY (year INT)
STORED AS SEQUENCEFILE;
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE tbl_seq PARTITION(year) SELECT * FROM tbl;
statement.
You map these specially created tables to corresponding tables that exist in HBase, with the clause
TBLPROPERTIES("hbase.table.name" = "table_name_in_hbase") on the Hive CREATE TABLE
statement.
See Examples of Querying HBase Tables from Impala on page 272 for a full example.
You define the column corresponding to the HBase row key as a string with the #string keyword, or map
it to a STRING column.
Because Impala and Hive share the same metastore database, once you create the table in Hive, you can
query or insert into it through Impala. (After creating a new table through Hive, issue the INVALIDATE
METADATA statement in impala-shell to make Impala aware of the new table.)
You issue queries against the Impala tables. For efficient queries, use WHERE clauses to find a single key
value or a range of key values wherever practical, by testing the Impala column corresponding to the HBase
row key. Avoid queries that do full-table scans, which are efficient for regular Impala tables but inefficient
in HBase.
To work with an HBase table from Impala, ensure that the impala user has read/write privileges for the HBase
table, using the GRANT command in the HBase shell. For details about HBase security, see
http://hbase.apache.org/book/ch08s04.html#hbase.accesscontrol.configuration.
Currently, Cloudera Manager does not have an Impala-only override for HBase settings, so any HBase configuration
change you make through Cloudera Manager would take affect for all HBase applications. Therefore, this change
is not recommended on systems managed by Cloudera Manager.
The best case for performance involves a single row lookup using an equality comparison on the column defined
as the row key:
explain select count(*) from hbase_table where cust_id = '[email protected]';
+------------------------------------------------------------------------------------+
| Explain String
|
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=1.01GB VCores=1
|
| WARNING: The following tables are missing relevant table and/or column statistics.
|
| hbase.hbase_table
|
|
|
| 03:AGGREGATE [MERGE FINALIZE]
|
| | output: sum(count(*))
|
| |
|
| 02:EXCHANGE [PARTITION=UNPARTITIONED]
|
| |
|
| 01:AGGREGATE
|
| | output: count(*)
|
Another type of efficient query involves a range lookup on the row key column, using SQL operators such as
greater than (or equal), less than (or equal), or BETWEEN. This example also includes an equality test on a non-key
column; because that column is a STRING, Impala can let HBase perform that test, indicated by the hbase
filters: line in the EXPLAIN output. Doing the filtering within HBase is more efficient than transmitting all
the data to Impala and doing the filtering on the Impala side.
explain select count(*) from hbase_table where cust_id between 'a' and 'b'
and never_logged_on = 'true';
+------------------------------------------------------------------------------------+
| Explain String
|
+------------------------------------------------------------------------------------+
...
| 01:AGGREGATE
|
| | output: count(*)
|
| |
|
| 00:SCAN HBASE [hbase.hbase_table]
|
|
start key: a
|
|
stop key: b\0
|
|
hbase filters: cols:never_logged_on EQUAL 'true'
|
+------------------------------------------------------------------------------------+
The query is less efficient if Impala has to evaluate any of the predicates, because Impala must scan the entire
HBase table. Impala can only push down predicates to HBase for columns declared as STRING. This example
tests a column declared as INT, and the predicates: line in the EXPLAIN output indicates that the test is
performed after the data is transmitted to Impala.
explain select count(*) from hbase_table where year_registered = 2010;
+------------------------------------------------------------------------------------+
| Explain String
|
+------------------------------------------------------------------------------------+
...
| 01:AGGREGATE
|
| | output: count(*)
|
| |
|
| 00:SCAN HBASE [hbase.hbase_table]
|
|
predicates: year_registered = 2010
|
+------------------------------------------------------------------------------------+
Currently, tests on the row key using OR or IN clauses are not optimized into direct lookups either. Such limitations
might be lifted in the future, so always check the EXPLAIN output to be sure whether a particular SQL construct
results in an efficient query or not for HBase tables.
explain select count(*) from hbase_table where
cust_id = '[email protected]' or cust_id = '[email protected]';
+----------------------------------------------------------------------------------------+
| Explain String
|
+----------------------------------------------------------------------------------------+
...
| 01:AGGREGATE
|
| | output: count(*)
|
| |
|
| 00:SCAN HBASE [hbase.hbase_table]
|
|
predicates: cust_id = '[email protected]' OR cust_id = '[email protected]'
|
+----------------------------------------------------------------------------------------+
explain select count(*) from hbase_table where
cust_id in ('[email protected]', '[email protected]');
+------------------------------------------------------------------------------------+
| Explain String
|
+------------------------------------------------------------------------------------+
...
| 01:AGGREGATE
|
| | output: count(*)
|
| |
|
| 00:SCAN HBASE [hbase.hbase_table]
|
|
predicates: cust_id IN ('[email protected]', '[email protected]')
|
+------------------------------------------------------------------------------------+
Note: After you create a table in Hive, such as the HBase mapping table in this example, issue an
INVALIDATE METADATA table_name statement the next time you connect to Impala, make Impala
aware of the new table. (Prior to Impala 1.2.4, you could not specify the table name if Impala was not
aware of the table yet; in Impala 1.2.4 and higher, specifying the table name avoids reloading the
metadata for other tables that are not changed.)
Without a String Row Key
This example defines the lookup key column as INT instead of STRING.
Note: Although this table definition works, Cloudera strongly recommends using a string value as
the row key for HBase tables, because the key lookups are much faster when the key column is defined
as a string.
Again, issue the following CREATE TABLE statement through Hive, then switch back to Impala and the
impala-shell interpreter to issue the queries.
$ hive
...
CREATE EXTERNAL TABLE hbasealltypessmall (
id int,
bool_col boolean,
tinyint_col tinyint,
smallint_col smallint,
int_col int,
bigint_col bigint,
float_col float,
double_col double,
date_string_col string,
string_col string,
timestamp_col timestamp)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" =
":key,bools:bool_col,ints:tinyint_col,ints:smallint_col,ints:int_col,ints:bigint_col,floats\
:float_col,floats:double_col,strings:date_string_col,strings:string_col,strings:timestamp_col"
)
TBLPROPERTIES("hbase.table.name" = "hbasealltypessmall");
Example Queries
Once you have established the mapping to an HBase table, you can issue queries.
For example:
# if the row key is mapped as a string col, range predicates are applied to the scan
select * from hbasestringids where id = '5';
# predicate on row key doesn't get transformed into scan parameter, because
# it's mapped as an int (but stored in ASCII and ordered lexicographically)
select * from hbasealltypessmall where id < 5;
Note: The preceding example shows only a small part of the log file. Impala log files are often several
megabytes in size.
There is information about each job Impala has run. Because each Impala job creates an additional set of data
about queries, the amount of job specific data may be very large. Logs may contained detailed information on
jobs. These detailed log entries may include:
The composition of the query.
The degree of data locality.
Statistics on data throughput and response times.
Note: For performance reasons, Cloudera highly recommends not enabling the most verbose logging
level of 3.
For more information on how to configure GLOG, including how to set variable logging levels for different system
components, see How To Use Google Logging Library (glog).
Understanding What is Logged at Different Logging Levels
As logging levels increase, the categories of information logged are cumulative. For example, GLOG_v=2 records
everything GLOG_v=1 records, as well as additional information.
Increasing logging levels imposes performance overhead and increases log size. Cloudera recommends using
GLOG_v=1 for most cases: this level has minimal performance impact but still captures useful troubleshooting
information.
Additional information logged at each level is as follows:
GLOG_v=1 - The default level. Logs information about each connection and query that is initiated to an
impalad instance, including runtime profiles.
GLOG_v=2 - Everything from the previous level plus information for each RPC initiated. This level also records
query execution progress information, including details on each file that is read.
GLOG_v=3 - Everything from the previous level plus logging of every row that is read. This level is only
applicable for the most serious troubleshooting and tuning scenarios, because it can produce exceptionally
large and detailed log files, potentially leading to its own set of performance and capacity problems.
Service
Port
Access
Requirement
Comment
Impala Daemon
21000
External
Impala Daemon
21050
External
Impala Daemon
Internal
Impala Daemon
Internal
Impala Daemon
External
External
Impala Catalog
Daemon
25020
External
24000
Internal
Impala Catalog
Daemon
26000
Internal
Service
Port
Access
Requirement
Comment
Impala Daemon
28000
Internal
Impala Llama
Llama Thrift Admin Port
ApplicationMaster
15002
Internal
Impala Llama
Llama Thrift Port
ApplicationMaster
15000
Internal
Impala Llama
Llama HTTP Port
ApplicationMaster
15001
External
Explanation
Recommendation
Joins fail to
complete.
Queries are
Some impalad instances may not have started. Ensure Impala is installed on all DataNodes.
slow to return Using a browser, connect to the host running Start any impalad instances that are not
results.
the Impala state store. Connect using an
running.
address of the form
http://hostname:port/metrics.
Note: Replace hostname and port
with the hostname and port of your
Impala state store host machine and
web server port. The default port is
25010.
The number of impalad instances listed should
match the expected number of impalad
Explanation
Recommendation
Queries are
Impala may not be configured to use data
slow to return locality tracking.
results.
Attempts to
complete
Impala tasks
such as
executing
INSERT-SELECT
actions fail.
The Impala
logs include
notes that
files could not
be opened due
to permission
denied.
Impala fails to
start up, with
the impalad
logs referring
to errors
connecting to
the statestore
service and
attempts to
re-register.
A large number of databases, tables, partitions, Increase the statestore timeout value above
and so on can require metadata
its default of 10 seconds. For instructions, see
synchronization on startup that takes longer Increasing the Statestore Timeout on page 44.
than the default timeout for the statestore
service.
Main Page
By default, the main page of the debug web UI is at http://impala-server-hostname:25000/ (non-secure
cluster) or https://impala-server-hostname:25000/ (secure cluster).
This page lists the version of the impalad daemon, plus basic hardware and software information about the
corresponding host, such as information about the CPU, memory, disks, and operating system version.
Backends Page
By default, the backends page of the debug web UI is at http://impala-server-hostname:25000/backends
(non-secure cluster) or https://impala-server-hostname:25000/backends (secure cluster).
This page lists the host and port info for each of the impalad nodes in the cluster. Because each impalad
daemon knows about every other impalad daemon through the statestore, this information should be the same
regardless of which node you select. Links take you to the corresponding debug web pages for any of the other
nodes in the cluster.
Catalog Page
By default, the catalog page of the debug web UI is at http://impala-server-hostname:25000/catalog
(non-secure cluster) or https://impala-server-hostname:25000/catalog (secure cluster).
This page displays a list of databases and associated tables recognized by this instance of impalad. You can
use this page to locate which database a table is in, check the exact spelling of a database or table name, look
for identical table names in multiple databases, and so on.
Logs Page
By default, the logs page of the debug web UI is at http://impala-server-hostname:25000/logs (non-secure
cluster) or https://impala-server-hostname:25000/logs (secure cluster).
This page shows the last portion of the impalad.INFO log file, the most detailed of the info, warning, and error
logs for the impalad daemon. You can refer here to see the details of the most recent operations, whether the
operations succeeded or encountered errors. This central page can be more convenient than looking around the
filesystem for the log files, which could be in different locations on clusters that use Cloudera Manager or not.
Memz Page
By default, the memz page of the debug web UI is at http://impala-server-hostname:25000/memz (non-secure
cluster) or https://impala-server-hostname:25000/memz (secure cluster).
This page displays summary and detailed information about memory usage by the impalad daemon. You can
see the memory limit in effect for the node, and how much of that memory Impala is currently using.
Metrics Page
By default, the metrics page of the debug web UI is at http://impala-server-hostname:25000/metrics
(non-secure cluster) or https://impala-server-hostname:25000/metrics (secure cluster).
This page displays the current set of metrics: counters and flags representing various aspects of impalad internal
operation. For the meanings of these metrics, see Impala Metrics in the Cloudera Manager documentation.
Sessions Page
By default, the sessions page of the debug web UI is at http://impala-server-hostname:25000/sessions
(non-secure cluster) or https://impala-server-hostname:25000/sessions (secure cluster).
This page displays information about the sessions currently connected to this impalad instance. For example,
sessions could include connections from the impala-shell command, JDBC or ODBC applications, or the Impala
Query UI in the Hue web interface.
Threadz Page
By default, the threadz page of the debug web UI is at http://impala-server-hostname:25000/threadz
(non-secure cluster) or https://impala-server-hostname:25000/threadz (secure cluster).
This page displays information about the threads used by this instance of impalad, and shows which categories
they are grouped into. Making use of this information requires substantial knowledge about Impala internals.
Varz Page
By default, the varz page of the debug web UI is at http://impala-server-hostname:25000/varz (non-secure
cluster) or https://impala-server-hostname:25000/varz (secure cluster).
This page shows the configuration settings in effect when this instance of impalad communicates with other
Hadoop components such as HDFS and YARN. These settings are collected from a set of configuration files;
Impala might not actually make use of all settings.
The bottom of this page also lists all the command-line settings in effect for this instance of impalad. See
Modifying Impala Startup Options for information about modifying these values.