Hive Using Hiveql
Hive Using Hiveql
14
https://docs.cloudera.com/
Legal Notice
© Cloudera Inc. 2024. All rights reserved.
The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual property
rights. No license under copyright or any other intellectual property right is granted herein.
Unless otherwise noted, scripts and sample code are licensed under the Apache License, Version 2.0.
Copyright information for Cloudera software may be found within the documentation accompanying each component in a
particular release.
Cloudera software includes software from various open source or other third party projects, and may be released under the
Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms.
Other software included may be released under the terms of alternative open source licenses. Please review the license and
notice files accompanying the software for additional licensing information.
Please visit the Cloudera software product page for more information on Cloudera software. For more information on
Cloudera support services, please visit either the Support or Sales page. Feel free to contact us directly to discuss your
specific needs.
Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes no responsibility nor
liability arising from the use of products, except as expressly agreed to in writing by Cloudera.
Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered or unregistered
trademarks in the United States and other countries. All other trademarks are the property of their respective owners.
Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA,
CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OF
ANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY OR
RELATED SUPPORT PROVIDED IN CONNECTION THEREWITH. CLOUDERA DOES NOT WARRANT THAT
CLOUDERA PRODUCTS NOR SOFTWARE WILL OPERATE UNINTERRUPTED NOR THAT IT WILL BE
FREE FROM DEFECTS NOR ERRORS, THAT IT WILL PROTECT YOUR DATA FROM LOSS, CORRUPTION
NOR UNAVAILABILITY, NOR THAT IT WILL MEET ALL OF CUSTOMER’S BUSINESS REQUIREMENTS.
WITHOUT LIMITING THE FOREGOING, AND TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE
LAW, CLOUDERA EXPRESSLY DISCLAIMS ANY AND ALL IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, QUALITY, NON-INFRINGEMENT, TITLE, AND
FITNESS FOR A PARTICULAR PURPOSE AND ANY REPRESENTATION, WARRANTY, OR COVENANT BASED
ON COURSE OF DEALING OR USAGE IN TRADE.
Cloudera Runtime | Contents | iii
Contents
Query scheduling.................................................................................................... 38
Enabling scheduled queries................................................................................................................................ 38
Enabling all scheduled queries...........................................................................................................................39
Periodically rebuilding a materialized view.......................................................................................................39
Getting scheduled query information and monitor the query............................................................................ 40
Materialized views.................................................................................................. 42
Creating and using a materialized view.............................................................................................................43
Creating the tables and view..................................................................................................................43
Using optimizations from a subquery.................................................................................................... 44
Dropping a materialized view................................................................................................................ 45
Showing materialized views...................................................................................................................45
Describing a materialized view.............................................................................................................. 46
Managing query rewrites........................................................................................................................ 48
Purposely using a stale materialized view............................................................................................. 48
Creating and using a partitioned materialized view...............................................................................48
The following matrix includes the types of tables you can create using Hive, whether or not ACID properties are
supported, required storage format, and key SQL operations.
Table Type ACID File Format INSERT UPDATE/DELETE
Although you cannot use the SQL UPDATE or DELETE statements to delete data in some types of tables, you can
use DROP PARTITION on any table type to delete the data.
5
Cloudera Runtime Apache Hive 3 tables
Transactional tables
Transactional tables are ACID tables that reside in the Hive warehouse. To achieve ACID compliance, Hive has
to manage the table, including access to the table data. Only through Hive can you access and change the data in
managed tables. Because Hive has full control of managed tables, Hive can optimize these tables extensively.
Hive is designed to support a relatively low rate of transactions, as opposed to serving as an online analytical
processing (OLAP) system. You can use the SHOW TRANSACTIONS command to list open and aborted
transactions.
Transactional tables in Hive 3 are on a par with non-ACID tables. No bucketing or sorting is required in Hive 3
transactional tables. Bucketing does not affect performance. These tables are compatible with native cloud storage.
Hive supports one statement per transaction, which can include any number of rows, partitions, or tables.
External tables
External table data is not owned or controlled by Hive. You typically use an external table when you want to access
data directly at the file level, using a tool other than Hive.
Hive 3 does not support the following capabilities for external tables:
• Query cache
• Materialized views, except in a limited way
• Automatic runtime filtering
• File merging after insert
• ARCHIVE, UNARCHIVE, TRUNCATE, MERGE, and CONCATENATE. These statements only work for Hive
Managed tables.
When you run DROP TABLE on an external table, by default Hive drops only the metadata (schema). If you want the
DROP TABLE command to also remove the actual data in the external table, as DROP TABLE does on a managed
table, you need to set the external.table.purge property to true as described later.
6
Cloudera Runtime Apache Hive 3 tables
Procedure
1. In Cloudera Manager, click Clusters Hive Action Menu Create Hive Warehouse Directory .
2. In Cloudera Manager, click Clusters Hive (the Hive Metastore service) Configuration , and change the hive.met
astore.warehouse.dir property value to the path for the new Hive warehouse directory.
3. Click Hive Hive Action Menu Create Hive Warehouse External Directory .
4. Change the hive.metastore.warehouse.external.dir property value to the path for the Hive warehouse external
directory.
5. Configure Ranger Hadoop SQL policy to access the URL of the directory on the object store, such as S3 or
Ozone, or file system, such as HDFS.
`db`.`table`
7
Cloudera Runtime Apache Hive 3 tables
performance or operational overload. The way you access managed Hive tables from Spark and other clients changes.
In CDP, access to external tables requires you to set up security access permissions.
You must understand the behavior of the CREATE TABLE statement in legacy platforms like CDH or HDP and how
the behavior changes after you upgrade to CDP.
In this example, tables created under the test_db database using the CREATE TABLE statement
creates external tables with the purge fucntionality enabled (external.table.purge = 'true').
You can also choose to configure a database to allow only external tables to be created and prevent
creation of ACID tables. While creating a database, you can set the database property, EXTER
NAL_TABLES_ONLY=true to ensure that only external tables are created in the database. For
example:
8
Cloudera Runtime Apache Hive 3 tables
You can configure the CREATE TABLE behavior at the site level by configuring the hive.create.
as.insert.only and hive.create.as.acid properties in Cloudera Manager under Hive configuration.
When configured at the site level, the behavior persists from session to session. For more
information, see Configuring CREATE TABLE behavior.
If you are a Spark user, switching to legacy behavior is unnecessary. Calling ‘create table’ from SparkSQL, for
example, creates an external table after upgrading to CDP as it did before the upgrade. You can connect to Hive using
the Hive Warehouse Connector (HWC) to read Hive ACID tables from Spark. To write ACID tables to Hive from
Spark, you use the HWC and HWC API. Spark creates an external table with the purge property when you do not use
the HWC API. For more information, see Hive Warehouse Connector for accessing Spark data.
Related Information
Configuring legacy CREATE TABLE behavior
Procedure
1. Start Hive.
2. Enter your user name and password.
The Hive 3 connection message, followed by the Hive prompt for entering SQL queries on the command line,
appears.
3. Create a CRUD transactional table named T having two integer columns, a and b:
DESCRIBE FORMATTED T;
9
Cloudera Runtime Apache Hive 3 tables
Procedure
1. Start Hive.
2. Enter your user name and password.
The Hive 3 connection message, followed by the Hive prompt for entering SQL queries on the command line,
appears.
3. Create a insert-only transactional table named T2 having two integer columns, a and b:
The 'transactional_properties'='insert_only' is required; otherwise, a CRUD table results. The STORED AS ORC
clause is optional (default = ORC).
4. Create an insert-only transactional table for text data.
The 'transactional_properties'='insert_only' is not required because the storage format is other than ORC.
Related Information
HMS storage
You need to set up access to external tables in the file system or object store using Ranger.
10
Cloudera Runtime Apache Hive 3 tables
Procedure
1. Create a text file named students.csv that contains the following lines.
1,jane,doe,senior,mathematics 2,john,smith,junior,engineering
2. Move the file to HDFS in a directory called andrena, and put students.csv in the directory.
3. Start the Hive shell.
For example, substitute the URI of your HiveServer: beeline -u jdbc:hive2://myhiveserver.com:10000 -n hive -
p
4. Create an external table schema definition that specifies the text format, loads data from students.csv in /user/
andrena.
5. Verify that the Hive warehouse stores the student names in the external table.
SELECT * FROM names_text;
6. Create the schema for a managed table.
11
Cloudera Runtime Apache Hive 3 tables
In this task, you create an external table and load data from a .csv file that is stored in Ozone. You can use the
LOCATION clause in the CREATE EXTERNAL TABLE statement to specify the location of the external table data.
The metadata is stored in the Hive warehouse.
Procedure
1. Create a text file named employee.csv that contains the following records.
1,Andrew,45000,Technical Manager
2,Sam,35000,Proof Reader
2. Move the employee.csv file to a directory called employee_hive in the Ozone filesystem.
3. Connect to the gateway node of the cluster and on the command line of the cluster, launch Beeline to start the
Hive shell.
beeline -u jdbc:hive2://myhiveserver.com:10000 -n hive -p
The Hive 3 connection message appears, followed by the Hive prompt for entering queries on the command line.
4. Create an external table schema definition that specifies the text format and loads data from employee.csv in ofs://
ozone1/vol1/bucket1/employee_hive.
5. Verify that the Hive warehouse stores the employee.csv records in the external table.
SELECT * FROM employee;
Related Information
Setting up Ranger policies to access Hive files in Ozone
Commands for managing Ozone volumes and buckets
Recommended Hive configurations when using Ozone
12
Cloudera Runtime Apache Hive 3 tables
Procedure
1. In Cloudera Manager, click Clusters Ozone Configuration to navigate to the configuration page for Ozone.
2. Search for ranger_service, and enable the property.
3. Click Clusters Ranger Ranger Admin Web UI , enter your user name and password, then click Sign In.
The Service Manager for Resource Based Policies page is displayed in the Ranger console.
13
Cloudera Runtime Apache Hive 3 tables
7. Click the Service Manager link in the breadcrumb trail and then click the Hadoop SQL preloaded resource-based
service to update the Hadoop SQL URL policy.
8.
In the Hadoop SQL policies page, click the Policy ID or click Edit against the "all - url" policy to modify
the policy details.
By default, "hive", "hue", "impala", "admin" and a few other users are provided access to all the Ozone URLs.
You can select users and groups in addition to the default. To grant everyone access, add the "public" group to the
group list. Every user is then subject to your allow conditions.
What to do next
Create a Hive external table having source data in Ozone.
Also, it is recommended that you set certain Hive configurations before querying Hive tables in Ozone.
Related Information
Using Ranger with Ozone
Creating an Ozone-based Hive external table
Creating partitions dynamically
Recommended Hive configurations when using Ozone
Configurations
The following configurations can be specified through Beeline during runtime using the SET command. For example,
SET key=value;. The configuration persists only for the particular session or query. If you want to set it permanently,
then specify the properties in hive-site.xml using the Cloudera Manager Safety Valve:
Table 1:
Configuration Value
hive.optimize.sort.dynamic.partition true
hive.optimize.sort.dynamic.partition.threshold 0
hive.query.results.cache.enabled false
hive.acid.direct.insert.enabled true
hive.orc.splits.include.fileid false
14
Cloudera Runtime Apache Hive 3 tables
Important: If you notice that some queries are taking a longer time to complete or failing entirely (usually
noticed in large clusters), you can choose to revert the value of hive.optimize.sort.dynamic.partition.threshold
to "-1". The performance issue is related to HIVE-26283.
Related Information
Creating an Ozone-based Hive external table
Setting up Ranger policies to access Hive files in Ozone
Related Information
Before and After Upgrading Table Type Comparison
Procedure
Using MariaDB, create an external table based on a user-defined schema.
15
Cloudera Runtime Apache Hive 3 tables
Procedure
1. Using MS SQL, create an external table based on a user-defined schema.
16
Cloudera Runtime Apache Hive 3 tables
3. Allow the user to connect to the database and run queries. For example:
Procedure
1. Using Oracle XE edition, connect to the PDB.
2. Create the bob schema/user and give appropriate connections to be able to connect to the database.
3. Create the alice schema/user, give appropriate connections to be able to connect to the database, and create an
external table.
17
Cloudera Runtime Apache Hive 3 tables
5. Allow the users to perform inserts on any table/view in the database, not only those present on their own schema.
Procedure
1. Using Postgres, create external tables based on a user-defined schema.
2. Create a user and associate them with a default schema <=> search_path.
18
Cloudera Runtime Apache Hive 3 tables
Using constraints
You can use SQL constraints to enforce data integrity and improve performance. Using constraints, the optimizer can
simplify queries. Constraints can make data predictable and easy to locate. Using constraints and supported modifiers,
you can follow examples to constrain queries to unique or not null values, for example.
Hive enforces DEFAULT, NOT NULL and CHECK only, not PRIMARY KEY, FOREIGN KEY, and UNIQUE.
You can use the constraints listed below in your queries. Hive enforces DEFAULT, NOT NULL and CHECK only,
not PRIMARY KEY, FOREIGN KEY, and UNIQUE. DEFAULT even if enforced, does not support complex types
(array,map,struct). Constraint enforcement is limited to the metadata level. This limitation aids integration with third
party tools and optimization of constraints declarations, such as materialized view rewriting.
CHECK
Limits the range of values you can place in a column.
DEFAULT
Ensures a value exists, which is useful in offloading data from a data warehouse.
PRIMARY KEY
Identifies each row in a table using a unique identifier.
FOREIGN KEY
Identifies a row in another table using a unique identifier.
UNIQUE KEY
Checks that values stored in a column are different.
NOT NULL
Ensures that a column cannot be set to NULL.
Supported modifiers
You can use the following optional modifiers:
ENABLE
Ensures that all incoming data conforms to the constraint.
DISABLE
Does not ensure that all incoming data conforms to the constraint.
VALIDATE
Checks that all existing data in the table conforms to the constraint.
NOVALIDATE
Does not check that all existing data in the table conforms to the constraint.
ENFORCED
Maps to ENABLE NOVALIDATE.
NOT ENFORCED
Maps to DISABLE NOVALIDATE.
RELY
Specifies abiding by a constraint; used by the optimizer to apply further optimizations.
NORELY
Specifies not abiding by a constraint.
19
Cloudera Runtime Apache Hive 3 tables
Note: For external tables, the RELY constraint is the only supported constraint.
Default modfiers
The following default modifiers are in place:
• The default modifier for ENABLE is NOVALIDATE RELY.
• The default modifier for DISABLE is NOVALIDATE NORELY.
• If you do not specify a modifier when you declare a constraint, the default is ENABLE NOVALIDATE RELY.
The following constraints do not support ENABLE:
• PRIMARY KEY
• FOREIGN KEY
• UNIQUE KEY
To prevent an error, specify a modfier when using these constraints to override the default.
Constraints examples
The optimizer uses the constraint information to make smart decisions. The following examples show the use of
constraints.
The following example shows how to create a table that declares the NOT NULL in-line constraint to constrain a
column.
The constrained column b accepts a SMALLINT value as shown in the first INSERT statement.
The following examples shows how to declare the FOREIGN KEY constraint out of line. You can specify a
constraint name, in this case fk, in an out-of-line constraint
20
Cloudera Runtime Apache Hive 3 ACID transactions
Procedure
1. In the Hive shell, get an extended description of the table.
For example: DESCRIBE EXTENDED mydatabase.mytable;
2. Scroll to the bottom of the command output to see the table type.
The following output says the table type is managed. transaction=true indicates that the table has ACID properties:
...
| Detailed Table Information | Table(tableName:t2, dbName:mydatabase, o
wner:hdfs, createTime:1538152187, lastAccessTime:0, retention:0, sd:Stor
ageDescriptor(cols:[FieldSchema(name:a, type:int, comment:null), FieldSc
hema(name:b, type:int, comment:null)], ...
Related Information
HMS storage
21
Cloudera Runtime Apache Hive 3 ACID transactions
Assume that three insert operations occur, and the second one fails:
For every write operation, Hive creates a delta directory to which the transaction manager writes data files. Hive
writes all data to delta files, designated by write IDs, and mapped to a transaction ID that represents an atomic
operation. If a failure occurs, the transaction is marked aborted, but it is atomic:
tm
___ delta_0000001_0000001_0000
### 000000_0
___ delta_0000002_0000002_0000 //Fails
### 000000_0
___ delta_0000003_0000003_0000
### 000000_0
During the read process, the transaction manager maintains the state of every transaction. When the reader starts, it
asks for the snapshot information, represented by a high watermark. The watermark identifies the highest transaction
ID in the system followed by a list of exceptions that represent transactions that are still running or are aborted.
The reader looks at deltas and filters out, or skips, any IDs of transactions that are aborted or still running. The reader
uses this technique with any number of partitions or tables that participate in the transaction to achieve atomicity and
isolation of operations on transactional tables.
Running SHOW CREATE TABLE acidtbl provides information about the defaults: transactional (ACID) and the
ORC data storage format:
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `acidtbl`( |
| `a` int, |
| `b` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 's3://myserver.com:8020/warehouse/tablespace/managed/hive/acidtb
l' |
| TBLPROPERTIES ( |
22
Cloudera Runtime Apache Hive 3 ACID transactions
| 'bucketing_version'='2', |
| 'transactional'='true', |
| 'transactional_properties'='default', |
| 'transient_lastDdlTime'='1555090610') |
+----------------------------------------------------+
Tables that support updates and deletions require a slightly different technique to achieve atomicity and isolation.
Hive runs in append-only mode, which means Hive does not perform in-place updates or deletions. Isolation of
readers and writers cannot occur in the presence of in-place updates or deletions. In this situation, a lock manager or
some other mechanism, is required for isolation. These mechanisms create a problem for long-running queries.
Instead of in-place updates, Hive decorates every row with a row ID. The row ID is a struct that consists of the
following information:
• The write ID that maps to the transaction that created the row
• The bucket ID, a bit-backed integer with several bits of information, of the physical writer that created the row
• The row ID, which numbers rows as they were written to a data file
Instead of in-place deletions, Hive appends changes to the table when a deletion occurs. The deleted data becomes
unavailable and the compaction process takes care of the garbage collection later.
Create operation
The following example inserts several rows of data into a full CRUD transactional table, creates a delta file, and adds
row IDs to a data file.
INSERT INTO acidtbl (a,b) VALUES (100, "oranges"), (200, "apples"), (300, "b
ananas");
This operation generates a directory and file, delta_00001_00001/bucket_0000, that have the following data:
ROW_ID a b
Delete operation
A delete statement that matches a single row also creates a delta file, called the delete-delta. The file stores a set of
row IDs for the rows that match your query. At read time, the reader looks at this information. When it finds a delete
23
Cloudera Runtime Apache Hive query basics
event that matches a row, it skips the row and that row is not included in the operator pipeline. The following example
deletes data from a transactional table:
This operation generates a directory and file, delete_delta_00002_00002/bucket_0000 that have the following data:
ROW_ID a b
Update operation
An update combines the deletion and insertion of new data. The following example updates a transactional table:
One delta file contains the delete event, and the other, the insert event:
The reader, which requires the AcidInputFormat, applies all the insert events and encapsulates all the logic to handle
delete events. A read operation first gets snapshot information from the transaction manager based on which it selects
files that are relevant to that read operation. Next, the process splits each data file into the number of pieces that each
process has to work on. Relevant delete events are localized to each processing task. Delete events are stored in a
sorted ORC file. The compressed, stored data is minimal, which is a significant advantage of Hive 3. You no longer
need to worry about saturating the network with insert events in delta files.
24
Cloudera Runtime Apache Hive query basics
sys database data, but in a user-friendly, read-only way. You can use joins, aggregates, filters, and projections in
information_schema queries.
Procedure
1. Open Ranger Access Manager, and check that the preloaded default database tables columns and information_
schema database policies are enabled for group public.
SHOW DATABASES;
...
+---------------------+
| database_name |
+---------------------+
| default |
| information_schema |
| sys |
+---------------------+
USE information_schema;
...
SHOW TABLES;
...
+--------------------+
| tab_name |
+--------------------+
25
Cloudera Runtime Apache Hive query basics
| column_privileges |
| columns |
| schemata |
| table_privileges |
| tables |
| views |
+--------------------+
4. Query the information_schema database to see, for example, information about tables into which you can insert
values.
Procedure
1. Create an ACID table to contain student information.
CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3,2));
2. Insert name, age, and gpa values for a few students into the table.
INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
3. Create a table called pageviews and assign null values to columns you do not want to assign a value.
26
Cloudera Runtime Apache Hive query basics
Procedure
Create a statement that changes the values in the name column of all rows where the gpa column has the value of 1.0.
UPDATE students SET name = null WHERE gpa <= 1.0;
Procedure
1. Construct a query to update the customers' names and states in customer target table to match the names and states
of customers having the same IDs in the new_customer_stage source table.
2. Enhance the query to insert data from new_customer_stage table into the customer table if none already exists.
Update or delete data using MERGE in a similar manner.
Note: You can map specific columns in the INSERT clause of the query instead of passing values
(including null) for columns in the target table that do not have any data to insert. The unspecified
columns in the INSERT clause are either mapped to null or use default constraints, if any.
For example, you can construct the INSERT clause as WHEN NOT MATCHED THEN INSERT VAL
UES (customer.id=sub.id, customer.name=sub.name, customer.state=sub.state) instead of WHEN
NOT MATCHED THEN INSERT VALUES (sub.id, sub.name, 'null', sub.state).
Related Information
Merge documentation on the Apache wiki
27
Cloudera Runtime Apache Hive query basics
Procedure
Delete any rows of data from the students table if the gpa column has a value of 1 or 0.
DELETE FROM students WHERE gpa <= 1,0;
Procedure
1. Create a temporary table having one string column.
CREATE TEMPORARY TABLE tmp1 (tname varchar(64));
2. Create a temporary table using the CREATE TABLE AS SELECT (CTAS) statement.
3. Create a temporary table using the CREATE TEMPORARY TABLE LIKE statement.
Related Information
Create/Drop/Truncate Table on the Apache wiki
28
Cloudera Runtime Apache Hive query basics
Procedure
1. Configure Hive to store temporary table data either in memory or on SSD by setting hive.exec.temporary.table.st
orage.
• Store data in memory. hive.exec.temporary.table.storage to memory
• Store data on SSD. hive.exec.temporary.table.storage to ssd
2. Create and use temporary tables.
Results
Hive drops temporary tables at the end of the session.
Using a subquery
Hive supports subqueries in FROM clauses and WHERE clauses that you can use for many Apache Hive operations,
such as filtering data from one table based on contents of another table.
Procedure
Select all the state and net_payments values from the transfer_payments table if the value of the year column in the
table matches a year in the us_census table.
The predicate starts with the first WHERE keyword. The predicate operator is the IN keyword.
The predicate returns true for a row in the transfer_payments table if the year value in at least one row of the
us_census table matches a year value in the transfer_payments table.
Subquery restrictions
To construct queries efficiently, you must understand the restrictions of subqueries in WHERE clauses.
• Subqueries must appear on the right side of an expression.
• Nested subqueries are not supported.
• Subquery predicates must appear as top-level conjuncts.
• Subqueries support four logical operators in query predicates: IN, NOT IN, EXISTS, and NOT EXISTS.
• The IN and NOT IN logical operators may select only one column in a WHERE clause subquery.
• The EXISTS and NOT EXISTS operators must have at least one correlated predicate.
• The left side of a subquery must qualify all references to table columns.
• References to columns in the parent query are allowed only in the WHERE clause of the subquery.
• Subquery predicates that reference a column in a parent query must use the equals (=) predicate operator.
• Subquery predicates may refer only to columns in the parent query.
• Correlated subqueries with an implied GROUP BY statement may return only one row.
• All unqualified references to columns in a subquery must resolve to tables in the subquery.
• Correlated subqueries cannot contain windowing clauses.
29
Cloudera Runtime Apache Hive query basics
Procedure
1. Construct a query that returns the average salary of all employees in the engineering department grouped by year.
30
Cloudera Runtime Apache Hive query basics
Procedure
Select all state and net_payments values from the transfer_payments table for years during which the value of the
state column in the transfer_payments table matches the value of the state column in the us_census table.
This query is correlated because one side of the equals predicate operator in the subquery references the state column
in the transfer_payments table in the parent query and the other side of the operator references the state column in the
us_census table.
This statement includes a conjunct in the WHERE clause.
A conjunct is equivalent to the AND condition, while a disjunct is the equivalent of the OR condition The following
subquery contains a conjunct:
... WHERE transfer_payments.year = "2018" AND us_census.state = "california"
The following subquery contains a disjunct:
... WHERE transfer_payments.year = "2018" OR us_census.state = "california"
Procedure
1. Use a CTE to create a table based on another table that you select using the CREATE TABLE AS SELECT
(CTAS) clause.
CREATE TABLE s2 AS WITH q1 AS (SELECT key FROM src WHERE key = '4') SELECT
* FROM q1;
CREATE VIEW v1 AS WITH q1 AS (SELECT key FROM src WHERE key='5') SELECT *
from q1;
31
Cloudera Runtime Apache Hive query basics
You run the following query to match all values in c1 of tbl not equal to any value in c2 from the same tbl.
32
Cloudera Runtime Apache Hive query basics
Set the hive.support.quoted.identifiers configuration parameter to column in the hive-site.xml file to enable quoted
identifiers in column names. Valid values are none and column. For example, in Hive execute the following
command: SET hive.support.quoted.identifiers = column.
Procedure
1. Create a table named test that has two columns of strings specified by quoted identifiers:
CREATE TABLE test (`x+y` String, `a?b` String);
2. Create a table that defines a partition using a quoted identifier and a region number:
CREATE TABLE partition_date-1 (key string, value string) PARTITIONED BY (`dt+x` date, region int);
3. Create a table that defines clustering using a quoted identifier:
CREATE TABLE bucket_test(`key?1` string, value string) CLUSTERED BY (`key?1`) into 5 buckets;
Table 3:
33
Cloudera Runtime Creating a default directory for managed tables
Use the following syntax to create a database that specifies a location for managed tables:
Do not set LOCATION and MANAGEDLOCATION to the same file system path.
Use the following syntax to set or change a location for managed tables.
Procedure
1. Create a database mydatabase that specifies a top level directory named sales for managed tables.
34
Cloudera Runtime Generating surrogate keys
The table you want to join using surrogate keys cannot have column types that need casting. These data types must be
primitives, such as INT or STRING.
Procedure
1. Create a students table in the default ORC format that has ACID properties.
INSERT INTO TABLE students VALUES (1, 'fred flintstone', 100), (2, 'barney
rubble', 200);
35
Cloudera Runtime Partitions and performance
4. Insert data, which automatically generates surrogate keys for the primary keys.
+-----------------+---------------------+-------------------+-----------
--------+
| students_v2.id | students_v2.row_id | students_v2.name | students_v2.
dorm |
+-----------------+---------------------+-------------------+-----------
--------+
| 1099511627776 | 1 | fred flintstone | 100
|
| 1099511627777 | 2 | barney rubble | 200
|
+-----------------+---------------------+-------------------+-------------
------+
6. Add the surrogate keys as a foreign key to another table, such as a student_grades table, to speed up subsequent
joins of the tables.
CREATE TABLE sale(id in, amount decimal) PARTITIONED BY (xdate string, state
string);
To insert data into this table, you specify the partition key for fast loading:
You do not need to specify dynamic partition columns. Hive generates a partition specification if you enable dynamic
partitions.
36
Cloudera Runtime Partitions and performance
Follow these best practices when you partition tables and query partitioned tables:
• Never partition on a unique ID.
• Size partitions to greater than or equal to 1 GB on average.
• Design queries to process not more than 1000 partitions.
Procedure
1. Remove the dept=sales object from the file system.
2. From the Hive command line, look at the emp_part table partitions.
+----------------+
| partition |
+----------------+
| dept=finance |
| dept=sales |
| dept=service |
+----------------+
37
Cloudera Runtime Query scheduling
Query scheduling
Apache Hive scheduled queries is a simple, secure way to create, manage, and monitor scheduled jobs. For
applications that require OS-level schedulers like cron, Apache Oozie, or Apache Airflow, you can use scheduled
queries.
Using SQL statements, you can schedule Hive queries to run on a recurring basis, monitor query progress, and
optionally disable a query schedule. You can run queries to ingest data periodically, refresh materialized views,
replicate data, and perform other repetitive tasks. For example, you can insert data from a stream into a transactional
table every 10 minutes, refresh a materialized view used for BI reporting every hour, and replicate data from one
cluster to another on a daily basis.
A Hive scheduled query consists of the following parts:
• A unique name for the schedule
• The SQL statement to be executed
• The execution schedule defined by a Quartz cron expression.
Quartz cron expressions are expressive and flexible. For instance, expressions can describe simple schedules such as
every 10 minutes, but also an execution happening at 10 AM on the first Sunday of the month in January, February in
2021, 2022. You can describe common schedules in an easily comprehensible format, for example every 20 minutes
or every day at ‘3:25:00’.
Operation
A scheduled query belongs to a namespace, which is a collection of HiveServer (HS2) instances that are responsible
for executing the query. Scheduled queries are stored in the Hive metastore. The metastore stores scheduled queries,
the status of ongoing and previously executed statements, and other information. HiveServer periodically polls
the metastore to retrieve scheduled queries that are due to be executed. If you run multiple HiveServer roles, the
metastore guarantees that only one of them executes a certain scheduled query at any given time.
You create, alter, and drop scheduled queries using dedicated SQL statements.
Related Information
Apache Hive Language Manual--Scheduled Queries
38
Cloudera Runtime Query scheduling
Procedure
1. In Cloudera Manager, click Clusters Hive on TEZ Configuration
2. In Search, enter safety.
3. In Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml HIVE_ON_TEZ-1 (Service-
Wide), click + and add the following property: hive.scheduled.queries.create.as.enabled
4. Set the value to true.
5. Save and restart Hive on Tez.
Imagine that you add data for a number of employees to the table. Assume many users of your database issue queries
to access to data about the employees hired during last year including the department they belong to.
You perform the steps below to create a materialized view of the table to address these queries. Imagine new
employees are hired and you add their records to the table. These changes render the materialized view contents
outdated. You need to update its contents. You create a scheduled query to perform this task. The scheduled
rebuilding will not occur unless there are changes to the input tables. You test the scheduled query by bypassing the
schedule and executing the schedule immediately. Finally, you change the schedule to rebuild less often.
Procedure
1. To handle many queries to access recently hired employee and departmental data, create a materialized view.
39
Cloudera Runtime Query scheduling
3. Assuming new hiring occurred and you added new records to the emps table, rebuild the materialized view.
A rebuild executes every 10 minutes, assuming changes to the emp table occur within that period. If a materialized
view can be rebuilt incrementally, the scheduled rebuild does not occur unless there are changes to the input
tables.
5. To test the schedule, run a scheduled query immediately.
Related Information
Apache Hive Language Manual--Scheduled Queries
Procedure
1. Query the information schema to get information about a schedule.
SELECT *
FROM information_schema.scheduled_queries
40
Cloudera Runtime Query scheduling
SELECT *
41
Cloudera Runtime Materialized views
FROM information_schema.scheduled_executions;
You can configure the retention period for this information in the Hive metastore.
scheduled_execution_id
Unique numeric identifier for a scheduled query execution.
schedule_name
Name of the scheduled query associated with this execution.
executor_query_id
Query ID assigned to the execution by HiveServer (HS2).
state
One of the following phases of execution.
• STARTED. A scheduled query is due and a HiveServer instance has retrieved its information.
• EXECUTING. HiveServer is executing the query and reporting progress in configurable
intervals.
• FAILED. The query execution was stopped due to an error or exception.
• FINISHED. The query execution was successful.
• TIMED_OUT. HiveServer did not provide an update on the query status for more than a
configurable timeout.
start_time
Start time of execution.
end_time
End time of execution.
elapsed
Difference between start and end time.
error_message
If the scheduled query failed, it contains the error message associated with its failure.
last_update_time
Time of the last update of the query status by HiveServer.
Related Information
Apache Hive Language Manual--Scheduled Queries
Materialized views
A materialized view is a Hive-managed database object that holds a query result you can use to speed up the
execution of a query workload. If your queries are repetitive, you can reduce latency and resource consumption by
using materialized views. You create materialized views to optimize your queries automatically.
Using a materialized view, the optimizer can compare old and new tables, rewrite queries to accelerate processing,
and manage maintenance of the materialized view when data updates occur. The optimizer can use a materialized
view to fully or partially rewrite projections, filters, joins, and aggregations.
You can perform the following materialized view operations:
• Create a materialized view of queries or subqueries
• Drop a materialized view
• Show materialized views
• Describe a materialized view
42
Cloudera Runtime Materialized views
Procedure
1. Create two ACID tables:
43
Cloudera Runtime Materialized views
4. Run a query that takes advantage of the precomputation performed by the materialized view:
Output is:
+--------+-----------+
| empid | deptname |
+--------+-----------+
| 10003 | Sup |
| 10002 | HR |
| 10001 | Eng |
+--------+-----------+
1 Chicago Hyderabad
2 London Moscow
...
Procedure
1. Create a table schema definition named flights_data for destination and origin data.
44
Cloudera Runtime Materialized views
3. Take advantage of the materialized view to speed your queries when you have to count destinations and origins
again.
For example, use a subquery to select the number of destination-origin pairs like the materialized view.
SELECT count(*)/2
FROM(
SELECT dest, origin, count(*)
FROM flights_data
GROUP BY dest, origin
) AS t;
Transparently, the SQL engine uses the work already in place since creation of the materialized view instead of
reprocessing.
Related Information
Materialized view commands
Procedure
Drop a materialized view in my_database named mv1 .
DROP MATERIALIZED VIEW default.mv;
Related Information
Materialized view commands
Procedure
1. List materialized views in the current database.
SHOW MATERIALIZED VIEWS;
2. List materialized views in a particular database.
SHOW MATERIALIZED VIEWS IN another_database;
45
Cloudera Runtime Materialized views
Related Information
Materialized view commands
Procedure
1. Get summary information about the materialized view named mv1.
DESCRIBE mv1;
+------------+---------------+----------+
| col_name | data_type | comment |
+------------+---------------+----------+
| empid | int | |
| deptname | varchar(256) | |
| hire_date | timestamp | |
+------------+---------------+----------+
+-----------------------------+---------------------------------...
| col_name | data_type ...
+-----------------------------+---------------------------------...
| empid | int ...
| deptname | varchar(256) ...
| hire_date | timestamp ...
| | NULL ...
| Detailed Table Information |Table(tableName:mv1, dbName:default, own
er:hive, createTime:1532466307, lastAccessTime:0, retention:0, sd:Storag
eDescriptor(cols:[FieldSchema(name:empid, type:int, comment:null), Field
Schema(name:deptname, type:varchar(256), comment:null), FieldSchema(name
:hire_date, type:timestamp, comment:null)], location:hdfs://myserver.com
:8020/warehouse/tablespace/managed/hive/mv1, inputFormat:org.apache.hado
op.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql
.io.orc.OrcOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerD
eInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSe
rde, parameters:{}), bucketCols:[], sortCols:[], parameters:{}, skewedIn
fo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocat
ionMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters
:{totalSize=488, numRows=4, rawDataSize=520, COLUMN_STATS_ACCURATE={\"BA
SIC_STATS\":\"true\"}, numFiles=1, transient_lastDdlTime=1532466307, buc
keting_version=2}, viewOriginalText:SELECT empid, deptname, hire_date\nF
ROM emps2 JOIN depts\nON (emps2.deptno = depts.deptno)\nWHERE hire_date >=
'2017-01-17', viewExpandedText:SELECT `emps2`.`empid`, `depts`.`deptname`
, `emps2`.`hire_date`\nFROM `default`.`emps2` JOIN `default`.`depts`\nON
(`emps2`.`deptno` = `depts`.`deptno`)\nWHERE `emps2`.`hire_date` >= '20
17-01-17', tableType:MATERIALIZED_VIEW, rewriteEnabled:true, creationMet
adata:CreationMetadata(catName:hive, dbName:default, tblName:mv1, tables
Used:[default.depts, default.emps2], validTxnList:53$default.depts:2:922
3372036854775807::$default.emps2:4:9223372036854775807::, materializatio
nTime:1532466307861), catName:hive, ownerType:USER)
46
Cloudera Runtime Materialized views
+-------------------------------+--------------------------------...
| col_name | data_type ...
+-------------------------------+--------------------------------...
| # col_name | data_type ...
| empid | int ...
| deptname | varchar(256) ...
| hire_date | timestamp ...
| | NULL ...
| # Detailed Table Information | NULL ...
| Database: | default ...
| OwnerType: | USER ...
| Owner: | hive ...
| CreateTime: | Tue Jul 24 21:05:07 UTC 2019 ...
| LastAccessTime: | UNKNOWN ...
| Retention: | 0 ...
| Location: | hdfs://myserver...
| Table Type: | MATERIALIZED_VIEW ...
| Table Parameters: | NULL ...
| | COLUMN_STATS_ACCURATE ...
| | bucketing_version ...
| | numFiles ...
| | numRows ...
| | rawDataSize ...
| | totalSize ...
| | transient_lastDdlTime ...
| | NULL ...
| InputFormat: | org.apache.hadoop.hive.ql.io.or...
| OutputFormat: | org.apache.hadoop.hive.ql.io.or...
| Compressed: | No ...
Related Information
Materialized view commands
47
Cloudera Runtime Materialized views
Procedure
1. Disable rewriting of a query based on a materialized view named mv1 in the default database.
3. Globally enable rewriting of queries based on materialized views by setting a global property.
SET hive.materializedview.rewriting=true;
Related Information
Materialized view commands
Procedure
1. Create a scheduled query to invoke the rebuild statement every 10 minutes.
SET hive.materializedview.rewriting.time.window=10min;
48
Cloudera Runtime Materialized views
100 HR 10
101 Eng 11
200 Sup 20
In this task, you create two materialized views: one partitions data on department; the other partitions data on hire
date. You select data, filtered by department,from the original table, not from either one of the materialized views.
The explain plan shows that Hive rewrites your query for efficiency to select data from the materialized view that
partitions data by department. In this task, you also see the effects of rebuilding a materialized view.
Procedure
1. Create a materialized view of the emps table that partitions data into departments.
2. Create a second materialized view that partitions the data on the hire date instead of the department number.
3. Generate an extended explain plan by selecting data for department 101 directly from the emps table without
using the materialized view.
EXPLAIN EXTENDED SELECT deptno, hire_date FROM emps where deptno = 101;
The explain plan shows that Hive rewrites your query for efficiency, using the better of the two materialized views
for the job: partition_mv_1.
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| OPTIMIZED SQL: SELECT CAST(101 AS INTEGER) AS `deptno`, `hire_date` |
| FROM `default`.`partition_mv_1` |
| WHERE 101 = `deptno` |
| STAGE DEPENDENCIES: |
| Stage-0 is a root stage
...
4. Correct Jane Doe's hire date to February 12, 2018, rebuild one of the materialized views, but not the other, and
compare contents of both materialized views.
49
Cloudera Runtime Materialized views
The output of selecting the rebuilt partition_mv_1 includes the original row and newly inserted row because
INSERT does not perform in-place updates (overwrites).
+---------------------------+------------------------+
| partition_mv_1.hire_date | partition_mv_1.deptno |
+---------------------------+------------------------+
| 2018-01-10 00:00:00.0 | 101 |
| 2018-02-12 00:00:00.0 | 101 |
+---------------------------+------------------------+
The output from the other partition is stale because you did not rebuild it:
+------------------------+---------------------------+
| partition_mv_2.deptno | partition_mv_2.hire_date |
+------------------------+---------------------------+
| 101 | 2018-01-10 00:00:00.0 |
+------------------------+---------------------------+
5. Create a second employees table and a materialized view of the tables joined on the department number.
6. Generate an explain plan that joins tables emps and emps2 on department number using a query that omits the
partitioned materialized view.
The output shows that Hive rewrites the query to use the partitioned materialized view partition_mv_3 even
though your query omitted the materialized view.
7. Verify that the partition_mv_3 sets up the partition for deptno=101 for partition_mv_3.
Output is:
+-------------+
| partition |
+-------------+
| deptno=101 |
+-------------+
Related Information
Creating and using a materialized view
Materialized view commands
50
Cloudera Runtime CDW stored procedures
HPL/SQL architecture
HPL/SQL has been re-architected from a command line tool to an integrated part of HiveServer (HS2). From a JDBC
client, such as Beeline, you connect to HiveServer through CDW. The interpreter executes the abstract syntax tree
(AST) from the parser. Hive metastore securely stores the function and procedure code permanently. The procedure
is loaded and cached on demand to the interpreter's memory when needed. You can close the session or restart Hive
without losing the definitions.
You can enable and use HPL/SQL from any host or third-party tool that can make a JDBC connection to HiveServer.
Beeline is a popular client for use with HPL/SQL because other third-party tools do not show you some of the error
messages about syntax mistakes.
51
Cloudera Runtime CDW stored procedures
When the client connects to HiveServer in CDW configuring this mode, HPL/SQL is enabled; otherwise, HPL/SQL is
disabled.
HPL/SQL limitations
• Some of the Hive specific CREATE TABLE parameters are missing.
• No colon syntax to parametrize SQL strings.
• No quoted string literals.
• No GOTO and Label.
• EXECUTE does not have output parameters.
• Some complex data types, such as Arrays and Records are not supported.
• No object-oriented extension.
• Data Analytics Studio (DAS) does not support HPL/SQL.
Creating a function
You need to know the syntax of HPL/SQL, which closely resembles Oracle’s PL/SQL. An example of creating a
function and calling it in a Hive SELECT statement demonstrates the HPL/SQL basics.
Procedure
1. Create a table and populate it with some numbers.
52
Cloudera Runtime CDW stored procedures
return 'BUZZ';
elseif mod(n, 3) == 0 then
return 'FIZZ';
else
return n;
end if;
end;
...
1
2
FIZZ
4
BUZZ
1
2
FIZZ
4
BUZZ
FIZZ
7
8
FIZZ
BUZZ
Procedure
Use an OPEN-FOR statement to select the number in each row of the numbers table, iteratively FETCH the numerical
content of the each row INTO cursor variable num, close the cursor, and print the results.
53
Cloudera Runtime CDW stored procedures
...
1 2 FIZZ 4 BUZZ 1 2 FIZZ 4 BUZZ FIZZ 7 8 FIZZ BUZZ
No rows affected
Greeting function
The following example creates a function that takes the input of your name and returns "hello <name>":
The test_even procedure below calls the even procedure above, passing the cursor of type SYS_REFCURSOR to
fetch each row containing an even number.
54
Cloudera Runtime Using JdbcStorageHandler to query RDBMS
END;
BULK COLLECT
Using BULK COLLECT, you can retrieve multiple rows in a single fetch quickly.
Procedure
1. Load data into a supported SQL database, such as MySQL, on a node in your cluster, or familiarize yourself with
existing data in the your database.
2. Create an external table using the JdbcStorageHandler and table properties that specify the minimum information:
database type, driver, database connection string, user name and password for querying hive, table name, and
number of active connections to Hive.
55
Cloudera Runtime Using JdbcStorageHandler to query RDBMS
col2 int,
col3 double
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "MYSQL",
"hive.sql.jdbc.driver" = "com.mysql.jdbc.Driver",
"hive.sql.jdbc.url" = "jdbc:mysql://localhost/sample",
"hive.sql.dbcp.username" = "hive",
"hive.sql.dbcp.password" = "hive",
"hive.sql.table" = "MYTABLE",
"hive.sql.dbcp.maxActive" = "1"
);
Related Information
Apache Wiki: JdbcStorageHandler
56