Apache HIVE

Download as pdf or txt
Download as pdf or txt
You are on page 1of 105
At a glance
Powered by AI
Apache Hive is a data warehouse system built on Hadoop that provides an SQL-like interface to query and analyze large datasets. It translates SQL queries to MapReduce jobs for scalable processing.

Apache Hive is an open source data warehouse system built on top of Hadoop used for querying and analyzing large datasets stored in Hadoop files. Hive provides an SQL dialect called Hive Query Language (HQL) for querying data stored in a Hadoop cluster.

In Hive, tables and databases are created first and then data is loaded into these tables. Hive deals with structured data stored in tables. It reuses concepts like tables, rows, columns from relational databases for ease of use. A Metastore is used to store schema information.

Apache HIVE

What Is It?
Apache Hive is an open source data warehouse system built on top of Hadoop Haused for querying and
analyzing large datasets stored in Hadoop files.
Hive provides an SQL dialect, called Hive Query Language (abbreviated HiveQL or just HQL) for
querying data stored in a Hadoop cluster.
It’s an effective, reasonably intuitive model for organizing and using data. Mapping these familiar data
operations to the low-level MapReduce Java API can be daunting, even for experienced Java developers.
Hive does this dirty work for you, so you can focus on the query itself. Hive translates most queries to
MapReduce jobs, thereby exploiting the scalability of Hadoop, while presenting a familiar SQL
abstraction.
Hive is most suited for data warehouse applications, where relatively static data is analyzed, fast
response times are not required, and when the data is not changing rapidly.
When history answered “Why we need it? “
When the size of data over internet goes beyond petabyte , xetabyte in mid 90s , the entire IT industry
started facing problem in processing such hugh amount of data which leads to the birth of term BigData .
These data includes structured , unstructured and semi-structured data which used to comes from
various data source like databases, servers, sensors etc.
In 2005 Doug Cutting and Mike Cafarella created Hadoop, a distributed processing frame work which
uses MapReduce to process hugh amount of data , to support distribution for the Nutch search engine
project in Yahoo Lab. The Hadoop was then donated to Apache which is now part of the Apache project
sponsored by the Apache Software Foundation .
However, a challenge remains; how do you move an existing data infrastructure to Hadoop, when that
infrastructure is based on traditional relational databases and the Structured Query Language (SQL)?
What about the large base of SQL users, both expert database designers and administrators, as well as
casual users who use SQL to extract information from their data warehouses?
This is where Hive comes in. Hive was developed by Facebook which later donated to Apache which is
now part of the Apache project sponsored by the Apache Software Foundation .
Hive provides an SQL dialect, called Hive Query Language (abbreviated HiveQL or just HQL) for
querying data stored in a Hadoop cluster.
SQL knowledge is widespread for a reason; it’s an effective, reasonably intuitive model for organizing and
using data. Mapping these familiar data operations to the low-level MapReduce Java API can be daunting,
even for experienced Java developers. Hive does this dirty work for you, so you can focus on the query
itself. Hive translates most queries to MapReduce jobs, thereby exploiting the scalability of Hadoop, while
presenting a familiar SQL abstraction.
Characteristics of Hive
In Hive, tables and databases are created first and then data is loaded into these tables.
Hive as data warehouse designed for managing and querying only structured data that is stored in tables.
While dealing with structured data, Map Reduce doesn't have optimization and usability features like
UDFs but Hive framework does. Query optimization refers to an effective way of query execution in terms
of performance.
Hive's SQL-inspired language separates the user from the complexity of Map Reduce programming. It
reuses familiar concepts from the relational database world, such as tables, rows, columns and schema,
etc. for ease of learning.
Hadoop's programming works on flat files. So, Hive can use directory structures to "partition" data to
improve performance on certain queries.
A new and important component of Hive i.e. Metastore used for storing schema information. This
Metastore typically resides in a relational database.

Himanshu Sekhar Paul Apache HIVE |1


We can interact with Hive using methods like
 Web GUI
 Java Database Connectivity (JDBC) interface
Most interactions tend to take place over a command line interface (CLI). Hive provides a CLI to write
Hive queries using Hive Query Language(HQL)
Generally, HQL syntax is similar to the SQL syntax that most data analysts are familiar with. The Sample
query below display all the records present in mentioned table name.
Sample query : Select * from <TableName>
Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar
File).
For single user metadata storage, Hive uses derby database and for multiple user Metadata or shared
Metadata case Hive uses MYSQL.

Difference between HIVE and RDBMS

HIVE RDBMS

Hive enforces schema on read time whereas In RDBMS, a table’s schema is enforced at data
RDBMS enforces schema on write time. Hive load time, If the data being loaded doesn’t conform
doesn’t verify the data when it is loaded, but rather to the schema, then it is rejected. This design is
when ait is retrieved. This is called schema on called schema on write.
read.
Schema on read makes for a very fast initial load, Schema on read makes for a very fast initial load,
since the data does not have to be read, parsed, and since the data does not have to be read, parsed,
serialized to disk in the database’s internal format. and serialized to disk in the database’s internal
The load operation is just a file copy or move. format. The load operation is just a file copy or
move.
RDBMS is designed for Read and Write many
Hive is based on the notion of Write once, Read times.
many times In RDBMS, record level updates, insertions and
Hive does not provide support for record level deletes, transactions and indexes are possible.
updates, insertions and deletes as it stores data in
HDFS and HDFS does not allow to change the
contents of file it holding. In RDBMS, maximum data size allowed will be in
Hive can process 100’s Petabytes of data very 10’s of Terabytes
easily. RDBMS is best suited for dynamic data analysis
As Hadoop is a batch-oriented system, Hive doesn’t and where fast responses are expected but Hive is
support OLTP (Online Transaction Processing) but suited for data warehouse applications, where
it is closer to OLAP (Online Analytical Processing) relatively static data is analyzed, fast response
but not ideal since there is significant latency times are not required, and when the data is not
between issuing a query and receiving a reply, due changing rapidly.
to the overhead of Mapreduce jobs and due to the
size of the data sets Hadoop was designed to serve.

[To overcome the limitations of Hive, HBase is


being integrated with Hive to support record level
operations and OLAP.]

Hive uses HDFS as storage system and as HDFS is


both horizontally and vertically scalable, Hive is RDBMS is not that much scalable that too it is very
also very easily scalable at low cost. costly scale up
Hive supports less no of built in function and
Himanshu Sekhar Paul Apache HIVE |2
operators as compared to RDBMS. Most of RDBMS system are reached with number
of Built in functions and operators.

Limitation of Hive
Hive is not a full database. So it cannot replace SQL completely.
The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do. The
biggest limitation is that Hive does not provide record-level update, insert, nor delete. You can generate
new tables from queries or output query results to files.
Also, because Hadoop is a batch-oriented system, Hive queries have higher latency, due to the start-up
overhead for MapReduce jobs. Queries that would finish in seconds for a traditional realtional database
take longer for Hive, even for relatively small data sets.
Finally, Hive does not provide transactions. So, Hive doesn’t provide crucial features required for OLTP,
Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing, but as we’ll
see, Hive isn’t ideal for satisfying the “online” part of OLAP, at least today, since there can be significant
latency between issuing a query and receiving a reply, both due to the overhead of Hadoop and due to the
size of the data sets Hadoop was designed to serve.
If you need OLTP features for large-scale data, you should consider using a NoSQL database. Examples
include HBase, a NoSQL database integrated with Hadoop.
Limited number of Built in functions
Not all Standard SQL is supported.
Himanshu Sekhar Paul Apache HIVE |3
When to use hive
If you have large (think terabytes/petabytes) datasets to query: Hive is designed specifically for
analytics on large datasets and works well for a range of complex queries. Hive is the most approachable
way to quickly (relatively) query and inspect datasets already stored in Hadoop.
If extensibility is important: Hive has a range of user function APIs that can be used to build custom
behavior in to the query engine. Check out my guide to Hive functions if you’d like to learn more.

When to use RDBMS


If performance is key: If you need to pull data frequently and quickly, such as to support an application
that uses online analytical processing (OLAP), MySQL performs much better. Hive isn’t designed to be an
online transactional platform, and thus performs much more slowly than MySQL.
If your datasets are relatively small (gigabytes): Hive works very well in large datasets, but MySQL
performs much better with smaller datasets and can be optimized in a range of ways.
If you need to update and modify a large number of records frequently: MySQL does this kind of
activity all day long. Hive, on the other hand, doesn’t really do this well (or at all, depending). And if you
need an interactive experience, use MySQL

What is Hive best suited for?


Hive is best suited for data warehouse applications, where a large data set is maintained and mined for
insights, reports, etc. However, for the big data sets Hive is designed for, this start-up overhead is trivial
compared to the actual processing time.

Hive Architecture

Himanshu Sekhar Paul Apache HIVE |4


Above diagram shows the major components of Apache Hive-

Hive Clients – Apache Hive supports all application written in languages like C++, Java, Python etc. using
JDBC, Thrift and ODBC drivers. Thus, one can easily write Hive client application written in a language of
their choice.
Hive Services – Hive provides various services like web Interface, CLI etc. to perform queries.
Processing framework and Resource Management – Hive internally uses Hadoop MapReduce
framework to execute the queries.
Distributed Storage – As seen above that Hive is built on the top of Hadoop, so it uses the underlying
HDFS for the distributed storage.

Hive Clients
The Hive provides different drivers for communication with a different types of application . supports
different types of client applications for performing queries. These clients are categorized into 3 types:
Thrift Clients – As Apache Hive server is based on Thrift, so it can serve the request from all those
languages that support Thrift. So Thrilft client for communiacation.
JDBC Clients – Apache Hive allows Java applications to connect to it using JDBC driver. It is defined in the
class apache.hadoop.hive.jdbc.HiveDriver.
ODBC Clients – ODBC Driver allows applications that support ODBC protocol to connect to Hive. For
example JDBC driver, ODBC uses Thrift to communicate with the Hive server.
Hive Services
Client interaction with Hive can be performed through Client services , if Client want to perform through any
query related operation On Hive it has to communicate through Hive services. All Driver from hive Client
communicate with hive Server and Hive Server communicate with driver(i.e main driver). The driver will
process those request.coming from differen application to metastore and field system for further process.
Apache Hive provides various services as shown in above diagram. Now, let us look at each in detail:
a) CLI(Command Line Interface) – This is the default shell that Hive provides, in which you can execute
your Hive queries and command directly.
b) Web Interface – Hive also provides web based GUI for executing Hive queries and commands.
c) Hive Server – It is built on Apache Thrift and thus is also called as Thrift server. It allows different clients
to submit requests to Hive and retrieve the final result.
d) Hive Driver – Driver is responsible for receiving the queries submitted by Thrift, JDBC, ODBC, CLI, Web
UL interface by a Hive client. Hive Driver contains following components
I. Complier –After that hive driver passes the query to the compiler. Where parsing, type checking, and
semantic analysis takes place with the help of schema present in the metastore.
II. Optimizer – It generates the optimized logical plan in the form of a DAG (Directed Acyclic Graph) of
MapReduce and HDFS tasks.
III. Executor – Once compilation and optimization complete, execution engine executes these tasks in the
order of their dependencies using Hadoop. Hive supports mostly 3 types of execution engine i.e.
MapReduce , Tez , Spark . Only one execution engine can be set at a time . Execution engine can be
set using hive.execution.engine parameter in hive-site.xml file
e) Metastore – Metastore is the central repository of Apache Hive metadata in the Hive Architecture. It
stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It
provides client access to this information by using metastore service API. The by default metastore for
Hive is Derby. But we can reconfigure it to MySql .Hive metastore consists of two fundamental units:
 A service that provides metastore access to other Apache Hive services.
 Disk storage for the Hive metadata which is separate from HDFS storage.

Hive Storage and Computing

Himanshu Sekhar Paul Apache HIVE |5


Hive service such as Metastore , file System , and Job client in turn communicate with Hive Storage and
perform following.
Metastore information of tables created in Hive is stored in Hive “ Meta storage Database”
Query results and data loaded in table are going to be stored I Hadoop cluster on HDFS.

Hive Work Flow

Step1 : Execute Query -The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
Step2: Get Plan - The driver takes the help of query compiler that parses the query to check the
syntax and query plan or the requirement of query.
Step3 : Get Metadata -The compiler sends metadata request to Metastore (any database).
Step4: Send Metadata - Metastore sends metadata as a response to the compiler.
Step 5: Send Plan - The compiler checks the requirement and resends the plan to the driver. Up to here,
the parsing and compiling of a query is complete.
Step 6: Execute Plan - The driver sends the execute plan to the execution engine.
Step 7: Execute Job- Internally, the process of execution job is a MapReduce job. The execution engine
sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data
node. Here, the query executes MapReduce job.
Step7.1: Metadata Ops - Meanwhile in execution, the execution engine can execute metadata operations
with Metastore.
Step 8: Fetch Result - The execution engine receives the results from Data nodes.
Step 9: Send Results - The execution engine sends those resultant values to the driver.
Step10: Send Results- The driver sends the results to Hive Interfaces.

More on Step 7
The execution engine in turns communicates with Hadoop daemons such as NameNodes , Data Node and
Job Tracker to execute the query on top of Hadoop File System.
Execution Engine Should first Contact Name Node to get the location of desired tables reside in datanode
only(i.e. Metadata info )
The actual data stored in data node only. Execution engine will fetch actual data from data node
Same Time execution engn communicate bidirectional with metastore present in Hive to perform DDL
operation.
Himanshu Sekhar Paul Apache HIVE |6
The metastore stores information about databases name , tables name ,column name , column properties,
table properties only.
Different modes of Hive
Hive can operate in two modes depending on the size of data nodes in Hadoop. These modes are,
1. Local mode
2. Map reduce mode
When to use Local mode:
 If the Hadoop installed under pseudo mode with having one data node we use Hive in this mode
 If the data size is smaller in term of limited to single local machine, we can use this mode
 Processing will be very fast on smaller data sets present in the local machine
When to use Map reduce mode:
If Hadoop is having multiple data nodes and data is distributed across different node we use Hive in this
mode
It will perform on large amount of data sets and query going to execute in parallel way
Processing of large data sets with better performance can be achieved through this mode
In Hive, we can set this property to mention which mode Hive can work? By default, it works on Map
Reduce mode and for local mode you can have the following setting.
Hive to work in local mode set SET mapred.job.tracker=local;

From the Hive version 0.7 it supports a mode to run map reduce jobs in local mode automatically.

Hive Server1, Hive CLI, Hive Server2, Beeline


HiveServer1
Hive 1 (the first version of Hive) support Server-Client model architecture service. It allows user to connect
to hive service remotely using Hive CLI interface and thrift client. It support for remote client connection but
only one client connect at a time

Hive Driver Phase


HIVE CLI Hive Server1 (Compilatio & MetaStore Database
Optimization )

Limitation Of Hive Server1


It does not provide session manegment support .
Because of thrift , No concurrency control due to thrift API
Hive CLI
Hive Cli is nothing but default Hive Shell i.e. ‘hive>’ (i.e. hive prompt) . It can be initialized from
$HIVE_HOME/bin/hive. It supports HQL (i.e. Hive Query Language ) type of query.
It supports HQL type of query. It is Simple to use
It supports MapReduce ,custom mapper and reducer with UDF
Limitation of Hive CLI
It Support single user at a time. No authentication support is provide
Hive Server2
To overcome the problems of HiveServer 1 , HiveServer2 comes up. It is also a Client Server model.It
allows to connect many different client( Multi-client concurrency)unlike thrift can connect at the same
time.
Hive Server 2 provides much better authentication using Kerboros .
It also supports JDBC and ODBC connection.

Himanshu Sekhar Paul Apache HIVE |7


HiveServer 2 has its own CLI called Beeline which is a JDBC client based in SQLLine
BeeLine
Beeline is a command line interface for HiveServer2. This is based SQLLine CLI. Its Shell represent as
“>beeline”
It gves better suppor for JDBC/ODBC which is not supported in HiveServer1
It works boths in embedded mode as well as remote mode
In embedded mode, beeline runs an embaded hive (similar to Hive CLI), where as in remote mode is for
connecting to a separate HiveServer2 process over thrift.

Special Points
As a part of metadata manegnent , Hive store information about table, column in table , schema partition
information in structured format in a relational database.
The default metadata store(i.e metastore ) in hive is Derby. Its a database .you can change configuration
to save all your hive metadata information into any JDBC supported database .
Most popular database fot storing metadata is MySql and PostgreSQL
The purpose of storing this metadata information into relational database is if we store this information
at file level then performance of Hive will be down. Logically file loading wll take more time as compared
to relational storage that is why the purpose come to store this information in separate database .

Schema-on-Read vs Schema-on-Write

Schema on Write
In traditional database, before any data is written in the database table , the structure of that data is
strictly defined during table creation and the metadata of table is stored and tracked. That metadata is
called Schema. When the data is inserted into table the structure of data is strictly checked against the
schema of table .If the structure of data found irrelevant with respect to structure of table, then data is
discarded, data types, lengths and positions are all delineated. This process of checking structure(or
schema) of data against schema of table during writing operation is called Schema on Write.
In Schema on write , the speed of query processing and structure of data matters most than time
required for loading data.

Advantages of Schema on Write


Because the schema is checked during data write operation , so what ever the data present it is in a
organized structure format.
As the data is well structured, you write simple SQL query and get back very fast answers enabling
improved query speed..
Disadvantages of Schema on Write
As schema is implemented on data during data load , it takes hugh time to load bilk amount of data.There
is always a time cost to imposing a schema on data. In schema on write strategies, that time cost is paid in
the data loading stage.
Another problem with schema on write data store is that the data has been altered and structured
specifically to serve a specific purpose. Chances are high that, if another purpose is found for that data,
the data store will not suit it well. All the speed that you got from customizing the data structure to match
a specific problem set will cost you if you try to use it for a different problem set. And there’s no
guarantee that the altered version of the data will even be useful at all for the new, unanticipated need.
There’s no ability to query the data in its original form, and certainly no ability to query any other data
set that isn’t in the structured format.

Himanshu Sekhar Paul Apache HIVE |8


Schema on Read
In hadoop where hugh amount of data is involved , the data is stored in HDFS first. Data could be of many
types, sizes, shapes and structures. While some metadata, data about that data, needs to be stored, so that
you know what’s in there, you don’t yet know how it will be structured. It is entirely possible that data
stored for one purpose might even be used for a completely different purpose than originally intended.
The data is stored without first deciding what piece of information will be important, what should be
used as a unique identifier, or what part of the data needs to be summed and aggregated to be useful.
Therefore, the data is stored in its original granular form, with nothing thrown away because it is
unimportant, nothing consolidated into a composite, and nothing defined as key information.
When someone is ready to use that data, then, at that time, they define what pieces are essential to their
purpose. They define where to find those pieces of information that matter for that purpose, and which
pieces of the data set to ignore.
So schema is checked when data is being read by some one. This is called Schema on Read.
In schema on read , the size and time to load data matters most than types of data.

Advantages of Schema on Read


Because your data is stored in its original form, nothing is discarded, or altered for a specific purpose.
This means that your query capabilities are very flexible. You can ask any question that the original data
set might hold answers for, not just the type of questions a data store was originally created to answer.
You have the flexibility to ask things you hadn’t even thought of when the data was stored.
Also, different types of data generated by different sources can be stored in the same place. This allows
you to query multiple data stores and types at once. If the answer you need isn’t in the data you originally
thought it would be in, perhaps it could be found if you combined it with other data sources. This power
of this ability cannot be underestimated. This is what makes the Hadoop data lake concept which puts all
your available data sets in their original form in a single location such a potent one.

Disadvantages of Schema on Read


The main disadvantages of schema on read are inaccuracies and slow query speed.
Since the data is not subjected to rigorous ETL and data cleansing processes, nor does it pass through any
validation, that data may be riddled with missing or invalid data, duplicates and a bunch of other
problems that may lead to inaccurate or incomplete query results.
In addition, since the structure must be defined when the data is queried, the SQL queries tend to be very
complex. They take time to write, and even more time to execute.

Different modes of Metastore


There are three modes for Hive Metastore deployment:
Embedded Metastore
Local Metastore
Remote Metastore

Embedded Metastore
In Hive by default, metastore service runs in the same JVM as the Hive service. It uses embedded Derby
database stored on the local file system in this mode. Thus both metastore service and hive service runs in
the same JVM by using embedded Derby Database. But, this mode also has limitation that, as only one
embedded Derby database can access the database files on disk at any one time, so only one Hive session
could be open at a time.

JDBC Driver Metastore


Derby DB

Himanshu Sekhar Paul Apache HIVE |9


If we try to start the second session it produces an error when it attempts to open a connection to the
metastore. So, to allow many services to connect the Metastore, it configures Derby as a network server. This
mode is good for unit testing. But it is not good for the practical solutions.

Local Metastore
Hive is the data-warehousing framework, so hive does not prefer single session. To overcome this limitation
of Embedded Metastore, for Local Metastore was introduced. This mode allows us to have many Hive
sessions i.e. many users can use the metastore at the same time. We can achieve by using any JDBC compliant
like MySQL which runs in a separate JVM or different machines than that of the Hive service and metastore
service which are running in the same JVM.

JDBC Driver Metastore

MySQL

JDBC Driver Metastore

This configuration is called as local metastore because metastore service still runs in the same process as the
Hive. But it connects to a database running in a separate process, either on the same machine or on a remote
machine. Before starting Apache Hive client, add the JDBC / ODBC driver libraries to the Hive lib folder.
MySQL is a popular choice for the standalone metastore. In this case, the javax.jdo.option.ConnectionURL
property is set to jdbc:mysql://host/dbname? createDatabaseIfNotExist=true, and
javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. The JDBC driver JAR file for
MySQL (Connector/J) must be on Hive’s classpath, which is achieved by placing it in Hive’s lib directory.

Remote Metastore
Moving further there is another metastore configuration called Remote Metastore. In this mode, metastore
runs on its own separate JVM, not in the Hive service JVM. If other processes want to communicate with the
metastore server they can communicate using Thrift Network APIs. We can also have one more metastore
servers in this case to provide more availability. This also brings better manageability/security because the
database tier can be completely firewalled off. And the clients no longer need share database credentials with
each Hiver user to access the metastore database.

JDBC Driver MetaStrore JVM Service

MySQL

JDBC Driver MetaStrore JVM Service

To use this remote metastore, you should configure Hive service by setting hive.metastore.uris to the
metastore server URI(s). Metastore server URIs are of the form thrift://host:port, where the port
corresponds to the one set by METASTORE_PORT when starting the metastore server.

What Is Inside Hive?


The core of a Hive binary distribution contains three parts.
The main part is the Java code itself. Multiple JAR (Java archive) files such as hive-exec*.jar and hive-meta
store*.jar are found under the $HIVE_HOME/lib directory. Each JAR file implements a particular subset of
Hive’s functionality.

Himanshu Sekhar Paul Apache HIVE |10


The $HIVE_HOME/bin directory contains executable scripts that launch various Hive services, including
the hive command-line interface (CLI).
The $HIVE_HOME/conf directory contains the files that configure Hive. Hive has a number of
configuration properties that we will discuss as needed. These properties control features such as the
metastore (where data is stored), various optimizations, and “safety controls,

Steps To Start Hive


As Hive works on top of Hadoop, it is required to start both HDFS and Mapreduce before starting Hive.
 So to start HDFS
 Go to $HADOOP_HOME/bin
 ./start-dfs.sh
 To start Mapreduce
 Go to $HADOOP_HOME/bin
 ./start-mapred.sh
 Or we can start both HDFS and Mpreduce in a single command
 Go to $HADOOP_HOME/bin
 ./start-all.sh
To start Hive metastore
 Go To $HIVE_HOME/bin
 ./hive --service metastore &
 Here the ‘&’ operater is linux option to run a process in background.
To Start Hive
 Go To $HIVE_HOME/bin
 ./hive
 It will open Hiv e shell. i.e. “hive>”

The first line printed by the CLI is the local filesystem location where the CLI writes log data about the
commands and queries you execute. If a command or query is successful, the first line of output will be OK,
followed by the output, and finished by the line showing the amount of time taken to run the command or
query.

The Hive Command


The $HIVE_HOME/bin/hive shell command, which we’ll simply refer to as hive from now on, is the
gateway to Hive services, including the command-line interface or CLI
Command Options
If you run the following command, you’ll see a brief list of the options for the hive command.
$ bin/hive --help
Usage ./hive <parameters> --service serviceName <service parameters>
Service List: cli help hiveserver hwi jar lineage metastore rcfilecat
Parameters parsed:
--auxpath : Auxiliary jars
--config : Hive configuration directory
--service : Starts specific service/component. cli is default
Parameters used:
HADOOP_HOME : Hadoop install directory
HIVE_OPT : Hive options
For help on a particular service:
./hive --service serviceName --help
Himanshu Sekhar Paul Apache HIVE |11
Debug help: ./hive --debug --help

Service List. :There are several services available, including the CLI

cli Command-line interface Used to define tables, run queries, etc. It is the default service if no
other service is specified

hiveserver Hive Server A daemon that listens for Thrift connections from other processes.
See Chapter 16 for more details.

hwi Hive Web Interface A simple web interface for running queries and other commands
without logging into a cluster machine and using the CLI

jar An extension of the hadoop jar command for running an application


that also requires the Hive environment

metastore Start an external Hive metastore service to support multiple clients

rcfilecat A tool for printing the contents of an RCFile

The --auxpath option lets you specify a colon-separated list of “auxiliary” Java archive (JAR) files that
contain custom extensions, etc., that you might require.
The --config directory is mostly useful if you have to override the default configuration properties in
$HIVE_HOME/conf in a new directory.
Command Line Interface(CLI): The Hive Shell

The command-line interface or CLI is the most common way to interact with Hive. Using the CLI, you can
create tables, inspect schema and query tables, etc.

CLI Options
The following command shows a brief list of the options for the CLI.
$ hive --help --service cli
usage: hive
-d,--define <key=value> Variable substitution to apply to hive
commands. e.g. -d A=B or --define A=B
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,--help Print help information
-h <hostname> connecting to Hive Server on remote host
--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable substitution to apply to hive
commands. e.g. --hivevar A=B
-i <filename> Initialization SQL file
-p <port> connecting to Hive Server on port number
-S,--silent Silent mode in interactive shell
-v,--verbose Verbose mode (echo executed sql to the consloe)
Hive Variables and Properties
Hive consists of 4 namespace e.g. hivevar, hiveconf, system, and env .
hivevar Read/Write (v0.8.0 and later) User-defined custom variables.
hiveconf Read/Write Hive-specific configuration properties.
system Read/Write Configuration properties defined by Java.
env Read only Environment variables defined by the shell environment (e.g., bash).
Himanshu Sekhar Paul Apache HIVE |12
 --hivevar
Syntax to define Variable
--define <key> =< value>
Or
--hivevar <key> = <value>
The --define key=value option is effectively equivalent to the --hivevar key=value option. Both let you
define on the command line custom variables that you can reference in Hive scripts to customize
execution. This feature is only supported in Hive v0.8.0 and later versions.When you use this feature, Hive
puts the key-value pair in the hivevar “namespace”.
Hive’s variables are internally stored as Java Strings. You can reference variables in queries; Hive replaces
the reference with the variable’s value before sending the query to the query processor.
Inside the CLI, variables are displayed and changed using the SET command.
$ ./hive
hive> set env:HOME;
env:HOME=/home/himanshu
Without the -v flag, set prints all the variables in the namespaces hivevar, hiveconf, system, and env.
With the -v option, it also prints all the properties defined by Hadoop, such as properties controlling
HDFS and MapReduce.
The set command is also used to set new values for variables.
$ ./hive --define name=himanshu
hive> set name;
name =himanshu;
hive> set hivevar:name;
hivevar:name=himanshu;

hive> set hivevar:name=sekhar;


hive> set name;
foo=sekhar
hive> set hivevar:foo;
hivevar:foo=bar2

 --hiveconf
It is used for all properties that configure Hive behavior. We’ll use it with a property hive.cli.print.current.db
that was added in Hive v0.8.0. It turns on printing of the current working database name in the CLI prompt.
The default database is named default. This property is false by default:
$ hive --hiveconf hive.cli.print.current.db=true
hive (default)> set hive.cli.print.current.db;
hive.cli.print.current.db=true

hive (default)> set hiveconf:hive.cli.print.current.db;


hiveconf:hive.cli.print.current.db=true

hive (default)> set hiveconf:hive.cli.print.current.db=false;


hive>
We can even add new hiveconf entries, which is the only supported option for Hive versions earlier than
v0.8.0:
$ hive --hiveconf y=5
hive> set y; y=5

 --system

Himanshu Sekhar Paul Apache HIVE |13


It’s also useful to know about the system namespace, which provides read-write access to Java system
properties, and the env namespace, which provides read-only access to environment variables: hive> set
system:user.name; system:user.name=myusername
hive> set system:user.name=yourusername;
hive> set system:user.name; system:user.name=yourusername
hive> set env:HOME; env:HOME=/home/yourusername
hive> set env:HOME; env:* variables can not be set.

Unlike hivevar variables, you have to use the system: or env: prefix with system properties and environment
variables. The env namespace is useful as an alternative way to pass variable definitions to Hive.
$ YEAR=2012 hive -e "SELECT * FROM mytable WHERE year = ${env:YEAR}";

Hive “One Shot” Commands


The user may wish to run one or more queries (semicolon separated) and then have the hive CLI exit
immediately after completion. The CLI accepts a -e command argument that enables this featur
$ hive -e "SELECT * FROM mytable LIMIT 3";
OK
name1 10
name2 20
name3 30
Time taken: 4.955 seconds
$
Adding the -S for silent mode removes the OK and Time taken ... lines, as well as other inessential output,
A quick and dirty technique is to use this feature to output the query results to a file.
$ hive -S -e "select * FROM mytable LIMIT 3" > /tmp/myquery $ cat /tmp/myquery
name1 10
name2 20
name3 30
Note that hive wrote the output to the standard output and the shell command redirected that output to the
local filesystem, not to HDFS.
Suppose you can’t remember the name of the property that specifies the “warehouse” location for
managed tables:
$ hive -S -e "set" | grep warehouse
hive.metastore.warehouse.dir=/user/hive/warehouse
hive.warehouse.subdir.inherit.perms=false

Executing Hive Queries from Files


Hive can execute one or more queries that were saved to a file using the -f file argument. By convention,
saved Hive query files use the .q or .hql extension. Here the file can be present on Local File System or HDFS.
When File is in Local FS
$ hive -f /path/to/file/withqueries.hql
If you are already inside the Hive shell you can use the SOURCE command to execute a script file.
hive> source /path/to/file/withqueries.hql;

When file is in HDFS:


$ hive -f hdfs://localhost:9000/test.hql
That hdfs keyword has to be mentioned in order to hive that file is in HDFS

How to add auxiliary Jars in Hive ?


Many times we need to add auxiliary (3rd party) jars in hive class path to make use of them. There are
different ways to achieve this
Himanshu Sekhar Paul Apache HIVE |14
1. Hive Server Config (hive-site.xml):
Modify your hive-site.xml config and add following property to it.
<property>
<name>hive.aux.jars.path</name>
<value>comma separated list of jar paths</value>
</property>
Example:
<property>
<name>hive.aux.jars.path</name>
<value>/usr/share/dimlookup.jar,/usr/share/serde.jar</value>
</property>
You will need to restart hive server, so that these properties take effect.
2. Hive-Cli –auxpath option:
You can mention the comma separated list of auxiliary jars path while launching hive shell.
Example.
hive --auxpath /usr/share/dimlookup.jar,/usr/share/serde.jar

3. Hive Cli add jar command:


You can add jar using ‘add ‘ . The file must be in HDFS. Then add path file to ‘add’ command.
add jar jar_path;
Example:
add jar /usr/share/serde.jar;
add jar /usr/share/dimlookup.jar;
4. Add in HIVE_AUX_JARS_PATH environment variable:
export HIVE_AUX_JARS_PATH=/usr/share/serde.jar

5. .hiverc:
You can add all your add jars statements to .hiverc file in your home / hive config directory. So that they take
effect on hive-cli launch.

The .hiverc File


It is a file that is executed when you launch the hive shell - making it an ideal place for adding any hive
configuration/customization you want set, on start of the hive shell. This could be:
 - Setting column headers to be visible in query results
 - Making the current database name part of the hive prompt
 - Adding any jars or files
 - Registering UDFs
.hiverc file location
 The file is loaded from the hive conf directory.
 If the file does not exist, you can create it.
 It needs to be deployed to every node from where you might launch the Hive shell.
Sample .hiverc
Himanshu Sekhar Paul Apache HIVE |15
 add jar /home/airawat/hadoop-lib/hive-contrib-0.10.0-cdh4.2.0.jar;
 set hive.exec.mode.local.auto=true;
 set hive.cli.print.header=true;
 set hive.cli.print.current.db=true;
 set hive.auto.convert.join=true;
 set hive.mapjoin.smalltable.filesize=30000000;

-i option
The -i file option lets you specify a file of commands for the CLI to run as it starts, before showing you the
prompt. Hive automatically looks for a file named .hiverc in your HOME directory and runs the commands
it contains, if any.
If the CLI is invoked without the -i option, then Hive will attempt to load $HIVE_HOME/bin/.hiverc and
$HOME/.hiverc as initialization files
Example:
$ hive -i /home/user/hive-init.sql

Autocomplete
If you start typing and hit the Tab key, the CLI will autocomplete possible keywords and function names. For
example, if you type SELE and then the Tab key, the CLI will complete the word SELECT. If you type the Tab
key at the prompt, you’ll get this reply:
hive> Display all 407 possibilities? (y or n)
If you enter y, you’ll get a long list of all the keywords and built-in functions.

Command History
You can use the up and down arrow keys to scroll through previous commands. Actually, each previous line
of input is shown separately; the CLI does not combine multiline commands and queries into a single history
entry. Hive saves the last 100,00 lines into a file $HOME/.hivehistory.

Shell Execution

Himanshu Sekhar Paul Apache HIVE |16


You don’t need to leave the hive CLI to run simple bash shell commands. Simply type “!” followed by the
command and terminate the line with a semicolon (;):
hive> ! /bin/echo "what up dog";
"what up dog"
hive> ! pwd;
/home/me/hive/bin
Shell “pipes” don’t work and neither do file “globs.” For example, ! ls *.hql; will look for a file named *.hql;,
rather than all files that end with the .hql extension.

Hadoop Dfs Command


You can run the hadoop dfs ... commands from within the hive CLI; just drop the hadoop word from the
command and add the semicolon at the end:
hive> dfs -ls / ;
Found 3 items
drwxr-xr-x - root supergroup 0 2011-08-17 16:27 /etl
drwxr-xr-x - edward supergroup 0 2012-01-18 15:51 /flag
drwxrwxr-x - hadoop supergroup 0 2010-02-03 17:50 /users
This method of accessing hadoop commands is actually more efficient than using the hadoop dfs ... equivalent
at the bash shell, because the latter starts up a new JVM instance each time, whereas Hive just runs the
same code in its current process.

Comments in Hive Scripts


As of Hive v0.8.0, you can embed lines of comments that start with the string --, for example:
-- Copyright (c) 2012 Megacorp, LLC.

Data Types and File Formats


Hive Data Types
Hive supports both primitive as well as complex data type.

DataTypes

Primitive Collection

ARRAY

STRING Miscellaneous
Numeric
Himanshu Sekhar Paul Apache HIVE |17
MAP
STRING BINARY
Integral STRUCT
VARCHAR BOOLEAN
TINYINT

SMALL INT CHAR

INT DATE / TYPE

BIGINT TIMESTAMP

Floating DATE

INTERVAL
FLOAT

DOUBLE

DECIMAL

Numeric Types
TINYINT (1-byte signed integer, from -128 to 127)
SMALLINT (2-byte signed integer, from -32,768 to 32,767)
INT/INTEGER (4-byte signed integer, from -2,147,483,648 to 2,147,483,647)
BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)
FLOAT (4-byte single precision floating point number)
DOUBLE (8-byte double precision floating point number)
DOUBLE PRECISION (alias for DOUBLE, only available starting with Hive 2.2.0)
DECIMAL
Introduced in Hive 0.11.0 with a precision of 38 digits
Hive 0.13.0 introduced user-definable precision and scale
NUMERIC (same as DECIMAL, starting with Hive 3.0.0)

Date/Time Types

TIMESTAMP (Note: Only available starting with Hive 0.8.0)


DATE (Note: Only available starting with Hive 0.12.0)
INTERVAL (Note: Only available starting with Hive 1.2.0)
String Types
STRING
VARCHAR (Note: Only available starting with Hive 0.12.0)
CHAR (Note: Only available starting with Hive 0.13.0)
Misc Types
BOOLEAN
BINARY (Note: Only available starting with Hive 0.8.0)
NOTE
It’s useful to remember that each of these primitive data types is implemented in Java, so the particular
behavior details will be exactly what you would expect from the corresponding Java types. For example,
STRING is implemented by the Java String, FLOAT is implemented by Java float, etc.
Himanshu Sekhar Paul Apache HIVE |18
As for other SQL dialects, the case of these names is ignored.
Hive relies on the presence of delimiters to separate fields.
Values of the new TIMESTAMP type can be integers, which are interpreted as seconds since the Unix
epoch time (Midnight, January 1, 1970), floats, which are interpreted as seconds since the epoch time
with nanosecond resolution (up to 9 decimal places), and strings, which are interpreted according to the
JDBC date string format convention, YYYY-MM-DD hh:mm:ss.fffffffff.
If a table schema specifies three columns and the data files contain five values for each record, the last
two will be ignored by Hive.
The BINARY type is similar to the VARBINARY type found in many relational databases. It’s not like a
BLOB type, since BINARY columns are stored within the record, not separately like BLOBs

Collection Data Types


Hive supports columns that are structs, maps, and arrays.
STRUCT Analogous to a C struct or an “object.” struct('John', 'Doe')
Fields can be accessed using the “dot” notation.
For example, if a column name of “name” is of type STRUCT {first
STRING; last STRING}, then the first name field can be referenced
using name.first.

MAP A collection of key-value tuples, where the fields are accessed map('first', 'John', 'last',
using array notation (e.g., ['key']). 'Doe')
For example, if a column name is of type MAP with key→value
pairs 'first'→'John' and 'last'→'Doe', then the last name can be
referenced using name['last'].

ARRAY Ordered sequences of the same type that are indexable using zero- array('John', 'Doe')
based integers.
For example, if a column name is of type ARRAY of strings with the
value ['John', 'Doe'], then the second element can be referenced
using name[1].

Most relational databases don’t support such collection types, because using them tends to break normal
form.
A practical problem with breaking normal form is the greater risk of data duplication, leading to
unnecessary disk space consumption and potential data inconsistencies, as duplicate copies can grow out
of sync as changes are made.
However, in Big Data systems, a benefit of sacrificing normal form is higher processing throughput.
Scanning data off hard disks with minimal “head seeks” is essential when processing terabytes to
petabytes of data. Embedding collections in records makes retrieval faster with minimal seeks.
Navigating each foreign key relationship requires seeking across the disk, with significant performance
overhead.
File Format
A file format is a way in which information is stored or encoded in a computer file. In Hive it refers to how
records are stored inside the file. As we are dealing with structured data, each record has to be its own
structure. How records are encoded in a file defines a file format. These file formats mainly vary between
data encoding, compression rate, usage of space and disk I/O.
Hive does not verify whether the data that you are loading matches the schema for the table or not.
However, it verifies if the file format matches the table definition or not.
By default Hive can supports following file format:
 TEXTFILE
Himanshu Sekhar Paul Apache HIVE |19
 SEQUENCEFILE
 RCFILE
 ORCFILE
 Parquet (Hive 0.13.0)
TEXTFILE
TEXTFILE format is a famous input/output format used in Hadoop. In Hive if we define a table as
TEXTFILE it can load data of from CSV (Comma Separated Values), delimited by Tabs, Spaces, and JSON
data. This means fields in each record should be separated by comma or space or tab or it may be
JSON(JavaScript Object Notation) data. By default, if we use TEXTFILE format then each line is
considered as a record.
create table olympic
(athelete STRING, age INT, country STRING, year STRING, closing STRING, sport STRING, gold INT,
silver INT, bronze INT,total INT)
row format delimited
fields terminated by '\t'
stored as TEXTFILE;
At the end, we need to specify the type of file format. If we do not specify anything it will consider the file format as
TEXTFILE format.

SEQUENCEFILE
We know that Hadoop’s performance is drawn out when we work with a small number of files with big
size rather than a large number of files with small size. If the size of a file is smaller than the typical block
size in Hadoop, we consider it as a small file. Due to this, a number of metadata increases which will
become an overhead to the NameNode. To solve this problem sequence files are introduced in Hadoop.
Sequence files act as a container to store the small files.
Sequence files are flat files consisting of binary key-value pairs. When Hive converts queries to
MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record. Sequence
files are in the binary format which can be split and the main use of these files is to club two or more
smaller files and make them as a one sequence file.
In Hive we can create a sequence file by specifying STORED AS SEQUENCEFILE in the end of a CREATE
TABLE statement. There are three types of sequence files:
• Uncompressed key/value records.
• Record compressed key/value records – only ‘values’ are compressed here
• Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and
compressed. The size of the ‘block’ is configurable.
Hive has its own SEQUENCEFILE reader and SEQUENCEFILE writer libraries for reading and writing
through sequence files.
CREATE TABLE olympic_sequencefile (athelete STRING age INT, country STRING, year STRING)
row format delimited
fields terminated by '\t'
stored as sequencefile

HiveQL: Data Definition


Hive Database
Introduction to Hive Database
From Hive-0.14.0 release onwards Hive DATABASE is also called as SCHEMA. So, Both SCHEMA and
DATABASE are same in Hive.Either SCHEMA or DATABASE in Hive is just like a Catalog of tables. With the
help of database names, users can have same table name in different databases, So thus, in large
organizations, teams or users are allowed create same table by creating their own separate DATABASE, to
avoid table name collisions.
Himanshu Sekhar Paul Apache HIVE |20
The default database in hive is default. We do not need to create this database. Any table created without
specifying database will be created under this.
Creating Database
Create Database is a statement used to create a database in Hive. A database in Hive is a namespace or a
collection of tables.
The syntax for this statement is as follows:
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];

Explaination
(DATABASE|SCHEMA) The uses of SCHEMA and DATABASE are interchangeable – they mean the same
thing. So any one can be placed

IF NOT EXISTS It is optional to use .While normally you might like to be warned if a database of
the same name already exists, the IF NOT EXISTS clause is useful for scripts that
should create a database on the-fly, if necessary, before proceeding. If it is
mentioned and there is database with same name is already exist , hive will
simply omit the step

COMMENT It is used to add description about database .It also optional. Whatever
mentioned with this parameter will be displayed in DESCRIBE command.

LOCATION It is also a optional parameter. It is used to specify location in which a new


directory will be created in the name of database with an extension of “.db” .

This parameter always point to HDFS location.

WITH DBPROPERTIES It is also optional parameter. It is used to specify database properties.

Example

CREATE DATABASE IF NOT EXISTS financials


COMMENT ‘It will Hold financial data of Company’
LOCATION '/etl/hive/data';
WITH DBPROPERTIES (‘Edited-By’ = ’Himanshu’);

The above create database statement will create a directory name “fianancials.db” in the location
'/etl/hive/data' in HDFS.If we don’t mentioned location , then Hive will create a directory name
“fianancials.db” in the path specified by the property hive.metastore.warehouse.dir.
Hive will create a directory for each database. Tables in that database will be stored in subdirectories of
the database directory. The exception is tables in the default database, which doesn’t have its own
directory.
When you don’t create any database but create any table , by default that table will be stored under
default database
By default, Hive always creates the table’s directory under the directory for the enclosing database. The
exception is the default database. It doesn’t have a directory under /user/hive/warehouse, so a
table in the default database will have its directory created directly in /user/hive/warehouse (unless
explicitly overridden).
Describe database
After creating database, you can see the various properties associated with database using DESCRIBE
command
Syntax
Himanshu Sekhar Paul Apache HIVE |21
hive> (DESCRIBE|DESC) (DATABASE|SCHEMA) [EXTENDED] database_name
Both (DESCRIBE|DESC) can be used one at a time . But One of them should definitely used
Both (DATABASE|SCHEMA) can be used one at a time . But One of them should definitely used
EXTENDED is optional . It is used to retrived more information about datbase

Example
hive >DESCRIBE DATABASE financials;
financials hdfs:/etl/hive/data/financials.db

You can see more output using EXTENDED keyword.


hive >DESCRIBE DATABASE EXTENDED financials;
financials hdfs:/etl/hive/data/financials.db
(‘Edited-By’ = ’Himanshu’)

Use Databases
We can set the database on which we need to work with USE command in hive. It sets the current
database to be used for further hive operations.
As, by default, we enter into default database in Hive CLI, we need to change our database if we need to
point to our custom database.
The USE command sets a database as your working database, analogous to changing working directories
in a filesystem:
hive> USE financials;
OK
Time taken: 1.051 second

Show Databases
Let’s verify the creation of these databases in Hive CLI with show databases command. It will list down the
databases in hive.
Syntax
SHOW (DATABASES|SCHEMAS) [LIKE identifier_with_wildcards];
By default, SHOW DATABASES lists all of the databases defined in the metastore.
LIKE – It is optional. But it allows us to filter the database names using a regular expression.
Wild cards in the regular expression can only be ” (single quotes) for any character(s) or ‘|’ for a choice.
Examples are ’employees’, ’emp’, ‘emp*|*ees‘, (emp* or *ees), all of which will match the database named
’employees’.

Examples
Below is the sample output of show databases command after execution above two creation commands.
hive> show databases;
OK
default
test_db
test_db2
Time taken: 0.072 seconds, Fetched: 3 row(s)
hive> SHOW DATABASES LIKE '*db*';
OK
test_db
test_db2
Time taken: 0.014 seconds, Fetched: 2 row(s)
hive>

Himanshu Sekhar Paul Apache HIVE |22


There is no command to show you which database is your current working database. Hive provide a
envr variable hive.cli.print.current.db to to print the current database as part of the prompt
(Hive v0.8.0) .By default it is false . We need to set it true using SET command
hive> set hive.cli.print.current.db=true;
hive (financials)> USE default;
OK
Time taken: 1.051 second
hive (default)>
Alter Database
We can alter the databases with Alter command in hive but it provides very minimal alterations.
We can
Assign any new (key, value) pairs into DBPROPERTIES
Set user or role to the Database
OWNER of Database (Hive 0.13.0 and later)
Location of Database (Hive 2.2.1 and later)
But below are its limitations
We can’t unset any property using Alter command
No other meta-data about the database can be changed, including its name
Syntax
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES
(property_name=property_value, ...);
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;

Example
Lets add new property ‘modified by’ to the above created database test_db and we can see the result in
‘describe extended’.
hive> ALTER SCHEMA test_db SET DBPROPERTIES ('Modified by' = ‘Sekhar’);
OK
Time taken: 0.414 seconds

Drop Database
Finally, you can drop a database:
Syntax
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
Both (DESCRIBE|DESC) can be used one at a time . But One of them should definitely used
IF EXISTS – It is optional but used to suppresses warnings if database_name doesn’t exist.
RESTRICT – This is optional and even if it is used, it is same as default hive behavior, i.e. it will not allow
database to be dropped until all the tables inside it are dropped.
CASCADE – This argument allows to drop the non-empty databases with single command. DROP with
CASCADE is equivalent to dropping all the tables separately and dropping the database finally in
cascading manner.
Example
hive> DROP DATABASE IF EXISTS financials;
By default, Hive won’t permit you to drop a database if it contains tables. You can either drop the tables
first or append the CASCADE keyword to the command, which will cause the Hive to drop the tables in
the database first:
Using the RESTRICT keyword instead of CASCADE is equivalent to the default behavior, where existing
tables must be dropped before dropping the database.
When a database is dropped, its directory is also deleted.
Hive Tables
Introduction to Hive Tables

Himanshu Sekhar Paul Apache HIVE |23


In Hive, Tables are nothing but collection of homogeneous data records which have same schema for all
the records in the collection.
Hive Table = Data Stored in HDFS + Metadata (Schema of the table) stored in RDBMS
Hive metadata is stored in hive metastore configured via any RDBMS (default is Derby but can be
configured to any of these: MySQL, PostGreSQL, Oracle, MS SQL Server, etc…). So, Hive metadata is not
stored on HDFS. Hive table data can be stored local filesystem as well, when running in local mode.
By default, Hive always creates the table’s directory under the directory for the enclosing database. If
table is not created under any database , then HIVE place table in default database .
For every database hive create a directory in default warehouse location or in specified location .The
exception is the default database. It doesn’t have a directory under /user/hive/warehouse, so a
table in the default database will have its directory created directly in /user/hive/warehouse (unless
explicitly overridden).
Types of Hive Table
Basing on ownership of data Hive has two types of table
1. Managed Tables – Default table type in Hive
Tables data is manged by Hive by moving data into its warehouse directory configured by
hive.metastore.warehouse.dir (by default /user/hive/warehouse).
If this table is dropped both data (entire table directory structure) and metadata (schema) are deleted.
I.e. these tables are owned by Hive.
Less convenient to share with other tools like Pig, HBase etc, as these are maintained by Hive and data
can be deleted without informing these tools.
2. External Tables
These tables are not managed or owned by Hive. And tables data will not be copied into hive warehouse
directory but maintained at external location
If these tables are dropped only the schema from metastore will be deleted but not the data files from
external location.
Provides convenience to share the tables data with other tools like Pig, HBase, etc…
“Location” Clause is mandatory to create an external table otherwise table will be managed by Hive only
even if we create it with “External” keyword.
Difference between Internal Table and External Table
Internal Table External Table

An internal table is also called a managed table, An external table is not “managed” by Hive. When
meaning it’s “managed” by Hive. That means you drop an external table, the schema/table
when you drop the internal table, both the table definition is deleted from metastore , but the
schema (or definition) from metastore AND the data/rows associated with it in HDFS are left
physical data (i.e table file structure) from the alone. I.e. the table’s rows are not deleted.
Hadoop Distributed File System (HDFS) are
dropped. (Similar to truncation operation )
“Location” Clause is not mandatory. If “Location” Clause is mandatory to create an
LOCATION is not mentioned then Hive will external table otherwise table will be managed
create the table directory structure inside by Hive only even if we create it with “External”
warehouse directory path mentioned in keyword.
hive.metastore.warehouse.dir parameter

With No Trash facility when INTERNAL table is With No Trash facility when External table is
deleted data along with table is deleted forever. deleted only table schema is where as underlying
There is no chance of getting data or table back. data remain untouched. So we can recreat table

Himanshu Sekhar Paul Apache HIVE |24


at any time by seeing the data right back where it
was before.

When to use External and Managed table

Managed table
Data is temporary
Hive to Manage the table data completely not allowing any external source to use the table
Don’t want data after deletion
External table
The data is also used outside of Hive. For example, the data files are read and processed by an existing
program that doesn’t lock the files
Hive should not own data and control settings, dirs, etc., you have another program or process that will
do those things
You are not creating table based on existing table (AS SELECT)
Can create table back and with the same schema and point the location of the data

Creating Table
Complex Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO
num_buckets BUCKETS]
[ SKEWED BY (col_name, ...) ON ([(col_value, ...), ...|col_value, ...])
[STORED AS DIRECTORIES] ]
[ [ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement];

Explanation
TEMPORARY – Specified for creation of temporary tables (Hive 0.14.0 and later)
EXTERNAL – Specified when you want to make any table external
IF NOT EXISTS – it is optional. Suppresses error messages when a table already exists with same name
and ignores creation of table again even if there is a schema difference between existing table and new
table.
db_name – This is also optional but can be used to specify the table under a particular target database, if
we are not already working under it.
COMMENT – This is also optional. Similar to CREATE DATABASE statement comments, we can add
comments to table as well as to columns (strings within single quotes) to provide descriptive information
to users.
PARTITIONED BY – This clause is useful to partition the tables based on particular columns. Detailed
discussion on Partitioning is deferred to another individual post Partitioning and Clustering tables in
Hive.
SKEWED BY – This clause is useful to create skewed tables.

Himanshu Sekhar Paul Apache HIVE |25


CLUSTERED BY – This clause is used to provide more structure to tables and partitions. Detailed
discussion is deferred to another individual post Partitioning and Clustering tables in Hive.
ROW FORMAT – This clause is used to specify the format of each row in the input data and format of each
field in each row . If data fields are delimited by certain characters we can use DELIMITED sub-clause or
we need to provide a SERDE that can serialize or deserialize the input data records.
STORED AS – Storage file format can be specified in this clause. The available file formats for hive table
creation are SEQUENCEFILE ,TEXTFILE,RCFILE,PARQUET,ORC,AVRO . It also take INPUTFORMAT and
OUTPUTFORMAT which specify input_format_classname , output_format_classname
STORED BY class_name [WITH SERDEPROPERTIES (…)] -It is an alternative to above two clauses
(ROW FORMAT & STORED AS) to provide custom row format handler class_name and custom serde
properties.
LOCATION – HDFS Directory location for table data will be specified under this clause.
TBLPROPERTIES – Metadata key/value pairs can be tagged to the table. last_modified_user and
last_modified_time properties are automatically added under table properties and managed by Hive.
Some example predefined table properties are
 TBLPROPERTIES ("comment"="table_comment")
 TBLPROPERTIES ("hbase.table.name"="table_name") //for hbase integration
 TBLPROPERTIES ("immutable"="true") or ("immutable"="false")
 TBLPROPERTIES ("orc.compress"="ZLIB") or ("orc.compress"="SNAPPY") or
("orc.compress"="NONE")
 TBLPROPERTIES ("transactional"="true") or ("transactional"="false") default is "false"
 TBLPROPERTIES ("NO_AUTO_COMPACTION"="true") or ("NO_AUTO_COMPACTION"="false"), the
default is "false"
AS select_statement – AS clause is used to create table similar to the schema of the select_statement
(another query statement) and populated with the output records of select_statement. It is also know as
CTAS (Create Table AS) clause.

Modified Simple Syntax

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name


[(col_name data_type [COMMENT col_comment], ...)]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO
num_buckets BUCKETS]
[ [ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]

Example
Consider we have a employee dataset having 5 different column e.z. name, salary, subordinate,
deduction and address.
Name is of STRING type which holds Name Of Of employee
Salary is of FLOAT type which holds salary data of employee
Subordinate is of ARRAY of STRING type which holds the name of subordinates of corresponding
employee
Deduction is of MAP type which holds deduction name as key and deduction percentage as
value
Address is of STRUCT type which holds the address of employee like stree, city, state ,zip.

Himanshu Sekhar Paul Apache HIVE |26


So the data should look like this

Now we will create Hive table for this data

Creating Managed Table with Different Data types

CREATE TABLE IF NOT EXISTS mydb.employees (


name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT> COMMENT 'Keys are deductions names, values are
percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> COMMENT 'Home
address')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '\t'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
TBLPROPERTIES ('creator'='Himanshu', 'created_at'='2017-27-09 10:00:00')
If you add the option IF NOT EXISTS, Hive will silently ignore the statement if the table already exists.
This is useful in scripts that should create a table the first time they run. However, if the schema specified
differs from the schema in the table that already exists, Hive won’t warn you.
Hive automatically adds two table properties: last_modified_by holds the username of the last user to
modify the table, and last_modified_time holds the epoch time in seconds of that modification.
As we have not given any LOCATION, the above command will create directory for employee table
inside default path of warehouse directory.
The default location for Maneged table can be changed in two ways.
1. By changing hive.metastore.warehouse.dir parameter in hive-site.xml file as follows. And
also provide read /write permission to HIVE for the specified directory.
<property>
<name>hive.metastore.warehouse.dir</name>
<value>YOUR_HDFS_PATH </value>
</property>
2. By using “LOCATION” clause with CREATE
TABLE statement
CREATE TABLE employee (name String, dept String)
LOCATION '/user/hive/warehouse/database/emp';
But, make sure to create the directory "emp" before you exeucte the above query as in managed
tables the data is stored in a directory specified.
Hive table can also created by copying schema of another existing table using LIKE command . In this way
only table schema will be copied but not the data.
CREATE TABLE IF NOT EXISTS mydb.employees2 LIKE mydb.employees;

Here schema of employees table will be copied to employee2 table.


When table is created without LOCATION clause ,then we have to manually load data into table using
LOAD command
LOAD DATA INPATH ‘/hdfs/location/of /data’ OVERWRITE INTO TABLE employee
We can also load data into table from local file system using LOCAL keyword in LOAD statement
Himanshu Sekhar Paul Apache HIVE |27
LOAD DATA LOCAL INPATH ‘/local/filesyatem/location/of /data’ OVERWRITE INTO TABLE
employee

If the OVERWRITE keyword is used then hive will replace the previous data if exists

Creating External Table with Different Data types


Let’s create external table for same data set. Let’s assume data is present in /etl/data/empl location in
HDFS
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.empl (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT> COMMENT 'Keys are deductions names, values are
percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> COMMENT 'Home
address')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '\t'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
COMMENT 'Description of the table'
LOCATION ‘/etl/data/empl’
TBLPROPERTIES ('creator'='Himanshu', 'created_at'='2017-27-09 10:00:00')

Here the LOCATION clause is mandatory in order to tell HIVE location of data.
The EXTERNAL table can also created by copying schema of another existing table by using LIKE
keyword
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.employees3
LIKE mydb.employees
LOCATION '/path/to/data';

Point to Remember when using LIKE to create table


If you omit the EXTERNAL keyword and the original table is external, the new table will also be external.
If you omit EXTERNAL and the original table is managed, the new table will also be managed.
If you include the EXTERNAL keyword and the original table is managed, the new table will be external.
Even in this scenario, the LOCATION clause will still be optional.
If you do not include the EXTERNAL keyword and the original table is maneged, the new table will be
maneged.
This LIKE clause also accepts the optional LOCATION clause, but note that no other properties, including
the schema, can be defined; they are determined from the original table.
Some Extra Bits
Temporary Tables ( Hive 0.14.0)
By the name itself, these are temporary and available till end of current session only.
Useful in case of creating intermediate tables to copy data records from one table to another but can be
deleted after our copy operation.
Table’s Data will be stored in the user’s scratch directory configured by hive.exec.scratchdir in
hive-site.xml , and deleted at the end of the session.
Temporary tables don’t support Partitioning & Indexing.
Multiple Hive users can create multiple Hive temporary tables with the same name because each table
resides in a separate session.
Warning:
Himanshu Sekhar Paul Apache HIVE |28
Be careful in naming temporary tables, As Hive doesn’t warn or error out if we use a name that already
exist in the database.
If a temporary table is created with same name of a permanent table which already exists in the database,
then, original table can’t be accessed in that session until we drop the temporary table.

Creating temporary table


Creating temp table and inserting records into it and querying the records. At the bottom, we are view
detailed description of the table.
hive> CREATE TEMPORARY TABLE temp (col1 STRING, col2 INT);
OK
Time taken: 0.176 seconds
hive> INSERT INTO TABLE temp VALUES ('bala', 100), ('siva',200), ('praveen',300);
Query ID = user_20141205145353_3457eac6-9445-4e79-91e1-ddd9bcb345b0
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1417770169357_0002, Tracking URL =
http://localhost:8088/proxy/application_1417770169357_0002/
Kill Command = /home/user/Downloads/hadoop-2.5.0/bin/hadoop job -kill
job_1417770169357_0002
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2014-12-05 14:53:19,091 Stage-1 map = 0%, reduce = 0%
2014-12-05 14:53:26,449 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1417770169357_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/tmp/hive/user/66378b99-32a0-4d6a-8134-
e1455b5ab68e/hive_2014-12-05_14-53-08_427_4849202042416406066-1/-ext-10000
Loading data to table default.temp
Table default.temp stats: [numFiles=1, numRows=3, totalSize=30, rawDataSize=27]
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 18.482 seconds
hive> SELECT * FROM temp;
OK
bala 100
siva 200
praveen 300
Time taken: 0.173 seconds, Fetched: 3 row(s)
hive> DESCRIBE FORMATTED temp;
OK
# col_name data_type comment

col1 string
col2 int

# Detailed Table Information


Database: default
Owner: user
CreateTime: Fri Dec 05 14:52:06 IST 2014
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0

Himanshu Sekhar Paul Apache HIVE |29


Location: hdfs://localhost:9000/tmp/hive/user/66378b99-32a0-4d6a-8134-
e1455b5ab68e/_tmp_space.db/ffb1b803-ad3f-406c-9137-94156f34cd9b
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles 1
numRows 3
rawDataSize 27
totalSize 30

# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.022 seconds, Fetched: 31 row(s)
hive>

Skewed Tables

These are introduced for first time in Hive-0.14.0 to improve performance of tables with one or more
columns having skewed (repeated) values.
Hive will split the skewed (very often) values records into separate files and and rest of the values go to
some other file., and the data of same skewed file will be considered into account at the time of querying
this table, so that it can skip (or include) the whole file based on the input criteria.
These are not separate table types, but can be managed or external.

CREATE EXTERNAL TABLE User_Skewed(


first_name VARCHAR(64),
last_name VARCHAR(64),
company_name VARCHAR(64),
address STRUCT<zip:INT, street:STRING>,
country VARCHAR(64),
city VARCHAR(32),
state VARCHAR(32),
post INT,
phone_nos ARRAY<STRING>,
mail MAP<STRING, STRING>,
web_address VARCHAR(64)
)
COMMENT 'Skewed table for testing purpose'
SKEWED BY (country) ON ('AU')
STORED AS TEXTFILE;

Comparison with Partitioned Tables and Skewed Tables


Skewing technique similar to Partitioning but it is recommended when only few values are occurring
very often in input.
Like, If we partition a table by country and there 200 countries in input file, but 80% records are from
only US, UK, IN, JPN, then it is better to go by Skewing by country for four values. In skewing, it will create
only 5 separate files/directories (4 for US, UK, IN, JPN and 1 for remaining all) where as partitioning will
create 200 directories making the structure very complex.
Himanshu Sekhar Paul Apache HIVE |30
One of the main disadvantage of Partitioning is that HDFS Scalability will be an issue more partitioning is
done. For example, if there are 1000 mappers and 1000 partitions, and each mapper gets at least 1 row
for each key, we will end up in creating 1 million intermediate files, So Namenode’s memory will be in
trouble to store metadata about all these files.

Show Tables
The SHOW TABLES command lists the tables. With no additional arguments, it shows the tables in the
current working database. Let’s assume we have already created a few other tables, table1 and
table2, and we did so in the mydb database:
hive> USE mydb;
hive> SHOW TABLES;
OK
Time taken: 18.482 seconds
employees
table1
table2
If we aren’t in the same database, we can still list the tables in that database:
hive> USE default;
hive> SHOW TABLES IN mydb;
OK
Time taken: 18.482 seconds
employees
table1
table2

If we have a lot of tables, we can limit the ones listed using a regular expression, a concept we’ll discuss in
detail in “LIKE and RLIKE” on page 96: hive> USE mydb;
hive> SHOW TABLES 'empl.*';
employees

The regular expression in the single quote looks for all tables with names starting with empl and ending with
any other characters (the .* part).

Alter Table
Most table properties can be altered with ALTER TABLE statements, which change metadata about the
table but not the data itself.
ALTER TABLE modifies table metadata only. The data for the table is untouched
Rename Table
ALTER TABLE table_name RENAME TO new_table_name;

Above command will rename the table to a new name


Alter Table Properties
ALTER TABLE log_messages SET TBLPROPERTIES (
'notes' = 'The process id is no longer captured; this column is always NULL');
You can use this statement to add properties to your own metadata to the tables. Currently
last_modified_user , last_modified_time properties are automatically added and managed by
Hive. Users can add their own properties to this list.
You can do DESCRIBE EXTENDED TABLE to get this information.
Changing a MANAGED table to a EXTERNAL table
We can change a managed table to EXTERNAL one
ALTER TABLE <table> SET TBLPROPERTIES('EXTERNAL'='TRUE');

Himanshu Sekhar Paul Apache HIVE |31


Alter Table Comment
To change the comment of a table you have to change the comment property of the TBLPROPERTIES:
ALTER TABLE table_name SET TBLPROPERTIES ('comment' = new_comment);
Changing Columns
You can rename a column, change its position, type, or comment:
ALTER TABLE log_messages
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp' AFTER severity;

You have to specify the old name, a new name, and the type, even if the name or type is not changing.
The keyword COLUMN is optional as is the COMMENT clause.
If you aren’t moving the column, the AFTER other_column clause is not necessary. In the example shown,
we move the column after the severity column. If you want to move the column to the first position, use
FIRST instead of AFTER other_column. As always, this command changes metadata only. If you are
moving columns, the data must already match the new schema or you must change it to match by some
other means.
Adding Columns
You can add new columns to the end of the existing columns, before any partition columns.
ALTER TABLE log_messages ADD COLUMNS (
app_name STRING COMMENT 'Application name',
session_id LONG COMMENT 'The current session id');

The COMMENT clauses are optional, as usual. If any of the new columns are in the wrong position, use an
ALTER COLUMN table CHANGE COLUMN statement for each one to move it to the correct position.

Deleting or Replacing Columns


The following example removes all the existing columns and replaces them with the new columns specified:
ALTER TABLE log_messages REPLACE COLUMNS (
hours_mins_secs INT COMMENT 'hour, minute, seconds from timestamp',
severity STRING COMMENT 'The message severity',
message STRING COMMENT 'The rest of the message');

This statement effectively renames the original hms column and removes the server and process_id
columns from the original schema definition. As for all ALTER statements, only the table metadata is
changed.
The REPLACE statement can only be used with tables that use one of the native SerDe modules:
DynamicSerDe or MetadataTypedColumnsetSerDe. Recall that the SerDe determines how records are
parsed into columns (deserialization) and how a record’s columns are written to storage (serialization).

Dropping Tables
The familiar DROP TABLE command from SQL is supported:
DROP TABLE IF EXISTS employees;

The IF EXISTS keywords are optional. If not used and the table doesn’t exist, Hive returns an error.
For managed tables, the table metadata and data are deleted.
For external tables, the metadata is deleted but the data is not.

Actually, if you enable the Hadoop Trash feature, which is not on by default, the data is moved to the .Trash
directory in the distributed filesystem for the user, which in HDFS is /user/$USER/.Trash. To enable this
feature, set the property fs.trash.interval to a reasonable positive number.

Himanshu Sekhar Paul Apache HIVE |32


Describing table
Like database, we can view the table and its properties using DESCRIBE command
Hive> DESCRIBE eployee

Himanshu Sekhar Paul Apache HIVE |33


Storage and File Format
Understanding How Hive Reads/Writes The Data
Hive by defaults saves data to a text file format. So Hive draws a distinction between how records are
encoded/read into files and how columns are encoded/read into records.
The encoding of record into a file is handled by an input format object specified under INPUTFORMAT
parameter and reading of record from file is handled by an output format object specified under
OUTPUTFORMAT parameter during table creation. These input/output format object are a Java classes
(compiled module)
This INPUTFORMAT split the input streams into <key, value> (unstructured bytes /intermediate stage
and the OUTPUTFORMAT format <key, value> (unstructured byte /intermediate stage) into output
streams (i.e., the output of queries) .
The record parsing is handled by a serializer /deserializer or SerDe for short. In deserialize operation ,
Serde parse (converts) <key, value> generated by INPUTFORMAT into columns of record containingg
Row object that Hive can understand. Similary , in Serializing operation , SerDe parse(converts ) hive row
object into <key, value> that OUTPUTFORMAT can under stand
So entire process comes out as :
 During a file read operation, Internally, the Hive engine uses the defined InputFormat to read a
record data caonvert it into <key, value>. That <key, value>. is then passed to the
SerDe.deserialize() method which converts it into Hive understandable row object format .
 During a file write operation , Internally, the Hive engine uses SerDe.serialize() method to
convert Hive understandable row object format format into intermediate <key, value>. which
further passed to OutputFormat which writes it into file.
When we mentioned the command STORED AS TEXTFILE ,Hive internally sets java classes for
InputFormat, OutputFormat , and SerDde
 For InputFormat it assign org.apache.hadoop.mapred.TextInputFormat and
 For OutputFormat it assign org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
 For SerDde it assign org.apache.hadoop.hive.serde2.lazy. LazySimpleSerDe.

Customizing Table Storage Formats


Apart from TEXTFILE format Hive Supports following file format
SequenceFile.
RCFile.
ORC Files.
Avro Files.
Parquet.
Custom INPUTFORMAT and OUTPUTFORMAT.

SequenceFile.
 We know that Hadoop’s performance is drawn out when we work with a small number of files with big
size rather than a large number of files with small size. If the size of a file is smaller than the typical block
size in Hadoop, we consider it as a small file. Due to this, a number of metadata increases which will
become an overhead to the NameNode. To solve this problem sequence files are introduced in Hadoop.
Sequence files act as a container to store the small files.
 Sequence files are flat files consisting of binary key-value pairs. When Hive converts queries to
MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record. Sequence
files are in the binary format which can be split and the main use of these files is to club two or more
smaller files and make them as a one sequence file.
 One benefit of sequence files is that they support block-level compression, so you can compress the
contents of the file while also maintaining the ability to split the file into segments for multiple map tasks.
 In Hive we can create a sequence file by specifying STORED AS SEQUENCEFILE in the end of a CREATE
TABLE statement.
Himanshu Sekhar Paul Apache HIVE |34
 There are three types of sequence files:
 Uncompressed key/value records.
 Record compressed key/value records – only ‘values’ are compressed here
 Block compressed key/value records – both keys and values are collected in ‘blocks’ separately
and compressed. The size of the ‘block’ is configurable.
 Hive has its own SEQUENCEFILE reader and SEQUENCEFILE writer libraries for reading and writing
through sequence files.

Creating SEQUENCEFILE
CREATE TABLE olympic_sequencefile(
athelete STRING,
age INT,
country STRING,
year STRING,
sport STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS SEQUENCEFILE

 When we mentioned the command STORED AS SEQUENCEFILE ,Hive internally sets java classes for
InputFormat, OutputFormat , and SerDde
 For InputFormat it assign org.apache.hadoop.mapred.SequenceFileInputFormat
 For OutputFormat it assign
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
 For ROW FORMAT SERDDE it assign org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Note :
 To load data into this table is somewhat different from loading into the table created using TEXTFILE
format. You need to insert the data from another table because this SEQUENCEFILE format is the
binary format. It compresses the data and then stores it into the table. If you want to load directly as
in TEXTFILE format that is not possible because we cannot insert the compressed files into tables.
RCFile
 RCFILE stands of Record Columnar File which is another type of binary file format which offers high
compression rate on the top of the rows.
 RCFILE is used when we want to perform operations on multiple rows at a time.
 RCFILEs are flat files consisting of binary key/value pairs, which shares many similarities with
SEQUENCEFILE. RCFILE stores columns of a table in form of record in a columnar manner. It first
partitions rows horizontally into row splits and then it vertically partitions each row split in a
columnar way. RCFILE first stores the metadata of a row split, as the key part of a record, and all the
data of a row split as the value part. This means that RCFILE encourages column oriented storage
rather than row oriented storage.
 Column-oriented organization is a good storage option for certain types of data and applications. For
example, if a given table has hundreds of columns but most queries use only a few of the columns, it is
wasteful to scan entire rows then discard most of the data. However, if the data is stored by column
instead of by row, then only the data for the desired columns has to be read, improving performance.
 This column oriented storage is very useful while performing analytics. It is easy to perform analytics
when we “hive’ a column oriented storage type.
 Facebook uses RCFILE as its default file format for storing of data in their data warehouse as they
perform different types of analytics using Hive.

Himanshu Sekhar Paul Apache HIVE |35


Creating RCFILE format
CREATE TABLE olympic_rcfile (
athelete STRING,
age INT,
country STRING,
year STRING,
sport STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS RCFILE
 When we mentioned the command STORED AS RCFILE ,Hive internally sets java classes for
InputFormat, OutputFormat , and SerDde
 FOR ROW FORMAT SERDE IT assign
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
 For INPUTFORMAT it assign 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
 For OUTPUTFORMAT it assign 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat';
 We cannot load data into RCFILE directly. First we need to load data into another table and then we need
to overwrite it into our newly created RCFILE
 INSERT OVERWRITE TABLE olympic_rcfile
 SELECT * FROM olympic;
 RCFile’s cannot be opened with the tools that open typical sequence files. However, Hive provides an
rcfilecat tool to display the contents of RCFiles:
 $ bin/hadoop dfs -text /user/hive/warehouse/columntable/000000_0 text:
java.io.IOException: WritableName can't load class:
org.apache.hadoop.hive.ql.io.RCFile$KeyBuffer
 $ bin/hive --service rcfilecat /user/hive/warehouse/columntable/000000_0

ORC FILE [included in Hive 0.11.0


 ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the
other file formats. ORC reduces the size of the original data up to 75%(eg: 100GB file will become
25GB). As a result the speed of data processing also increases. ORC shows better performance than
Text, Sequence and RC file formats.
 An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves
the performance when Hive is processing the data.
CREATE TABLE olympic_rcfile (
athelete STRING,
age INT,
country STRING,
year STRING,
sport STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS ORCFILE
 When we mentioned the command STORED AS ORCFILE ,Hive internally sets java classes for
InputFormat, OutputFormat , and SerDde
 For ROW FORMAT SERDE it assign 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
 For INPUTFORMAT it assign 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
 For OUTPUTFORMAT it assign 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
 We cannot load data into ORCFILE directly. First we need to load data into another table and then we
need to overwrite it into our newly created ORCFILE.
INSERT OVERWRITE TABLE olympic_orcfile
SELECT * FROM olympic;
Himanshu Sekhar Paul Apache HIVE |36
Avro file
 Avro file format is one of the popular file formats in Hadoop based applications. Avro is an Apache™ open
source project that provides data serialization and data exchange services for Hadoop®. These services
can be used together or independently. Using Avro, big data can be exchanged between programs written
in any language
What is Avro?
 Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father
of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it
deals with data formats that can be processed by multiple languages. Avro is a preferred tool to serialize
data in Hadoop.
 Avro has a schema-based system. A language-independent schema is associated with its read and writes
operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a compact
binary format, which can be deserialized by any application.
 Avro creates a self-describing file named Avro Data File, in which it stores data along with its schema in
the metadata section so that files may be processed later by any program. If the program reading the data
expects a different schema this can be easily resolved, since both schemas are present.
 Avro Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string)
and complex types (record, enum, array, map, union, and fixed).
 Avro data file is a compact binary format, which is both compressible and splittable . Hence it can be
efficiently used as the input to Hadoop MapReduce jobs
 Avro schemas defined in JSON, facilitate implementation in the languages that already have JSON
libraries.
 We can create avro data file by storing data as avro data file using -- --as-avrodatafile during
Sqoop import

Creating a Hive Table


Avro table ca n be created in two ways
 If we have a Avro schema file (i.e file with extension .avsc)then we can create avro table by just pointing
‘avro.schema.url' to Avro schema file as below
CREATE EXTERNAL TABLE categories
STORED AS AVRO
LOCATION 'hdfs:///user/cloudera/sqoop_import/categories'
TBLPROPERTIES ('avro.schema.url'='hdfs:/avrodata/olympic.avsc');

 If we donot have Avro schema file we still can create a avro schema while creating a table
 This process is initiated with the creation of JSON based schema to serialize data in a format that has a
schema built in. Avro has its own parser to return the provided schema as an object.The created object
allows us to create records with that schema.
 We can create our schema inside the table properties while creating a Hive table axaTBLPROPERTIES
(‘avro.schema.literal’='{json schema here}’);
Now, let’s create an Avro file format for olympic data .

create table olympic_avro


ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
tblproperties ('avro.schema.literal'='{
"name": "my_record",
"type": "record",
"fields": [
{"name":"athelete", "type":"string"},
{"name":"age", "type":"int"},
{"name":"country", "type":"string"},

Himanshu Sekhar Paul Apache HIVE |37


{"name":"year", "type":"string"},
{"name":"closing", "type":"string"},
{"name":"sport", "type":"string"}
]}');

 Inside the tblproperties you can see the schema of the data. Every record inside the tblproperties will
become a column in olympic_avro table. Here, ‘Name’ defines the column name and ‘type’ defines the
datatype of the particular column.
 If you are using Hive 0.14.0, you don’t even need to mention ROW FORMAT SERDE, INPUTFORMAT,
and OUTPUTFORMAT.
Data Insertion into Avro Table:
There are 2 methods by which the data can be inserted into an Avro table:
1. If we have a file with extension ‘.avro’ and the schema of the file is the same as what you specified, then
you can directly import the file using the command
LOAD DATA LOCAL INPATH ‘PATH OF THE FILE';

2. You can copy the contents of a previously created table into the newly created Avro table. Let’s take a
look at the second type of data insertion technique to import data into an Avro table. We will begin by
creating a table which is delimited by tab space and stored as textfile
Text File Table Creation:
CREATE TABLE Olympic_txt(
athelete STRING,
age INT,
country STRING,
year STRING,
closing STRING,
sport STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\T'
STORED AS TEXTFILE;
Now textfile data can be simply be loaded into this using LOAD command.
Now data tis Olympic _txt table can be loaded into above Olympic_avro by using simple INSERT command .
INSERT OVERWRITE TABLE Olympic_avro select * from Olympic_txt

Handling Null Value In Avro schema


Avro schema cannot handle the null values by default. To make the Avro table work with NULL values, the
schema of the table needs to be changed as follows.
create table olympic_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
tblproperties ('avro.schema.literal'='{
"name": "my_record",
"type": "record",
"fields": [
{"name":"athelete", "type":["string","null"],"default":null},
{"name":"age", "type":["int","null"],"default":0},
{"name":"country", "type":["string","null"],"default":null},
{"name":"year", "type":["string","null"],"default":null},
{"name":"closing", "type":["string","null"],"default":null},
{"name":"sport", "type":["string","null"],"default":null}
]}');

Himanshu Sekhar Paul Apache HIVE |38


In the type in above code, we have given two values, one is the data type and the other is null, which means
that if the value is the specified data type, it accepts the value, else, the value is NULL. It will consider the
default value which we have given with attribute default.Here we have given the default value as null for
string and 0(Zero) for int.
Converting Avro to JSON
We cannot see the content of Avro file like other TEXTFILE format .we need to conver it into JSON file
format . To convert the Avro file into JSON we need to download a jar file called
‘avro-tools-1.7.5 jar’, which contains the option to convert the Avro file into JSON.
java -jar avro-tools-1.7.5.jar tojson 'avro file name' >newfilename.json

Parquet
 Parquet, an open source file format for Hadoop. Parquet stores nested data structures in a flat
columnar format .Compared to a traditional approach where data is stored in row-oriented approach,
parquet is more efficient in terms of storage and performance.
 Parquet stores binary data in a column-oriented way, where the values of each column are organized
so that they are all adjacent, enabling better compression. It is especially good for queries which read
particular columns from a “wide” (with many columns) table since only needed columns are read and
IO is minimized. Read this for more details on Parquet.
 When we are processing Big data, cost required to store such data is more (Hadoop stores data
redundantly I.e 3 copies of each file to achieve fault tolerance) along with the storage cost processing
the data comes with CPU,Network IO, etc costs. As the data increases cost for processing and storage
increases. Parquet is the choice of Big data as it serves both needs, efficient and performance in both
storage and processing.
Advantages of using Parquet
 There are several advantages to columnar formats.
 Organizing by column allows for better compression, as data is more homogeneous. The space
savings are very noticeable at the scale of a Hadoop cluster.
 I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data.
Better compression also reduces the bandwidth required to read the input.
 As we store data of the same type in each column, we can use encoding better suited to the modern
processors’ pipeline by making instruction branching more predictable.
Creating table in hive to store parquet format:
 To use Parquet with Hive 0.10 – 0.12 you must download the Parquet Hive package from the Parquet
project. You want the parquet-hive-bundle jar in Maven Central.
 From Hive 0.13 Native Parquet support was added.
CREATE TABLE Olympic_parquet(
athelete STRING,
age INT,
country STRING,
year STRING,
closing STRING,
sport STRING)
STORED AS PARQUET;

 We can not load data directly into parquet table. We should first create an alternate table to store the
text file and use insert overwrite command to write the data in parquet format.
Lets use the Olympic_txt table and load the data Olympic_txt of to Olympic_parquet

INSERT OVERWRITE TABLE Olympic_parquet select * from Olympic_txt

Himanshu Sekhar Paul Apache HIVE |39


HiveQL: Data Manipulation
Loading Data Into Hive Table
 Since Hive has no row-level insert, update, and delete operations, the only way to put data into an table is
to use one of the “bulk” load operations. Or you can just write files in the correct directories.
 Data can be loaded into table either from local file System or from HDFS path or from another table .
Loading data from a File
 To load data from Local file we will use LOAD command as follows.
LOAD DATA LOCAL INPATH ‘/home/local/filesystem/path’ INTO TABLE employee

 Here LOCAL is mandatory to mention as it tell hive that data is present in local file system. The data is
copied into the final location
 To load data from hdfs path we will use LOAD command as follows
LOAD DATA INPATH ‘/home/hdfs/filesystem/path’ INTO TABLE employee
 If LOCAL is omitted, the path is assumed to be in the distributed filesystem. In this case, the data is moved
from the path to the final location. The rationale for this inconsistency is the assumption that you usually
don’t want duplicate copies of your data files in the distributed filesystem.
 Also, because files are moved in this case, Hive requires the source and target files and directories to be in
the same file system. For example, you can’t use LOAD DATA to load (move) data from one HDFS cluster to
another.
 Hive does not verify that the data you are loading matches the schema for the table. However, it will
verify that the file format matches the table definition. Inserting Data into Tables from Queries
 If you specify the OVERWRITE keyword, any data already present in the target directory will be deleted
first. Without the keyword, the new files are simply added to the target directory. However, if files
already exist in the target directory that match filenames being loaded, the old files are overwritten.
LOAD DATA LOCAL INPATH '/home/local/filesystem/path’ OVERWRITE INTO TABLE
employees

Loading data from Query


 The INSERT statement lets you load data into a table from a query.
INSERT OVERWRITE TABLE employees
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';

 With OVERWRITE, any previous contents of the partition (or whole table if not partitioned) are replaced.
 If you drop the keyword OVERWRITE or replace it with INTO, Hive appends the data rather than replaces
it. This feature is only available in Hive v0.8.0 or later.
Creating Tables and Loading Them in One Query
 You can also create a table and insert query results into it in one statement:
CREATE TABLE ca_employees
AS SELECT name, salary, address FROM employees
WHERE se.state = 'CA';

 This table contains just the name, salary, and address columns from the employee table records for
employees in California. The schema for the new table is taken from the SELECT clause.This feature can’t
be used with external tables.

Himanshu Sekhar Paul Apache HIVE |40


Exporting Data
Sometime you need to export the data out of table and store it in a different directory. How do we get data
out of tables? If the data files are already formatted the way you want, then it’s simple enough to copy the
directories or files:
hadoop fs -cp <source_path> <target_path >

Otherwise, you can use INSERT … DIRECTORY …, as in this example:


INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address FROM employees
WHERE se.state = 'CA';

 OVERWRITE and LOCAL have the same interpretations as before and paths are interpreted following the
usual rules. One or more files will be written to /tmp/ca_employees, depending on the number of
reducers invoked.
 We can look at the results from within the hive CLI:
hive> ! ls /tmp/ca_employees; 000000_0

 You can also specify multiple inserts to directories:


FROM staged_employees se
INSERT OVERWRITE DIRECTORY '/tmp/or_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'OR'
INSERT OVERWRITE DIRECTORY '/tmp/ca_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'CA'
INSERT OVERWRITE DIRECTORY '/tmp/il_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'IL';

Himanshu Sekhar Paul Apache HIVE |41


Hive Data Model
Data modeling defines how data in organized inside a structure. In Hive , data is organized ito three ways e.z.
table , Partition and bucket .
Table : These are analogous to Tables in Relational Databases. As described before, tables can be filtered,
projected, joined and unioned. Additionally all the data of a table is stored in a directory in HDFS. Hive also
supports the notion of external tables wherein a table can be created on prexisting files or directories in
HDFS by providing the appropriate location to the table creation DDL. The rows in a table are organized into
typed columns similar to Relational Databases.
Now we will see two new concept partition and bucket.
Before going to any new concept , lets consider a uses case scenario for this module . Consider a order
table which has following data in it
userid INT,
name STRING,
item STRING,
addres STRING,
city STRING,
state STRING,
zip STRING,
country STRING
The table holds data of orders received by a ecommerce company over a period of time . it contains
userid from which order is made , item order and order address. The table may contains lakhs of record
as it stores orders from all over the world means there could be no of user from no of different country.

Partitioning in Hive
Table partitioning means dividing table data into some parts based on the values of particular columns
like date or country, segregate the input records into different files/directories based on date or country.
Partitioning can be done based on more than column which will impose multi-dimensional structure on
directory storage. For Example, In addition to partitioning log records by date column, we can also sup
divide the single day records into country wise separate files by including country column into
partitioning. We will see more about this in the examples.
Partitions are defined at the time of table creation using the PARTITIONED BY clause, with a list of
column definitions for partitioning.
Syntax
CREATE [EXTERNAL] TABLE table_name (col_name_1 data_type_1, ....)
PARTITIONED BY (col_name_n data_type_n [COMMENT col_comment], ...);

Advantages
Partitioning is used for distributing execution load horizontally.
As the data is stored as slices/parts, query response time is faster to process the small part of the data
instead of looking for a search in the entire data set.
For example, In a large user table where the table is partitioned by country, then selecting users of
country ‘IN’ will just scan one directory ‘country=IN’ instead of all the directories.
Limitations
Having too many partitions in table creates large number of files and directories in HDFS, which is an
overhead to NameNode since it must keep all metadata for the file system in memory only.
Partitions may optimize some queries based on Where clauses, but may be less responsive for other
important queries on grouping clauses.
In Mapreduce processing, Huge number of partitions will lead to huge no of tasks (which will run in
separate JVM) in each mapreduce job, thus creates lot of overhead in maintaining JVM start up and tear
down. For small files, a separate task will be used for each file. In worst scenarios, the overhead of JVM
start up and tear down can exceed the actual processing time.
Himanshu Sekhar Paul Apache HIVE |42
Creation of Partition Table
Managed Partitioned Table
Below is the HiveQL to create managed partitioned_user table as per the above requirements.

CREATE TABLE mn_partition_order(


name STRING,
item STRING,
addres STRING,
city STRING,
zip STRING )
PARTITIONED BY (country STRING, state STRING)
STORED AS TEXTFILE;

Here we are going to partitioned data basing on country and state.


Note that we didn’t include country and state columns in table definition but included in partition
definition. If we include them, then we will encounter error scenario 1. We can verify the partition
columns of the table with the help of below command.
DESCRIBE FORMATTED partitioned_order;

Partitioned columns country and state can be used in Query statements WHERE clause and can be
treated regular column names even though there is actual column inside the input file data.
External Partitioned Tables
We can create external partitioned tables as well, just by using the EXTERNAL keyword in the CREATE
statement, but for creation of External Partitioned Tables, we do not need to mention LOCATION clause
as we will mention locations of each partitions separately while inserting data into table.
CREATE EXTERNAL TABLE ex_partitioned_order(
name STRING,
item STRING,
addres STRING,
city STRING,
zip STRING )
PARTITIONED BY (country STRING, state STRING)
STORED AS TEXTFILE;

Inserting Data Into Partitioned Tables


Data insertion into partitioned tables can be done in two modes.
1. Static Partitioning
2. Dynamic Partitioning
1. Static Partitioning in Hive
In this mode, input data should contain the columns listed only in table definition (for example, name ,
item , addres , city , state , zip) but not the columns defined in partitioned by clause (country and state).
If our input column layout is according to the expected layout and we already have separate input files for
each partitioned key value pairs, like one separate file for each combination of country and state values
(country=Ind and state=Mu), then these files can be easily loaded directly into partitioned tables with
below syntax.

Himanshu Sekhar Paul Apache HIVE |43


Loading Data into Managed Partitioned Table from Local FS using Static Partition
Consider that our local file system has a inputdir which contains data statewise for different country
in different directory. I.e ‘ inputdir/Ind/Mu’ will hold data for state Maharstra(MH) and country
India(Ind). ‘ inputdir/Ind/Od’ will hold data for state Odisha(Od) and country India(Ind)
Consider following sample data which contains orders made by user from state Maharastra( MH) and
country India (Ind) only. And this data does not contain country column and state column.

Now we will load this data into above Maneged Partitioned table
LOAD DATA LOCAL INPATH '/home/himanshu/inputdir/Ind/MH/staticinput.txt'
INTO TABLE mn_partitioned_order
PARTITION (country = 'Ind', state = 'MH')

This will create separate directory under the default warehouse directory in HDFS.
/user/hive/warehouse/partitioned_order/country=Ind/state=MH/

Similarly we have to add other partitions, which will create corresponding directories in HDFS.
Or else we can load the entire directory into Hive table with single command and can add partitions for each
file with ALTER command.
LOAD DATA LOCAL INPATH '/home/himanshu/inputdir' INTO TABLE mn_partitioned_order;
ALTER TABLE mn_partitioned_order ADD IF NOT EXISTS
PARTITION (country = ’Ind’, state = ‘OD’)
LOCATION '/user/hive/warehouse/partitioned_order/country=Ind/state=OD/
PARTITION (year = ‘Ind’, state = ‘KA’)
LOCATION '/user/hive/warehouse/partitioned_order/country=Ind/state=KA/'
PARTITION (year = ‘Ind’, state = ‘TN’)
LOCATION '/user/hive/warehouse/partitioned_order/country=Ind/state=TN/'

This will create separate directory under the default warehouse directory in HDFS. Multiple partitions can be
added in the same query when using Hive v0.8.0 and later
/user/hive/warehouse/partitioned_order/country=Ind/state=OD/
/user/hive/warehouse/partitioned_order/country=Ind/state=KA/
/user/hive/warehouse/partitioned_order/country=Ind/state=TN/

Loading Data into Managed Partitioned Table from From Other Table
Consider we have another table name temp_order as follows
CREATE TABLE temp_order(
userid INT,
name STRING,
item STRING,
addres STRING,
city STRING,
state STRING,
zip STRING,
country STRING )

Himanshu Sekhar Paul Apache HIVE |44


We can load or add partitions with query results from another table as shown below.
hive> INSERT OVERWRITE TABLE mn_partitioned_user
PARTITION (country = 'US', state = 'AL')
SELECT * FROM temp_order te
WHERE te.country = 'US' AND te.state = 'AL';

Overwriting Existing Partition


We can overwrite an existing partition with help of OVERWRITE INTO TABLE partitioned_user clause.

Loading Data into External Partitioned Table From HDFS using Static Partion
There is alternative for bulk loading of partitions into hive table. As data is already present in HDFS and
should be made accessible by Hive, we will just mention the locations of the HDFS files for each partition.
If our files are on Local FS, they can be moved to a directory in HDFS using –put or -copyFromLocal and we
can add partition for each file in that directory with commands similar to below.
hive> ALTER TABLE ex_partitioned_user
ADD PARTITION (country = 'US', state = 'CA')
LOCATION '/hive/external/tables/user/country=Ind/state=OD/'

As here table is a external table, we are not loading data into the table , rather the linking the location of the
file to the a partition of given table Similarly we need to repeat the above alter command for all partition files
in the directory so that a meta data entry will be created in metastore, mapping the partition and table.
2. Dynamic Partitioning in Hive
Instead of loading each partition with single SQL statement as shown above, which will result in writing
lot of SQL statements for huge no of partitions, Hive supports dynamic partitioning with which we can
add any number of partitions with single SQL execution. Hive will automatically splits our data into
separate partition files based on the values of partition keys present in the input files.
It gives the advantages of easy coding and no need of manual identification of partitions. This dynamic
partition suits well for our example requirement on user records provided above.
When you have large data stored in a unpartitoned table then Dynamic partition is suitable.
Usually dynamic partition load the data from non partitioned table
Dynamic Partition takes more time in loading data compared to static partition
If you want to partition number of column but you don’t know how many columns then also dynamic
partition is suitable
Before going to dynamic partition we have to consider parameter
<property>
<name>hive.exec.dynamic.partition</name>
<value>true</value>
<description>Whether or not to allow dynamic partitions in DML/DDL. </description>
</property>
By default hive.exec.dynamic.partition is set to false.
This parameter allows to run dynamic partition .
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
<description>
In strict mode, the user must specify at least one static partition in case
The user accidentally overwrites all partitions. In nonstrict mode all
partitions are allowed to be dynamic.
</description>
</property>
By Default hive.exec.dynamic.partition.mode is set to STRICT mode .The “Strict” mode prohibits
queries of partitioned tables without a WHERE clause that filters on partitions. You can set the mode to
“nonstrict,” as above.
Himanshu Sekhar Paul Apache HIVE |45
<property>
<name>hive.exec.max.dynamic.partitions</name>
<value>1000</value>
<description>Maximum number of dynamic partitions allowed to be created in
total.
</description>
</property>

This parameter allows to set Max no of partition can be created . by default it is set to 1000. Max value it
can set depends on cluster hardware configuration.
hive.exec.max.dynamic.partitions (default value being 1000) is the total number of dynamic
partitions could be created by one DML. If each mapper/reducer did not exceed the limit but the total
number of dynamic partitions does, then an exception is raised at the end of the job before the
intermediate data are moved to the final destination.
<property>
<name>hive.exec.max.dynamic.partitions.pernode</name>
<value>1000</value>
<description> Maximum number of dynamic partitions allowed to be created in
each mapper/reducer node.
</description>
</property>
<property>

hive.exec.max.dynamic.partitions.pernode (default value being 100) is the maximum dynamic


partitions that can be created by each mapper or reducer. If one mapper or reducer created more than
that the threshold, a fatal error will be raised from the mapper/reducer (through counter) and the whole
job will be killed.
We can set these through hive shell with below commands,
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;

Strict Mode in Hive


In mapreduce strict mode (hive.mapred.mode=strict) , some risky queries are not allowed to run. They
include:
1. Cartesian Product.
2. No partition being picked up for a query.
3. Comparing bigints and strings.
4. Comparing bigints and doubles.
5. Orderby without limit.
According to point 2 and 5, we can not use SELECT statements without at least one partition key filter (like
WHERE country=’US’) or ORDER BY clause without LIMIT condition on partitioned tables. But by default this
property is set to nonstrict.
For dynamic partition loading we will not provide the values for partition keys, as shown below for
previously seen query.
Again, we donot have to mention partition column separately in select clause as Hive will take last 2
column of select column as partition column .So data type of last 2 column of SELECT statement should
match with data type of partition column in INSERT sstatement
hive>INSERT INTO TABLE partitioned_user
PARTITION (country, state)
SELECT userid ,name ,item , addres, city ,zip , state,country
FROM temp_order

Himanshu Sekhar Paul Apache HIVE |46


NOTE
When inserting data into a partition, it’s necessary to include the partition columns as the last columns in
the query. The column names in the source query don’t need to match the partition column names, but
they really do need to be last.
We can also mix dynamic and static partitions by specifying it as PARTITION(country = ‘US’, state). But
static partition keys must come before the dynamic partition keys.
Show Partitions
We can see the partitions of a partitioned table with SHOW command as shown below.
hive> SHOW PARTITIONS partitioned_user;

If we have a lot of partitions and want to see partitions for particular partition keys, we can further
restrict the command with an optional PARTITION clause that specifies one or more of the partitions with
specific values.
hive> SHOW PARTITIONS partitioned_user PARTITION(country='US');

Describe partitions
As we already know how to see the descriptions of tables, Now we can see the descriptions of each partition
with commands similar to below.
hive> DESCRIBE FORMATTED partitioned_user PARTITION(country='US', state='CA');

Alter Partitions
We can alter/change partitions (add/change/drop) with the help of below commands.
Adding Partitions
We can add partitions to an existing table with ADD PARTITION clause as shown below.
ALTER TABLE partitioned_user ADD IF NOT EXISTS
PARTITION (country = 'US', state = 'XY') LOCATION '/hdfs/external/file/path1'
PARTITION (country = 'CA', state = 'YZ') LOCATION '/hdfs/external/file/path2'
PARTITION (country = 'UK', state = 'ZX') LOCATION '/hdfs/external/file/path2'

Changing Partitions
We can change a partition location with commands like below. This command does not move the data from
the old location and does not delete the old data but the reference to old data file will be lost.
ALTER TABLE partitioned_user PARTITION (country='US', state='CA')
SET LOCATION '/hdfs/partition/newpath';

Drop Partitions
We can drop partitions of a table with DROP IF EXISTS PARTITION clause as shown below.
ALTER TABLE partitioned_user DROP IF EXISTS PARTITION(country='US', state='CA');

Archive Partition
The ARCHIVE PARTITION clause captures the partition files into a Hadoop archive (HAR) file. This only
reduces the number of files in the filesystem, reducing the load on the NameNode, but doesn’t provide any
space savings.
ALTER TABLE log_messages ARCHIVE PARTITION(country='US',state='XZ');

We can un archive these with UNARCHIVE PARTITION clause.


The partition from being dropped and queried
The following statements prevent the partition from being dropped and queried.
ALTER TABLE partitioned_user PARTITION(country='US',state='XY') ENABLE NO_DROP;
ALTER TABLE partitioned_user PARTITION(country='US',state='XY') ENABLE OFFLINE;

Himanshu Sekhar Paul Apache HIVE |47


Bucketing in Hive
Usually Partitioning in Hive offers a way of segregating hive table data into multiple
files/directories. But partitioning gives effective results when,
 There are limited number of partitions
 Comparatively equal sized partitions
But this may not possible in all scenarios, like when are partitioning our tables based geographic
locations like country, some bigger countries will have large partitions (ex: 4-5 countries itself
contributing 70-80% of total data) where as small countries data will create small partitions
(remaining all countries in the world may contribute to just 20-30 % of total data). So, In these
cases Partitioning will not be ideal.
To overcome the problem of over partitioning, Hive provides Bucketing concept, another
technique for decomposing table data sets into more manageable parts.
Features
Bucketing concept is based on (hashing function on the bucketed column) mod (by total number
of buckets). The hash_function depends on the type of the bucketing column.
Records with the same bucketed column will always be stored in the same bucket.
We use CLUSTERED BY clause to divide the table into buckets.
Physically, each bucket is just a file in the table directory, and Bucket numbering is 1-based.
Bucketing can be done along with Partitioning on Hive tables and even without partitioning.
Bucketed tables will create almost equally distributed data file parts.
Advantages
Bucketed tables offer efficient sampling than by non-bucketed tables. With sampling, we can try
out queries on a fraction of data for testing and debugging purpose when the original data sets
are very huge.
As the data files are equal sized parts, map-side joins will be faster on bucketed tables than non-
bucketed tables. In Map-side join, a mapper processing a bucket of the left table knows that the
matching rows in the right table will be in its corresponding bucket, so it only retrieves that
bucket (which is a small fraction of all the data stored in the right table).
Similar to partitioning, bucketed tables provide faster query responses than non-bucketed
tables.
Bucketing concept also provides the flexibility to keep the records in each bucket to be sorted by
one or more columns. This makes map-side joins even more efficient, since the join of each
bucket becomes an efficient merge-sort.
Limitations
Specifying bucketing doesn’t ensure that the table is properly populated. Data Loading into
buckets needs to be handled by our-self.

Himanshu Sekhar Paul Apache HIVE |48


Creating bucketed Table
We can create bucketed tables with the help ofmandatory CLUSTERED BY clause and optional
SORTED BY clause in CREATE TABLE statement.We also can create bucketed table without using
PARTITON clause. With the help of the below HiveQL we can create bucketed_user table with above
given requirement.
CREATE EXTERNAL TABLE bucket_order(
name STRING,
item STRING,
addres STRING,
city STRING,
zip STRING )
PARTITIONED BY (country STRING, state STRING)
CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETS
STORED AS TEXTFILE;

Unlike partitioned columns (which are not included in table columns definition) , Bucketed
columns are included in table definition as shown in above code for state and city columns.
INTO... BUCKETS clause defines how many no of bucket will be created.
CLUSTERED BY () clause will define on which column the table will bucketed.
SORTED BY () is a optional clause. When it is mentioned it will sort the data on given column.

Inserting data Into Bucketed Tables


Similar to partitioned tables, we can not directly load bucketed tables with LOAD DATA (LOCAL)
INPATH command, rather we need to use INSERT OVERWRITE TABLE … SELECT …FROM clause from
another table to populate the bucketed tables. For this, we will create one temporary table in
hive with all the columns in input file from that table we will copy into our target bucketed table.
Lets assume we have created temp_user temporary table, and below is the HiveQL for
populating bucketed table with temp_user table.
To populate the bucketed table, we need to set the property hive.enforce.bucketing = true, so
that Hive knows to create the number of buckets declared in the table definition.
set hive.enforce.bucketing = true;

INSERT OVERWRITE TABLE bucket_order PARTITION (state,country)


SELECT name STRING,
item STRING,
addres STRING,
city STRING,
zip STRING,
state STRING,
country
FROM temp_order;
The property hive.enforce.bucketing = true similar to hive.exec.dynamic.partition=true property
in partitioning. By Setting this property we will enable dynamic bucketing while loading data
into hive table.
It will automatically sets the number of reduce tasks to be equal to the number of buckets
mentioned in the table definition (for example 32 in our case) and automatically selects the
clustered by column from table definition.
If we do not set this property in Hive Session, we have to manually convey same information to
Hive that, number of reduce tasks to be run (for example in our case, by using set
mapred.reduce.tasks=32 ) and CLUSTER BY (state) and SORT BY (city) clause in the above
INSERT …SELECT statement at the end.
Himanshu Sekhar Paul Apache HIVE |49
Table Sampling in Hive
Table Sampling in hive is nothing but extraction small fraction of data from the original large
data sets. It is similar to LIMIT operator in Hive.
But below are the difference between LIMIT and TABLESAMPLE in Hive.
In many cases a LIMIT clause executes the entire query, and then only returns a limited results.
But Sampling will only select a portion of data to perform query.
Now we will do sampling on these bucketed tables to see the performance difference between
bucketed and non-bucketed tables. Lets pull the records present in the last bucket of
bucketed_user table created above

hive> SELECT firstname, country, state, city FROM bucketed_user


> TABLESAMPLE(BUCKET 32 OUT OF 32 ON state);
OK
Carman CA NL St. Johns
Chuck CA NL St. Johns
Kristal CA NL Paradise
Micah CA NL St. Johns
..............................
Man UK Greater London Mildmay Ward
Lovetta UK Greater London High Barnet Ward
Evette UK Leicester Stone ygate Ward
Eulah UK Greater London Bunhill Ward
Selene UK Greater London West Wickham Ward
Kenda UK Greater London Custom House Ward
...........................
Abraham UK Greater London Aldborough Ward
Dustin UK Greater London Brockley Ward
Craig UK Greater London East Putney Ward
................
Lindsey US CA Ontario
Justine US CA Pomona
Tarra US CA San Francisco
Kiley US CA Los Angeles
........................
Dorothy US CA San Diego
Refugia US CA Hayward
Time taken: 0.89 seconds, Fetched: 129 row(s)
hive>
In the above sampling we can see the sample records from various countries and covering many
states and cities, But if we use LIMIT operator on non-bucketed tables it will return either all the
129 records from first country CA or last country US but we can’t evenly distributed sample
records from all countries and states. This can be seen in the below screen.

hive> SELECT firstname, country, state, city FROM temp_user LIMIT 129 ;
OK
first_name country state city
Rebbecca AU TA Leith
Stevie AU QL Proston
Mariko AU WA Hamel
Gerardo AU NS Talmalmo
Mayra AU NS Lane Cove
Idella AU WA Cartmeticup
Sherill AU WA Nyamup
Ena AU NS Bendick Murrell

Himanshu Sekhar Paul Apache HIVE |50


Vince AU QL Purrawunda
Theron AU SA Blanchetown
Amira AU QL Rockside
...............
Louann AU QL Wyandra
William AU QL Goondi Hill
Time taken: 0.232 seconds, Fetched: 129 row(s)

We can also perform random sampling with Hive with below syntax.

hive> SELECT firstname, country, state, city FROM bucketed_user TABLESAMPLE(1


PERCENT);

Himanshu Sekhar Paul Apache HIVE |51


HiveQL: Queries
Before going deep into the topic lets consider a sample dataset for this module .Here emplyoee is a
partitioned managed table that holds various data about employee like employee name , employee salary ,
his/ her subordinates, tax deductions , and address.

CREATE TABLE employees (


name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> )
PARTITIONED BY (country STRING, state STRING);
Which cntain following data

SELECT … FROM Clauses


SELECT is the projection operator in SQL. The FROM clause identifies from which table, view, or nested
query we select records. For a given record, SELECT specifies the columns to keep, as well as the outputs
of function calls on one or more columns.
Here are queries of this table and the output they produce:
hive> SELECT name, salary FROM employees;
John Doe 100000.0
Mary Smith 80000.0
Todd Jones 70000.0
Bill King 60000

The following two queries are identical. The second version uses a table alias e, which is not very useful
in this query, but becomes necessary in queries with JOINs
hive> SELECT name, salary FROM employees;
hive> SELECT e.name, e.salary FROM employees e;

When you select columns that are one of the collection types, Hive uses JSON (JavaScript Object Notation)
syntax for the output. First, let’s select the subordinates, an ARRAY, where a comma-separated list
surrounded with […] is used. Note that STRING elements of the collection are quoted, while the primitive
STRING name column is not:
hive> SELECT name, subordinates FROM employees;
John Doe ["Mary Smith","Todd Jones"]
Mary Smith ["Bill King"]
Todd Jones []
Bill King []

The deductions is a MAP, where the JSON representation for maps is used, namely a comma-separated list
of key:value pairs, surrounded with {...}:
hive> SELECT name, deductions FROM employees;
John Doe {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Mary Smith {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Todd Jones {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
Bill King {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
Himanshu Sekhar Paul Apache HIVE |52
Finally, the address is a STRUCT, which is also written using the JSON map format:
hive> SELECT name, address FROM employees;
John Doe {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}
Mary Smith {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}
Todd Jones {"street":"200 Chicago Ave.","city":"Oak
Park","state":"IL","zip":60700}
Bill King {"street":"300 Obscure Dr.","city":"Obscuria","state":"IL","zip":60100}

ARRAY indexing is 0-based, as in Java. Here is a query that selects the first element of the subordinates
array:
hive> SELECT name, subordinates[0] FROM employees;
John Doe Mary Smith
Mary Smith Bill King
Todd Jones NULL Bill King NULL

Note that referencing a nonexistent element returns NULL. Also, the extracted STRING values are no
longer quoted! To reference a MAP element, you also use ARRAY[...] syntax, but with key values instead of
integer indices:
hive> SELECT name, deductions["State Taxes"] FROM employees;
John Doe 0.05
Mary Smith 0.05
Todd Jones 0.03
Bill King 0.03

Finally, to reference an element in a STRUCT, you use “dot” notation, similar to the table_alias.column
mentioned above:
hive> SELECT name, address.city FROM employees;
John Doe Chicago
Mary Smith Chicago
Todd Jones Oak Park
Bill King Obscurias

Specify Columns with Regular Expressions


We can even use regular expressions to select the columns we want.
hive> SELECT symbol, `price.*` FROM stocks;
AAPL 195.69 197.88 194.0 194.12 194.12
AAPL 192.63 196.0 190.85 195.46 195.46
AAPL 196.73 198.37 191.57 192.05 192.05
AAPL 195.17 200.2 194.42 199.23 199.23
AAPL 195.91 196.32 193.38 195.86 195.86

Column Aliases
When tables have long column name , using such column in join operation become teedyous job as we
have to write that long name each time.So it’s sometimes useful to give those anonymous columns a name,
called a column alias.

SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,


round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
FROM employees LIMIT 2;
JOHN DOE 100000.0 0.2 80000
MARY SMITH 80000.0 0.2 64000

Himanshu Sekhar Paul Apache HIVE |53


Nested SELECT Statements
The column alias feature is especially useful in nested select statements. Let’s use the previous example as
a nested query:
hive> FROM (
SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
FROM employees
) e
SELECT e.name, e.salary_minus_fed_taxes
WHERE e.salary_minus_fed_taxes > 70000;
JOHN DOE 100000.0 0.2 80000

The previous result set is aliased as e, from which we perform a second query to select the name and the
salary_minus_fed_taxes, where the latter is greater than 70,000.

CASE … WHEN … THEN Statements


The CASE … WHEN … THEN clauses are like if statements for individual columns in query results. For
example:
hive> SELECT name, salary,
CASE
WHEN salary < 50000.0 THEN 'low'
WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle'
WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high'
ELSE 'very high'
END AS bracket FROM employees;

John Doe 100000.0 very high


Mary Smith 80000.0 high
Todd Jones 70000.0 high
Bill King 60000.0 middle
Boss Man 200000.0 very high
Fred Finance 150000.0 very high
Stacy Accountant 60000.0 middle

When Hive Can Avoid MapReduce


Hive implements some kinds of queries without using MapReduce, in so-called local mode, for example:
SELECT * FROM employees;
In this case, Hive can simply read the records from employees and dump the formatted output to the
console. This even works for WHERE clauses that only filter on partition keys, with or without LIMIT
clauses:
SELECT * FROM employees WHERE country = 'US' AND state = 'CA' LIMIT 100;

Furthermore, Hive will attempt to run other operations in local mode if the hive.exec.mode.local.auto
property is set to true:
set hive.exec.mode.local.auto=true;
Otherwise, Hive uses MapReduce to run all other queries.
LIMIT Clause
The results of a typical query can return a large number of rows. The LIMIT clause puts an upper limit on
the number of rows returned:
hive> SELECT upper(name), salary, deductions["Federal Taxes"],
round(salary * (1 - deductions["Federal Taxes"])) FROM employees
LIMIT 2;

Himanshu Sekhar Paul Apache HIVE |54


JOHN DOE 100000.0 0.2 80000
MARY SMITH 80000.0 0.2 64000

WHERE Clauses
While SELECT clauses select columns, WHERE clauses are filters; they select which records to return.
WHERE clauses use predicate expressions, applying predicate operators. Several predicate expressions
can be joined with AND and OR clauses. When the predicate expressions evaluate to true, the
corresponding rows are retained in the output.
SELECT * FROM employees WHERE country = 'US' AND state = 'CA';

Predicate can also contains expression which may involves ome computation
hive> SELECT name, salary, deductions["Federal Taxes"],
salary * (1 - deductions["Federal Taxes"])
FROM employees
WHERE round(salary * (1 - deductions["Federal Taxes"])) > 70000;

John Doe 100000.0 0.2 80000.0

We can’t reference column aliases in the WHERE clause. lets re-write above code.
hive> SELECT name, salary, deductions["Federal Taxes"],
salary * (1 - deductions["Federal Taxes"]) as salary_minus_fed_taxes
FROM employees
WHERE round(salary_minus_fed_taxes) > 70000;

FAILED: Error in semantic analysis: Line 4:13 Invalid table alias or


column reference 'salary_minus_fed_taxes': (possible column names are:
name, salary, subordinates, deductions, address)

However, we can use a nested SELECT statement:

hive> SELECT e.* FROM


(SELECT name, salary, deductions["Federal Taxes"] as ded,
salary * (1 - deductions["Federal Taxes"]) as salary_minus_fed_taxes
FROM employees) e
WHERE round(e.salary_minus_fed_taxes) > 70000;
John Doe 100000.0 0.2 80000.0
Boss Man 200000.0 0.3 140000.0
Fred Finance 150000.0 0.3 105000.0

GROUP BY Clauses
The GROUP BY statement is often used in conjunction with aggregate functions to group the result set by
one or more columns and then perform an aggregation over each group.
SELECT year(ymd), avg(price_close) FROM stocks
WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
GROUP BY year(ymd);

Note :
When using clause like GROUP BY ,ORDER BY etc make sure that all the column present in SELECT
statement should ccombine with any aggregate function or placed in ORDER BY ,or GROUP BY clause

Himanshu Sekhar Paul Apache HIVE |55


HAVING Clauses
The HAVING clause lets you constrain the groups produced by GROUP BY in a way that could be
expressed with a subquery, using a syntax that’s easier to express.
Here’s the previous query with an additional HAVING clause that limits the results to years where the
average closing price was greater than $50.0:
hive> SELECT year(ymd), avg(price_close) FROM stocks
WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
GROUP BY year(ymd)
HAVING avg(price_close) > 50.0;

Without the HAVING clause, this query would require a nested SELECT statement:
hive> SELECT s2.year, s2.avg FROM
(SELECT year(ymd) AS year, avg(price_close) AS avg FROM stocks
WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
GROUP BY year(ymd)) s2
WHERE s2.avg > 50.0;

ORDER BY Clause
The ORDER BY clause performs a total ordering of the query result set. This means that all the data is
passed through a single reducer, which may take an unacceptably long time to execute for larger data sets.
You can specify any columns you wish and specify whether or not the columns are ascending using the
ASC keyword (the default) or descending using the DESC keyword.
SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
ORDER BY s.ymd ASC, s.symbol DESC;

Because ORDER BY can result in excessively long run times, Hive will require a LIMIT clause with ORDER
BY if the property hive.mapred.mode is set to strict. By default, it is set to nonstrict

SORT BY Clause
SORT BY, that orders the data only within each reducer, thereby performing a local ordering, where each
reducer’s output will be sorted. Better performance is traded for total ordering.
You can specify any columns you wish and specify whether or not the columns are ascending using the
ASC keyword (the default) or descending using the DESC keyword.
SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
SORT BY s.ymd ASC, s.symbol DESC;

As more than one reducer is invoked, the output will be sorted differently than ORDER BY. While each
reducer’s output files will be sorted, the data will probably overlap with the output of other reducers.
By default, MapReduce computes a hash on the keys output by mappers and tries to evenly distribute the
key-value pairs among the available reducers using the hash values. Unfortunately, this means that when
we use SORT BY, the contents of one reducer’s output will overlap significantly with the output of the
other reducers, as far as sorted order is concerned, even though the data is sorted within each reducer’s
output .

DISTRIBUTE BY Clause
DISTRIBUTE BY controls how map output is divided among reducers. All data that flows through a
MapReduce job is organized into key-value pairs. Hive must use this feature internally when it converts
your queries to MapReduce jobs.
As described above , in SORT BY there may be chance data will probably overlap with the output of other
reducers, we can use DISTRIBUTE BY first to ensure that the same key from output of mapper goes to the
same reducer, then use SORT BY to order the data the way we want.

Himanshu Sekhar Paul Apache HIVE |56


SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
DISTRIBUTE BY s.symbol
SORT BY s.symbol ASC, s.ymd ASC;

DISTRIBUTE BY works similar to GROUP BY in the sense that it controls how reducers receive rows for
processing, while SORT BY controls the sorting of data inside the reducer. Note that Hive requires that the
DISTRIBUTE BY clause come before the SORT BY clause.
CLUSTER BY Clause
In the previous example, the s.symbol column was used in the DISTRIBUTE BY clause, and the s.symbol
and the s.ymd columns in the SORT BY clause. Suppose that the same columns are used in both clauses and
all columns are sorted by ascending order (the default). In this case, the CLUSTER BY clause is a shor-hand
way of expressing the same query.
For example, let’s modify the previous query to drop sorting by s.ymd and use CLUSTER BY on s.symbol:
hive> SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
CLUSTER BY s.symbol;

Using DISTRIBUTE BY ... SORT BY or the shorthand CLUSTER BY clauses is a way to exploit the parallelism
of SORT BY, yet achieve a total ordering across the output files.
Casting
Here we discuss the cast() function that allows you to explicitly convert a value of one type to another.
Recall our employees table uses a FLOAT for the salary column. Now, imagine for a moment that STRING
was used for that column instead. How could we work with the values as FLOATS?
The following example casts the values to FLOAT before performing a comparison:
SELECT name, salary
FROM employees
WHERE cast (salary AS FLOAT) < 100000.0;

The syntax of the cast function is cast(value AS TYPE). What would happen in the example if a salary value
was not a valid string for a floating-point number? In this case, Hive returns NULL

Casting BINARY Values


The new BINARY type introduced in Hive v0.8.0 only supports casting BINARY to STRING. However, if you
know the value is a number, you can nest cast() invocations, as in this example where column b is a
BINARY column:

SELECT (2.0*cast(cast(b as string) as double)) from src;

You can also cast STRING to BINARY.

Block Sampling
Hive offers another syntax for sampling a percentage of blocks of an input path as an alternative to
sampling based on rows:
hive> SELECT * FROM numbersflat TABLESAMPLE(0.1 PERCENT) s;

the smallest unit of sampling is a single HDFS block. Hence, for tables less than the typical block size of 128
MB, all rows will be retuned.

Himanshu Sekhar Paul Apache HIVE |57


HIVE JOIN

JOIN is a clause that is used for combining specific fields from two tables by using values common to
each one. It is used to combine records from two or more tables in the database. It is more or less similar
to SQL JOIN.
Syntax
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference [join_condition]
| table_reference LEFT SEMI JOIN table_reference [join_condition]
| table_reference CROSS JOIN table_reference [join_condition]

We will use the following four tables in this chapter interchangigly. Consider the following table named
CUSTOMERS..
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+

Consider another table ORDERS as follows:


+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |
+-----+---------------------+-------------+--------+
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |
| 100 | 2009-10-08 00:00:00 | 3 | 1500 |
| 101 | 2009-11-20 00:00:00 | 2 | 1560 |
| 103 | 2008-05-20 00:00:00 | 4 | 2060 |
+-----+---------------------+-------------+--------+
Consider stock table which consists of following column
exchange STRING, symbol STRING, ymd STRING, price_open FLOAT,
price_high FLOAT, price_low FLOAT, price_close FLOAT, volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

Himanshu Sekhar Paul Apache HIVE |58


Consider dividends table s which have following table
CREATE EXTERNAL TABLE IF NOT EXISTS dividends (
ymd STRING,
dividend FLOAT )
PARTITIONED BY (exchange STRING, symbol STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Hive supports only equi-joins i.e. only == supported in JOIN condition


There are 4 types of Join in Hive
1. INNER JOIN
2. LEFT OUTER JOIN
3. RIGHT OUTER JOIN
4. FULL OUTER JOIN

1. INNER JOIN
In an inner JOIN, records are discarded unless join criteria finds matching records in every table being joined.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the records:

hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT


FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
| 4 | Chaitali | 25 | 2060 |
+----+----------+-----+--------+
The ON clause specifies the conditions for joining records between the two tables.
We can use WHERE clause in order to reduce the no of rows eligible for join .
Standard SQL allows a non-equi-join on the join keys. But this is not valid in Hive, primarily because it is
difficult to implement these kinds of joins in MapReduce.
Hive does not currently support using OR between predicates in ON clauses.
We can place multiple condition in join condition using AND operater . Consider following example in
which stock table is joined with dividends
SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL'

Hive does not currently support using OR between predicates in ON clauses.

2. LEFT OUTER JOIN


The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no matches in the
right table. This means, if the ON clause matches 0 (zero) records in the right table, the JOIN still returns a
row in the result, but with NULL in each column from the right table.
A LEFT JOIN returns all the values from the left table, plus the matched values from the right table, or
NULL in case of no matching JOIN predicate.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables:

The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables:

Himanshu Sekhar Paul Apache HIVE |59


hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the following response:
+----+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+----+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |

3. RIGHT OUTER JOIN


The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no matches in
the left table. If the ON clause matches 0 (zero) records in the left table, the JOIN still returns a row in the
result, but with NULL in each column from the left table.
A RIGHT JOIN returns all the values from the right table, plus the matched values from the left table, or
NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and ORDER tables.
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+

4. LEFT SEMI-JOIN
A left semi-join returns records from the lefthand table if records are found in the righthand table that satisfy
the ON predicates. It’s a special, optimized case of the more general inner join. Most SQL dialects support an
IN ... EXISTS construct to do the same thing. For instance, the following query in Example 6-2 attempts to
return stock records only on the days of dividend payments, but it doesn’t work in Hive.

SELECT s.ymd, s.symbol, s.price_close


FROM stocks s
WHERE s.ymd, s.symbol IN (
SELECT d.ymd, d.symbol FROM dividends d);

Instead, you use the following LEFT SEMI JOIN syntax:


Himanshu Sekhar Paul Apache HIVE |60
hive> SELECT s.ymd, s.symbol, s.price_close
FROM stocks s LEFT SEMI JOIN dividends d
ON s.ymd = d.ymd AND s.symbol = d.symbol;

1962-11-05 IBM 361.5


1962-08-07 IBM 373.25
1962-05-08 IBM 459.5
1962-02-06 IBM 551.5

The reason semi-joins are more efficient than the more general inner join is as follows.
For a given record in the lefthand table, Hive can stop looking for matching records in the righthand table
as soon as any match is found. At that point, the selected columns from the lefthand table record can be
projected.
Right semi-joins are not supported in Hive.

5. FULL OUTER JOIN


The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables that fulfil
the JOIN condition. The joined table contains either all the records from both the tables, or fills in NULL
values for missing matches on either side.
The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the following response:

------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |

Cartesian Product JOINs


Himanshu Sekhar Paul Apache HIVE |61
A Cartesian product is a join where all the tuples in the left side of the join are paired with all the tuples of the
right table. If the left table has 5 rows and the right table has 6 rows, 30 rows of output will be produced:
SELECTS * FROM stocks JOIN dividends;

Additionally, Cartesian products create a lot of data. Unlike other join types, Cartesian products are not
executed in parallel, and they are not optimized in any way using MapReduce.
In Hive, this query computes the full Cartesian product before applying the WHERE clause. It could take a
very long time to finish. When the property hive.mapred.mode is set to strict, Hive prevents users from
inadvertently issuing a Cartesian product query.
Cartesian product queries can be useful. For example, suppose there is a table of user preferences, a table
of news articles, and an algorithm that predicts which articles a user would like to read. A Cartesian
product is required to generate the set of all users and all pages.

Join Optimization
Himanshu Sekhar Paul Apache HIVE |62
Before understanding to how to optimize join process, we need to understand how Hive join process is
carried out internally.
HIVE Join
Each time we run a Join query, Hive internally generate a Map Reduce job . So by default for 1 join 1
MapReduce is generated. If more than two tables involved in a Join statement then there could be more
than 1 MapReduce task generated. For now, let’s consider a join statement where only 2 tables are
involved. So Hive will generate one 1 MapReduce task.
Now , like other MapReduce job , Here also it will start with Mapper phase . in Mapper phase , indivisual
Mapper will read data from tble (The table data physicaly stored in data file which will be logically
splited into inputSplit and each mapper will read 1 input split)and emit <Key ,Value > pair. Here Key
will join key (column in join condition) and Value will be entire tuple.
Now the output of Mapper will undergo suffle and sort phase in which all the tuple with same join key
will go to same reducer
Now reducer phase is aggregation phase where actual join happens. Reducer will takes the sorted
results as input and join records with same join keys.
This process also called Shuffle Join or Common Join.This is also called reduce side join as actual
join is done reducer phase.

Need of Optimization

Himanshu Sekhar Paul Apache HIVE |63


Though shuffle join (or common join) looks good but it also involves one issue. For a hugh dataset , mapper
emits hugh amount if <Key, Value > pair . Such amount of <Key .Value > pair has to go through suffle and sort
phase causing increase in bandwidth and network traffic. To handle these situation , Hive support
optimization.
Optimizing Hive Join
Hive offers optimization for join at two place . During
a) Mapper phase and
b) Reducer Phase
The Mapper phase optimization will be done by MAP-SIDE join
For Reduce phase join optimization will done in SMB(sort-merge bucket join),
MAP-SIDE join
The motivation of map join is to save the shuffle and reduce stages and do the join work only in the map
stage. By doing so, when one of the join tables is small enough to fit into the memory, all the mappers can
hold the data (from smaller table) in memory and do the join work there by reading another table from
file (i.e second table will be streamed) . So all the join operations can be finished in the map stage.
However there are some scaling problems with this type of map join. To get that table into memory all
the mapper hasto read the smaller file(/table) first. When thousands of mappers read the small join table
from the Hadoop Distributed File System (HDFS) into memory at the same time, the join table easily
becomes the performance bottleneck, causing the mappers to time out during the read operations.
Using the Distributed Cache
Hive solves this scaling problem. The basic idea of optimization is to create a new MapReduce local task
just before the original join MapReduce task. This new task reads the small table data from HDFS to an in-
memory hash table. After reading, it serializes the in-memory hash table into a hashtable file.
In the next stage, when the MapReduce task is launching, it uploads this hashtable file to the Hadoop
distributed cache, which populates these files to each mapper's local disk. So all the mappers can load this
persistent hashtable file back into memory and do the join work as before. The execution flow of the
optimized map join is shown in below figure.
After optimization, the small table needs to be read just once. Also if multiple mappers are running on the
same machine, the distributed cache only needs to push one copy of the hashtable file to this machine.
Since map join is faster than the common join, it's better to run the map join whenever possible.
Previously, Hive users needed to give a hint in the query to specify the small table.

Himanshu Sekhar Paul Apache HIVE |64


There is two ways to enable Map join in Hive.
1. First is by using a hint, which looks like /*+ MAPJOIN(aliasname), MAPJOIN(anothertable) */. This C-style
comment should be placed immediately following the SELECT. It directs Hive to load aliasname (which is a
table or alias of the query) into memory. Before using the hint, firstly make sure below parameter is set to
false (Default is true in Hive 0.13).
set hive.ignore.mapjoin.hint=false;
Then
SELECT /*+ MAPJOIN(c) */ * FROM orders o JOIN cities c ON (o.city_id = c.id);

This isn't a good user experience because sometimes the user may give the wrong hint or may not give any
hint at all.
2. Another (better, in my opinion) way to turn on mapjoins is to let Hive do it automatically. Simply set
hive.auto.convert.join to true in your config, and Hive will automatically use mapjoins for any tables
smaller than hive.mapjoin.smalltable.filesize (default is 25MB).These two table can be manually
set from Hive terminal using SET operater.
set hive.auto.convert.join=true;

When hive.auto.convert.join.noconditionaltask=true, if estimated size of small table(s) is


equal or smaller than hive.auto.convert.join.noconditionaltask.size(default 10MB), then
common join can convert to map join automatically.
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;

Limitation Of Map Join


Mapjoins have a limitation in that the same table or alias cannot be used to join on different columns in the
same query. (This makes sense because presumably Hive uses a HashMap keyed on the column(s) used in
the join, and such a HashMap would be of no use for a join on different keys).
It’s the best suitable for small tables. It’s fast and single scan through a largest table, but if table is more
than Ram, it’s not process. So when we join two tables, that one tables must smaller than RAM. If table
more than RAM size, we should use SortMergeBucket Join

Himanshu Sekhar Paul Apache HIVE |65


SMB (sort-merge bucket join) / Bucket-map join
If table is large (size of table is greater than size set in hive.mapjoin.smalltable.filesize pamater),
you cannot do a map side join.
Again as larger table are involved, it is obivious that each table should be bucketed which offers faster
query processing . Hive provides a to optimize join of such large bucketize table. This new process is
called Sort-Merge bucket join.
Sort-Merge bucket joins are used wherever the tables are sorted and bucketed on same column. The join
boils down to just merging the already sorted tables, allowing this operation to be faster than an ordinary
map-join.
However, if the tables are partitioned, there could be a slow down as each mapper would need to get a
very small chunk of a partition which has a single key.
For SMB, you must turned on "hive.optimize.bucketmapjoin.sortedmerge = true", then you can
still do a map side join on large tables. (Of course, you still need "set hive.optimize.bucketmapjoin =
true")
Make sure that your tables are truly bucketed and sorted on the same column. It's so easy to make
mistakes. To get a bucketed and sorted table, you need to set following parameter.
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
Along with setting above parameter, we have to use SORTED BY and INTO. . . BUCKET during
creation of tables.
Sort-Merge bucket join uses both mapper and reducer phase .In mapper phase , mapper will simply
read data(bucket) from table which is already sorted and bucketized. In reducer phase , it will simply
merge the file from two table.

Himanshu Sekhar Paul Apache HIVE |66


Hive Performance Tuning:
Following are 10 ways to optimize performance of hive
1. Enable Compression in Hive
By enabling compression at various phases (i.e. on final output, intermediate data), we achieve the
performance improvement in Hive Queries. For further details on how to enable compression Hive refer
the post Compression in Hive.
2. Optimize Joins
We can improve the performance of joins by enabling Auto Convert Map Joins and enabling optimization
of skew joins.
Auto Map Joins
Auto Map-Join is a very useful feature when joining a big table with a small table. if we enable this feature,
the small table will be saved in the local cache on each node, and then joined with the big table in the Map
phase. Enabling Auto Map Join provides two advantages. First, loading a small table into cache will save
read time on each data node. Second, it avoids skew joins in the Hive query, since the join operation has
been already done in the Map phase for each block of data.
To enable the Auto Map-Join feature, we need to set below properties.
<property>
<name>hive.auto.convert.join</name>
<value>true</value>
<description>Whether Hive enables the optimization about converting common join
into mapjoin based on the input file size
</description>
</property>
<property>
<name>hive.auto.convert.join.noconditionaltask</name>
<value>true</value>
<description>
Whether Hive enables the optimization about converting common join into
mapjoin based on the input file size. If this parameter is on, and the sum of
size for n-1 of the tables/partitions for a n-way join is smaller than the
specified size, the join is directly converted to a mapjoin (there is no
conditional task).
</description>
</property>
<property>
<name>hive.auto.convert.join.noconditionaltask.size</name>
<value>10000000</value>
<description>
If hive.auto.convert.join.noconditionaltask is off, this parameter does not
take affect. However, if it is on, and the sum of size for n-1 of the
tables/partitions for a n-way join is smaller than this size, the join
is directly converted to a mapjoin(there is no conditional task). The default
is 10MB
</description>
</property>
<property>
<name>hive.auto.convert.join.use.nonstaged</name>
<value>false</value>
<description>
For conditional joins, if input stream from a small alias can be directly
applied to join operator without filtering or projection, the alias need not
to be pre-staged in distributed cache via mapred local task.
Currently, this is not working with vectorization or tez execution engine.
</description>
</property>
Himanshu Sekhar Paul Apache HIVE |67
Skew Joins
We can enable optimization of skew joins, i.e. imbalanced joins by setting hive.optimize.skewjoin
property to true either via SET command in hive shell or hive-site.xml file. Below are the list of properties
that can be fine tuned to better optimize the skew joins.
<property>
<name>hive.optimize.skewjoin</name>
<value>true</value>
<description>
Whether to enable skew join optimization. The algorithm is as follows: At
runtime, detect the keys with a large skew. Instead of processing those keys,
store them temporarily in an HDFS directory. In a follow-up map-reduce job,
process those skewed keys. The same key need not be skewed for all the
tables, and so, the follow-up map-reduce job (for the skewed keys) would be
much faster, since it would be a
map-join.
</description>
</property>
<property>
<name>hive.skewjoin.key</name>
<value>100000</value>
<description>
Determine if we get a skew key in join. If we see more than the specified
number of rows with the same key in join operator,
we think the key as a skew join key.
</description>
</property>
<property>
<name>hive.skewjoin.mapjoin.map.tasks</name>
<value>10000</value>
<description>
Determine the number of map task used in the follow up map join job for a
skew join. It should be used together with hive.skewjoin.mapjoin.min.split to
perform a fine grained control.
</description>
</property>
<property>
<name>hive.skewjoin.mapjoin.min.split</name>
<value>33554432</value>
<description>
Determine the number of map task at most used in the follow up map join job
for a skew join by specifying the minimum split size. It should be used
together with hive.skewjoin.mapjoin.map.tasks to perform a fine grained
control.
</description>
</property>

Enable Bucketed Map Joins


If tables are bucketed by a particular column and these tables are being used in joins then we can enable
bucketed map join to improve the performance. To do this, we can set below properties in hive-site.xml
or hive shell.

<property>
<name>hive.optimize.bucketmapjoin</name>
<value>true</value>
<description>Whether to try bucket mapjoin</description>
</property>

Himanshu Sekhar Paul Apache HIVE |68


<property>
<name>hive.optimize.bucketmapjoin.sortedmerge</name>
<value>true</value>
<description>Whether to try sorted bucket merge map join</description>
</property>

3. Avoid Global Sorting in Hive


Global Sorting in Hive can be achieved in Hive with ORDER BY clause but this comes with a drawback.
ORDER BY produces a result by setting the number of reducers to one, making it very inefficient for
large datasets.
When a globally sorted result is not required, then we can use SORT BY clause. SORT BY produces a
sorted file per reducer. If we need to control which reducer a particular row goes to, we can use
DISTRIBUTE BY clause, for example,
SELECT id, name, salary, dept FROM employee
DISTRIBUTE BY dept
SORT BY id ASC, name DESC;

Each dept will be processed separately by a reducer and records will be sorted by id and name fields
within each dept separately.
4. Enable Tez Execution Engine
Instead of running Hive queries on venerable Map-reduce engine, we can improve the performance of
hive queries at least by 100% to 300 % by running on Tez execution engine. We can enable the Tez
engine with below property from hive shell.
hive> set hive.execution.engine=tez;

5. Optimize LIMIT operator


By default LIMIT operator still executes the entire query, then only returns a limited results. Because
this behavior is generally wasteful, it can be avoided by setting below properties.
<property>
<name>hive.limit.optimize.enable</name>
<value>true</value>
<description>Whether to enable to optimization to trying a smaller subset of
data for simple LIMIT first. </description>
</property>
<property>
<name>hive.limit.row.max.size</name>
<value>100000</value>
<description>When trying a smaller subset of data for simple LIMIT, how much
size we need to guarantee each row to have at least.</description>
</property>
<property>
<name>hive.limit.optimize.limit.file</name>
<value>10</value>
<description>When trying a smaller subset of data for simple LIMIT, maximum
number of files we can sample.</description>
</property>
<property>
<name>hive.limit.optimize.fetch.max</name>
<value>50000</value>
<description>
Maximum number of rows allowed for a smaller subset of data for simple LIMIT,
if it is a fetch query.Insert queries are not restricted by this limit.
</description>
</property>

Himanshu Sekhar Paul Apache HIVE |69


6. Enable Parallel Execution
Hive converts a query into one or more stages. Stages could be a MapReduce stage, sampling stage, a
merge stage, a limit stage. By default, Hive executes these stages one at a time. A particular job may
consist of some stages that are not dependent on each other and could be executed in
parallel, possibly allowing the overall job to complete more quickly. Parallel execution can be enabled by
setting below properties.
<property>
<name>hive.exec.parallel</name>
<value>true</value>
<description>Whether to execute jobs in parallel</description>
</property>
<property>
<name>hive.exec.parallel.thread.number</name>
<value>8</value>
<description>How many jobs at most can be executed in parallel</description>
</property>

7. Enable Mapreduce Strict Mode


we can enable mapreduce strict mode by setting below property to strict.
<property>
<name>hive.mapred.mode</name>
<value>nonstrict</value>
<description>
The mode in which the Hive operations are being performed.
In strict mode, some risky queries are not allowed to run. They include:
Cartesian Product.
No partition being picked up for a query.
Comparing bigints and strings.
Comparing bigints and doubles.
Orderby without limit.
</description>

8. Single Reduce for Multi Group BY


By enabling single reducer task for multi group by operations, we can combine multiple GROUP BY
operations in a query into a single MapReduce job.
<property>
<name>hive.multigroupby.singlereducer</name>
<value>true</value>
<description>
Whether to optimize multi group by query to generate single M/R job plan. If
the multi group by query has
common group by keys, it will be optimized to generate single M/R job.
</description>
</property>

9. Enable Vectorization
By default, Hive processes rows one by one. Each row of data goes through all operators before
processing of the next one. This way is very ineffective in terms of CPU usage.
To improve efficency of CPU instructions and cache usage, Hive (version 0.13.0 and later) uses
vectorization. This is a parallel processing technique, in which an operation is applied to a block of 1024
rows at a time rather than a single row. Each column in the block is represented by a vector of a
primitive data type. The inner loop of execution effectively scans these vectors, avoiding method calls,
deserialization, and unnecessary if-then-else instructions.
Vectorization only works with columnar formats, such as ORC and Parquet.
We can enable vectorized query execution by setting below three properties in either hive shell or hive-
site.xml file.
Himanshu Sekhar Paul Apache HIVE |70
hive> set hive.vectorized.execution.enabled = true;
hive> set hive.vectorized.execution.reduce.enabled = true;
hive> set hive.vectorized.execution.reduce.groupby.enabled = true;

If possible, Hive will apply operations to vectors. Otherwise, it will execute the query with vectorization
turned off.
10. Controls Parallel Reduce Tasks
We can control the number of parallel reduce tasks that can be run for a given hive query with below
properties.
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>256000000</value>
<description>size per reducer.The default is 256Mb, i.e if the input size is
1G, it will use 4 reducers.</description>
</property>
<property>
<name>hive.exec.reducers.max</name>
<value>1009</value>
<description>
max number of reducers will be used. If the one specified in the
configuration parameter mapred.reduce.tasks is negative, Hive will use this
one as the max number of reducers when automatically determine number of
reducers.
</description>
</property>
we can also set the parallel reduce tasks to a fixed value with below property.
hive> set mapred.reduce.tasks=32;

11. Enable Cost Based Optimization


Recent Hive releases provided the feature of cost based optimization, one can achieve further
optimizations based on query cost, resulting in potentially different decisions: how to order joins, which
type of join to perform, degree of parallelism and others.
Cost based optimization can be enabled by setting below properties in hive-site.xml file.
<property>
<name>hive.cbo.enable</name>
<value>true</value>
<description>Flag to control enabling Cost Based Optimizations using Calcite
framework.</description>
</property>
<property>
<name>hive.compute.query.using.stats</name>
<value>true</value>
<description>
When set to true Hive will answer a few queries like count(1) purely using
stats stored in metastore. For basic stats collection turn on the config
hive.stats.autogather to true.
For more advanced stats collection need to run analyze table queries.
</description>
</property>
<property>
<name>hive.stats.fetch.partition.stats</name>
<value>true</value>
<description>
Annotation of operator tree with statistics information requires partition
level basic statistics like number of rows, data size and file size.

Himanshu Sekhar Paul Apache HIVE |71


Partition statistics are fetched from metastore. Fetching partition
statistics for each needed partition can be expensive when the number of
partitions is high. This flag can be used to disable fetching of partition
statistic from metastore. When this flag is disabled, Hive will make calls to
filesystem to get file sizesand will estimate the number of rows from row
schema.
</description>
</property>
<property>
<name>hive.stats.fetch.column.stats</name>
<value>true</value>
<description>
Annotation of operator tree with statistics information requires column
statistics. Column statistics are fetched from metastore. Fetching column
statistics for each needed column can be expensive when the number of columns
is high. This flag can be used to disable fetching
of column statistics from metastore.
</description>
</property>
<property>
<name>hive.stats.autogather</name>
<value>true</value>
<description>A flag to gather statistics automatically during the INSERT
OVERWRITE command.</description>
</property>
<property>
<name>hive.stats.dbclass</name>
<value>fs</value>
<description>
Expects one of the pattern in [jdbc(:.*), hbase, counter, custom, fs].
The storage that stores temporary Hive statistics. In filesystem based
statistics collection ('fs'), each task writes statistics it has collected in
a file on the filesystem, which will be aggregated after the job has
finished. Supported values are fs (filesystem), jdbc:database (where database
can be derby, mysql, etc.), hbase, counter, and custom as defined in
StatsSetupConst.java.
</description>
</property>
And we can gather basic statistics about all columns in an employee table with below command in hive shell.
hive> ANALYZE TABLE employee COMPUTE STATISTICS FOR COLUMNS;
hive> ANALYZE TABLE employee COMPUTE STATISTICS FOR COLUMNS id, dept;

12. Use ORC File Format


Using ORC (Optimized Record Columnar) file format we can improve the performance of Hive Queries
very effectively. Below picture on file format best depicts the power of ORC file file over other formats.

Himanshu Sekhar Paul Apache HIVE |72


Debugging Tool in Hive
DESCRIBE
The DESCRIBE statement displays metadata about a database or table or column, such as the column
names and their data types. You can use the abbreviation DESC for the DESCRIBE statement.
Syntax [For Database]
DESCRIBE DATABASE [EXTENDED] [IF EXISTS] db_name;
Both EXTENDED and IF EXISTS is optional.
When DESCRIBE command is entered without extended keyword Hive will show only data base name and
hdfs folder structure of database.
When DESCRIBE command is entered with EXTENDED keyword, Hive will show database name, database
location and database properties. [We can take EXTENDED keyword as way of saying Hive to show extensive
information about database.]
When IF EXISTS is used , if database is not exists in same Hive will ignore error message .
Syntax [For Table]
DESCRIBE DATABASE [EXTENDED |FORMATTED] [IF EXISTS] [db_name.]table_name;
Here also both EXTENDED and IF EXISTS is optional.
When DESCRIBE command is entered without extended or formatted keyword Hive will show table name
and hdfs folder structure of table.
When DESCRIBE command is entered with EXTENDED keyword, Hive will show table name, table location,
column name and type, and other table properties.
When DESCRIBE command is entered with FORMATTED keyword, It will also all the table properties but in
a more formatted and readable way.
When IF EXISTS is used, if database is not exists in same Hive will ignore error message.
db_name. is optional. But it helps when you are not inside any database

Syntax [For Column]


DESCRIBE [IF EXISTS] [db_name.]table_name.col_name;
If you only want to see the schema for a particular column, append the column to the table name. Here,
EXTENDED adds no additional output:

EXPLAIN
Hive provides an EXPLAIN command that shows the logical and physical execution plan for a query. The syntax
for this statement is as follows:
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query
AUTHORIZATION is supported from HIVE 0.14.0 via HIVE-5961.
The use of EXTENDED in the EXPLAIN statement produces extra information about the operators in the plan.
This is typically physical information like file names.
A Hive query gets converted into a sequence (it is more a Directed Acyclic Graph) of stages. These stages may
be map/reduce stages or they may even be stages that do metastore or file system operations like move and
rename. The explain output has three parts:
 The Abstract Syntax Tree for the query
 The dependencies between the different stages of the plan
 The description of each of the stages
The description of the stages itself shows a sequence of operators with the metadata associated with the
operators. The metadata may comprise things like filter expressions for the FilterOperator or the select
expressions for the Select Operator or the output file names for the FileSinkOperator.

Himanshu Sekhar Paul Apache HIVE |73


Indexing in Hive
What is an Index?
An Index acts as a reference to the records. Instead of searching all the records, we can refer to the index
to search for a particular record. Indexes maintain the reference of the records. So that it is easy to search
for a record with minimum overhead. Indexes also speed up the searching of data.

Why to use indexing in Hive?


Hive is a data warehousing tool present on the top of Hadoop, which provides the SQL kind of interface to
perform queries on large data sets. Since Hive deals with Big Data, the size of files is naturally large and
can span up to Terabytes and Petabytes. Now if we want to perform any operation or a query on this huge
amount of data it will take large amount of time.
In a Hive table, there are many numbers of rows and columns. If we want to perform queries only on
some columns without indexing, it will take large amount of time because queries will be executed on all
the columns present in the table.
The major advantage of using indexing is; whenever we perform a query on a table that has an index,
there is no need for the query to scan all the rows in the table. Further, it checks the index first and then
goes to the particular column and performs the operation.
So if we maintain indexes, it will be easier for Hive query to look into the indexes first and then perform
the needed operations within less amount of time.
Eventually, time is the only factor that everyone focuses on.

When to use Indexing?


Indexing can be use under the following circumstances:
If the dataset is very large.
If the query execution is more amount of time than you expected.
If a speedy query execution is required.
When building a data model.

Indexes are maintained in a separate table in Hive so that it won’t affect the data inside the table, which
contains the data. Another major advantage for indexing in Hive is that indexes can also be partitioned
depending on the size of the data we have.
Types of Indexes in Hive
Compact Indexing
Bitmap Indexing
Bit map indexing was introduced in Hive 0.8 and is commonly used for columns with distinct values.

Differences between Compact and Bitmap Indexing


The main difference is the storing of the mapped values of the rows in the different blocks. When the data
inside a Hive table is stored by default in the HDFS, they are distributed across the nodes in a cluster.
There needs to be a proper identification of the data, like the data in block indexing. This data will be able
to identity which row is present in which block, so that when a query is triggered it can go directly into
that block. So, while performing a query, it will first check the index and then go directly into that block.
Compact indexing stores the pair of indexed column’s value and its blockid.
Bitmap indexing stores the combination of indexed column value and list of rows as a bitmap.

Understanding what is bitmap?


A bitmap is is a type of memory organization or image file format used to store digital images so with
this meaning of bitmap, we can redefine bitmap indexing as given below.
“Bitmap index stores the combination of value and list of rows as a digital image.”

Himanshu Sekhar Paul Apache HIVE |74


The following are the different operations that can be performed on Hive indexes:
Creating index
Showing index
Alter index
Dropping index

Creating Index in Hive


Syntax for creating a compact index in Hive is as follows:
CREATE INDEX index_name
ON TABLE table_name (columns,....)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;
Here, in the place of index_name we can give any name of our choice, which will be the table’s INDEX
NAME.
In the ON TABLE line, we can give the table_name for which we are creating the index and the names of
the columns in brackets for which the indexes are to be created. We should specify the columns which are
available only in the table.
The org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’ line specifies that a
built in CompactIndexHandler will act on the created index, which means we are creating a compact
index for the table.
The WITH DEFERRED REBUILD statement should be present in the created index because we need to
alter the index in later stages using this statement.
This syntax will create an index for our table, but to complete the creation, we need to complete the
REBUILD statement. For this to happen, we need to add one more alter statement. A MapReduce job will
be launched and the index creation is now completed.
ALTER INDEX index_nam on table_name REBUILD;

This ALTER statement will complete our REBUILDED index creation for the table.

Examples – Creating Index


In this section we will first execute the hive query on non-indexed table and will note down the time
taken by query to fetch the result.
In the second part, we will be performing the same query on indexed table and then will compare the
time taken by query to fetch the result with the earlier case.
We will be demonstrating this difference of time with practical examples.
In first scenario we are performing operations on non-indexed table.
Let’s create a normal managed table to contain the olympic dataset first.
create table olympic(
athelete STRING,age INT,country STRING,year STRING,
closing STRING,sport STRING,gold INT,silver INT,
bronze INT,total INT)
row format delimited
fields terminated by '\t'
stored as textfile;
Here we are creating a table with name ‘olympic’. The schema of the table is as specified and the data
inside the input file is delimited by tab space.
At the end of the line, we have specified ‘stored as textfile’, which means we are using a TEXTFILE format.
You can check the schema of your created table using the command ‘describe olympic;’
We can load data into the created table as follows:
load data local inpath ‘path of your file‘into table olympic;

Himanshu Sekhar Paul Apache HIVE |75


We have successfully loaded the input file data into the table which is in the TEXTFILE format.
Let’s perform an Average operation on this ‘olympic’ data. Let’s calculate the average age of the athletes
using the following command:

SELECT AVG(age) from olympic;

Here we can see the average age of the athletes to be 26.405433646812956 and the time for performing
this operation is 21.08 seconds.
Now, let’s create the index for this table:

CREATE INDEX olympic_index


ON TABLE olympic (age)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;

Himanshu Sekhar Paul Apache HIVE |76


ALTER INDEX olympic_index on olympic REBUILD;

Here we have created an index for the ‘olympic’ table on the age column. We can view the indexes created
for the table by using the below command:

show formatted index on olympic;

We can see the indexes available for the ‘olympic’ table in the above image.
Now, let’s perform the same Average operation on the same table.

We have now got the average age as 26.405433646812956, which is same as the above, but now the
time taken for performing this operation is 17.26 seconds, which is less than the above case.
Now we know that by using indexes we can reduce the time of performing the queries.

Himanshu Sekhar Paul Apache HIVE |77


Can we have different indexes for the same table?
Yes! We can have any number of indexes for a particular table and any type of indexes as well.
Let’s create a Bitmap index for the same table:

CREATE INDEX olympic_index_bitmap


ON TABLE olympic (age)
AS 'BITMAP'
WITH DEFERRED REBUILD;

ALTER INDEX olympic_index_bitmap on olympic REBUILD;


Here, As ‘BITMAP’ defines the type of index as BITMAP.

We have successfully created the Bitmap index for the table.


We can check the available indexes for the table using the below command:
show formatted index on olympic;

We can now see that we have two indexes available for our table.
Average Operation with Two Indexes
Now, let’s perform the same Average operation having the two indexes.

Himanshu Sekhar Paul Apache HIVE |78


This time, we have got the same result in 17.614 seconds which is same as in the case of compact
index.
Note: With different types (compact,bitmap) of indexes on the same columns, for the same table, the
index which is created first is taken as the index for that table on the specified columns.
Now let’s delete one index using the following command:
DROP INDEX IF EXISTS olympic_index ON olympic;
We can check the available indexes on the table to verify whether the index is deleted or not.

We have successfully deleted one index i.e., olympic_index ,which is a compact index.
We now have only one index available for our table, which is a bitmap index.

Average Operation with Bitmap Index


Let’s perform the same Average age operation on the same table with bitmap index.

Himanshu Sekhar Paul Apache HIVE |79


We have got the average age as 26.105433646812956, which is same as the above cases but the
operation was done in just 16.47 seconds, which is less than the above two cases.
Through the above examples, we have proved the following:
 Indexes decrease the time for executing the query.
 We can have any number of indexes on the same table.
 We can use the type of index depending on the data we have.
 In some cases, Bitmap indexes work faster than the Compact indexes and vice versa.

When not to use indexing?


It is essential to know when and where indexing shouldn’t be used. They should not be used in the
following scenarios:
Indexes are advised to build on the columns on which you frequently perform operations.
Building more number of indexes also degrade the performance of your query.
Type of index to be created should be identified prior to its creation (if your data requires bitmap you
should not create compact).This leads to increase in time for executing your query.

Himanshu Sekhar Paul Apache HIVE |80


Functions in Hive
Hive supports two type of functions. Built-In-Function and User-Defined Function

Hive - Built-in Functions


Return Type Signature Description

BIGINT floor(double a) It returns the maximum BIGINT value that is equal or less than
the double.

BIGINT ceil(double a) It returns the minimum BIGINT value that is equal or greater
than the double.

double rand(), rand(int seed) It returns a random number that changes from row to row.

string concat(string A, string B,...) It returns the string resulting from concatenating B after A.

string CONCAT_WS(string The CONCAT_WS function concatenates all the strings only
delimiter, string strings and Column with datatype string.
str1,str2……)

string substr(string A, int start) It returns the substring of A starting from start position till the
end of string A.

string substr(string A, int start, int It returns the substring of A starting from start position with the
length) given length.

string upper(string A) It returns the string resulting from converting all characters of A
to upper case.

string ucase(string A) Same as above.

string FIND_IN_SET(string The FIND_IN_SET function searches for the search_string in the
search_string ,string source_string_list and returns the position of the first
source_string_list) occurrence in the source_string_list. Here the source_string_list
should be comma delimited one.

string lower(string A) It returns the string resulting from converting all characters of B
to lower case.

string lcase(string A) Same as above.

string trim(string A) It returns the string resulting from trimming spaces from both
ends of A.

string ltrim(string A) It returns the string resulting from trimming spaces from the
beginning (left hand side) of A.

string rtrim(string A) rtrim(string A) It returns the string resulting from trimming


spaces from the end (right hand side) of A.

string regexp_replace(string A, It returns the string resulting from replacing all substrings in B
string B, string C) that match the Java regular expression syntax with C.

Himanshu Sekhar Paul Apache HIVE |81


int size(Map<K.V>) It returns the number of elements in the map type.

int size(Array<T>) It returns the number of elements in the array type.

value of cast(<expr> as <type>) It converts the results of the expression expr to <type> e.g.
<type> cast('1' as BIGINT) converts the string '1' to it integral
representation. A NULL is returned if the conversion does not
succeed.

string from_unixtime(int unixtime) convert the number of seconds from Unix epoch (1970-01-01
00:00:00 UTC) to a string representing the timestamp of that
moment in the current system time zone in the format of "1970-
01-01 00:00:00"

string to_date(string timestamp) It returns the date part of a timestamp string: to_date("1970-01-
01 00:00:00") = "1970-01-01"

int year(string date) It returns the year part of a date or a timestamp string:
year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970

BIGINT round(double a) It returns the rounded BIGINT value of the double.

String LENGTH LENGTH function returns the number of characters in the string.

int month(string date) It returns the month part of a date or a timestamp string:
month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11

int day(string date) It returns the day part of a date or a timestamp string:
day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1

string get_json_object(string It extracts json object from a json string based on json path
json_string, string path) specified, and returns json string of the extracted json object. It
returns NULL if the input json string is invalid.

string LPAD(string str,int len,string The LPAD function returns the string with a length of len
pad) characters left-padded with pad.

string RPAD(string str,int The RPAD function returns the string with a length of len
len,string pad) characters Right-padded with pad.

Example
The following queries demonstrate some built-in functions:
round() function
hive> SELECT round(2.6) from temp;
On successful execution of query, you get to see the following response:
3.0

floor() function
hive> SELECT floor(2.6) from temp;
On successful execution of the query, you get to see the following response:
2.0

ceil() function
hive> SELECT ceil(2.6) from temp;
On successful execution of the query, you get to see the following response:
Himanshu Sekhar Paul Apache HIVE |82
3.0

CONCAT_WS(string delimiter, string str1,str2……) function


hive> select CONCAT_WS('+',name,location) from Tri100;
rahul+Hyderabad
Mohit+Banglore
Rohan+Banglore
Ajay+Bangladesh
srujay+Srilanka

hive> select CONCAT_WS(' ',name,'from',location) from Tri100;


rahul from Hyderabad
Mohit from Banglore
Rohan from Banglore
Ajay from Bangladesh
srujay from Srilanka

FIND_IN_SET(string search_string ,string source_string_list)


hive> select FIND_IN_SET('ha','ho,hi,ha,bye') from Tri100 where sal=22000;
3

hive> select FIND_IN_SET('rahul',name) from Tri100;


1
0
0
0
0

LPAD(string str,int len,string pad)


hive> select LPAD(name,6,'#') from Tri100;
#rahul
#Mohit
#Rohan
##Ajay
srujay

hive> select LPAD('India',6,'#') from Tri100;


#India
#India
#India
#India
#India

RPAD(string str,int len,string pad)


hive> select LPAD(name,6,'#') from Tri100;
#rahul
#Mohit
#Rohan
##Ajay
srujay
hive> select LPAD('India',6,'#') from Tri100;
#India
#India
#India
#India
#India

REPEAT and REVERSE

Himanshu Sekhar Paul Apache HIVE |83


REPEAT Function repeat the string for n times.
hive> select REPEAT(name,2) from Tri100;
rahulrahul
MohitMohit
RohanRohan
AjayAjay
srujaysrujay

REVERSE Function gives the reverse string.


hive> select REVERSE(name) from Tri100;
luhar
tihoM
nahoR
yajA
yajurs

SPACE :
SPACE function returns the specified number of spaces.
hive> select space(10),name from Tri100;
rahul
Mohit
Rohan
Ajay
srujay
SPLITT :
Syntax: SPLITT(‘string1:string2’,’pat’)
Split function splits the string depending on the pattern pat and returns an array of strings.
hive> select split('hadoop:hive',':') from Tri100 where sal=22000;
["hadoop","hive"]

Format :
Syntax: “FORMAT_NUMBER(number X,int D)”
Formats the number X to a format like #,###,###.##, rounded to D decimal places and returns result as a
string. If D=0 then the value will only have fraction part there will not be any decimal part.
hive> select name,format_number(Hike,2) from Tri100;
rahul 40,000.00
Mohit 25,000.00
Rohan 40,000.00
Ajay 45,000.00
srujay 30,000.00

hive> select name,Format_number(Hike,0) from Tri100;


rahul 40,000
Mohit 25,000
Rohan 40,000
Ajay 45,000
srujay 30,000

INSTRING :
Syntax: “instr(string str,string substring)”
Returns the position of the first occurrence of substr in str. Returns null if either of the arguments are null and
returns 0 if substr could not be found in str. Be aware that this is not zero based. The first character in str has
index 1.
hive> select instr('rahul','ul') from Tri100 where sal=22000;
4

Himanshu Sekhar Paul Apache HIVE |84


Locate :
Syntax: “Locate(string substring, string str[,int pos])”
Returns the position of the first occurrence of substr in str after position pos.
hive> select locate('ul','rahul',2) from Tri100 where sal=22000;
4
hive> select locate('ul','rahul',5) from Tri100 where sal=22000;

N-Grams :
Syntax: N-grams(array<array<string>>,int N, int K, int P)
Returns the top-k N-grams from a set of tokenized sentences, such as those returned by the sentences() UDAF.
hive> select ngrams(sentences(name),1,5)from Tri100 ;
[{"ngram":["Ajay"],"estfrequency":1.0},{"ngram":["Mohit"],"estfrequency":1.0},{"ng
ram":["Rohan"],"estfrequency":1.0},{"ngram":["rahul"],"estfrequency":1.0},{"ngram":["
srujay"],"estfrequency":1.0}]

Parse URL :
Syntax: “parse_url(string urlString, string partToExtract [, string keyToExtract])”
Returns the specified part from the URL. Valid values for partToExtract include HOST, PATH, QUERY, REF,
PROTOCOL, AUTHORITY, FILE, and USERINFO.
hive> select
parse_URL('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1','HOST')from Tri100 where
sal=22000;
facebook.com
Printf :
Syntax: “printf(String format, Obj… args)”
Returns the input formatted according do printf-style format strings
hive> select printf("color %s, number1 %d, float %f",'red',89,3.14) from Tri100
where sal=22000;
color red, number1 89, float 3.140000

Regexp_Extract :
Syntax: “regexp_extract(string subject, string pattern, int index)”
Returns the string extracted using the pattern.
hive> select regexp_extract('foothebar','foo(.*?)(bar)',2) from Tri100 where
sal=22000;
bar
hive> select regexp_extract('foothebar','foo(.*?)(bar)',1) from Tri100 where
sal=22000;
the
hive> select regexp_extract('foothebar','foo(.*?)(bar)',0) from Tri100 where
sal=22000;
foothebar

Regexp_Repalce :
Syntax: “regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)”
Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular
expression syntax defined in PATTERN with instances of REPLACEMENT.
hive> select regexp_replace('foothebar','oo|ba','') from Tri100 where sal=22000;
fther

Sentences :
Syntax: “sentences(string str, string lang, string locale)”
Tokenizes a string of natural language text into words and sentences, where each sentence is broken at the
appropriate sentence boundary and returned as an array of words. The ‘lang’ and ‘locale’ are optional
arguments.

Himanshu Sekhar Paul Apache HIVE |85


hive> select sentences('hello there!, how are you!, what you doing?') from Tri100
where sal=22000;
[["hello","there"],["how","are","you"],["what","you","doing"]]

Str_to_map :
Syntax: “str_to_map(text[, delimiter1, delimiter2])”
Splits text into key-value pairs using two delimiters. Delimiter1 separates text into K-V pairs, and Delimiter2
splits each K-V pair. Default delimiters are ‘,’ for delimiter1 and ‘=’ for delimiter2.
hive> select str_to_map(concat('Names=',name,'&','Hike=',Hike)) from Tri100;
{"Names=rahul&Hike=40000":null}
{"Names=Mohit&Hike=25000":null}
{"Names=Rohan&Hike=40000":null}
{"Names=Ajay&Hike=45000":null}
{"Names=srujay&Hike=30000":null}

Translate :
Syntax: “translate(string|char|varchar input, string|char|varchar from, string|char|varchar to)”
Translates the input string by replacing the characters present in the from string with the corresponding
characters in the to string. If any of the parameters to this UDF are NULL, the result is NULL as well.
hive> select translate('hello','hello','hi') from Tri100 where sal=22000;
hi
hive> select translate('Make sure u knew that code','e','o') from Tri100 where
sal=22000;
Mako suro u know that codo
Aggregated Functions and Normal Queries:
Lets consider Tri100 table ha following data
hive> select * from Tri100;
OK
1 rahul Hyderabad 3000 40000
2 Mohit Banglore 22000 25000
3 Rohan Banglore 33000 40000
4 Ajay Bangladesh 40000 45000
5 srujay Srilanka 25000 30000
Time taken: 0.184 seconds, Fetched: 5 row(s)

SUM
Returns the sum of the elements in the group or sum of the distinct values of the column in the group.
hive> select sum(sal) from Tri100;
OK
150000
Time taken: 17.909 seconds, Fetched: 1 row(s)
hive> select Sum(sal) from Tri100 where loccation='Banglore';
OK
55000
Time taken: 18.324 seconds, Fetched: 1 row(s)

Count
count(*) – Returns the total number of retrieved rows, including rows containing NULL values;
count(expr) – Returns the number of rows for which the supplied expression is non-NULL;
count(DISTINCT expr[, expr]) – Returns the number of rows for which the supplied expression(s) are unique
and non- NULL;
hive> select count(*) from Tri100;
OK
5
Time taken: 16.307 seconds, Fetched: 1 row(s)

Himanshu Sekhar Paul Apache HIVE |86


hive> select count(distinct location) from Tri100;
OK
4
Time taken: 17.37 seconds, Fetched: 1 row(s)
hive> select count(*) from Tri100 where sal>30000;
OK
2
Time taken: 18.36 seconds, Fetched: 1 row(s)
hive> select count(location) from Tri100;
OK
5
Time taken: 17.338 seconds, Fetched: 1 row(s)

Average
Returns the average of the elements in the group or the average of the distinct values of the column in the group.
hive> select avg(sal) from Tri100 where location='Banglore';
OK
27500.0
Time taken: 17.276 seconds, Fetched: 1 row(s)
hive> select avg(distinct sal) from Tri100;
OK
30000.0
Time taken: 17.276 seconds, Fetched: 1 row(s)

Minimum
Returns the minimum of the column in the group.
hive> select min(sal) from Tri100;
OK
22000
Time taken: 17.368 seconds, Fetched: 1 row(s)

Maximum
Returns the maximum of the column in the group.
hive> select max(sal) from Tri100;
OK
40000
Time taken: 17.267 seconds, Fetched: 1 row(s)

Variance
Returns the variance of a numeric column in the group.
hive> select variance(sal) from Tri100;
OK
3.96E7
Time taken: 17.223 seconds, Fetched: 1 row(s)

hive> select var_pop(sal) from Tri100;


OK
3.96E7
Time taken: 17.195 seconds, Fetched: 1 row(s)
Returns the unbiased sample variance of a numeric column in the group.
hive> select var_samp(sal) from Tri100;
OK
4.95E7
Time taken: 17.245 seconds, Fetched: 1 row(s)

Standard Deviation
Himanshu Sekhar Paul Apache HIVE |87
Returns the Standard Deviation of a numeric column in the group.
hive> select stddev_pop(sal) from Tri100;
OK
6292.8530890209095
Time taken: 18.63 seconds, Fetched: 1 row(s)
Returns the unbiased sample Standard Deviation of a numeric column in the group.
hive> select stddev_samp(sal) from Tri100;
OK
7035.623639735144
Time taken: 17.299 seconds, Fetched: 1 row(s)

Covariance
Returns the population covariance of a pair of numeric columns in the group.
hive> select covar_pop(sal,Hike) from Tri100;
OK
4.4E7
Time taken: 18.888 seconds, Fetched: 1 row(s)
Returns the sample covariance of a pair of numeric columns in the group.
hive> select covar_samp(sal,Hike) from Tri100;
OK
5.5E7
Time taken: 18.302 seconds, Fetched: 1 row(s)

Correlation
Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.
hive> select corr(sal,Hike) from Tri100;
OK
0.9514987095307504
Time taken: 17.514 seconds, Fetched: 1 row(s)

Percentile
Returns the exact pth percentile of a column in the group(does not work with floating point types).P must be
between 0 and 1. NOTE: A true percentile “ Percentile(BIGINT col,P)”can only be computed for INTEGER
values. Use PERCENTILE_APPROX if you are input is non-integral.
hive> select percentile(sal,0) from Tri100;------------------------Output Gives
Lower Value of table as P is 0.It takes lower value as 0%.
OK
22000.0
Time taken: 17.321 seconds, Fetched: 1 row(s)
hive> select percentile(sal,1) from Tri100; -----------------------Output Gives
Higher Value of table as P is 1.It takes Higher value as 100%.
OK
40000.0
Time taken: 17.966 seconds, Fetched: 1 row(s)

hive> select percentile(sal,0.5) from Tri100;


OK
30000.0
Time taken: 17.368 seconds, Fetched: 1 row(s)

Histogram
Computes a histogram of a numeric column in the group using b non-uniformly spaced bins.The output is an
array of size b of double-valued (x,y) coordinates that represent the bin centers and heights.
“histogram_numeric(col, b)”

Himanshu Sekhar Paul Apache HIVE |88


hive> select histogram_numeric(sal,5) from Tri100;
OK
[{"x":22000.0,"y":1.0},{"x":25000.0,"y":1.0},{"x":30000.0,"y":1.0},{"x":33000.0,"y
":1.0},{"x":40000.0,"y":1.0}]
Time taken: 17.534 seconds, Fetched: 1 row(s)

Collections
Returns a set of objects with duplicate elements eliminated.
hive> select collect_set(Hike) from Tri100;
OK
[45000,40000,25000,30000]
Time taken: 18.29 seconds, Fetched: 1 row(s)
Returns a set of objects with duplicates(as of Hive 0.13.0)
hive> select collect_list(Hike) from Tri100;
OK
[40000,25000,40000,45000,30000]
Time taken: 17.217 seconds, Fetched: 1 row(s)

NTILE
This function divides an ordered partition into x groups called buckets and assigns a bucket number to each
row in the partition. This allows easy calculation of tertiles, quartiles, deciles, percentiles and other common
summary statistics. (As of Hive 0.11.0.).

hive> select name,Hike,NTILE(3) over (order by sal DESC) from Tri100;


OK
Ajay 45000 1
Rohan 40000 1
Rahul 40000 2
Srujay 30000 2
Mohit 25000 3
Time taken: 17.217 seconds, Fetched: 1 row(s)

HIVE Date Functions


from_unixtime:
This function converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a STRING that
represents the TIMESTAMP of that moment in the current system time zone in the format of “1970-01-01
00:00:00”. The following example returns the current date including the time.
hive> SELECT FROM_UNIXTIME(UNIX_TIMESTAMP());
OK
2015-05-18 05:43:37
Time taken: 0.153 seconds, Fetched: 1 row(s)

from_utc_timestamp
This function assumes that the string in the first expression is UTC and then, converts that string to the time
zone of the second expression. This function and the to_utc_timestamp function do timezone conversions. In
the following example, t1 is a string.
hive> SELECT from_utc_timestamp('1970-01-01 07:00:00', 'JST');
OK
1970-01-01 16:00:00
Time taken: 0.148 seconds, Fetched: 1 row(s)

to_utc_timestamp:

Himanshu Sekhar Paul Apache HIVE |89


This function assumes that the string in the first expression is in the timezone that is specified in the second
expression, and then converts the value to UTC format. This function and the from_utc_timestamp function
do timezone conversions.
hive> SELECT to_utc_timestamp ('1970-01-01 00:00:00','America/Denver');
OK
1970-01-01 07:00:00
Time taken: 0.153 seconds, Fetched: 1 row(s)

unix_timestamp :
This function converts the date to the specified date format and returns the number of seconds between the
specified date and Unix epoch. If it fails, then it returns 0. The following example returns the value
1237487400
hive> SELECT unix_timestamp ('2009-03-20', 'yyyy-MM-dd');
OK
1237487400
Time taken: 0.156 seconds, Fetched: 1 row(s)

unix_timestamp() :
This function returns the number of seconds from the Unix epoch (1970-01-01 00:00:00 UTC) using the default
time zone.
hive> select UNIX_TIMESTAMP('2000-01-01 00:00:00');
OK
946665000
Time taken: 0.147 seconds, Fetched: 1 row(s)

unix_timestamp( string date ) :


This function converts the date in format ‘yyyy-MM-dd HH:mm:ss’ into Unix timestamp. This will return the
number of seconds between the specified date and the Unix epoch. If it fails, then it returns 0.
hive> select UNIX_TIMESTAMP('2000-01-01 10:20:30','yyyy-MM-dd');
OK
946665000
Time taken: 0.148 seconds, Fetched: 1 row(s)

unix_timestamp( string date, string pattern ) :


This function converts the date to the specified date format and returns the number of seconds between the
specified date and Unix epoch. If it fails, then it returns 0.
hive> select FROM_UNIXTIME( UNIX_TIMESTAMP() );
OK
2015-06-23 17:27:39
Time taken: 0.143 seconds, Fetched: 1 row(s)

from_unixtime( bigint number_of_seconds [, string format] ) :


The FROM_UNIX function converts the specified number of seconds from Unix epoch and returns the date in
the format ‘yyyy-MM-dd HH:mm:ss’.
hive> SELECT FROM_UNIXTIME(UNIX_TIMESTAMP());
OK
2015-05-18 05:43:37
Time taken: 0.153 seconds, Fetched: 1 row(s)

To_Date( string timestamp ) :


The TO_DATE function returns the date part of the timestamp in the format ‘yyyy-MM-dd’.
hive> select TO_DATE('2000-01-01 10:20:30');
OK
2000-01-01
Time taken: 0.17 seconds, Fetched: 1 row(s)
Himanshu Sekhar Paul Apache HIVE |90
YEAR( string date ) :
The YEAR function returns the year part of the date.
hive> select YEAR('2000-01-01 10:20:30');
OK
2000
Time taken: 0.144 seconds, Fetched: 1 row(s)

MONTH( string date ) :


The MONTH function returns the month part of the date.
hive> select MONTH('2000-01-01 10:20:30');
OK
01
Time taken: 0.144 seconds, Fetched: 1 row(s)

DAY( string date ), DAYOFMONTH( date ) :


The DAY or DAYOFMONTH function returns the day part of the date.
hive> SELECT DAY('2000-03-01 10:20:30');
OK
1
Time taken: 0.178 seconds, Fetched: 1 row(s)

HOUR( string date ) :


The HOUR function returns the hour part of the date.
hive> SELECT HOUR('2000-03-01 10:20:30');
OK
10
Time taken: 0.144 seconds, Fetched: 1 row(s)
MINUTE( string date ) :
The MINUTE function returns the minute part of the timestamp.
hive> SELECT MINUTE('2000-03-01 10:20:30');
OK
20
Time taken: 0.144 seconds, Fetched: 1 row(s)

SECOND( string date )


The SECOND function returns the second part of the timestamp.
hive> SELECT SECOND('2000-03-01 10:20:30');
OK
30
Time taken: 0.16 seconds, Fetched: 1 row(s)

WEEKOFYEAR( string date )


The WEEKOFYEAR function returns the week number of the date.
hive> SELECT WEEKOFYEAR('2000-03-01 10:20:30');
OK
9
Time taken: 0.144 seconds, Fetched: 1 row(s)

DATEDIFF( string date1, string date2 )


The DATEDIFF function returns the number of days between the two given dates.
hive> SELECT DATEDIFF('2000-03-01', '2000-01-10');
OK
51
Time taken: 0.156 seconds, Fetched: 1 row(s)

DATE_ADD( string date, int days )


The DATE_ADD function adds the number of days to the specified date

Himanshu Sekhar Paul Apache HIVE |91


hive> SELECT DATE_ADD('2000-03-01', 5);
OK
2000-03-06
Time taken: 0.15 seconds, Fetched: 1 row(s)

DATE_SUB( string date, int days )


The DATE_SUB function subtracts the number of days to the specified date
hive> SELECT DATE_SUB('2000-03-01', 5);
OK
2000-02-25
Time taken: 0.15 seconds, Fetched: 1 row(s)

DATE CONVERSIONS :
Convert MMddyyyy Format to Unixtime
Note: M Should be Capital Every time in MMddyyyy Format
acreate table sample(rn int, dt string) row format delimited fields terminated by ',';
select * from sample
02111993
03121994
03131995
04141996
load data local inpath '/home/user/Desktop/sample.txt' into table sample;
select cast(substring(from_unixtime(unix_timestamp(dt, 'MMddyyyy')),1,10) as date)
from sample;
OK
1993-02-11
1994-03-12
1995-03-13
1996-04-14
Time taken: 0.112 seconds, Fetched: 4 row(s)

Convert MMM dd, yyyy Format to Unixtime


create table A (rn int, dt string) row format delimited fields terminated by ',';
load data local inpath '/home/user/Desktop/data.txt' into table A;
select cast(substring(from_unixtime(unix_timestamp(dt, 'MMM dd, yyyy')),1,10) as
date) from A;
OK
2014-10-16
2013-11-13
2012-09-14

Convert yyyy-MM-dd to Unix_timestamp


create table B (dt string);
LOAD DATA LOCAL INPATH '/home/user/Desktop/data.txt' INTO B;
SELECT UNIX_TIMESTAMP('dt','yyyy-MM-dd') from B;
OK
946665000
981052200
1015093800

Hive User Define Function


Apache Hive comes with a lot of built-in UDFs, but what happens when you need a function which does
not provided by Hive. In this scenario you need to develop that on your own. This is called User Define
Function.
Hive supports user define function developed in Java or python.
Himanshu Sekhar Paul Apache HIVE |92
User Defined Functions (UDFs) in hive are used to plug in our own logic in terms of code into hive when
we are not able to get the desired result from hive's built in functions. We can invoke the UDFs from hive
query.
Basing on input it takes and output it returns, there are 3 kind of UDFs in Hive:
1. Regular UDF,
2. User Defined Aggregate Function (UDAF),
3. User Defined Tabular Function (UDTF).

1. Regular UDF:
UDFs work on a single row in a table and produce a single row as output. Its one to one relationship
between input and output of a function. e.g Hive built in TRIM() function.
Hive allows us to define our own UDFs as well. Lets take an example of student record.
Problem Statement: Find the maximum marks obtained out of four subject by an student.
There are two different interfaces you can use for writing UDFs for Apache Hive. One is really simple, the
other… not so much.
 Simple API - org.apache.hadoop.hive.ql.exec.UDF
 Complex API - org.apache.hadoop.hive.ql.udf.generic.GenericUDF
The simple API (org.apache.hadoop.hive.ql.exec.UDF) can be used so long as your function reads
and returns primitive types. By this I mean basic Hadoop & Hive writable types - Text, IntWritable,
LongWritable, DoubleWritable, etc.
However, if you plan on writing a UDF that can manipulate embedded data structures, such as Map, List,
and Set, then you’re stuck using org.apache.hadoop.hive.ql.udf.generic.GenericUDF, which is
a little more involved.
I’m going to walk through an example of building a UDF in each interface. I will provide code and tests for
everything I do.
Simple Generic

Reduced performance due to use of reflection: each Optimal performance: no reflective call, and
call of the evaluate method is reflective. Furthermore, arguments are parsed lazily
all arguments are evaluated and parsed.

Limited handling of complex types. Arrays are All complex parameters are supported (even
handled but suffer from type erasure limitations nested ones like array<array>

Variable number of arguments are not supported Variable number of arguments are supported

Very easy to write Not very difficult, but not well documented

UDF The simple API


Building a UDF with the simpler UDF API involves little more than writing a class with one function (evaluate).
Here is an example:
Below , the UDF is written to take Text type of data and will return “Hello “ with input string.
class SimpleUDFExample extends UDF
{
public Text evaluate(Text input)
{
if(input == null) return null;
return new Text("Hello " + input.toString());
}
}

The Complex API

Himanshu Sekhar Paul Apache HIVE |93


The org.apache.hadoop.hive.ql.udf.generic.GenericUDF API provides a way to write code for
objects that are not writable types, for example - struct, map and array types.
This api requires you to manually manage object inspectors for the function arguments, and verify the
number and types of the arguments you receive. An object inspector provides a consistent interface for
underlying object types so that different object implementations can all be accessed in a consistent way from
within hive (eg you could implement a struct as a Map so long as you provide a corresponding object
inspector.
A generic UDF can be written by extending the GenericUDF class in which we have to implement 3 methods :
 public ObjectInspector initialize(ObjectInspector[] args) throws
UDFArgumentException;
 public Object evaluate(DeferredObject[] args) throws HiveException;
 public String getDisplayString(String[] args);
A key concept when working with Generic UDF and UDAF is the ObjectInspector.
In generic UDFs, all objects are passed around using the Object type. Hive is structured this way so that all
code handling records and cells is generic, and to avoid the costs of instantiating and deserializing objects
when it's not needed.
Therefefore, all interaction with the data passed in to UDFs is done via ObjectInspectors. They allow you to
read values from an UDF parameter, and to write output values.
Object Inspectors belong to one of the following categories:
 Primitive, for primitive types (all numerical types, string, boolean, …)
 List, for Hive arrays
 Map, for Hive maps
 Struct, for Hive structs

You can pass multiple arguments to the UDF. Whatever arguments you pass to the UDF, they are not
presented in the evaluate() method as is. Rather, you will get an array of ObjectInspector objects, one
ObjectInspector per argument. So arguments[0] represents an Inspector for the first argument you
passed to the UDF, arguments[1] represents the Inspector for the 2nd argument and so on.
ObjectInspector are helpful in look into the internal structure of an object.

getDisplayString()
The getDisplayString method is really helpful to the developer, since it can return meaningful
troubleshooting information. Instead of returning general error message, Hive calls this method whenever
there is an error executing the UDF. The UDF developer can really compile useful information, that can be
instrumental in troubleshooting the runtime error/exception. When a problem is detected while executing the
UDF, hive throws a HiveException but append information returned by GetDisplayString method to
the exception thrown by it. In the above example, this method returns the name and type of the column that
caused the problem.

Why Hive Provides The ObjectInspectors


As the name suggests, an ObjectInspector helps us to inspect the argument we are going to receive in the
UDF. Since, Hive has a variety of data types and it can go to a very complex level of custom data type
definition, Hive UDFs can be passed very basic data types (Primitive like long, double, boolean) as well
as very complex data types (like an Array of Map of String key and Struct Value, where the Struct contains
Name, Age, Salary and Location, i.e. ARRAY<MAP<STRING, STRUCT<STRING, FLOAT, INT, STRING>>>

Himanshu Sekhar Paul Apache HIVE |94


). Since, UDFs can be called on tables within the query, it is possible that columns with really complex data
types can be passed to UDFs.
It is because of this possible complexity of data types, that can be passed to a generic UDF (which is flexible to
type until runtime), Hive passes an ObjectInspector instead of the object itself, since now the UDF code
must understand the structure of the object and then process it. Similarly, the processed out can be equally
complex. Therefore, an ObjectInspector for the output value is required that Hive will use when you
return back the processed output.
ObjectInspectors are of great use within a generic UDF and we access the values of the parameters passes
using them. There are ObjectInspectors for typically all types and they are categorized among
PrimitiveObjectInspector, ListObjectInspector, MapObjectInspector and
StructObjectInspector.
All the specialized ObjectInspectors are derived from these four, e.g. LazyDoubleObjectInspector
that helps us in dealing with a DoubleWritable data type, is actually extended from a class that implements
PrimitiveObjectInspector. An ObjectInspector of a complex object can return ObjectInspectors of
underlying objects, e.g. myArrayObjInsp.getListElementObjectInspector() returns an inspector
that can be type casted to a StandardMapObjectInspector, if the Array contains Map objects in the Input
to the UDF.

Initialize()
When a UDF is used in a query, Hive loads the UDF in memory. The initialize() is called for the first time,
when the UDF is invoked. The purpose of call to this method, is to check the type of arguments that will be
passed to the UDF. For each value that will be passed to the UDF, the evaluate() method will be called. So if
there are 10 rows for which the UDF is going to be called, evaluate() will be called 10 times. However, Hive
first call the initialize() method of the Generic UDF before any call to evaluate(). The goals for
initialize() are to
 avalidate the input arguments and complain if input is not as per expectation
 save the Object Inspectors of input arguments for later use during evaluate()
 provide an Object Inspector to Hive for the return type
You can do various ways to validate the input, like checking the arguments array for size, category on input
type (remember PrimitiveObjectInspector, MapObjectInspector etc. ?), checking the size of
underlying objects (in case of a Map or Struct etc.). Validation can go up to any extent that you choose,
including traversing the entire object hierarchy and validating every object. When the validation fails, we can
throw a UDFArgumentException or one of its subtypes to indicate error.
The Object Inspector for the return type, should be constructed within the initialize() method and
returned. We can use the factory methods of ObjectInspectorFactory class. For example, if the UDF is
going to return a MAP type, then we can use the getStandardMapObjectInspector() method which
accept information about how the Map will be constructed (e.g. Key type of the Map and the Value type of the
Map).
The saved Object inspectors are instrumental when we try to obtain the input value in the evaluate()
method.
valuate()
SELECT GenericDouble(bonus) FROM emp;
Suppose the temp table has 10 rows in it. The the evaluate() method will be called 10 times for each column
value in 10 rows. All the values passed to evaluate() however are serialized bytes. Hive delay the instantiation
of objects until a request for the object is made, hence the name DeferredObject. Based on what type of
value was passed to the UDF, the DeferredObject could represent lazily initialized objects. In the above
example, it could be an instance of LazyDouble class. When the value is requested, like
LazyDouble.getWritableObject() then the bytes are deserialized into an object and returned.

Himanshu Sekhar Paul Apache HIVE |95


However, if the same GenericUDF is called with a value provided at command line (instead of as a result of
IO), it could be a DoubleWritable object in the first place and doesn’t need a deserialization. Based on the
type of object we get in the Input, we need to use its data accordingly and process it.
Finally, based on the type of input we received, we want to return the same type of Output, since we just
doubled the input and returned. The convertIfNecessary() method helps us in this and turn the output
type the same as the Input type based on the Object Inspector we pass to it.
.
class ComplexUDFExample extends GenericUDF
{
ListObjectInspector listOI;
StringObjectInspector elementOI;
@Override
public String getDisplayString(String[] arg0) {
return "arrayContainsExample()"; // this should probably be better
}

@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws
UDFArgumentException
{
if (arguments.length != 2)
{
throw new UDFArgumentLengthException("arrayContainsExample only takes 2
arguments: List<T>, T");
}
// 1. Check we received the right object types.
ObjectInspector a = arguments[0];
ObjectInspector b = arguments[1];
if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector))
{
throw new UDFArgumentException("first argument must be a list / array, second
argument must be a string");
}
this.listOI = (ListObjectInspector) a;
this.elementOI = (StringObjectInspector) b;

// 2. Check that the list contains strings


if(!(listOI.getListElementObjectInspector() instanceof StringObjectInspector))
{
throw new UDFArgumentException("first argument must be a list of strings");
}

// the return type of our function is a boolean, so we provide the correct object
inspector
return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
}

@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException
{
// get the list and string from the deferred objects using the object inspectors
List<String> list = (List<String>) this.listOI.getList(arguments[0].get());
String arg = elementOI.getPrimitiveJavaObject(arguments[1].get());

// check for nulls


if (list == null || arg == null)

Himanshu Sekhar Paul Apache HIVE |96


{
return null;
}

// see if our list contains the value we need


for(String s: list)
{
if (arg.equals(s)) return new Boolean(true);
}
return new Boolean(false);
}
}

User Defined Aggregate Function (UDAF),

User-Defined Aggregation Functions (UDAFs) are an exceptional way to integrate advanced data-processing
into Hive. Aggregate functions perform a calculation on a set of values and return a single value.

An aggregate function is more difficult to write than a regular UDF. Values are aggregated in chunks
(potentially across many tasks), so the implementation has to be capable of combining partial aggregations
into a final result.

We will start our discussion with the given source code which has been used to find the largest Integer from
the input file.

The code to achieve this is explained in the below example, we need to make a jar file of the below source
code and then use that jar file while executing hive scripts shown in the upcoming section.

UDAF to find the largest Integer in the table.


package com.hive.udaf;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;

public class Max extends UDAF


{
public static class MaxIntUDAFEvaluator implements UDAFEvaluator
{
private IntWritable output;
public void init()
{
output=null;
}

public boolean iterate(IntWritable maxvalue) // Process input table


{
if(maxvalue==null)
{
return true;
}
if(output == null)
{
Himanshu Sekhar Paul Apache HIVE |97
output = new IntWritable(maxvalue.get());
}
else
{
output.set(Math.max(output.get(), maxvalue.get()));
}
return true;
}

public IntWritable terminatePartial()


{
return output;
}

public boolean merge(IntWritable other)


{
return iterate(other);
}

public IntWritable terminate() //final result


{
return output;
}
}
}
Let’s see now the steps for UDAF Execution.

Creating a new Input Dataset


We need an input dataset to execute the above example. The Dataset that will be used for demonstration is
Numbers_List. It has one column, which contains List of Integer values.

Create a new table and load the input dataset


In the below screenshot we have a created a new table Num_list with only one field(column) Num.
Next, we have loaded the input dataset Numbers_List contents into the table Num_List.

Himanshu Sekhar Paul Apache HIVE |98


Display the contents of table Num_list to ensure whether the input file have been loaded successfully
or not.

By using select statement command we can see if the contents of the dataset Numbers_List have been
loaded to the table Num_list or not.

Add the Jar file in hive with complete path (Jar file made from source code need to be added)

We can see in the above screenshot we have added h-udaf.jar in hive.

Create temporary function as shown below


The need to create function is, calling function is very easily inside hive than using jar multiple times
during analysis.
Let us create a temporary function max for newly created UDAF.

Use the select statement to find the largest number from the table Num_List
After, successfully following the above steps we can see use the Select statement command to find the
largest number in the table.
Himanshu Sekhar Paul Apache HIVE |99
Thus, from the above screenshot we can see the largest number in the table Num_list is 99.

User defined tabular function (UDTF: )


User defined tabular function works on one row as input and returns multiple rows as output. So here the
relation in one to many. e.g Hive built in EXPLODE() function.
Differences Between UDF, UDAF and UDTF:
UDF:
 UDF is a user-defined function that takes a single input value and produces a single output value. When
used in a query, we can call it once for each row in the result set.
 Example:
 input.toString().toUpperCase();
 input.toString().toLowerCase();
 The above methods will convert a string of lowercase to uppercase and vice versa.

UDAF:
 UDAF is a user-defined aggregate function (UDAF) that accepts a group of values and returns a single
value. Users can implement UDAFs to summarize and condense sets of rows in the same style as the built-
in COUNT, MAX(), SUM(), and AVG() functions.
UDTF:
 UDTF is a User Defined Table Generating Function that operates on a single row and produces multiple
rows a table as output.

Himanshu Sekhar Paul Apache HIVE


|100
Problem Statement:
Let’s look at how Hive UDTF work with the help of below example. Here, we will create one value for one
unique key from a distinct key followed by one or multiple entries.
Data Set Used as Input in the Example:

You can refer to the below screenshot to see what the expected output will be.

Himanshu Sekhar Paul Apache HIVE


|101
Data Set Description:
1. Unique id of a local resident.
2. Phone number 1 of that particular unique id local resident.
3. Phone number 2 of that particular unique id local resident.
Source Code:
We can create a custom Hive UDTF by extending the GenericUDTF abstract class and then implementing
the initialize, process, and possibly close methods.

package com.Myhiveudtf;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
Import
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspec
torFactory;

public class Myudtf extends GenericUDTF


{
private PrimitiveObjectInspector stringOI = null;
@Override
public StructObjectInspector initialize(ObjectInspector[] args) throws
UDFArgumentException
{
if (args.length != 1)
{
throw new UDFArgumentException("NameParserGenericUDTF() takes
exactly one argument");
}

if(args[0].getCategory()!=ObjectInspector.Category.PRIMITIVE&&((PrimitiveObje
ctInspector) args[0]).getPrimitiveCategory() !=
PrimitiveObjectInspector.PrimitiveCategory.STRING)
{
throw new UDFArgumentException("NameParserGenericUDTF() takes a string
as a parameter");
}

// input inspectors
stringOI = (PrimitiveObjectInspector) args[0];
// output inspectors -- an object with three fields!
List<String> fieldNames = new ArrayList<String>(2);
List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(2);
fieldNames.add("id");
Himanshu Sekhar Paul Apache HIVE
|102
fieldNames.add("phone_number");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return
ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs)
}

public ArrayList<Object[]> processInputRecord(String id)


{
ArrayList<Object[]> result = new ArrayList<Object[]>();
// ignoring null or empty input
if (id == null || id.isEmpty())
{
return result;
}

String[] tokens = id.split("\\s+");


if (tokens.length == 2)
{
result.add(new Object[] { tokens[0], tokens[1]});
}
else if (tokens.length == 3)
{
result.add(new Object[] { tokens[0], tokens[1]});
result.add(new Object[] { tokens[0], tokens[2]});
}
return result;
}

Initialize()
The Hive calls the initialize method to notify the UDTF the argument types to expect. The UDTF must
then return an object inspector corresponding to the row objects that the UDTF will generate.

@Override
public void process(Object[] record) throws HiveException
{
final String id = stringOI.getPrimitiveJavaObject(record[0]).toString();
ArrayList<Object[]> results = processInputRecord(id);
Iterator<Object[]> it = results.iterator();
while (it.hasNext())
{
Object[] r = it.next();
forward(r);
}
}

Process()
Once initialize() method has been called, Hive will give rows to the UDTF using the process() method.
While in process() function, the UDTF can produce and forward rows to other operators by calling
forward() method.
@Override

Himanshu Sekhar Paul Apache HIVE


|103
public void close() throws HiveException
{
// do nothing
}
}

Close()
Finally, Hive will call the close() method when all the rows have passed to the UDTF. This function
allows for any cleanup that is necessary before returning from the User Defined Table Generating
Function. It is important to note that we cannot write any records from this function.
So far, from our above example, no data is required which needs to be cleaned up.
Therefore, we can execute the above example program.
Steps for Executing Hive UDTF:
Step 1: After writing the above code in Eclipse, add the below mentioned jar files in the program and then
export it in the Hadoop environment as a jar file.

Step 2: Create a table named ‘phone’ with a single column named ‘id’.

Step 3: Load the input data set phn_num contents into the table phone.

Step 4: Check if the data contents are loaded or not, using select statement.

Himanshu Sekhar Paul Apache HIVE


|104
Step 5: Add the jar file with the complete path of the jar made as shown above.

Step 6: Create a temporary function as shown below.

Step 7: Use the select statement to populate the above table of strings with its primary id.

From the above screenshot, we can see that we have populated a single column, which contains multiple
values to its primary id.

Himanshu Sekhar Paul Apache HIVE


|105

You might also like