Apache HIVE
Apache HIVE
Apache HIVE
What Is It?
Apache Hive is an open source data warehouse system built on top of Hadoop Haused for querying and
analyzing large datasets stored in Hadoop files.
Hive provides an SQL dialect, called Hive Query Language (abbreviated HiveQL or just HQL) for
querying data stored in a Hadoop cluster.
It’s an effective, reasonably intuitive model for organizing and using data. Mapping these familiar data
operations to the low-level MapReduce Java API can be daunting, even for experienced Java developers.
Hive does this dirty work for you, so you can focus on the query itself. Hive translates most queries to
MapReduce jobs, thereby exploiting the scalability of Hadoop, while presenting a familiar SQL
abstraction.
Hive is most suited for data warehouse applications, where relatively static data is analyzed, fast
response times are not required, and when the data is not changing rapidly.
When history answered “Why we need it? “
When the size of data over internet goes beyond petabyte , xetabyte in mid 90s , the entire IT industry
started facing problem in processing such hugh amount of data which leads to the birth of term BigData .
These data includes structured , unstructured and semi-structured data which used to comes from
various data source like databases, servers, sensors etc.
In 2005 Doug Cutting and Mike Cafarella created Hadoop, a distributed processing frame work which
uses MapReduce to process hugh amount of data , to support distribution for the Nutch search engine
project in Yahoo Lab. The Hadoop was then donated to Apache which is now part of the Apache project
sponsored by the Apache Software Foundation .
However, a challenge remains; how do you move an existing data infrastructure to Hadoop, when that
infrastructure is based on traditional relational databases and the Structured Query Language (SQL)?
What about the large base of SQL users, both expert database designers and administrators, as well as
casual users who use SQL to extract information from their data warehouses?
This is where Hive comes in. Hive was developed by Facebook which later donated to Apache which is
now part of the Apache project sponsored by the Apache Software Foundation .
Hive provides an SQL dialect, called Hive Query Language (abbreviated HiveQL or just HQL) for
querying data stored in a Hadoop cluster.
SQL knowledge is widespread for a reason; it’s an effective, reasonably intuitive model for organizing and
using data. Mapping these familiar data operations to the low-level MapReduce Java API can be daunting,
even for experienced Java developers. Hive does this dirty work for you, so you can focus on the query
itself. Hive translates most queries to MapReduce jobs, thereby exploiting the scalability of Hadoop, while
presenting a familiar SQL abstraction.
Characteristics of Hive
In Hive, tables and databases are created first and then data is loaded into these tables.
Hive as data warehouse designed for managing and querying only structured data that is stored in tables.
While dealing with structured data, Map Reduce doesn't have optimization and usability features like
UDFs but Hive framework does. Query optimization refers to an effective way of query execution in terms
of performance.
Hive's SQL-inspired language separates the user from the complexity of Map Reduce programming. It
reuses familiar concepts from the relational database world, such as tables, rows, columns and schema,
etc. for ease of learning.
Hadoop's programming works on flat files. So, Hive can use directory structures to "partition" data to
improve performance on certain queries.
A new and important component of Hive i.e. Metastore used for storing schema information. This
Metastore typically resides in a relational database.
HIVE RDBMS
Hive enforces schema on read time whereas In RDBMS, a table’s schema is enforced at data
RDBMS enforces schema on write time. Hive load time, If the data being loaded doesn’t conform
doesn’t verify the data when it is loaded, but rather to the schema, then it is rejected. This design is
when ait is retrieved. This is called schema on called schema on write.
read.
Schema on read makes for a very fast initial load, Schema on read makes for a very fast initial load,
since the data does not have to be read, parsed, and since the data does not have to be read, parsed,
serialized to disk in the database’s internal format. and serialized to disk in the database’s internal
The load operation is just a file copy or move. format. The load operation is just a file copy or
move.
RDBMS is designed for Read and Write many
Hive is based on the notion of Write once, Read times.
many times In RDBMS, record level updates, insertions and
Hive does not provide support for record level deletes, transactions and indexes are possible.
updates, insertions and deletes as it stores data in
HDFS and HDFS does not allow to change the
contents of file it holding. In RDBMS, maximum data size allowed will be in
Hive can process 100’s Petabytes of data very 10’s of Terabytes
easily. RDBMS is best suited for dynamic data analysis
As Hadoop is a batch-oriented system, Hive doesn’t and where fast responses are expected but Hive is
support OLTP (Online Transaction Processing) but suited for data warehouse applications, where
it is closer to OLAP (Online Analytical Processing) relatively static data is analyzed, fast response
but not ideal since there is significant latency times are not required, and when the data is not
between issuing a query and receiving a reply, due changing rapidly.
to the overhead of Mapreduce jobs and due to the
size of the data sets Hadoop was designed to serve.
Limitation of Hive
Hive is not a full database. So it cannot replace SQL completely.
The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do. The
biggest limitation is that Hive does not provide record-level update, insert, nor delete. You can generate
new tables from queries or output query results to files.
Also, because Hadoop is a batch-oriented system, Hive queries have higher latency, due to the start-up
overhead for MapReduce jobs. Queries that would finish in seconds for a traditional realtional database
take longer for Hive, even for relatively small data sets.
Finally, Hive does not provide transactions. So, Hive doesn’t provide crucial features required for OLTP,
Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing, but as we’ll
see, Hive isn’t ideal for satisfying the “online” part of OLAP, at least today, since there can be significant
latency between issuing a query and receiving a reply, both due to the overhead of Hadoop and due to the
size of the data sets Hadoop was designed to serve.
If you need OLTP features for large-scale data, you should consider using a NoSQL database. Examples
include HBase, a NoSQL database integrated with Hadoop.
Limited number of Built in functions
Not all Standard SQL is supported.
Himanshu Sekhar Paul Apache HIVE |3
When to use hive
If you have large (think terabytes/petabytes) datasets to query: Hive is designed specifically for
analytics on large datasets and works well for a range of complex queries. Hive is the most approachable
way to quickly (relatively) query and inspect datasets already stored in Hadoop.
If extensibility is important: Hive has a range of user function APIs that can be used to build custom
behavior in to the query engine. Check out my guide to Hive functions if you’d like to learn more.
Hive Architecture
Hive Clients – Apache Hive supports all application written in languages like C++, Java, Python etc. using
JDBC, Thrift and ODBC drivers. Thus, one can easily write Hive client application written in a language of
their choice.
Hive Services – Hive provides various services like web Interface, CLI etc. to perform queries.
Processing framework and Resource Management – Hive internally uses Hadoop MapReduce
framework to execute the queries.
Distributed Storage – As seen above that Hive is built on the top of Hadoop, so it uses the underlying
HDFS for the distributed storage.
Hive Clients
The Hive provides different drivers for communication with a different types of application . supports
different types of client applications for performing queries. These clients are categorized into 3 types:
Thrift Clients – As Apache Hive server is based on Thrift, so it can serve the request from all those
languages that support Thrift. So Thrilft client for communiacation.
JDBC Clients – Apache Hive allows Java applications to connect to it using JDBC driver. It is defined in the
class apache.hadoop.hive.jdbc.HiveDriver.
ODBC Clients – ODBC Driver allows applications that support ODBC protocol to connect to Hive. For
example JDBC driver, ODBC uses Thrift to communicate with the Hive server.
Hive Services
Client interaction with Hive can be performed through Client services , if Client want to perform through any
query related operation On Hive it has to communicate through Hive services. All Driver from hive Client
communicate with hive Server and Hive Server communicate with driver(i.e main driver). The driver will
process those request.coming from differen application to metastore and field system for further process.
Apache Hive provides various services as shown in above diagram. Now, let us look at each in detail:
a) CLI(Command Line Interface) – This is the default shell that Hive provides, in which you can execute
your Hive queries and command directly.
b) Web Interface – Hive also provides web based GUI for executing Hive queries and commands.
c) Hive Server – It is built on Apache Thrift and thus is also called as Thrift server. It allows different clients
to submit requests to Hive and retrieve the final result.
d) Hive Driver – Driver is responsible for receiving the queries submitted by Thrift, JDBC, ODBC, CLI, Web
UL interface by a Hive client. Hive Driver contains following components
I. Complier –After that hive driver passes the query to the compiler. Where parsing, type checking, and
semantic analysis takes place with the help of schema present in the metastore.
II. Optimizer – It generates the optimized logical plan in the form of a DAG (Directed Acyclic Graph) of
MapReduce and HDFS tasks.
III. Executor – Once compilation and optimization complete, execution engine executes these tasks in the
order of their dependencies using Hadoop. Hive supports mostly 3 types of execution engine i.e.
MapReduce , Tez , Spark . Only one execution engine can be set at a time . Execution engine can be
set using hive.execution.engine parameter in hive-site.xml file
e) Metastore – Metastore is the central repository of Apache Hive metadata in the Hive Architecture. It
stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It
provides client access to this information by using metastore service API. The by default metastore for
Hive is Derby. But we can reconfigure it to MySql .Hive metastore consists of two fundamental units:
A service that provides metastore access to other Apache Hive services.
Disk storage for the Hive metadata which is separate from HDFS storage.
Step1 : Execute Query -The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
Step2: Get Plan - The driver takes the help of query compiler that parses the query to check the
syntax and query plan or the requirement of query.
Step3 : Get Metadata -The compiler sends metadata request to Metastore (any database).
Step4: Send Metadata - Metastore sends metadata as a response to the compiler.
Step 5: Send Plan - The compiler checks the requirement and resends the plan to the driver. Up to here,
the parsing and compiling of a query is complete.
Step 6: Execute Plan - The driver sends the execute plan to the execution engine.
Step 7: Execute Job- Internally, the process of execution job is a MapReduce job. The execution engine
sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data
node. Here, the query executes MapReduce job.
Step7.1: Metadata Ops - Meanwhile in execution, the execution engine can execute metadata operations
with Metastore.
Step 8: Fetch Result - The execution engine receives the results from Data nodes.
Step 9: Send Results - The execution engine sends those resultant values to the driver.
Step10: Send Results- The driver sends the results to Hive Interfaces.
More on Step 7
The execution engine in turns communicates with Hadoop daemons such as NameNodes , Data Node and
Job Tracker to execute the query on top of Hadoop File System.
Execution Engine Should first Contact Name Node to get the location of desired tables reside in datanode
only(i.e. Metadata info )
The actual data stored in data node only. Execution engine will fetch actual data from data node
Same Time execution engn communicate bidirectional with metastore present in Hive to perform DDL
operation.
Himanshu Sekhar Paul Apache HIVE |6
The metastore stores information about databases name , tables name ,column name , column properties,
table properties only.
Different modes of Hive
Hive can operate in two modes depending on the size of data nodes in Hadoop. These modes are,
1. Local mode
2. Map reduce mode
When to use Local mode:
If the Hadoop installed under pseudo mode with having one data node we use Hive in this mode
If the data size is smaller in term of limited to single local machine, we can use this mode
Processing will be very fast on smaller data sets present in the local machine
When to use Map reduce mode:
If Hadoop is having multiple data nodes and data is distributed across different node we use Hive in this
mode
It will perform on large amount of data sets and query going to execute in parallel way
Processing of large data sets with better performance can be achieved through this mode
In Hive, we can set this property to mention which mode Hive can work? By default, it works on Map
Reduce mode and for local mode you can have the following setting.
Hive to work in local mode set SET mapred.job.tracker=local;
From the Hive version 0.7 it supports a mode to run map reduce jobs in local mode automatically.
Special Points
As a part of metadata manegnent , Hive store information about table, column in table , schema partition
information in structured format in a relational database.
The default metadata store(i.e metastore ) in hive is Derby. Its a database .you can change configuration
to save all your hive metadata information into any JDBC supported database .
Most popular database fot storing metadata is MySql and PostgreSQL
The purpose of storing this metadata information into relational database is if we store this information
at file level then performance of Hive will be down. Logically file loading wll take more time as compared
to relational storage that is why the purpose come to store this information in separate database .
Schema-on-Read vs Schema-on-Write
Schema on Write
In traditional database, before any data is written in the database table , the structure of that data is
strictly defined during table creation and the metadata of table is stored and tracked. That metadata is
called Schema. When the data is inserted into table the structure of data is strictly checked against the
schema of table .If the structure of data found irrelevant with respect to structure of table, then data is
discarded, data types, lengths and positions are all delineated. This process of checking structure(or
schema) of data against schema of table during writing operation is called Schema on Write.
In Schema on write , the speed of query processing and structure of data matters most than time
required for loading data.
Embedded Metastore
In Hive by default, metastore service runs in the same JVM as the Hive service. It uses embedded Derby
database stored on the local file system in this mode. Thus both metastore service and hive service runs in
the same JVM by using embedded Derby Database. But, this mode also has limitation that, as only one
embedded Derby database can access the database files on disk at any one time, so only one Hive session
could be open at a time.
Local Metastore
Hive is the data-warehousing framework, so hive does not prefer single session. To overcome this limitation
of Embedded Metastore, for Local Metastore was introduced. This mode allows us to have many Hive
sessions i.e. many users can use the metastore at the same time. We can achieve by using any JDBC compliant
like MySQL which runs in a separate JVM or different machines than that of the Hive service and metastore
service which are running in the same JVM.
MySQL
This configuration is called as local metastore because metastore service still runs in the same process as the
Hive. But it connects to a database running in a separate process, either on the same machine or on a remote
machine. Before starting Apache Hive client, add the JDBC / ODBC driver libraries to the Hive lib folder.
MySQL is a popular choice for the standalone metastore. In this case, the javax.jdo.option.ConnectionURL
property is set to jdbc:mysql://host/dbname? createDatabaseIfNotExist=true, and
javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. The JDBC driver JAR file for
MySQL (Connector/J) must be on Hive’s classpath, which is achieved by placing it in Hive’s lib directory.
Remote Metastore
Moving further there is another metastore configuration called Remote Metastore. In this mode, metastore
runs on its own separate JVM, not in the Hive service JVM. If other processes want to communicate with the
metastore server they can communicate using Thrift Network APIs. We can also have one more metastore
servers in this case to provide more availability. This also brings better manageability/security because the
database tier can be completely firewalled off. And the clients no longer need share database credentials with
each Hiver user to access the metastore database.
MySQL
To use this remote metastore, you should configure Hive service by setting hive.metastore.uris to the
metastore server URI(s). Metastore server URIs are of the form thrift://host:port, where the port
corresponds to the one set by METASTORE_PORT when starting the metastore server.
The first line printed by the CLI is the local filesystem location where the CLI writes log data about the
commands and queries you execute. If a command or query is successful, the first line of output will be OK,
followed by the output, and finished by the line showing the amount of time taken to run the command or
query.
Service List. :There are several services available, including the CLI
cli Command-line interface Used to define tables, run queries, etc. It is the default service if no
other service is specified
hiveserver Hive Server A daemon that listens for Thrift connections from other processes.
See Chapter 16 for more details.
hwi Hive Web Interface A simple web interface for running queries and other commands
without logging into a cluster machine and using the CLI
The --auxpath option lets you specify a colon-separated list of “auxiliary” Java archive (JAR) files that
contain custom extensions, etc., that you might require.
The --config directory is mostly useful if you have to override the default configuration properties in
$HIVE_HOME/conf in a new directory.
Command Line Interface(CLI): The Hive Shell
The command-line interface or CLI is the most common way to interact with Hive. Using the CLI, you can
create tables, inspect schema and query tables, etc.
CLI Options
The following command shows a brief list of the options for the CLI.
$ hive --help --service cli
usage: hive
-d,--define <key=value> Variable substitution to apply to hive
commands. e.g. -d A=B or --define A=B
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,--help Print help information
-h <hostname> connecting to Hive Server on remote host
--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable substitution to apply to hive
commands. e.g. --hivevar A=B
-i <filename> Initialization SQL file
-p <port> connecting to Hive Server on port number
-S,--silent Silent mode in interactive shell
-v,--verbose Verbose mode (echo executed sql to the consloe)
Hive Variables and Properties
Hive consists of 4 namespace e.g. hivevar, hiveconf, system, and env .
hivevar Read/Write (v0.8.0 and later) User-defined custom variables.
hiveconf Read/Write Hive-specific configuration properties.
system Read/Write Configuration properties defined by Java.
env Read only Environment variables defined by the shell environment (e.g., bash).
Himanshu Sekhar Paul Apache HIVE |12
--hivevar
Syntax to define Variable
--define <key> =< value>
Or
--hivevar <key> = <value>
The --define key=value option is effectively equivalent to the --hivevar key=value option. Both let you
define on the command line custom variables that you can reference in Hive scripts to customize
execution. This feature is only supported in Hive v0.8.0 and later versions.When you use this feature, Hive
puts the key-value pair in the hivevar “namespace”.
Hive’s variables are internally stored as Java Strings. You can reference variables in queries; Hive replaces
the reference with the variable’s value before sending the query to the query processor.
Inside the CLI, variables are displayed and changed using the SET command.
$ ./hive
hive> set env:HOME;
env:HOME=/home/himanshu
Without the -v flag, set prints all the variables in the namespaces hivevar, hiveconf, system, and env.
With the -v option, it also prints all the properties defined by Hadoop, such as properties controlling
HDFS and MapReduce.
The set command is also used to set new values for variables.
$ ./hive --define name=himanshu
hive> set name;
name =himanshu;
hive> set hivevar:name;
hivevar:name=himanshu;
--hiveconf
It is used for all properties that configure Hive behavior. We’ll use it with a property hive.cli.print.current.db
that was added in Hive v0.8.0. It turns on printing of the current working database name in the CLI prompt.
The default database is named default. This property is false by default:
$ hive --hiveconf hive.cli.print.current.db=true
hive (default)> set hive.cli.print.current.db;
hive.cli.print.current.db=true
--system
Unlike hivevar variables, you have to use the system: or env: prefix with system properties and environment
variables. The env namespace is useful as an alternative way to pass variable definitions to Hive.
$ YEAR=2012 hive -e "SELECT * FROM mytable WHERE year = ${env:YEAR}";
5. .hiverc:
You can add all your add jars statements to .hiverc file in your home / hive config directory. So that they take
effect on hive-cli launch.
-i option
The -i file option lets you specify a file of commands for the CLI to run as it starts, before showing you the
prompt. Hive automatically looks for a file named .hiverc in your HOME directory and runs the commands
it contains, if any.
If the CLI is invoked without the -i option, then Hive will attempt to load $HIVE_HOME/bin/.hiverc and
$HOME/.hiverc as initialization files
Example:
$ hive -i /home/user/hive-init.sql
Autocomplete
If you start typing and hit the Tab key, the CLI will autocomplete possible keywords and function names. For
example, if you type SELE and then the Tab key, the CLI will complete the word SELECT. If you type the Tab
key at the prompt, you’ll get this reply:
hive> Display all 407 possibilities? (y or n)
If you enter y, you’ll get a long list of all the keywords and built-in functions.
Command History
You can use the up and down arrow keys to scroll through previous commands. Actually, each previous line
of input is shown separately; the CLI does not combine multiline commands and queries into a single history
entry. Hive saves the last 100,00 lines into a file $HOME/.hivehistory.
Shell Execution
DataTypes
Primitive Collection
ARRAY
STRING Miscellaneous
Numeric
Himanshu Sekhar Paul Apache HIVE |17
MAP
STRING BINARY
Integral STRUCT
VARCHAR BOOLEAN
TINYINT
BIGINT TIMESTAMP
Floating DATE
INTERVAL
FLOAT
DOUBLE
DECIMAL
Numeric Types
TINYINT (1-byte signed integer, from -128 to 127)
SMALLINT (2-byte signed integer, from -32,768 to 32,767)
INT/INTEGER (4-byte signed integer, from -2,147,483,648 to 2,147,483,647)
BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)
FLOAT (4-byte single precision floating point number)
DOUBLE (8-byte double precision floating point number)
DOUBLE PRECISION (alias for DOUBLE, only available starting with Hive 2.2.0)
DECIMAL
Introduced in Hive 0.11.0 with a precision of 38 digits
Hive 0.13.0 introduced user-definable precision and scale
NUMERIC (same as DECIMAL, starting with Hive 3.0.0)
Date/Time Types
MAP A collection of key-value tuples, where the fields are accessed map('first', 'John', 'last',
using array notation (e.g., ['key']). 'Doe')
For example, if a column name is of type MAP with key→value
pairs 'first'→'John' and 'last'→'Doe', then the last name can be
referenced using name['last'].
ARRAY Ordered sequences of the same type that are indexable using zero- array('John', 'Doe')
based integers.
For example, if a column name is of type ARRAY of strings with the
value ['John', 'Doe'], then the second element can be referenced
using name[1].
Most relational databases don’t support such collection types, because using them tends to break normal
form.
A practical problem with breaking normal form is the greater risk of data duplication, leading to
unnecessary disk space consumption and potential data inconsistencies, as duplicate copies can grow out
of sync as changes are made.
However, in Big Data systems, a benefit of sacrificing normal form is higher processing throughput.
Scanning data off hard disks with minimal “head seeks” is essential when processing terabytes to
petabytes of data. Embedding collections in records makes retrieval faster with minimal seeks.
Navigating each foreign key relationship requires seeking across the disk, with significant performance
overhead.
File Format
A file format is a way in which information is stored or encoded in a computer file. In Hive it refers to how
records are stored inside the file. As we are dealing with structured data, each record has to be its own
structure. How records are encoded in a file defines a file format. These file formats mainly vary between
data encoding, compression rate, usage of space and disk I/O.
Hive does not verify whether the data that you are loading matches the schema for the table or not.
However, it verifies if the file format matches the table definition or not.
By default Hive can supports following file format:
TEXTFILE
Himanshu Sekhar Paul Apache HIVE |19
SEQUENCEFILE
RCFILE
ORCFILE
Parquet (Hive 0.13.0)
TEXTFILE
TEXTFILE format is a famous input/output format used in Hadoop. In Hive if we define a table as
TEXTFILE it can load data of from CSV (Comma Separated Values), delimited by Tabs, Spaces, and JSON
data. This means fields in each record should be separated by comma or space or tab or it may be
JSON(JavaScript Object Notation) data. By default, if we use TEXTFILE format then each line is
considered as a record.
create table olympic
(athelete STRING, age INT, country STRING, year STRING, closing STRING, sport STRING, gold INT,
silver INT, bronze INT,total INT)
row format delimited
fields terminated by '\t'
stored as TEXTFILE;
At the end, we need to specify the type of file format. If we do not specify anything it will consider the file format as
TEXTFILE format.
SEQUENCEFILE
We know that Hadoop’s performance is drawn out when we work with a small number of files with big
size rather than a large number of files with small size. If the size of a file is smaller than the typical block
size in Hadoop, we consider it as a small file. Due to this, a number of metadata increases which will
become an overhead to the NameNode. To solve this problem sequence files are introduced in Hadoop.
Sequence files act as a container to store the small files.
Sequence files are flat files consisting of binary key-value pairs. When Hive converts queries to
MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record. Sequence
files are in the binary format which can be split and the main use of these files is to club two or more
smaller files and make them as a one sequence file.
In Hive we can create a sequence file by specifying STORED AS SEQUENCEFILE in the end of a CREATE
TABLE statement. There are three types of sequence files:
• Uncompressed key/value records.
• Record compressed key/value records – only ‘values’ are compressed here
• Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and
compressed. The size of the ‘block’ is configurable.
Hive has its own SEQUENCEFILE reader and SEQUENCEFILE writer libraries for reading and writing
through sequence files.
CREATE TABLE olympic_sequencefile (athelete STRING age INT, country STRING, year STRING)
row format delimited
fields terminated by '\t'
stored as sequencefile
Explaination
(DATABASE|SCHEMA) The uses of SCHEMA and DATABASE are interchangeable – they mean the same
thing. So any one can be placed
IF NOT EXISTS It is optional to use .While normally you might like to be warned if a database of
the same name already exists, the IF NOT EXISTS clause is useful for scripts that
should create a database on the-fly, if necessary, before proceeding. If it is
mentioned and there is database with same name is already exist , hive will
simply omit the step
COMMENT It is used to add description about database .It also optional. Whatever
mentioned with this parameter will be displayed in DESCRIBE command.
Example
The above create database statement will create a directory name “fianancials.db” in the location
'/etl/hive/data' in HDFS.If we don’t mentioned location , then Hive will create a directory name
“fianancials.db” in the path specified by the property hive.metastore.warehouse.dir.
Hive will create a directory for each database. Tables in that database will be stored in subdirectories of
the database directory. The exception is tables in the default database, which doesn’t have its own
directory.
When you don’t create any database but create any table , by default that table will be stored under
default database
By default, Hive always creates the table’s directory under the directory for the enclosing database. The
exception is the default database. It doesn’t have a directory under /user/hive/warehouse, so a
table in the default database will have its directory created directly in /user/hive/warehouse (unless
explicitly overridden).
Describe database
After creating database, you can see the various properties associated with database using DESCRIBE
command
Syntax
Himanshu Sekhar Paul Apache HIVE |21
hive> (DESCRIBE|DESC) (DATABASE|SCHEMA) [EXTENDED] database_name
Both (DESCRIBE|DESC) can be used one at a time . But One of them should definitely used
Both (DATABASE|SCHEMA) can be used one at a time . But One of them should definitely used
EXTENDED is optional . It is used to retrived more information about datbase
Example
hive >DESCRIBE DATABASE financials;
financials hdfs:/etl/hive/data/financials.db
Use Databases
We can set the database on which we need to work with USE command in hive. It sets the current
database to be used for further hive operations.
As, by default, we enter into default database in Hive CLI, we need to change our database if we need to
point to our custom database.
The USE command sets a database as your working database, analogous to changing working directories
in a filesystem:
hive> USE financials;
OK
Time taken: 1.051 second
Show Databases
Let’s verify the creation of these databases in Hive CLI with show databases command. It will list down the
databases in hive.
Syntax
SHOW (DATABASES|SCHEMAS) [LIKE identifier_with_wildcards];
By default, SHOW DATABASES lists all of the databases defined in the metastore.
LIKE – It is optional. But it allows us to filter the database names using a regular expression.
Wild cards in the regular expression can only be ” (single quotes) for any character(s) or ‘|’ for a choice.
Examples are ’employees’, ’emp’, ‘emp*|*ees‘, (emp* or *ees), all of which will match the database named
’employees’.
Examples
Below is the sample output of show databases command after execution above two creation commands.
hive> show databases;
OK
default
test_db
test_db2
Time taken: 0.072 seconds, Fetched: 3 row(s)
hive> SHOW DATABASES LIKE '*db*';
OK
test_db
test_db2
Time taken: 0.014 seconds, Fetched: 2 row(s)
hive>
Example
Lets add new property ‘modified by’ to the above created database test_db and we can see the result in
‘describe extended’.
hive> ALTER SCHEMA test_db SET DBPROPERTIES ('Modified by' = ‘Sekhar’);
OK
Time taken: 0.414 seconds
Drop Database
Finally, you can drop a database:
Syntax
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
Both (DESCRIBE|DESC) can be used one at a time . But One of them should definitely used
IF EXISTS – It is optional but used to suppresses warnings if database_name doesn’t exist.
RESTRICT – This is optional and even if it is used, it is same as default hive behavior, i.e. it will not allow
database to be dropped until all the tables inside it are dropped.
CASCADE – This argument allows to drop the non-empty databases with single command. DROP with
CASCADE is equivalent to dropping all the tables separately and dropping the database finally in
cascading manner.
Example
hive> DROP DATABASE IF EXISTS financials;
By default, Hive won’t permit you to drop a database if it contains tables. You can either drop the tables
first or append the CASCADE keyword to the command, which will cause the Hive to drop the tables in
the database first:
Using the RESTRICT keyword instead of CASCADE is equivalent to the default behavior, where existing
tables must be dropped before dropping the database.
When a database is dropped, its directory is also deleted.
Hive Tables
Introduction to Hive Tables
An internal table is also called a managed table, An external table is not “managed” by Hive. When
meaning it’s “managed” by Hive. That means you drop an external table, the schema/table
when you drop the internal table, both the table definition is deleted from metastore , but the
schema (or definition) from metastore AND the data/rows associated with it in HDFS are left
physical data (i.e table file structure) from the alone. I.e. the table’s rows are not deleted.
Hadoop Distributed File System (HDFS) are
dropped. (Similar to truncation operation )
“Location” Clause is not mandatory. If “Location” Clause is mandatory to create an
LOCATION is not mentioned then Hive will external table otherwise table will be managed
create the table directory structure inside by Hive only even if we create it with “External”
warehouse directory path mentioned in keyword.
hive.metastore.warehouse.dir parameter
With No Trash facility when INTERNAL table is With No Trash facility when External table is
deleted data along with table is deleted forever. deleted only table schema is where as underlying
There is no chance of getting data or table back. data remain untouched. So we can recreat table
Managed table
Data is temporary
Hive to Manage the table data completely not allowing any external source to use the table
Don’t want data after deletion
External table
The data is also used outside of Hive. For example, the data files are read and processed by an existing
program that doesn’t lock the files
Hive should not own data and control settings, dirs, etc., you have another program or process that will
do those things
You are not creating table based on existing table (AS SELECT)
Can create table back and with the same schema and point the location of the data
Creating Table
Complex Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO
num_buckets BUCKETS]
[ SKEWED BY (col_name, ...) ON ([(col_value, ...), ...|col_value, ...])
[STORED AS DIRECTORIES] ]
[ [ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement];
Explanation
TEMPORARY – Specified for creation of temporary tables (Hive 0.14.0 and later)
EXTERNAL – Specified when you want to make any table external
IF NOT EXISTS – it is optional. Suppresses error messages when a table already exists with same name
and ignores creation of table again even if there is a schema difference between existing table and new
table.
db_name – This is also optional but can be used to specify the table under a particular target database, if
we are not already working under it.
COMMENT – This is also optional. Similar to CREATE DATABASE statement comments, we can add
comments to table as well as to columns (strings within single quotes) to provide descriptive information
to users.
PARTITIONED BY – This clause is useful to partition the tables based on particular columns. Detailed
discussion on Partitioning is deferred to another individual post Partitioning and Clustering tables in
Hive.
SKEWED BY – This clause is useful to create skewed tables.
Example
Consider we have a employee dataset having 5 different column e.z. name, salary, subordinate,
deduction and address.
Name is of STRING type which holds Name Of Of employee
Salary is of FLOAT type which holds salary data of employee
Subordinate is of ARRAY of STRING type which holds the name of subordinates of corresponding
employee
Deduction is of MAP type which holds deduction name as key and deduction percentage as
value
Address is of STRUCT type which holds the address of employee like stree, city, state ,zip.
If the OVERWRITE keyword is used then hive will replace the previous data if exists
Here the LOCATION clause is mandatory in order to tell HIVE location of data.
The EXTERNAL table can also created by copying schema of another existing table by using LIKE
keyword
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.employees3
LIKE mydb.employees
LOCATION '/path/to/data';
col1 string
col2 int
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.022 seconds, Fetched: 31 row(s)
hive>
Skewed Tables
These are introduced for first time in Hive-0.14.0 to improve performance of tables with one or more
columns having skewed (repeated) values.
Hive will split the skewed (very often) values records into separate files and and rest of the values go to
some other file., and the data of same skewed file will be considered into account at the time of querying
this table, so that it can skip (or include) the whole file based on the input criteria.
These are not separate table types, but can be managed or external.
Show Tables
The SHOW TABLES command lists the tables. With no additional arguments, it shows the tables in the
current working database. Let’s assume we have already created a few other tables, table1 and
table2, and we did so in the mydb database:
hive> USE mydb;
hive> SHOW TABLES;
OK
Time taken: 18.482 seconds
employees
table1
table2
If we aren’t in the same database, we can still list the tables in that database:
hive> USE default;
hive> SHOW TABLES IN mydb;
OK
Time taken: 18.482 seconds
employees
table1
table2
If we have a lot of tables, we can limit the ones listed using a regular expression, a concept we’ll discuss in
detail in “LIKE and RLIKE” on page 96: hive> USE mydb;
hive> SHOW TABLES 'empl.*';
employees
The regular expression in the single quote looks for all tables with names starting with empl and ending with
any other characters (the .* part).
Alter Table
Most table properties can be altered with ALTER TABLE statements, which change metadata about the
table but not the data itself.
ALTER TABLE modifies table metadata only. The data for the table is untouched
Rename Table
ALTER TABLE table_name RENAME TO new_table_name;
You have to specify the old name, a new name, and the type, even if the name or type is not changing.
The keyword COLUMN is optional as is the COMMENT clause.
If you aren’t moving the column, the AFTER other_column clause is not necessary. In the example shown,
we move the column after the severity column. If you want to move the column to the first position, use
FIRST instead of AFTER other_column. As always, this command changes metadata only. If you are
moving columns, the data must already match the new schema or you must change it to match by some
other means.
Adding Columns
You can add new columns to the end of the existing columns, before any partition columns.
ALTER TABLE log_messages ADD COLUMNS (
app_name STRING COMMENT 'Application name',
session_id LONG COMMENT 'The current session id');
The COMMENT clauses are optional, as usual. If any of the new columns are in the wrong position, use an
ALTER COLUMN table CHANGE COLUMN statement for each one to move it to the correct position.
This statement effectively renames the original hms column and removes the server and process_id
columns from the original schema definition. As for all ALTER statements, only the table metadata is
changed.
The REPLACE statement can only be used with tables that use one of the native SerDe modules:
DynamicSerDe or MetadataTypedColumnsetSerDe. Recall that the SerDe determines how records are
parsed into columns (deserialization) and how a record’s columns are written to storage (serialization).
Dropping Tables
The familiar DROP TABLE command from SQL is supported:
DROP TABLE IF EXISTS employees;
The IF EXISTS keywords are optional. If not used and the table doesn’t exist, Hive returns an error.
For managed tables, the table metadata and data are deleted.
For external tables, the metadata is deleted but the data is not.
Actually, if you enable the Hadoop Trash feature, which is not on by default, the data is moved to the .Trash
directory in the distributed filesystem for the user, which in HDFS is /user/$USER/.Trash. To enable this
feature, set the property fs.trash.interval to a reasonable positive number.
SequenceFile.
We know that Hadoop’s performance is drawn out when we work with a small number of files with big
size rather than a large number of files with small size. If the size of a file is smaller than the typical block
size in Hadoop, we consider it as a small file. Due to this, a number of metadata increases which will
become an overhead to the NameNode. To solve this problem sequence files are introduced in Hadoop.
Sequence files act as a container to store the small files.
Sequence files are flat files consisting of binary key-value pairs. When Hive converts queries to
MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record. Sequence
files are in the binary format which can be split and the main use of these files is to club two or more
smaller files and make them as a one sequence file.
One benefit of sequence files is that they support block-level compression, so you can compress the
contents of the file while also maintaining the ability to split the file into segments for multiple map tasks.
In Hive we can create a sequence file by specifying STORED AS SEQUENCEFILE in the end of a CREATE
TABLE statement.
Himanshu Sekhar Paul Apache HIVE |34
There are three types of sequence files:
Uncompressed key/value records.
Record compressed key/value records – only ‘values’ are compressed here
Block compressed key/value records – both keys and values are collected in ‘blocks’ separately
and compressed. The size of the ‘block’ is configurable.
Hive has its own SEQUENCEFILE reader and SEQUENCEFILE writer libraries for reading and writing
through sequence files.
Creating SEQUENCEFILE
CREATE TABLE olympic_sequencefile(
athelete STRING,
age INT,
country STRING,
year STRING,
sport STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS SEQUENCEFILE
When we mentioned the command STORED AS SEQUENCEFILE ,Hive internally sets java classes for
InputFormat, OutputFormat , and SerDde
For InputFormat it assign org.apache.hadoop.mapred.SequenceFileInputFormat
For OutputFormat it assign
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
For ROW FORMAT SERDDE it assign org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Note :
To load data into this table is somewhat different from loading into the table created using TEXTFILE
format. You need to insert the data from another table because this SEQUENCEFILE format is the
binary format. It compresses the data and then stores it into the table. If you want to load directly as
in TEXTFILE format that is not possible because we cannot insert the compressed files into tables.
RCFile
RCFILE stands of Record Columnar File which is another type of binary file format which offers high
compression rate on the top of the rows.
RCFILE is used when we want to perform operations on multiple rows at a time.
RCFILEs are flat files consisting of binary key/value pairs, which shares many similarities with
SEQUENCEFILE. RCFILE stores columns of a table in form of record in a columnar manner. It first
partitions rows horizontally into row splits and then it vertically partitions each row split in a
columnar way. RCFILE first stores the metadata of a row split, as the key part of a record, and all the
data of a row split as the value part. This means that RCFILE encourages column oriented storage
rather than row oriented storage.
Column-oriented organization is a good storage option for certain types of data and applications. For
example, if a given table has hundreds of columns but most queries use only a few of the columns, it is
wasteful to scan entire rows then discard most of the data. However, if the data is stored by column
instead of by row, then only the data for the desired columns has to be read, improving performance.
This column oriented storage is very useful while performing analytics. It is easy to perform analytics
when we “hive’ a column oriented storage type.
Facebook uses RCFILE as its default file format for storing of data in their data warehouse as they
perform different types of analytics using Hive.
If we donot have Avro schema file we still can create a avro schema while creating a table
This process is initiated with the creation of JSON based schema to serialize data in a format that has a
schema built in. Avro has its own parser to return the provided schema as an object.The created object
allows us to create records with that schema.
We can create our schema inside the table properties while creating a Hive table axaTBLPROPERTIES
(‘avro.schema.literal’='{json schema here}’);
Now, let’s create an Avro file format for olympic data .
Inside the tblproperties you can see the schema of the data. Every record inside the tblproperties will
become a column in olympic_avro table. Here, ‘Name’ defines the column name and ‘type’ defines the
datatype of the particular column.
If you are using Hive 0.14.0, you don’t even need to mention ROW FORMAT SERDE, INPUTFORMAT,
and OUTPUTFORMAT.
Data Insertion into Avro Table:
There are 2 methods by which the data can be inserted into an Avro table:
1. If we have a file with extension ‘.avro’ and the schema of the file is the same as what you specified, then
you can directly import the file using the command
LOAD DATA LOCAL INPATH ‘PATH OF THE FILE';
2. You can copy the contents of a previously created table into the newly created Avro table. Let’s take a
look at the second type of data insertion technique to import data into an Avro table. We will begin by
creating a table which is delimited by tab space and stored as textfile
Text File Table Creation:
CREATE TABLE Olympic_txt(
athelete STRING,
age INT,
country STRING,
year STRING,
closing STRING,
sport STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\T'
STORED AS TEXTFILE;
Now textfile data can be simply be loaded into this using LOAD command.
Now data tis Olympic _txt table can be loaded into above Olympic_avro by using simple INSERT command .
INSERT OVERWRITE TABLE Olympic_avro select * from Olympic_txt
Parquet
Parquet, an open source file format for Hadoop. Parquet stores nested data structures in a flat
columnar format .Compared to a traditional approach where data is stored in row-oriented approach,
parquet is more efficient in terms of storage and performance.
Parquet stores binary data in a column-oriented way, where the values of each column are organized
so that they are all adjacent, enabling better compression. It is especially good for queries which read
particular columns from a “wide” (with many columns) table since only needed columns are read and
IO is minimized. Read this for more details on Parquet.
When we are processing Big data, cost required to store such data is more (Hadoop stores data
redundantly I.e 3 copies of each file to achieve fault tolerance) along with the storage cost processing
the data comes with CPU,Network IO, etc costs. As the data increases cost for processing and storage
increases. Parquet is the choice of Big data as it serves both needs, efficient and performance in both
storage and processing.
Advantages of using Parquet
There are several advantages to columnar formats.
Organizing by column allows for better compression, as data is more homogeneous. The space
savings are very noticeable at the scale of a Hadoop cluster.
I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data.
Better compression also reduces the bandwidth required to read the input.
As we store data of the same type in each column, we can use encoding better suited to the modern
processors’ pipeline by making instruction branching more predictable.
Creating table in hive to store parquet format:
To use Parquet with Hive 0.10 – 0.12 you must download the Parquet Hive package from the Parquet
project. You want the parquet-hive-bundle jar in Maven Central.
From Hive 0.13 Native Parquet support was added.
CREATE TABLE Olympic_parquet(
athelete STRING,
age INT,
country STRING,
year STRING,
closing STRING,
sport STRING)
STORED AS PARQUET;
We can not load data directly into parquet table. We should first create an alternate table to store the
text file and use insert overwrite command to write the data in parquet format.
Lets use the Olympic_txt table and load the data Olympic_txt of to Olympic_parquet
Here LOCAL is mandatory to mention as it tell hive that data is present in local file system. The data is
copied into the final location
To load data from hdfs path we will use LOAD command as follows
LOAD DATA INPATH ‘/home/hdfs/filesystem/path’ INTO TABLE employee
If LOCAL is omitted, the path is assumed to be in the distributed filesystem. In this case, the data is moved
from the path to the final location. The rationale for this inconsistency is the assumption that you usually
don’t want duplicate copies of your data files in the distributed filesystem.
Also, because files are moved in this case, Hive requires the source and target files and directories to be in
the same file system. For example, you can’t use LOAD DATA to load (move) data from one HDFS cluster to
another.
Hive does not verify that the data you are loading matches the schema for the table. However, it will
verify that the file format matches the table definition. Inserting Data into Tables from Queries
If you specify the OVERWRITE keyword, any data already present in the target directory will be deleted
first. Without the keyword, the new files are simply added to the target directory. However, if files
already exist in the target directory that match filenames being loaded, the old files are overwritten.
LOAD DATA LOCAL INPATH '/home/local/filesystem/path’ OVERWRITE INTO TABLE
employees
With OVERWRITE, any previous contents of the partition (or whole table if not partitioned) are replaced.
If you drop the keyword OVERWRITE or replace it with INTO, Hive appends the data rather than replaces
it. This feature is only available in Hive v0.8.0 or later.
Creating Tables and Loading Them in One Query
You can also create a table and insert query results into it in one statement:
CREATE TABLE ca_employees
AS SELECT name, salary, address FROM employees
WHERE se.state = 'CA';
This table contains just the name, salary, and address columns from the employee table records for
employees in California. The schema for the new table is taken from the SELECT clause.This feature can’t
be used with external tables.
OVERWRITE and LOCAL have the same interpretations as before and paths are interpreted following the
usual rules. One or more files will be written to /tmp/ca_employees, depending on the number of
reducers invoked.
We can look at the results from within the hive CLI:
hive> ! ls /tmp/ca_employees; 000000_0
Partitioning in Hive
Table partitioning means dividing table data into some parts based on the values of particular columns
like date or country, segregate the input records into different files/directories based on date or country.
Partitioning can be done based on more than column which will impose multi-dimensional structure on
directory storage. For Example, In addition to partitioning log records by date column, we can also sup
divide the single day records into country wise separate files by including country column into
partitioning. We will see more about this in the examples.
Partitions are defined at the time of table creation using the PARTITIONED BY clause, with a list of
column definitions for partitioning.
Syntax
CREATE [EXTERNAL] TABLE table_name (col_name_1 data_type_1, ....)
PARTITIONED BY (col_name_n data_type_n [COMMENT col_comment], ...);
Advantages
Partitioning is used for distributing execution load horizontally.
As the data is stored as slices/parts, query response time is faster to process the small part of the data
instead of looking for a search in the entire data set.
For example, In a large user table where the table is partitioned by country, then selecting users of
country ‘IN’ will just scan one directory ‘country=IN’ instead of all the directories.
Limitations
Having too many partitions in table creates large number of files and directories in HDFS, which is an
overhead to NameNode since it must keep all metadata for the file system in memory only.
Partitions may optimize some queries based on Where clauses, but may be less responsive for other
important queries on grouping clauses.
In Mapreduce processing, Huge number of partitions will lead to huge no of tasks (which will run in
separate JVM) in each mapreduce job, thus creates lot of overhead in maintaining JVM start up and tear
down. For small files, a separate task will be used for each file. In worst scenarios, the overhead of JVM
start up and tear down can exceed the actual processing time.
Himanshu Sekhar Paul Apache HIVE |42
Creation of Partition Table
Managed Partitioned Table
Below is the HiveQL to create managed partitioned_user table as per the above requirements.
Partitioned columns country and state can be used in Query statements WHERE clause and can be
treated regular column names even though there is actual column inside the input file data.
External Partitioned Tables
We can create external partitioned tables as well, just by using the EXTERNAL keyword in the CREATE
statement, but for creation of External Partitioned Tables, we do not need to mention LOCATION clause
as we will mention locations of each partitions separately while inserting data into table.
CREATE EXTERNAL TABLE ex_partitioned_order(
name STRING,
item STRING,
addres STRING,
city STRING,
zip STRING )
PARTITIONED BY (country STRING, state STRING)
STORED AS TEXTFILE;
Now we will load this data into above Maneged Partitioned table
LOAD DATA LOCAL INPATH '/home/himanshu/inputdir/Ind/MH/staticinput.txt'
INTO TABLE mn_partitioned_order
PARTITION (country = 'Ind', state = 'MH')
This will create separate directory under the default warehouse directory in HDFS.
/user/hive/warehouse/partitioned_order/country=Ind/state=MH/
Similarly we have to add other partitions, which will create corresponding directories in HDFS.
Or else we can load the entire directory into Hive table with single command and can add partitions for each
file with ALTER command.
LOAD DATA LOCAL INPATH '/home/himanshu/inputdir' INTO TABLE mn_partitioned_order;
ALTER TABLE mn_partitioned_order ADD IF NOT EXISTS
PARTITION (country = ’Ind’, state = ‘OD’)
LOCATION '/user/hive/warehouse/partitioned_order/country=Ind/state=OD/
PARTITION (year = ‘Ind’, state = ‘KA’)
LOCATION '/user/hive/warehouse/partitioned_order/country=Ind/state=KA/'
PARTITION (year = ‘Ind’, state = ‘TN’)
LOCATION '/user/hive/warehouse/partitioned_order/country=Ind/state=TN/'
This will create separate directory under the default warehouse directory in HDFS. Multiple partitions can be
added in the same query when using Hive v0.8.0 and later
/user/hive/warehouse/partitioned_order/country=Ind/state=OD/
/user/hive/warehouse/partitioned_order/country=Ind/state=KA/
/user/hive/warehouse/partitioned_order/country=Ind/state=TN/
Loading Data into Managed Partitioned Table from From Other Table
Consider we have another table name temp_order as follows
CREATE TABLE temp_order(
userid INT,
name STRING,
item STRING,
addres STRING,
city STRING,
state STRING,
zip STRING,
country STRING )
Loading Data into External Partitioned Table From HDFS using Static Partion
There is alternative for bulk loading of partitions into hive table. As data is already present in HDFS and
should be made accessible by Hive, we will just mention the locations of the HDFS files for each partition.
If our files are on Local FS, they can be moved to a directory in HDFS using –put or -copyFromLocal and we
can add partition for each file in that directory with commands similar to below.
hive> ALTER TABLE ex_partitioned_user
ADD PARTITION (country = 'US', state = 'CA')
LOCATION '/hive/external/tables/user/country=Ind/state=OD/'
As here table is a external table, we are not loading data into the table , rather the linking the location of the
file to the a partition of given table Similarly we need to repeat the above alter command for all partition files
in the directory so that a meta data entry will be created in metastore, mapping the partition and table.
2. Dynamic Partitioning in Hive
Instead of loading each partition with single SQL statement as shown above, which will result in writing
lot of SQL statements for huge no of partitions, Hive supports dynamic partitioning with which we can
add any number of partitions with single SQL execution. Hive will automatically splits our data into
separate partition files based on the values of partition keys present in the input files.
It gives the advantages of easy coding and no need of manual identification of partitions. This dynamic
partition suits well for our example requirement on user records provided above.
When you have large data stored in a unpartitoned table then Dynamic partition is suitable.
Usually dynamic partition load the data from non partitioned table
Dynamic Partition takes more time in loading data compared to static partition
If you want to partition number of column but you don’t know how many columns then also dynamic
partition is suitable
Before going to dynamic partition we have to consider parameter
<property>
<name>hive.exec.dynamic.partition</name>
<value>true</value>
<description>Whether or not to allow dynamic partitions in DML/DDL. </description>
</property>
By default hive.exec.dynamic.partition is set to false.
This parameter allows to run dynamic partition .
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
<description>
In strict mode, the user must specify at least one static partition in case
The user accidentally overwrites all partitions. In nonstrict mode all
partitions are allowed to be dynamic.
</description>
</property>
By Default hive.exec.dynamic.partition.mode is set to STRICT mode .The “Strict” mode prohibits
queries of partitioned tables without a WHERE clause that filters on partitions. You can set the mode to
“nonstrict,” as above.
Himanshu Sekhar Paul Apache HIVE |45
<property>
<name>hive.exec.max.dynamic.partitions</name>
<value>1000</value>
<description>Maximum number of dynamic partitions allowed to be created in
total.
</description>
</property>
This parameter allows to set Max no of partition can be created . by default it is set to 1000. Max value it
can set depends on cluster hardware configuration.
hive.exec.max.dynamic.partitions (default value being 1000) is the total number of dynamic
partitions could be created by one DML. If each mapper/reducer did not exceed the limit but the total
number of dynamic partitions does, then an exception is raised at the end of the job before the
intermediate data are moved to the final destination.
<property>
<name>hive.exec.max.dynamic.partitions.pernode</name>
<value>1000</value>
<description> Maximum number of dynamic partitions allowed to be created in
each mapper/reducer node.
</description>
</property>
<property>
If we have a lot of partitions and want to see partitions for particular partition keys, we can further
restrict the command with an optional PARTITION clause that specifies one or more of the partitions with
specific values.
hive> SHOW PARTITIONS partitioned_user PARTITION(country='US');
Describe partitions
As we already know how to see the descriptions of tables, Now we can see the descriptions of each partition
with commands similar to below.
hive> DESCRIBE FORMATTED partitioned_user PARTITION(country='US', state='CA');
Alter Partitions
We can alter/change partitions (add/change/drop) with the help of below commands.
Adding Partitions
We can add partitions to an existing table with ADD PARTITION clause as shown below.
ALTER TABLE partitioned_user ADD IF NOT EXISTS
PARTITION (country = 'US', state = 'XY') LOCATION '/hdfs/external/file/path1'
PARTITION (country = 'CA', state = 'YZ') LOCATION '/hdfs/external/file/path2'
PARTITION (country = 'UK', state = 'ZX') LOCATION '/hdfs/external/file/path2'
Changing Partitions
We can change a partition location with commands like below. This command does not move the data from
the old location and does not delete the old data but the reference to old data file will be lost.
ALTER TABLE partitioned_user PARTITION (country='US', state='CA')
SET LOCATION '/hdfs/partition/newpath';
Drop Partitions
We can drop partitions of a table with DROP IF EXISTS PARTITION clause as shown below.
ALTER TABLE partitioned_user DROP IF EXISTS PARTITION(country='US', state='CA');
Archive Partition
The ARCHIVE PARTITION clause captures the partition files into a Hadoop archive (HAR) file. This only
reduces the number of files in the filesystem, reducing the load on the NameNode, but doesn’t provide any
space savings.
ALTER TABLE log_messages ARCHIVE PARTITION(country='US',state='XZ');
Unlike partitioned columns (which are not included in table columns definition) , Bucketed
columns are included in table definition as shown in above code for state and city columns.
INTO... BUCKETS clause defines how many no of bucket will be created.
CLUSTERED BY () clause will define on which column the table will bucketed.
SORTED BY () is a optional clause. When it is mentioned it will sort the data on given column.
hive> SELECT firstname, country, state, city FROM temp_user LIMIT 129 ;
OK
first_name country state city
Rebbecca AU TA Leith
Stevie AU QL Proston
Mariko AU WA Hamel
Gerardo AU NS Talmalmo
Mayra AU NS Lane Cove
Idella AU WA Cartmeticup
Sherill AU WA Nyamup
Ena AU NS Bendick Murrell
We can also perform random sampling with Hive with below syntax.
The following two queries are identical. The second version uses a table alias e, which is not very useful
in this query, but becomes necessary in queries with JOINs
hive> SELECT name, salary FROM employees;
hive> SELECT e.name, e.salary FROM employees e;
When you select columns that are one of the collection types, Hive uses JSON (JavaScript Object Notation)
syntax for the output. First, let’s select the subordinates, an ARRAY, where a comma-separated list
surrounded with […] is used. Note that STRING elements of the collection are quoted, while the primitive
STRING name column is not:
hive> SELECT name, subordinates FROM employees;
John Doe ["Mary Smith","Todd Jones"]
Mary Smith ["Bill King"]
Todd Jones []
Bill King []
The deductions is a MAP, where the JSON representation for maps is used, namely a comma-separated list
of key:value pairs, surrounded with {...}:
hive> SELECT name, deductions FROM employees;
John Doe {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Mary Smith {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Todd Jones {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
Bill King {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
Himanshu Sekhar Paul Apache HIVE |52
Finally, the address is a STRUCT, which is also written using the JSON map format:
hive> SELECT name, address FROM employees;
John Doe {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}
Mary Smith {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}
Todd Jones {"street":"200 Chicago Ave.","city":"Oak
Park","state":"IL","zip":60700}
Bill King {"street":"300 Obscure Dr.","city":"Obscuria","state":"IL","zip":60100}
ARRAY indexing is 0-based, as in Java. Here is a query that selects the first element of the subordinates
array:
hive> SELECT name, subordinates[0] FROM employees;
John Doe Mary Smith
Mary Smith Bill King
Todd Jones NULL Bill King NULL
Note that referencing a nonexistent element returns NULL. Also, the extracted STRING values are no
longer quoted! To reference a MAP element, you also use ARRAY[...] syntax, but with key values instead of
integer indices:
hive> SELECT name, deductions["State Taxes"] FROM employees;
John Doe 0.05
Mary Smith 0.05
Todd Jones 0.03
Bill King 0.03
Finally, to reference an element in a STRUCT, you use “dot” notation, similar to the table_alias.column
mentioned above:
hive> SELECT name, address.city FROM employees;
John Doe Chicago
Mary Smith Chicago
Todd Jones Oak Park
Bill King Obscurias
Column Aliases
When tables have long column name , using such column in join operation become teedyous job as we
have to write that long name each time.So it’s sometimes useful to give those anonymous columns a name,
called a column alias.
The previous result set is aliased as e, from which we perform a second query to select the name and the
salary_minus_fed_taxes, where the latter is greater than 70,000.
Furthermore, Hive will attempt to run other operations in local mode if the hive.exec.mode.local.auto
property is set to true:
set hive.exec.mode.local.auto=true;
Otherwise, Hive uses MapReduce to run all other queries.
LIMIT Clause
The results of a typical query can return a large number of rows. The LIMIT clause puts an upper limit on
the number of rows returned:
hive> SELECT upper(name), salary, deductions["Federal Taxes"],
round(salary * (1 - deductions["Federal Taxes"])) FROM employees
LIMIT 2;
WHERE Clauses
While SELECT clauses select columns, WHERE clauses are filters; they select which records to return.
WHERE clauses use predicate expressions, applying predicate operators. Several predicate expressions
can be joined with AND and OR clauses. When the predicate expressions evaluate to true, the
corresponding rows are retained in the output.
SELECT * FROM employees WHERE country = 'US' AND state = 'CA';
Predicate can also contains expression which may involves ome computation
hive> SELECT name, salary, deductions["Federal Taxes"],
salary * (1 - deductions["Federal Taxes"])
FROM employees
WHERE round(salary * (1 - deductions["Federal Taxes"])) > 70000;
We can’t reference column aliases in the WHERE clause. lets re-write above code.
hive> SELECT name, salary, deductions["Federal Taxes"],
salary * (1 - deductions["Federal Taxes"]) as salary_minus_fed_taxes
FROM employees
WHERE round(salary_minus_fed_taxes) > 70000;
GROUP BY Clauses
The GROUP BY statement is often used in conjunction with aggregate functions to group the result set by
one or more columns and then perform an aggregation over each group.
SELECT year(ymd), avg(price_close) FROM stocks
WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
GROUP BY year(ymd);
Note :
When using clause like GROUP BY ,ORDER BY etc make sure that all the column present in SELECT
statement should ccombine with any aggregate function or placed in ORDER BY ,or GROUP BY clause
Without the HAVING clause, this query would require a nested SELECT statement:
hive> SELECT s2.year, s2.avg FROM
(SELECT year(ymd) AS year, avg(price_close) AS avg FROM stocks
WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
GROUP BY year(ymd)) s2
WHERE s2.avg > 50.0;
ORDER BY Clause
The ORDER BY clause performs a total ordering of the query result set. This means that all the data is
passed through a single reducer, which may take an unacceptably long time to execute for larger data sets.
You can specify any columns you wish and specify whether or not the columns are ascending using the
ASC keyword (the default) or descending using the DESC keyword.
SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
ORDER BY s.ymd ASC, s.symbol DESC;
Because ORDER BY can result in excessively long run times, Hive will require a LIMIT clause with ORDER
BY if the property hive.mapred.mode is set to strict. By default, it is set to nonstrict
SORT BY Clause
SORT BY, that orders the data only within each reducer, thereby performing a local ordering, where each
reducer’s output will be sorted. Better performance is traded for total ordering.
You can specify any columns you wish and specify whether or not the columns are ascending using the
ASC keyword (the default) or descending using the DESC keyword.
SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
SORT BY s.ymd ASC, s.symbol DESC;
As more than one reducer is invoked, the output will be sorted differently than ORDER BY. While each
reducer’s output files will be sorted, the data will probably overlap with the output of other reducers.
By default, MapReduce computes a hash on the keys output by mappers and tries to evenly distribute the
key-value pairs among the available reducers using the hash values. Unfortunately, this means that when
we use SORT BY, the contents of one reducer’s output will overlap significantly with the output of the
other reducers, as far as sorted order is concerned, even though the data is sorted within each reducer’s
output .
DISTRIBUTE BY Clause
DISTRIBUTE BY controls how map output is divided among reducers. All data that flows through a
MapReduce job is organized into key-value pairs. Hive must use this feature internally when it converts
your queries to MapReduce jobs.
As described above , in SORT BY there may be chance data will probably overlap with the output of other
reducers, we can use DISTRIBUTE BY first to ensure that the same key from output of mapper goes to the
same reducer, then use SORT BY to order the data the way we want.
DISTRIBUTE BY works similar to GROUP BY in the sense that it controls how reducers receive rows for
processing, while SORT BY controls the sorting of data inside the reducer. Note that Hive requires that the
DISTRIBUTE BY clause come before the SORT BY clause.
CLUSTER BY Clause
In the previous example, the s.symbol column was used in the DISTRIBUTE BY clause, and the s.symbol
and the s.ymd columns in the SORT BY clause. Suppose that the same columns are used in both clauses and
all columns are sorted by ascending order (the default). In this case, the CLUSTER BY clause is a shor-hand
way of expressing the same query.
For example, let’s modify the previous query to drop sorting by s.ymd and use CLUSTER BY on s.symbol:
hive> SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
CLUSTER BY s.symbol;
Using DISTRIBUTE BY ... SORT BY or the shorthand CLUSTER BY clauses is a way to exploit the parallelism
of SORT BY, yet achieve a total ordering across the output files.
Casting
Here we discuss the cast() function that allows you to explicitly convert a value of one type to another.
Recall our employees table uses a FLOAT for the salary column. Now, imagine for a moment that STRING
was used for that column instead. How could we work with the values as FLOATS?
The following example casts the values to FLOAT before performing a comparison:
SELECT name, salary
FROM employees
WHERE cast (salary AS FLOAT) < 100000.0;
The syntax of the cast function is cast(value AS TYPE). What would happen in the example if a salary value
was not a valid string for a floating-point number? In this case, Hive returns NULL
Block Sampling
Hive offers another syntax for sampling a percentage of blocks of an input path as an alternative to
sampling based on rows:
hive> SELECT * FROM numbersflat TABLESAMPLE(0.1 PERCENT) s;
the smallest unit of sampling is a single HDFS block. Hence, for tables less than the typical block size of 128
MB, all rows will be retuned.
JOIN is a clause that is used for combining specific fields from two tables by using values common to
each one. It is used to combine records from two or more tables in the database. It is more or less similar
to SQL JOIN.
Syntax
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference [join_condition]
| table_reference LEFT SEMI JOIN table_reference [join_condition]
| table_reference CROSS JOIN table_reference [join_condition]
We will use the following four tables in this chapter interchangigly. Consider the following table named
CUSTOMERS..
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+
1. INNER JOIN
In an inner JOIN, records are discarded unless join criteria finds matching records in every table being joined.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the records:
On successful execution of the query, you get to see the following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
| 4 | Chaitali | 25 | 2060 |
+----+----------+-----+--------+
The ON clause specifies the conditions for joining records between the two tables.
We can use WHERE clause in order to reduce the no of rows eligible for join .
Standard SQL allows a non-equi-join on the join keys. But this is not valid in Hive, primarily because it is
difficult to implement these kinds of joins in MapReduce.
Hive does not currently support using OR between predicates in ON clauses.
We can place multiple condition in join condition using AND operater . Consider following example in
which stock table is joined with dividends
SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL'
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables:
On successful execution of the query, you get to see the following response:
+----+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+----+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
4. LEFT SEMI-JOIN
A left semi-join returns records from the lefthand table if records are found in the righthand table that satisfy
the ON predicates. It’s a special, optimized case of the more general inner join. Most SQL dialects support an
IN ... EXISTS construct to do the same thing. For instance, the following query in Example 6-2 attempts to
return stock records only on the days of dividend payments, but it doesn’t work in Hive.
The reason semi-joins are more efficient than the more general inner join is as follows.
For a given record in the lefthand table, Hive can stop looking for matching records in the righthand table
as soon as any match is found. At that point, the selected columns from the lefthand table record can be
projected.
Right semi-joins are not supported in Hive.
On successful execution of the query, you get to see the following response:
------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
Additionally, Cartesian products create a lot of data. Unlike other join types, Cartesian products are not
executed in parallel, and they are not optimized in any way using MapReduce.
In Hive, this query computes the full Cartesian product before applying the WHERE clause. It could take a
very long time to finish. When the property hive.mapred.mode is set to strict, Hive prevents users from
inadvertently issuing a Cartesian product query.
Cartesian product queries can be useful. For example, suppose there is a table of user preferences, a table
of news articles, and an algorithm that predicts which articles a user would like to read. A Cartesian
product is required to generate the set of all users and all pages.
Join Optimization
Himanshu Sekhar Paul Apache HIVE |62
Before understanding to how to optimize join process, we need to understand how Hive join process is
carried out internally.
HIVE Join
Each time we run a Join query, Hive internally generate a Map Reduce job . So by default for 1 join 1
MapReduce is generated. If more than two tables involved in a Join statement then there could be more
than 1 MapReduce task generated. For now, let’s consider a join statement where only 2 tables are
involved. So Hive will generate one 1 MapReduce task.
Now , like other MapReduce job , Here also it will start with Mapper phase . in Mapper phase , indivisual
Mapper will read data from tble (The table data physicaly stored in data file which will be logically
splited into inputSplit and each mapper will read 1 input split)and emit <Key ,Value > pair. Here Key
will join key (column in join condition) and Value will be entire tuple.
Now the output of Mapper will undergo suffle and sort phase in which all the tuple with same join key
will go to same reducer
Now reducer phase is aggregation phase where actual join happens. Reducer will takes the sorted
results as input and join records with same join keys.
This process also called Shuffle Join or Common Join.This is also called reduce side join as actual
join is done reducer phase.
Need of Optimization
This isn't a good user experience because sometimes the user may give the wrong hint or may not give any
hint at all.
2. Another (better, in my opinion) way to turn on mapjoins is to let Hive do it automatically. Simply set
hive.auto.convert.join to true in your config, and Hive will automatically use mapjoins for any tables
smaller than hive.mapjoin.smalltable.filesize (default is 25MB).These two table can be manually
set from Hive terminal using SET operater.
set hive.auto.convert.join=true;
<property>
<name>hive.optimize.bucketmapjoin</name>
<value>true</value>
<description>Whether to try bucket mapjoin</description>
</property>
Each dept will be processed separately by a reducer and records will be sorted by id and name fields
within each dept separately.
4. Enable Tez Execution Engine
Instead of running Hive queries on venerable Map-reduce engine, we can improve the performance of
hive queries at least by 100% to 300 % by running on Tez execution engine. We can enable the Tez
engine with below property from hive shell.
hive> set hive.execution.engine=tez;
9. Enable Vectorization
By default, Hive processes rows one by one. Each row of data goes through all operators before
processing of the next one. This way is very ineffective in terms of CPU usage.
To improve efficency of CPU instructions and cache usage, Hive (version 0.13.0 and later) uses
vectorization. This is a parallel processing technique, in which an operation is applied to a block of 1024
rows at a time rather than a single row. Each column in the block is represented by a vector of a
primitive data type. The inner loop of execution effectively scans these vectors, avoiding method calls,
deserialization, and unnecessary if-then-else instructions.
Vectorization only works with columnar formats, such as ORC and Parquet.
We can enable vectorized query execution by setting below three properties in either hive shell or hive-
site.xml file.
Himanshu Sekhar Paul Apache HIVE |70
hive> set hive.vectorized.execution.enabled = true;
hive> set hive.vectorized.execution.reduce.enabled = true;
hive> set hive.vectorized.execution.reduce.groupby.enabled = true;
If possible, Hive will apply operations to vectors. Otherwise, it will execute the query with vectorization
turned off.
10. Controls Parallel Reduce Tasks
We can control the number of parallel reduce tasks that can be run for a given hive query with below
properties.
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>256000000</value>
<description>size per reducer.The default is 256Mb, i.e if the input size is
1G, it will use 4 reducers.</description>
</property>
<property>
<name>hive.exec.reducers.max</name>
<value>1009</value>
<description>
max number of reducers will be used. If the one specified in the
configuration parameter mapred.reduce.tasks is negative, Hive will use this
one as the max number of reducers when automatically determine number of
reducers.
</description>
</property>
we can also set the parallel reduce tasks to a fixed value with below property.
hive> set mapred.reduce.tasks=32;
EXPLAIN
Hive provides an EXPLAIN command that shows the logical and physical execution plan for a query. The syntax
for this statement is as follows:
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query
AUTHORIZATION is supported from HIVE 0.14.0 via HIVE-5961.
The use of EXTENDED in the EXPLAIN statement produces extra information about the operators in the plan.
This is typically physical information like file names.
A Hive query gets converted into a sequence (it is more a Directed Acyclic Graph) of stages. These stages may
be map/reduce stages or they may even be stages that do metastore or file system operations like move and
rename. The explain output has three parts:
The Abstract Syntax Tree for the query
The dependencies between the different stages of the plan
The description of each of the stages
The description of the stages itself shows a sequence of operators with the metadata associated with the
operators. The metadata may comprise things like filter expressions for the FilterOperator or the select
expressions for the Select Operator or the output file names for the FileSinkOperator.
Indexes are maintained in a separate table in Hive so that it won’t affect the data inside the table, which
contains the data. Another major advantage for indexing in Hive is that indexes can also be partitioned
depending on the size of the data we have.
Types of Indexes in Hive
Compact Indexing
Bitmap Indexing
Bit map indexing was introduced in Hive 0.8 and is commonly used for columns with distinct values.
This ALTER statement will complete our REBUILDED index creation for the table.
Here we can see the average age of the athletes to be 26.405433646812956 and the time for performing
this operation is 21.08 seconds.
Now, let’s create the index for this table:
Here we have created an index for the ‘olympic’ table on the age column. We can view the indexes created
for the table by using the below command:
We can see the indexes available for the ‘olympic’ table in the above image.
Now, let’s perform the same Average operation on the same table.
We have now got the average age as 26.405433646812956, which is same as the above, but now the
time taken for performing this operation is 17.26 seconds, which is less than the above case.
Now we know that by using indexes we can reduce the time of performing the queries.
We can now see that we have two indexes available for our table.
Average Operation with Two Indexes
Now, let’s perform the same Average operation having the two indexes.
We have successfully deleted one index i.e., olympic_index ,which is a compact index.
We now have only one index available for our table, which is a bitmap index.
BIGINT floor(double a) It returns the maximum BIGINT value that is equal or less than
the double.
BIGINT ceil(double a) It returns the minimum BIGINT value that is equal or greater
than the double.
double rand(), rand(int seed) It returns a random number that changes from row to row.
string concat(string A, string B,...) It returns the string resulting from concatenating B after A.
string CONCAT_WS(string The CONCAT_WS function concatenates all the strings only
delimiter, string strings and Column with datatype string.
str1,str2……)
string substr(string A, int start) It returns the substring of A starting from start position till the
end of string A.
string substr(string A, int start, int It returns the substring of A starting from start position with the
length) given length.
string upper(string A) It returns the string resulting from converting all characters of A
to upper case.
string FIND_IN_SET(string The FIND_IN_SET function searches for the search_string in the
search_string ,string source_string_list and returns the position of the first
source_string_list) occurrence in the source_string_list. Here the source_string_list
should be comma delimited one.
string lower(string A) It returns the string resulting from converting all characters of B
to lower case.
string trim(string A) It returns the string resulting from trimming spaces from both
ends of A.
string ltrim(string A) It returns the string resulting from trimming spaces from the
beginning (left hand side) of A.
string regexp_replace(string A, It returns the string resulting from replacing all substrings in B
string B, string C) that match the Java regular expression syntax with C.
value of cast(<expr> as <type>) It converts the results of the expression expr to <type> e.g.
<type> cast('1' as BIGINT) converts the string '1' to it integral
representation. A NULL is returned if the conversion does not
succeed.
string from_unixtime(int unixtime) convert the number of seconds from Unix epoch (1970-01-01
00:00:00 UTC) to a string representing the timestamp of that
moment in the current system time zone in the format of "1970-
01-01 00:00:00"
string to_date(string timestamp) It returns the date part of a timestamp string: to_date("1970-01-
01 00:00:00") = "1970-01-01"
int year(string date) It returns the year part of a date or a timestamp string:
year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970
String LENGTH LENGTH function returns the number of characters in the string.
int month(string date) It returns the month part of a date or a timestamp string:
month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11
int day(string date) It returns the day part of a date or a timestamp string:
day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1
string get_json_object(string It extracts json object from a json string based on json path
json_string, string path) specified, and returns json string of the extracted json object. It
returns NULL if the input json string is invalid.
string LPAD(string str,int len,string The LPAD function returns the string with a length of len
pad) characters left-padded with pad.
string RPAD(string str,int The RPAD function returns the string with a length of len
len,string pad) characters Right-padded with pad.
Example
The following queries demonstrate some built-in functions:
round() function
hive> SELECT round(2.6) from temp;
On successful execution of query, you get to see the following response:
3.0
floor() function
hive> SELECT floor(2.6) from temp;
On successful execution of the query, you get to see the following response:
2.0
ceil() function
hive> SELECT ceil(2.6) from temp;
On successful execution of the query, you get to see the following response:
Himanshu Sekhar Paul Apache HIVE |82
3.0
SPACE :
SPACE function returns the specified number of spaces.
hive> select space(10),name from Tri100;
rahul
Mohit
Rohan
Ajay
srujay
SPLITT :
Syntax: SPLITT(‘string1:string2’,’pat’)
Split function splits the string depending on the pattern pat and returns an array of strings.
hive> select split('hadoop:hive',':') from Tri100 where sal=22000;
["hadoop","hive"]
Format :
Syntax: “FORMAT_NUMBER(number X,int D)”
Formats the number X to a format like #,###,###.##, rounded to D decimal places and returns result as a
string. If D=0 then the value will only have fraction part there will not be any decimal part.
hive> select name,format_number(Hike,2) from Tri100;
rahul 40,000.00
Mohit 25,000.00
Rohan 40,000.00
Ajay 45,000.00
srujay 30,000.00
INSTRING :
Syntax: “instr(string str,string substring)”
Returns the position of the first occurrence of substr in str. Returns null if either of the arguments are null and
returns 0 if substr could not be found in str. Be aware that this is not zero based. The first character in str has
index 1.
hive> select instr('rahul','ul') from Tri100 where sal=22000;
4
N-Grams :
Syntax: N-grams(array<array<string>>,int N, int K, int P)
Returns the top-k N-grams from a set of tokenized sentences, such as those returned by the sentences() UDAF.
hive> select ngrams(sentences(name),1,5)from Tri100 ;
[{"ngram":["Ajay"],"estfrequency":1.0},{"ngram":["Mohit"],"estfrequency":1.0},{"ng
ram":["Rohan"],"estfrequency":1.0},{"ngram":["rahul"],"estfrequency":1.0},{"ngram":["
srujay"],"estfrequency":1.0}]
Parse URL :
Syntax: “parse_url(string urlString, string partToExtract [, string keyToExtract])”
Returns the specified part from the URL. Valid values for partToExtract include HOST, PATH, QUERY, REF,
PROTOCOL, AUTHORITY, FILE, and USERINFO.
hive> select
parse_URL('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1','HOST')from Tri100 where
sal=22000;
facebook.com
Printf :
Syntax: “printf(String format, Obj… args)”
Returns the input formatted according do printf-style format strings
hive> select printf("color %s, number1 %d, float %f",'red',89,3.14) from Tri100
where sal=22000;
color red, number1 89, float 3.140000
Regexp_Extract :
Syntax: “regexp_extract(string subject, string pattern, int index)”
Returns the string extracted using the pattern.
hive> select regexp_extract('foothebar','foo(.*?)(bar)',2) from Tri100 where
sal=22000;
bar
hive> select regexp_extract('foothebar','foo(.*?)(bar)',1) from Tri100 where
sal=22000;
the
hive> select regexp_extract('foothebar','foo(.*?)(bar)',0) from Tri100 where
sal=22000;
foothebar
Regexp_Repalce :
Syntax: “regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)”
Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular
expression syntax defined in PATTERN with instances of REPLACEMENT.
hive> select regexp_replace('foothebar','oo|ba','') from Tri100 where sal=22000;
fther
Sentences :
Syntax: “sentences(string str, string lang, string locale)”
Tokenizes a string of natural language text into words and sentences, where each sentence is broken at the
appropriate sentence boundary and returned as an array of words. The ‘lang’ and ‘locale’ are optional
arguments.
Str_to_map :
Syntax: “str_to_map(text[, delimiter1, delimiter2])”
Splits text into key-value pairs using two delimiters. Delimiter1 separates text into K-V pairs, and Delimiter2
splits each K-V pair. Default delimiters are ‘,’ for delimiter1 and ‘=’ for delimiter2.
hive> select str_to_map(concat('Names=',name,'&','Hike=',Hike)) from Tri100;
{"Names=rahul&Hike=40000":null}
{"Names=Mohit&Hike=25000":null}
{"Names=Rohan&Hike=40000":null}
{"Names=Ajay&Hike=45000":null}
{"Names=srujay&Hike=30000":null}
Translate :
Syntax: “translate(string|char|varchar input, string|char|varchar from, string|char|varchar to)”
Translates the input string by replacing the characters present in the from string with the corresponding
characters in the to string. If any of the parameters to this UDF are NULL, the result is NULL as well.
hive> select translate('hello','hello','hi') from Tri100 where sal=22000;
hi
hive> select translate('Make sure u knew that code','e','o') from Tri100 where
sal=22000;
Mako suro u know that codo
Aggregated Functions and Normal Queries:
Lets consider Tri100 table ha following data
hive> select * from Tri100;
OK
1 rahul Hyderabad 3000 40000
2 Mohit Banglore 22000 25000
3 Rohan Banglore 33000 40000
4 Ajay Bangladesh 40000 45000
5 srujay Srilanka 25000 30000
Time taken: 0.184 seconds, Fetched: 5 row(s)
SUM
Returns the sum of the elements in the group or sum of the distinct values of the column in the group.
hive> select sum(sal) from Tri100;
OK
150000
Time taken: 17.909 seconds, Fetched: 1 row(s)
hive> select Sum(sal) from Tri100 where loccation='Banglore';
OK
55000
Time taken: 18.324 seconds, Fetched: 1 row(s)
Count
count(*) – Returns the total number of retrieved rows, including rows containing NULL values;
count(expr) – Returns the number of rows for which the supplied expression is non-NULL;
count(DISTINCT expr[, expr]) – Returns the number of rows for which the supplied expression(s) are unique
and non- NULL;
hive> select count(*) from Tri100;
OK
5
Time taken: 16.307 seconds, Fetched: 1 row(s)
Average
Returns the average of the elements in the group or the average of the distinct values of the column in the group.
hive> select avg(sal) from Tri100 where location='Banglore';
OK
27500.0
Time taken: 17.276 seconds, Fetched: 1 row(s)
hive> select avg(distinct sal) from Tri100;
OK
30000.0
Time taken: 17.276 seconds, Fetched: 1 row(s)
Minimum
Returns the minimum of the column in the group.
hive> select min(sal) from Tri100;
OK
22000
Time taken: 17.368 seconds, Fetched: 1 row(s)
Maximum
Returns the maximum of the column in the group.
hive> select max(sal) from Tri100;
OK
40000
Time taken: 17.267 seconds, Fetched: 1 row(s)
Variance
Returns the variance of a numeric column in the group.
hive> select variance(sal) from Tri100;
OK
3.96E7
Time taken: 17.223 seconds, Fetched: 1 row(s)
Standard Deviation
Himanshu Sekhar Paul Apache HIVE |87
Returns the Standard Deviation of a numeric column in the group.
hive> select stddev_pop(sal) from Tri100;
OK
6292.8530890209095
Time taken: 18.63 seconds, Fetched: 1 row(s)
Returns the unbiased sample Standard Deviation of a numeric column in the group.
hive> select stddev_samp(sal) from Tri100;
OK
7035.623639735144
Time taken: 17.299 seconds, Fetched: 1 row(s)
Covariance
Returns the population covariance of a pair of numeric columns in the group.
hive> select covar_pop(sal,Hike) from Tri100;
OK
4.4E7
Time taken: 18.888 seconds, Fetched: 1 row(s)
Returns the sample covariance of a pair of numeric columns in the group.
hive> select covar_samp(sal,Hike) from Tri100;
OK
5.5E7
Time taken: 18.302 seconds, Fetched: 1 row(s)
Correlation
Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.
hive> select corr(sal,Hike) from Tri100;
OK
0.9514987095307504
Time taken: 17.514 seconds, Fetched: 1 row(s)
Percentile
Returns the exact pth percentile of a column in the group(does not work with floating point types).P must be
between 0 and 1. NOTE: A true percentile “ Percentile(BIGINT col,P)”can only be computed for INTEGER
values. Use PERCENTILE_APPROX if you are input is non-integral.
hive> select percentile(sal,0) from Tri100;------------------------Output Gives
Lower Value of table as P is 0.It takes lower value as 0%.
OK
22000.0
Time taken: 17.321 seconds, Fetched: 1 row(s)
hive> select percentile(sal,1) from Tri100; -----------------------Output Gives
Higher Value of table as P is 1.It takes Higher value as 100%.
OK
40000.0
Time taken: 17.966 seconds, Fetched: 1 row(s)
Histogram
Computes a histogram of a numeric column in the group using b non-uniformly spaced bins.The output is an
array of size b of double-valued (x,y) coordinates that represent the bin centers and heights.
“histogram_numeric(col, b)”
Collections
Returns a set of objects with duplicate elements eliminated.
hive> select collect_set(Hike) from Tri100;
OK
[45000,40000,25000,30000]
Time taken: 18.29 seconds, Fetched: 1 row(s)
Returns a set of objects with duplicates(as of Hive 0.13.0)
hive> select collect_list(Hike) from Tri100;
OK
[40000,25000,40000,45000,30000]
Time taken: 17.217 seconds, Fetched: 1 row(s)
NTILE
This function divides an ordered partition into x groups called buckets and assigns a bucket number to each
row in the partition. This allows easy calculation of tertiles, quartiles, deciles, percentiles and other common
summary statistics. (As of Hive 0.11.0.).
from_utc_timestamp
This function assumes that the string in the first expression is UTC and then, converts that string to the time
zone of the second expression. This function and the to_utc_timestamp function do timezone conversions. In
the following example, t1 is a string.
hive> SELECT from_utc_timestamp('1970-01-01 07:00:00', 'JST');
OK
1970-01-01 16:00:00
Time taken: 0.148 seconds, Fetched: 1 row(s)
to_utc_timestamp:
unix_timestamp :
This function converts the date to the specified date format and returns the number of seconds between the
specified date and Unix epoch. If it fails, then it returns 0. The following example returns the value
1237487400
hive> SELECT unix_timestamp ('2009-03-20', 'yyyy-MM-dd');
OK
1237487400
Time taken: 0.156 seconds, Fetched: 1 row(s)
unix_timestamp() :
This function returns the number of seconds from the Unix epoch (1970-01-01 00:00:00 UTC) using the default
time zone.
hive> select UNIX_TIMESTAMP('2000-01-01 00:00:00');
OK
946665000
Time taken: 0.147 seconds, Fetched: 1 row(s)
DATE CONVERSIONS :
Convert MMddyyyy Format to Unixtime
Note: M Should be Capital Every time in MMddyyyy Format
acreate table sample(rn int, dt string) row format delimited fields terminated by ',';
select * from sample
02111993
03121994
03131995
04141996
load data local inpath '/home/user/Desktop/sample.txt' into table sample;
select cast(substring(from_unixtime(unix_timestamp(dt, 'MMddyyyy')),1,10) as date)
from sample;
OK
1993-02-11
1994-03-12
1995-03-13
1996-04-14
Time taken: 0.112 seconds, Fetched: 4 row(s)
1. Regular UDF:
UDFs work on a single row in a table and produce a single row as output. Its one to one relationship
between input and output of a function. e.g Hive built in TRIM() function.
Hive allows us to define our own UDFs as well. Lets take an example of student record.
Problem Statement: Find the maximum marks obtained out of four subject by an student.
There are two different interfaces you can use for writing UDFs for Apache Hive. One is really simple, the
other… not so much.
Simple API - org.apache.hadoop.hive.ql.exec.UDF
Complex API - org.apache.hadoop.hive.ql.udf.generic.GenericUDF
The simple API (org.apache.hadoop.hive.ql.exec.UDF) can be used so long as your function reads
and returns primitive types. By this I mean basic Hadoop & Hive writable types - Text, IntWritable,
LongWritable, DoubleWritable, etc.
However, if you plan on writing a UDF that can manipulate embedded data structures, such as Map, List,
and Set, then you’re stuck using org.apache.hadoop.hive.ql.udf.generic.GenericUDF, which is
a little more involved.
I’m going to walk through an example of building a UDF in each interface. I will provide code and tests for
everything I do.
Simple Generic
Reduced performance due to use of reflection: each Optimal performance: no reflective call, and
call of the evaluate method is reflective. Furthermore, arguments are parsed lazily
all arguments are evaluated and parsed.
Limited handling of complex types. Arrays are All complex parameters are supported (even
handled but suffer from type erasure limitations nested ones like array<array>
Variable number of arguments are not supported Variable number of arguments are supported
Very easy to write Not very difficult, but not well documented
You can pass multiple arguments to the UDF. Whatever arguments you pass to the UDF, they are not
presented in the evaluate() method as is. Rather, you will get an array of ObjectInspector objects, one
ObjectInspector per argument. So arguments[0] represents an Inspector for the first argument you
passed to the UDF, arguments[1] represents the Inspector for the 2nd argument and so on.
ObjectInspector are helpful in look into the internal structure of an object.
getDisplayString()
The getDisplayString method is really helpful to the developer, since it can return meaningful
troubleshooting information. Instead of returning general error message, Hive calls this method whenever
there is an error executing the UDF. The UDF developer can really compile useful information, that can be
instrumental in troubleshooting the runtime error/exception. When a problem is detected while executing the
UDF, hive throws a HiveException but append information returned by GetDisplayString method to
the exception thrown by it. In the above example, this method returns the name and type of the column that
caused the problem.
Initialize()
When a UDF is used in a query, Hive loads the UDF in memory. The initialize() is called for the first time,
when the UDF is invoked. The purpose of call to this method, is to check the type of arguments that will be
passed to the UDF. For each value that will be passed to the UDF, the evaluate() method will be called. So if
there are 10 rows for which the UDF is going to be called, evaluate() will be called 10 times. However, Hive
first call the initialize() method of the Generic UDF before any call to evaluate(). The goals for
initialize() are to
avalidate the input arguments and complain if input is not as per expectation
save the Object Inspectors of input arguments for later use during evaluate()
provide an Object Inspector to Hive for the return type
You can do various ways to validate the input, like checking the arguments array for size, category on input
type (remember PrimitiveObjectInspector, MapObjectInspector etc. ?), checking the size of
underlying objects (in case of a Map or Struct etc.). Validation can go up to any extent that you choose,
including traversing the entire object hierarchy and validating every object. When the validation fails, we can
throw a UDFArgumentException or one of its subtypes to indicate error.
The Object Inspector for the return type, should be constructed within the initialize() method and
returned. We can use the factory methods of ObjectInspectorFactory class. For example, if the UDF is
going to return a MAP type, then we can use the getStandardMapObjectInspector() method which
accept information about how the Map will be constructed (e.g. Key type of the Map and the Value type of the
Map).
The saved Object inspectors are instrumental when we try to obtain the input value in the evaluate()
method.
valuate()
SELECT GenericDouble(bonus) FROM emp;
Suppose the temp table has 10 rows in it. The the evaluate() method will be called 10 times for each column
value in 10 rows. All the values passed to evaluate() however are serialized bytes. Hive delay the instantiation
of objects until a request for the object is made, hence the name DeferredObject. Based on what type of
value was passed to the UDF, the DeferredObject could represent lazily initialized objects. In the above
example, it could be an instance of LazyDouble class. When the value is requested, like
LazyDouble.getWritableObject() then the bytes are deserialized into an object and returned.
@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws
UDFArgumentException
{
if (arguments.length != 2)
{
throw new UDFArgumentLengthException("arrayContainsExample only takes 2
arguments: List<T>, T");
}
// 1. Check we received the right object types.
ObjectInspector a = arguments[0];
ObjectInspector b = arguments[1];
if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector))
{
throw new UDFArgumentException("first argument must be a list / array, second
argument must be a string");
}
this.listOI = (ListObjectInspector) a;
this.elementOI = (StringObjectInspector) b;
// the return type of our function is a boolean, so we provide the correct object
inspector
return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
}
@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException
{
// get the list and string from the deferred objects using the object inspectors
List<String> list = (List<String>) this.listOI.getList(arguments[0].get());
String arg = elementOI.getPrimitiveJavaObject(arguments[1].get());
User-Defined Aggregation Functions (UDAFs) are an exceptional way to integrate advanced data-processing
into Hive. Aggregate functions perform a calculation on a set of values and return a single value.
An aggregate function is more difficult to write than a regular UDF. Values are aggregated in chunks
(potentially across many tasks), so the implementation has to be capable of combining partial aggregations
into a final result.
We will start our discussion with the given source code which has been used to find the largest Integer from
the input file.
The code to achieve this is explained in the below example, we need to make a jar file of the below source
code and then use that jar file while executing hive scripts shown in the upcoming section.
By using select statement command we can see if the contents of the dataset Numbers_List have been
loaded to the table Num_list or not.
Add the Jar file in hive with complete path (Jar file made from source code need to be added)
Use the select statement to find the largest number from the table Num_List
After, successfully following the above steps we can see use the Select statement command to find the
largest number in the table.
Himanshu Sekhar Paul Apache HIVE |99
Thus, from the above screenshot we can see the largest number in the table Num_list is 99.
UDAF:
UDAF is a user-defined aggregate function (UDAF) that accepts a group of values and returns a single
value. Users can implement UDAFs to summarize and condense sets of rows in the same style as the built-
in COUNT, MAX(), SUM(), and AVG() functions.
UDTF:
UDTF is a User Defined Table Generating Function that operates on a single row and produces multiple
rows a table as output.
You can refer to the below screenshot to see what the expected output will be.
package com.Myhiveudtf;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
Import
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspec
torFactory;
if(args[0].getCategory()!=ObjectInspector.Category.PRIMITIVE&&((PrimitiveObje
ctInspector) args[0]).getPrimitiveCategory() !=
PrimitiveObjectInspector.PrimitiveCategory.STRING)
{
throw new UDFArgumentException("NameParserGenericUDTF() takes a string
as a parameter");
}
// input inspectors
stringOI = (PrimitiveObjectInspector) args[0];
// output inspectors -- an object with three fields!
List<String> fieldNames = new ArrayList<String>(2);
List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(2);
fieldNames.add("id");
Himanshu Sekhar Paul Apache HIVE
|102
fieldNames.add("phone_number");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return
ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs)
}
Initialize()
The Hive calls the initialize method to notify the UDTF the argument types to expect. The UDTF must
then return an object inspector corresponding to the row objects that the UDTF will generate.
@Override
public void process(Object[] record) throws HiveException
{
final String id = stringOI.getPrimitiveJavaObject(record[0]).toString();
ArrayList<Object[]> results = processInputRecord(id);
Iterator<Object[]> it = results.iterator();
while (it.hasNext())
{
Object[] r = it.next();
forward(r);
}
}
Process()
Once initialize() method has been called, Hive will give rows to the UDTF using the process() method.
While in process() function, the UDTF can produce and forward rows to other operators by calling
forward() method.
@Override
Close()
Finally, Hive will call the close() method when all the rows have passed to the UDTF. This function
allows for any cleanup that is necessary before returning from the User Defined Table Generating
Function. It is important to note that we cannot write any records from this function.
So far, from our above example, no data is required which needs to be cleaned up.
Therefore, we can execute the above example program.
Steps for Executing Hive UDTF:
Step 1: After writing the above code in Eclipse, add the below mentioned jar files in the program and then
export it in the Hadoop environment as a jar file.
Step 2: Create a table named ‘phone’ with a single column named ‘id’.
Step 3: Load the input data set phn_num contents into the table phone.
Step 4: Check if the data contents are loaded or not, using select statement.
Step 7: Use the select statement to populate the above table of strings with its primary id.
From the above screenshot, we can see that we have populated a single column, which contains multiple
values to its primary id.