Unit-5 1
Unit-5 1
Exploring Hive: Introducing Hive, Hive Service, Built-In Functions in Hive, Hive DDL,
Data Manipulation in Hive, Data Retrieval Queries, Using JOINS in Hive.
Introducing Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible
The architecture of Hive consists of various components. These components are described as
follows:
User Interface (UI)—Allows you to submit queries to the Hive system for execution.
Driver—Receives the submitted queries. This driver component creates a session handle
for the submitted query and then sends the query to the compiler to generate an execution
plan.
Compiler—Parses the query, performs semantic analysis on different query blocks and
query expressions, and generates an execution plan.
Metastore—Stores all the information related to the structure of the various tables and
partitions in the data warehouse. It also includes column and column type information and the
serializers and deserializers necessary to read and write data. It also contains information
about the corresponding HDFS files where your data is stored.
Execution Engine—Executes the execution plan created by the compiler. The plan is in the
form of a Directed Acyclic Graph (DAG) to be executed in various stages. This engine
manages the dependencies between the different stages of a plan and is also responsible to
execute these stages on the appropriate system components.
Hive Services
You can get all the Hive services you want by typing Hive service help. Some Hive services
are as follows:
CLI—It is the black window or the panel that we get after the installation of Hive. This is
nothing but a command line interface of Hive. This is an inbuilt default service present in
Hive.
Hive server—Runs Hive as an integrated server to expose a thrift service, which integrates
the access to the number of clients that are written in different types of languages. There are
many applications like JDBC, ODBC connectors that need to execute a Hive server to get in
communication loop with Hive.
Hive Web Interface (HWI)—The Hive Web Interface is the GUI of Hive on which we
can execute the queries. It is an alternative to the shell
Metastore—It is the service that runs with the Hive services whenever Hive starts. This is
a default process. Using the metastore service, it is possible to run it on a standalone (remote)
process. For this, you just need to set the property of the Hive “METASTORE_PORT”
environment variable so that the specified port gets listened by the server.
Hive client—There are many different mechanisms to get in contact with the applications
when you run Hive as a server that is hiveserver. Following is one of the clients of Hive:
Thrift Client—It is very easy to run Hive commands from the extensive variety of
programming dialects through the Hive thrift client. There are numerous
programming dialects such as C++, Java, php, python, and ruby for which Hives has
the thrift client, to make these dialects accessible.
Similarly, there are JDBC and ODBC drivers that are compatible with Hive. These drivers
connect Hive with the metastores and help run applications.
Hive Variables:
Hive allows you to set variables that can be referred in the Hive script. For this purpose, you
need to use the –d or –define option, as shown in the following commands:
In the preceding commands, a table named sampletable is created in the database
sampledatabase.
By default, the variable substitution option is enabled. However, you can disable this option
by using the following command:
In general, there are three namespaces for defining variables hiveconf, system, and env. You
can also define custom variables in separate namespaces using the define or hivevar options.
Hive Properties:
The hive-site.xml file stores the configuration properties of Hive. These properties can be
overwritten by the developers. To overwrite the properties of the hive-site.xml file, the set
command is used.
Hive Queries:
Hive allows you to simultaneously execute one or more queries. These queries can be stored
in and executed from files. The extension of a query file in Hive is .hql or .q. Let’s take an
example of a Hive query file, named ourquery.hql, stored in the …/home/weusers/
queries/folder.
Now, type the following commands to execute the query stored in the ourquery.hql file:
You can also execute a Hive query in the background, as shown in the following command:
To list the directory content in the Hadoop dfs, type the following command:
Hive supports two kinds of data types: primitive type and complex type. Primitive data types
are built-in data types, which also act as basic structures for building more sophisticated data
types. Primitive data types are associated with columns of a table. Figure 12.2 shows a list of
primitive data types available in Hive:
Complex data types are customized data types that can be created by using primitive data
types. The following types of complex data types are available in Hive:
Structs—The elements within this type can be accessed by using the Dot (.) operator.
Maps—The elements in a map are accessed by using the ['element name'] notation.
Arrays—The elements in an array have to be of the same data type. Elements can be
accessed by using the [n] notation where n represents the index of the array.
You must already be aware of the concept of functions and the role they play in the
development of an application. A function is a group of commands used to perform a
particular task in a program and return an outcome. Like every programming language, Hive
also has its set of built-in functions (also known as pre-defined functions). Table 12.3 lists the
built-in functions available in Hive:
To perform various types of computations, Hive provides aggregate functions. Table 12.4
lists the aggregate functions available in Hive:
Hive DDL
Data Definition Language (DDL) is used to describe data and data structures of a database.
Hive has its own DDL, such as SQL DDL, which is used for managing, creating, altering,
and dropping databases, tables, and other objects in a database. Similar to other SQL
databases, Hive databases also contain namespaces for tables. If the name of the database is
not specified, the table is created in the default database. Some of the main commands used in
DDL are as follows:
Let’s learn more about these commands and the operations that can be performed by using
them.
Creating Databases;
You can also add a table in a particular database by using the following command:
In the preceding command, the name of the database is added before the name of the table.
Therefore, temp_table gets added in the temp_database. In addition, you can also create a
table in the database by using the following commands:
In the preceding commands, the USE statement is used for setting the current database to
execute all the subsequent HiveQL statements. In this way, you do not need to add the name
of the database before the table name. The table temp_table is created in the database
temp_database. Furthermore, you can specify DBPROPERTIES in the form of key-value
pairs in the following manner:
Viewing a Database:
You can view all the databases present in a particular path by using the following command:
The preceding command shows all the databases present at a particular location.
Dropping a Database:
Dropping a database means deleting it from its storage location. The database can be deleted
by using the following command:
You must note that if the database to be deleted contains any tables, you first need to delete
tables from the database by using the DROP command. Deleting each table individually from
a database is a tedious task. Hive allows you to delete tables along with the database by using
the following command:
Altering Databases:
Altering a database means making changes in the existing database. You can alter a database
by using the following command:
Creating Tables:
You can create a table in a database by using the CREATE command, as discussed earlier.
Now, let’s learn how to provide the complete definition of a table in a database by using the
following commands:
In the preceding commands, temp_database is first set and then the table is created with the
name employee. In the employee table, the columns (ename, salary, and designation) are
specified with their respective data types. The TBLPROPERTIES is a set of key-value
properties. C comments are also added in the table to provide more details.
External Table:
External tables are created by accessing the query written in files stored on some other
computer system. You can specify the location of a data file by using the LOCATION
keyword while creating the external table. Execute the following commands for creating an
external table:
The concept of external table proves to be very useful when HDFS has huge data files,
making copies of which is not feasible due to space constraints.
You can also create a table having the same schema as another table by using the following
command:
In the preceding command, the schema of the student table present in the details database gets
copied in the employee table present in the same database. The name of the table to which the
schema gets copied is specified before the LIKE keyword, and the name of the table from
which the schema gets copied is specified after the LIKE keyword.
You can know the structure of the existing table by using the following command:
The DESCRIBE command shows the structure of the student table present in the details
database.
You can also find the details about any column present in a table by using the following
command:
In the preceding command, the name of the column is specified with the table name using a
dot between them. This will return the data type of the name column of the student table.
Partitioning Tables:
Partitioning of a table means dividing a table into different sections (also called partitions) in
order to view the required data. It helps you to create and view only a particular set of data of
a table. You can create partitions in a table by using the PARTITIONED BY keyword.
Consider the following command:
Here, we are creating a table named employee with the ename and sal columns. The table is
partitioned from the column designation.
Dropping Tables:
Altering Tables:
Altering a table means modifying or changing an existing table. By altering a table, you can
modify the metadata associated with the table. The table can be modified by using the
ALTER TABLE statement. The altering of a table allows you to:
Rename tables
Modify columns
Rename Tables:
Modify Columns:
You can modify the name of a column in Hive by using the following command:
Add Columns:
In Hive, you can also add an extra column to a given table by using the ADD COLUMN
statement:
The preceding command adds an extra column with the name sales_month in the employees
table. By default, the new column is added at the last position before the partitioned columns.
Replace Columns:
You can replace all the existing columns with new columns in a table by using the following
command:
The REPLACE COLUMNS statement is used to delete all the existing columns and add new
columns in their place.
While loading data into tables, Hive does not perform any type of transformations. The data
load operations in Hive are, at present, pure copy/move operations, which move data files
from one location to another. You can upload data into Hive tables from the local file system
as well as from HDFS. The syntax of loading data from files into tables is as follows:
When the LOCAL keyword is specified in the LOAD DATA command, Hive searches for
the local directory. If the LOCAL keyword is not used, Hive checks the directory on HDFS.
On the other hand, when the OVERWRITE keyword is specified, it deletes all the files under
Hive’s warehouse directory for the given table. After that, the latest files get uploaded. If you
do not specify the OVERWRITE keyword, the latest files are added in the already existing
folder.
You can also insert data into tables through queries by using the INSERT command. The
syntax of using the INSERT command is as follows:
In the preceding syntax, the INSERT OVERWRITE statement overwrites the current data in
the table or partition. The IF NOT EXISTS statement is given for a partition. On the other
hand, the INSERT INTO statement either appends the table or creates a partition without
modifying the existing data. The insert operation can be performed on a table or a partition.
You can also specify multiple insert clauses in the same query.
Consider two tables, T1 and T2. We want to copy the sal column from T2 to T1 by using the
INSERT command. It can be done as follows:
Here, we use the keyword OVERWRITE. It means that any previous data in table T1 will be
deleted, and new data from table T2 will be inserted. If we want to retain the previous data in
table T1, we can use the following command:
Static partition insertion refers to the task of inserting data into a table by specifying a
partition column value. Consider the following example:
In the preceding command, we are inserting data into the JANUARY partition of the T1
table.
Dynamic Partition Insertion:
In dynamic partition insertion, you need to specify a list of partition column names in the
PARTITION() clause along with the optional column values. A dynamic partition column
always has a corresponding input column in the SELECT statement. If the SELECT
statement has multiple column names, the dynamic partition columns must be specified at the
end of the columns and in the same order in which they appear in the PARTITION() clause.
By default, the feature of dynamic partition is disabled.
In the preceding command, dynamic partition insertion takes place by including all ds and hr
columns. In the SELECT statement, these columns are listed at the end, in the same order in
which they appear in the PARTITION clause.
Sometimes, you might require to save the result of the SELECT query in flat files so that you
do not have to execute the queries again and again. Consider the following example:
The preceding command inserts the SELECT query result of the employee table in the myfile
file of the local directory.
In Hive, you can create and load data into a table using a single query as follows:
Here, a new table called T1 would be created, and the schema for the table would be three
columns named name, sal, and month, with the same data types as mentioned in the T2 table.
Update in Hive:
The update operation in Hive is available from Hive 0.14 version. The update operation can
only be performed on tables that support the ACID property. The syntax for performing the
update operation in Hive is as follows:
In the preceding syntax, the name of the table will be followed by the UPDATE statement.
Only those rows that match with the WHERE clause will be updated. You must note that the
partitioning and bucketing columns cannot be updated.
In the preceding command, the salary of the employee whose empid is 10001 is updated.
Delete in Hive:
The delete operation is available in Hive from the Hive 0.14 version. The syntax for
performing the delete operation is as follows:
In the preceding syntax, only those rows that match the WHERE clause will be deleted. The
table name from which the records are deleted are specified after the DELETE FROM
statement. To delete the records of an employee whose empid is 10001, type the following
command:
When the preceding command is executed, the records of the employee whose employee id is
Hive allows you to perform data retrieval queries by using the SELECT command along with
various types of operators and clauses.
The SELECT statement is the most common operation in SQL. You can filter the required
columns, rows, or both. The syntax for using the SELECT command is as follows:
The following query retrieves all the columns and rows from table, mydemotable:
The WHERE clause is used to search records of a table on the basis of a given condition.
This clause returns a boolean result. For example, a query can return only those sales records
that have an amount greater than 15000 from the US region. Hive also supports a number of
operators (such as > and <) in the WHERE clause. The following query shows an example of
using the WHERE clause:
For example, if you wish to calculate the average marks obtained by the students from all
semesters, you may use the GROUP BY clause as shown in the following example:
The following example helps you to count the number of distinct users by gender:
You can also use aggregate functions at the same time, but no two aggregations can have
different DISTINCT columns.
It should be noted that when the GROUP BY clause is used with the SELECT query, the
SELECT statement can only include columns included in the GROUP BY clause.
The following queries create a table and use the GROUP BY clause on it:
The HAVING clause is used to specify a condition on the use of the GROUP BY clause. The
use of the HAVING clause is added in the version 0.7.0 of Hive. The following query shows
an example of using the HAVING clause:
The preceding command displays the clmn1 column of the mydemotable table grouped by
clmn1 where the sum of clmn2 is greater than 15.
The LIMIT clause is used to restrict the number of rows to be returned. The rows that are to
be returned are chosen at random. The following query returns 8 rows from mydemotable at
random:
Using JOINS in Hive
Hive supports joining of one or more tables to aggregate information. The various joins
supported by Hive are:
Inner joins
Outer joins
Inner Joins:
In case of inner joins, only the records satisfying the given condition get selected. All the
other records get discarded. Figure 12.11 illustrates the concept of inner joins:
Let’s take an example to describe the concept of inner joins. Consider two tables, order and
customer. Table 12.6 lists the data of the order table:
The above table contains the order id (order_id) and the corresponding customer id (cust_id).
Table 12.7 lists the data of the customer table:
The above table contains the customer id (cust_id) and the corresponding customer name
(cust_name).
To know the names of the customers who have placed orders, we need to take the inner join
of the two tables, as follows:
Outer Joins:
Sometimes, you need to retrieve all the records from one table and only some records from
the other table. In such cases, you have to use the outer join. Figure 12.12 illustrates the
concept of outer joins:
In this type of join, all the records from the table on the right side of the join are retained.
Figure 12.13 illustrates the concept of right outer join:
You can see that the order_id field for the customer ‘Monty’ is marked as NULL. It is
because there is no related record matching the given customer id in the order table.
In this type of join, all the records from the table on the left side of the join are retained.
Figure 12.14 illustrates the concept of left outer joins:
The query involving left outer joins can be written as follows:
Here, all the entries from the order table are present in the output. A field is kept blank as
there is no corresponding value for the key in the customer table. The preceding query gives
the following output:
In this case, all the fields from both tables are included. For the entries that do not have any
match, a NULL value would be displayed.
In cartesian product joins, all the records of one table are combined with another table in all
possible combinations. This type of join does not involve any key column to join the tables.
Map-Side Joins:
In a map-side join operation in Hive, the job is assigned to a Map Reduce task that consists of
two stages: Map and Reduce. At the Map stage, the data is read from join tables and the ‘join
key’ and ‘join value’ pair are returned to an intermediate file. This intermediate file is then
sorted and merged in the shuffle stage. At the Reduce stage, the sorted result is taken as input,
and the joining task is completed.
A map-side join is similar to the normal join, but here, all the tasks are performed by the
mapper alone. The map-side join is preferred in small tables. Suppose you have two tables,
out of which one is a small table. Now, when a map reduce task is submitted, a Map Reduce
local task will be created to read the data of the small table from HDFS and store it into an in-
memory hash table. After reading the data, the Map Reduce local task serializes the in-
memory hash table into a hash table file.
In the next stage, the main join Map Reduce task runs and moves the data from the hash table
file to the Hadoop distributed cache, which supplies these files to each mapper’s local disk. It
means that all the mappers can load this hash table file back into the memory and continue
with the join work.
Joining Tables:
You can combine the data of two or more tables in Hive by using HiveQL queries. For this,
we need to create tables and load them into Hive from HDFS.
Let us create two tables, Post_data_uk and Post_data_us, in HDFS through HQL.
1. To create the table Post_data_uk, type the following command on the terminal:
2. Type the following command on the terminal to create the table Post_data_us:
3. To load the two tables, Post_data_uk and Post_data_us, type the following commands:
4. To perform join operations on tables, we have specific HQL queries. To join the
Post_data_uk table with the Post_data_us table on the basis of the first_name column, type
the following command:
5. Type the following command to select five records after performing the left outer join
operation on the two tables, namely, Post_data_uk and Post_data_us:
6. To perform the right outer join on the tables, type the following command on the terminal:
7. To perform a full outer join operation, type the following command on the terminal:
You can also create partitions on the records stored in HDFS tables through HQL queries. For
this, let’s create a table named Post_data with partitions.
8. Type the following command to create the Post_data table with partitions:
The preceding command creates a table that would be partitioned on the basis of a string,
country. To load data from the post_uk file in the first partition of the Post_data table, type:
To load data from the post_us file in the second partition of the Post_data table, type:
The post_data file now contains the data given in both the post_uk and post_us datasets.
Bucketing:
Bucketing is the other method for partitioning, but there is a difference between both the
methods i.e. partitioning and bucketing. In partitioning, we always create a partitioning table
that holds each unique value of column. But, sometimes, there is a problem in creating the
partitions as there are so many values that are unique in nature due to which we need to make
a lot of small partitions. But, when we use bucketing, we can limit it to an extent of number
of your choice and extract your data to that bucket.
In Hive, partitions are basically the directories, whereas bucket is a file. In Hive, bucketing is
not automatically defined; bucketing property need to be set before using it.
set.hive.enforce.bucketing=true is the property that should be “true” to use bucketing. Let’s
perform the following steps to insert tables in bucket: