BDA Unit V
BDA Unit V
HIVE
1
UNIT-VI
Moving up the diagram, you find the Hive Driver, which compiles, optimizes, and executes the
HiveQL. The Hive Driver may choose to execute HiveQL statements and commands locally or
spawn a MapReduce job, depending on the task. The Hive Driver stores table metadata in the
metastore and its database.
By default, Hive includes the Apache Derby RDBMS configured with the metastore in what’s
called embedded mode. Embedded mode means that the Hive Driver, the metastore, and Apache
Derby are all running in one Java Virtual Machine (JVM).
This configuration is fine for learning purposes, but embedded mode can support only a single
Hive session, so it normally isn’t used in multi-user production environments.
Two other modes exist — local and remote which can better support multiple Hive sessions in
production environments. Also, you can configure any RDBMS that’s compliant with the Java
Database Connectivity (JDBC) Application Programming Interface (API) suite. (Examples here
include MySQL and DB2.)
The key to application support is the Hive Thrift Server, which enables a rich set of clients to
access the Hive subsystem. The main point is that any JDBC-compliant application can access
Hive via the bundled JDBC driver. The same statement applies to clients compliant with Open
Database Connectivity (ODBC) — for example, unixODBC and the isql utility, which are
typically bundled with Linux, enable access to Hive from remote Linux clients.
Additionally, if you use Microsoft Excel, you’ll be pleased to know that you can access Hive after
you install the Microsoft ODBC driver on your client system. Finally, if you need to access Hive
from programming languages other than Java (PHP or Python, for example), Apache Thrift is the
answer. Apache Thrift clients connect to Hive via the Hive Thrift Server, just as the JDBC and
ODBC clients do.
Hive includes a Command Line Interface (CLI), where you can use a Linux terminal window to
issue queries and administrative commands directly to the Hive Driver.
You can also use web interface so that you can access your Hive-managed tables and data via your
favorite browser.
There is another web browser technology known as Hue that provides a graphical user interface
(GUI) to Apache Hive. Some Hadoop users like to have a GUI at their disposal instead of just a
command line interface (CLI).
Hue is also an open source project and you can find it at http://gethue.com.
2
UNIT-VI
3
UNIT-VI
If you already have a Hadoop cluster configured and running, you need to set the
hive.metastore.warehouse.dir configuration variable to the HDFS directory where you intend to store
your Hive warehouse.
Examining the HIVE Clients (or) Configuration of HIVE Clients:
There are number of client options for Hive. We considered three that are useful when the time comes to
analyze data using HiveQL. The first client is the Hive command-line interface (CLI), followed by a web
browser using the Hive Web Interface (HWI) Server, and, finally, the open source SQuirreL client using
the JDBC driver. Each of these client options can play a particular role as you work with Hive to analyze
data.
4
UNIT-VI
5
UNIT-VI
The following steps show you what you need to do before you can start the HWI Server:HWI
Server:
Using the commands listed below configure the $HIVE_HOME/conf/hive-site.xml file to ensure
that Hive can find and load the HWI’s Java server pages.
Configuring the $HIVE_HOME/conf/hive-site.xml file
<property>
<name>hive.hwi.war.file</name>
<value>${HIVE_HOME}/lib/hive_hwi.war</value>
</property>
The HWI Server requires Apache Ant libraries to run, so you need to download Ant from the
Apache site at http://ant.apache.org/bindownload.cgi.
Install Ant using the following commands:
Set the $ANT_LIB environment variable and start the HWI Server by using the following
commands:
3. SQuirreL as Hive client with the JDBC Driver:
The last hive client is the open source tool SquirrelSQL Tool. You can download this universal SQL
client from the SourceForge website: http://sourceforge.net. It provides a user interface to Hive and
simplifies the tasks of querying large tables and analysing data with Apache Hive.
6
UNIT-VI
Figure below illustrates how the Hive architecture would work when using tools such as SQuirreL. In the
figure, you can see that the SQuirreL client uses the JDBC APIs to pass commands to the Hive Driver by
way of the Server.
7
UNIT-VI
Change to the new SQuirreL release directory and start the tool using the following command.
$ cd squirrel-sql-3.5.0-standard
$ ./squirrelsql.sh
Working with HIVE Data Types:
Categories Data Types Description
TINYINT 1 Byte Signed Integer, Postfix:Y, Eg:10Y
SMALLINT 2 Byte Signed Integer, Postfix:S, Eg:10L
INT 4 Byte Signed Integer
NUMERIC BIGINT 8 Byte Signed Integer, Posfix:L , Eg:10L
TYPE FLOAT 4 Byte Single Precision Floating point
DOUBLE 8 Byte Double Precision Floating point
DECIMAL Precision of the DECIMAL type fixed and limited to 38
digits.
DATE/TIME TIMESTAMP YYYY-MM-DD HH:MM:SS.fffffffff" (9 decimal
TYPES place precision)
DATE It describes date in the form YYYY-MM-DD.
BOOLEAN Represents true or false
Miscellaneous BINARY Data type for storing arbitrary no. of bytes
The following list describes the file formats you can choose from as of Hive version 0.11.
TEXTFILE: The default file format for Hive records. Alphanumeric characters from the
Unicode standard are used to store your data.
8
UNIT-VI
SEQUENCEFILE: The format for binary files composed of key/value pairs. Sequence
files, which are used heavily by Hadoop, are often good choices for Hive table storage,
especially if you want to integrate Hive with other technologies in the Hadoop ecosystem.
RCFILE: RCFILE stands for record columnar file. Stores records in a column-oriented
fashion rather than a row-oriented fashion — like the TEXTFILE format approach
ORC: ORC stands for optimized row columnar. A format (new as of Hive 0.11) that has
significant optimizations to improve Hive reads and writes and the processing of tables.
For example, ORC files include optimizations for Hive complex types and new types such
as DECIMAL. Also lightweight indexes are included with ORC files to improve perfor-
mance.
INPUTFORMAT, OUTPUTFORMAT: INPUTFORMAT will read data from the Hive
table. OUTPUTFORMAT does the same thing for writing data to the Hive table. To see
the default settings for the table, simply execute a DESCRIBE EXTENDED tablename
HiveQL statement and we’ll see the INPUTFORMAT and OUTPUTFORMAT classes
for your table.
Defining table record formats
The Java technology that Hive uses to process records and map them to column data types in
Hive tables is called SerDe, which is short for SerializerDeserializer. Figure 7 will help us to under-
stand how Hive keeps file formats separate from record formats.
9
UNIT-VI
10
UNIT-VI
When you drop a table from Hive Metastore, it removes the table/column data and their metadata. It
can be a normal table (stored in Metastore) or an external table (stored in local file system); Hive treats
both in the same manner, irrespective of their types.
SHOW
It is used to verify the databases.
hive> SHOW DATABASES;
default
csedb
Create table Example:
hive> create table customer(cno string, cname string,age int,profession string) row format delimited
fields terminated by ',' lines terminated by '\n' stored as textfile;
OK
Time taken: 0.686 seconds
11
UNIT-VI
12
UNIT-VI
13
UNIT-VI
The following query deletes all the columns from the employee table and replaces it
with emp and name columns:
hive> ALTER TABLE employee REPLACE COLUMNS ( empid Int, ename STRING name String);
14
UNIT-VI
For example, a table named Tab1 contains employee data such as id, name, dept, and doj date of joining).
Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole
table for the required information. However, if you partition the employee data with the year and store it
in a separate file, it reduces the query processing time. The following example shows how to partition a
file and its data:
The following file contains employeedata table.
/tab1/employeedata/file1
/tab1/employeedata/2013/file3
3, kaleel,Admin, 2013
Renaming Partition
Droping Partition:
Example:
hive>set hive.exec.dynamic.partition.mode=nonstrict;
hive> create table txnbycatg(txnno int,txndate string,cid int,amount double,product string,city
string,state string,spendby string)partitioned by(category string)clustered by (state)into 4 buckets row
format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;
15
UNIT-VI
OK
Time taken: 0.056 seconds
hive> from txnrecords txn insert overwrite table txnbycatg partition(category) select
txn.txnno,txn.txndate,txn.cid,txn.amount,txn.product,txn.city,txn.state,txn.spendby,txn.category;
hive> select * from txnbycatg;
OK
1 12-04-2019 101 600.0 car nrt ap debit kids
2 24-08-2019 102 400.0 actionfigure gnt ap debit kids
3 12-04-2019 101 500.0 shirt nrt ap debit kids
4 24-07-2019 102 700.0 shoe gnt ap debit men
7 12-04-2018 104 600.0 Tshirt nrt ap debit men
8 24-08-2017 105 400.0 Top gnt ap debit women
Querying and Analysing Data:
Aggregations:
hive> select count(*) from customer;
OK
8
hive> select count(distinct category) from txnrecords;
Stage-Stage-1: HDFS Read: 10172 HDFS Write: 4972 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
3
Time taken: 2.38 seconds, Fetched: 1 row(s)
Group BY
Example:
ID | Name | Salary | Designation | Dept |
+------+-------------- -------------+-------------------+--------+ ----------------------
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 45000 | Proofreader | PR |
|1205 | Kranthi | 30000 | Op Admin | Admi
hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
Dept | Count(*) |
+------ +--------------+
16
UNIT-VI
|Admin | 1 |
|PR | 2 |
|TP | 3 |
Example2:
hive> select category,sum(amount) from txnrecords group by category;
OK
kids 2300.0
men 1300.0
women 400.0
Time taken: 1.23 seconds, Fetched: 3 row(s)
Joins:
JOIN is a clause that is used for combining specific fields from two tables by using values common
to each one. It is used to combine records from two or more tables in the database. JOIN clause is used
to combine and retrieve the records from multiple tables. JOIN is same as OUTER JOIN in SQL. A JOIN
condition is to be raised using the primary keys and foreign keys of the tables.
Examples:
hive> create table mailids(name string,email string) row format delimited fields terminated by ',' lines
terminated by '\n' stored as textfile;
OK
Time taken: 0.047 seconds
hive> load data local inpath '/home/user/mails.txt' into table mailids;
OK
Time taken: 0.543 seconds
hive> select * from mailids;
OK
ria [email protected]
safoora [email protected]
Ayaan [email protected]
amaan [email protected]
aaliyah [email protected]
xxx [email protected]
xyz [email protected]
rrr [email protected]
Time taken: 0.037 seconds, Fetched: 8 row(s)
hive> load data local inpath '/home/user/empl.txt' into table emp;
17
UNIT-VI
OK
Time taken: 0.145 seconds
hive> select * from emp;
OK
ria 22 30000 [email protected]
safoora 23 22000 [email protected]
Ayaan 24 34000 [email protected]
amaan 22 21000 [email protected]
aaliyah 22 18000 [email protected]
Time taken: 0.044 seconds, Fetched: 5 row(s)
hive> select e.name, e.age, e.sal,s.email from emp e join mailids s on e.name=s.name;
OK
ria 22 30000 [email protected]
safoora 23 22000 [email protected]
Ayaan 24 34000 [email protected]
amaan 22 21000 [email protected]
aaliyah 22 18000 [email protected]
Time taken: 15.667 seconds, Fetched: 5 row(s)
Left Outer Join:
The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no matches
in the right table.
Example:
hive> select e.name,e.age,e.sal,s.email from emp e left outer join mailids s on e.name=s.name;
OK
ria 22 30000 [email protected]
safoora 23 22000 [email protected]
Ayaan 24 34000 [email protected]
amaan 22 21000 [email protected]
aaliyah 22 18000 [email protected]
Time taken: 5.552 seconds, Fetched: 5 row(s)
Right outer Join:
The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no
matches in the left table.
Example:
hive> select e.name,e.age,e.sal,s.email from emp e right outer join mailids s on e.name=s.name;
18
UNIT-VI
OK
ria 22 30000 [email protected]
safoora 23 22000 [email protected]
Ayaan 24 34000 [email protected]
amaan 22 21000 [email protected]
aaliyah 22 18000 [email protected]
null null null [email protected]
null null null [email protected]
Time taken: 5.552 seconds, Fetched: 5 row(s)
FULL OUTER JOIN:
The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables that
fulfil the JOIN condition.
Example:
hive> select e.name,e.age,e.sal,s.email from emp e full join mailids s on e.name=s.name;
OK
Ayaan 24 34000 [email protected]
aaliyah 22 18000 [email protected]
amaan 22 21000 [email protected]
ria 22 30000 [email protected]
NULL NULL NULL [email protected]
safoora 23 22000 [email protected]
NULL NULL NULL [email protected]
NULL NULL NULL [email protected]
Time taken: 1.218 seconds, Fetched: 8 row(s)
Improving your Hive queries with indexes:
Creating an index is common practice with relational databases when we want to speed access to a
column or set of columns in your database. Without an index, the database system has to read all rows in
the table to find the data we have selected. Indexes become even more essential when the tables grow
extremely large. Hive supports index creation on tables.
Example:
hive> create index index_sals on table employee (salary) as 'COMPACT' with deferred rebuild;
OK
Time taken: 0.501 seconds
hive> alter index index_sals on employee rebuild;
hive> alter index index_sals on employee rebuild;;
19
UNIT-VI
OK
Time taken: 4.197 seconds
hive> show indexes on employee;
OK
index_sal employee salary csedb__employee_index_sal__ compact
index_sals employee salary csedb__employee_index_sals__ compact
Time taken: 0.073 seconds, Fetched: 2 row(s)
hive> describe csedb__employee_index_sals__;
OK
salary int
_bucketname string
_offsets array<bigint>
Time taken: 0.058 seconds, Fetched: 3 row(s)
hive> select salary,count(1) from employee where salary=25000 group by salary;
OK
25000 7
Time taken: 1.535 seconds, Fetched: 1 row(s)
hive> select salary,SIZE(`_offsets`) from csedb__employee_index_sals__ where salary=25000;
OK
25000 7
Time taken: 0.054 seconds, Fetched: 1 row(s)
hive> drop index index_sals on employee;
OK
Time taken: 1.611 seconds
Hive indexes:
Hive indexes are implemented as tables. This is why we need to first create the index table and
then build it to populate the table. Therefore, we can use indexes in at least two ways:
✓ Count on the system to automatically use indexes that you create.
20
UNIT-VI
21
UNIT-VI
iv. Language
Queries data using a SQL-like language called HiveQL (HQL).
v. Declarative language
HiveQL is a declarative language like SQL.
vi. Table structure
Table structure/s is/are similar to tables in a relational database.
vii. Multi-user
Multiple users can simultaneously query the data using Hive-QL.
viii. Data Analysis
However, to perform more detailed data analysis, Hive allows writing custom MapReduce framework
processes.
ix. ETL support
Also, it is possible to extract/transform/load (ETL) Data easily.
x. Data Formats
Moreover, Hive offers the structure on a variety of data formats.
xi. Storage
Hive allows access files stored in HDFS. Also, similar others data storage systems such as Apache
HBase.
x. Format conversion
Moreover, it allows converting the variety of format from to within Hive. Although, it is very simple
and possible.
Limitations of Hive
i. OLTP Processing issues
Hive is not designed for Online transaction processing (OLTP). Although, we can use it for the Online
Analytical Processing (OLAP).
ii. No Updates
It does not support updates and deletes, however, it does support overwriting or apprehending data.
iii. Subqueries
Basically, in Hive, Subqueries are not supported.
22