0% found this document useful (0 votes)
16 views23 pages

BDA Unit V

Uploaded by

syambabuj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views23 pages

BDA Unit V

Uploaded by

syambabuj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

UNIT-VI

HIVE

 A SQL like data warehouse infrastructure.


 Hive is a data warehousing package built on top of Hadoop.
 Hive was created to make it possible for analysts with strong SQL skills to run queries on the huge
volumes of data that Facebook stored in HDFS.
 Today, Hive is a successful Apache project used by many organizations as a general-purpose,
scalable data processing platform.
 Hive runs on your workstation and converts your SQL query into a series of MapReduce jobs for
execution on a Hadoop cluster. Hive organizes data into tables, which provide a means for
attaching structure to data stored in HDFS. Metadata— such as table schemas—is stored in a
database called the metastore.
 You interact with Hive by issuing queries in a SQL-like language called HiveQL.
 Hive also makes possible the concept known as enterprise data warehouse (EDW) augmentation,
a leading use case for Apache Hadoop, where data warehouses are set up as RDBMSs built
specifically for data analysis and reporting.
 Closely associated with RDBMS/EDW technology is extract, transform, and load (ETL)
technology.
 ETL: - in many use cases, data cannot be immediately loaded into the relational database — it
must first be extracted from its native source, transformed into an appropriate format, and then
loaded into the RDBMS or EDW. For example, a company or an organization might extract
unstructured text data from an Internet forum, transform the data into a structured format that’s
both valuable and useful, and then load the structured data into its EDW.
Seeing How the Hive is Put Together (or) Architecture of HIVE:
The below figure shows the architecture of HIVE and explains its various components.
 As you examine the elements shown in Figure 1, you can see that Hive sits on top of the Hadoop
Distributed File System (HDFS) and MapReduce systems. In the case of MapReduce, Figure
below shows both the Hadoop 1 and Hadoop 2 components. With Hadoop 1, Hive queries are
converted to MapReduce code and executed using the MapReduce framework.
 Hive queries can still be converted to MapReduce code and executed, now with MapReduce v2
(MRv2) and the YARN infrastructure.
 There is a new framework under development called Apache Tez, which is designed to improve
Hive performance for batch-style queries and support smaller interactive (also known as real-time)
queries.

1
UNIT-VI

 Moving up the diagram, you find the Hive Driver, which compiles, optimizes, and executes the
HiveQL. The Hive Driver may choose to execute HiveQL statements and commands locally or
spawn a MapReduce job, depending on the task. The Hive Driver stores table metadata in the
metastore and its database.
 By default, Hive includes the Apache Derby RDBMS configured with the metastore in what’s
called embedded mode. Embedded mode means that the Hive Driver, the metastore, and Apache
Derby are all running in one Java Virtual Machine (JVM).
 This configuration is fine for learning purposes, but embedded mode can support only a single
Hive session, so it normally isn’t used in multi-user production environments.
 Two other modes exist — local and remote which can better support multiple Hive sessions in
production environments. Also, you can configure any RDBMS that’s compliant with the Java
Database Connectivity (JDBC) Application Programming Interface (API) suite. (Examples here
include MySQL and DB2.)
 The key to application support is the Hive Thrift Server, which enables a rich set of clients to
access the Hive subsystem. The main point is that any JDBC-compliant application can access
Hive via the bundled JDBC driver. The same statement applies to clients compliant with Open
Database Connectivity (ODBC) — for example, unixODBC and the isql utility, which are
typically bundled with Linux, enable access to Hive from remote Linux clients.
 Additionally, if you use Microsoft Excel, you’ll be pleased to know that you can access Hive after
you install the Microsoft ODBC driver on your client system. Finally, if you need to access Hive
from programming languages other than Java (PHP or Python, for example), Apache Thrift is the
answer. Apache Thrift clients connect to Hive via the Hive Thrift Server, just as the JDBC and
ODBC clients do.
 Hive includes a Command Line Interface (CLI), where you can use a Linux terminal window to
issue queries and administrative commands directly to the Hive Driver.
 You can also use web interface so that you can access your Hive-managed tables and data via your
favorite browser.
 There is another web browser technology known as Hue that provides a graphical user interface
(GUI) to Apache Hive. Some Hadoop users like to have a GUI at their disposal instead of just a
command line interface (CLI).
 Hue is also an open source project and you can find it at http://gethue.com.

2
UNIT-VI

Figure 1: The Apache Hive architecture.


Getting Started with HIVE (or) Installation of HIVE
1. Download Hive http://hive.apache.org/releases.html. Download hive version 11.0.
2. Download Hadoop version 1.2.1 from this web site.
http://hadoop.apache.org/releases.html
3. Using the commands in below, place the releases in separate directories, and then uncompress
and untar them.
$ mkdir hadoop;
$ cp hadoop-1.2.1.tar.gz hadoop;
$ cd hadoop
$ gunzip hadoop-1.2.1.tar.gz
$ tar xvf *.tar
$ mkdir hive; cp hive-0.11.0.tar.gz hive;
$ cd hive
$ gunzip hive-0.11.0.tar.gz

3
UNIT-VI

$ tar xvf *.tar


4. Using the commands below set up your Apache Hive environment variables, including
HADOOP_HOME, JAVA_HOME, HIVE_HOME and PATH, in your shell profile script. Set
up these variables in .bashrc file.
export HADOOP_HOME=/home/user/Hive/hadoop/hadoop-1.2.1
export JAVA_HOME=/opt/jdk
export HIVE_HOME=/home/user/Hive/hive-0.11.0
export PATH=$HADOOP_HOME/bin:$HIVE_HOME/bin:
$JAVA_HOME/bin:$PATH
5. Create hive configuration file. Consider that we are running Hive in stand-alone mode on virtual
machine rather than in a real-life Apache Hadoop cluster, configure the system to use local storage
rather than the HDFS: Simply set the hive.metastore.warehouse.dir parameter.
Setting Up the hive-site.xml File
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/home/biadmin/Hive/warehouse</value>
<description>location of default database for the warehouse </description>
</property>
</configuration>

If you already have a Hadoop cluster configured and running, you need to set the
hive.metastore.warehouse.dir configuration variable to the HDFS directory where you intend to store
your Hive warehouse.
Examining the HIVE Clients (or) Configuration of HIVE Clients:
There are number of client options for Hive. We considered three that are useful when the time comes to
analyze data using HiveQL. The first client is the Hive command-line interface (CLI), followed by a web
browser using the Hive Web Interface (HWI) Server, and, finally, the open source SQuirreL client using
the JDBC driver. Each of these client options can play a particular role as you work with Hive to analyze
data.

1. The HIVE CLI Client:

4
UNIT-VI

Fig: The Hive CLI mode


 The above figure shows the Hive components that are required when running the CLI on hadoop
cluster. in this chapter, you run Hive in local mode, which uses local storage, rather than the
HDFS, for your data.
 Hive CLI takes the HQL statements entered by the user through linux terminal using hive
prompt and get it executed by the Hive Driver. Hive Driver compiles, optimizes and executes the
HQL queries using metastore information and Database.
Using the Hive CLI to Create a Table
 The first command (see Step A) starts the Hive CLI using the $HIVE_HOME environment
variable. The –service cli command-line option directs the Hive system to start the command-line
interface. Next, in Step B, you tell the Hive CLI to print your current working database so that you
know where you are in the namespace. In Step C you use HiveQL’s data definition language
(DDL) to create your first database.
More specifically, you’re using DDL to tell the system to create a database called ourfirstdatabase
and then to make this database the default for subsequent HiveQL DDL commands using the USE
command in Step D. In Step E, you create your first table by giving the name our_first_table.The
last command, in Step F, carries out a directory listing of your chosen Hive warehouse directory
so that you can see that our_first_table has in fact been stored on disk.
 You set the hive.metastore.warehouse.dir variable to point to the local directory
/home/biadmin/Hive/warehouse in your Linux virtual machine rather than use the HDFS as you
would on a proper Hadoop cluster.
 Using the HWI Server instead of the CLI can also be more secure. Careful consideration must be
made when using the CLI in production environments because the machine running the CLI must
have access to the entire Hadoop cluster. Therefore, system administrators typically put in place
tools like the secure shell (ssh) in order to provide controlled and secure access to the machine
running the CLI as well as to provide network encryption. However, when the HWI Server is
employed, a user can only access Hive data allowed by the HWI Server via his or her web browser

5
UNIT-VI

2. The web browser as Hive client:


Using the Hive CLI requires only one command to start the Hive shell, but when you want to
access Hive using a web browser, you first need to start the HWI Server and then point your browser
to the port on which the server is listening.

Fig: Hive Web Interface Client Configuration

The following steps show you what you need to do before you can start the HWI Server:HWI
Server:
 Using the commands listed below configure the $HIVE_HOME/conf/hive-site.xml file to ensure
that Hive can find and load the HWI’s Java server pages.
Configuring the $HIVE_HOME/conf/hive-site.xml file
<property>
<name>hive.hwi.war.file</name>
<value>${HIVE_HOME}/lib/hive_hwi.war</value>
</property>
 The HWI Server requires Apache Ant libraries to run, so you need to download Ant from the
Apache site at http://ant.apache.org/bindownload.cgi.
Install Ant using the following commands:
 Set the $ANT_LIB environment variable and start the HWI Server by using the following
commands:
3. SQuirreL as Hive client with the JDBC Driver:
The last hive client is the open source tool SquirrelSQL Tool. You can download this universal SQL
client from the SourceForge website: http://sourceforge.net. It provides a user interface to Hive and
simplifies the tasks of querying large tables and analysing data with Apache Hive.

6
UNIT-VI

Figure below illustrates how the Hive architecture would work when using tools such as SQuirreL. In the
figure, you can see that the SQuirreL client uses the JDBC APIs to pass commands to the Hive Driver by
way of the Server.

Fig: Using SquirrelSQL Client with Apache Hive.

Figure : Using the SQuirreL SQL client to run HiveQL commands.


 Start the Hive Thrift Server using the following commands.
 Download the latest SQuirreL distribution from the SourceForge site into a directory of your
choice.
 Uncompress the SQuirreL package using the gunzip command and expand the archive using the
tar command.
$ gunzip squirrel-sql-3.5.0-standard.tar.gz
$ tar xvf squirrel-sql-3.5.0-standard.tar

7
UNIT-VI

 Change to the new SQuirreL release directory and start the tool using the following command.
$ cd squirrel-sql-3.5.0-standard
$ ./squirrelsql.sh
Working with HIVE Data Types:
Categories Data Types Description
TINYINT 1 Byte Signed Integer, Postfix:Y, Eg:10Y
SMALLINT 2 Byte Signed Integer, Postfix:S, Eg:10L
INT 4 Byte Signed Integer
NUMERIC BIGINT 8 Byte Signed Integer, Posfix:L , Eg:10L
TYPE FLOAT 4 Byte Single Precision Floating point
DOUBLE 8 Byte Double Precision Floating point
DECIMAL Precision of the DECIMAL type fixed and limited to 38
digits.
DATE/TIME TIMESTAMP YYYY-MM-DD HH:MM:SS.fffffffff" (9 decimal
TYPES place precision)
DATE It describes date in the form YYYY-MM-DD.
BOOLEAN Represents true or false
Miscellaneous BINARY Data type for storing arbitrary no. of bytes

String Type STRING Character string data type


VARCHAR Created with length specifies (1-65535).
ARRAY<DataType> Represents collection of elements accessed with same
name.
Complex MAP<primitive-type, Collection of Key-Value Pairs where Key of Primitive
Types data-type> type and value can be of any type.
Struct<Col- A nested Complex Data structure
name:datatype>
Union<datatype,datatyp A Complex Datatype that can hold one of its possible
e,…> data types at once.

File formats of Hive

The following list describes the file formats you can choose from as of Hive version 0.11.

 TEXTFILE: The default file format for Hive records. Alphanumeric characters from the
Unicode standard are used to store your data.

8
UNIT-VI

 SEQUENCEFILE: The format for binary files composed of key/value pairs. Sequence
files, which are used heavily by Hadoop, are often good choices for Hive table storage,
especially if you want to integrate Hive with other technologies in the Hadoop ecosystem.
 RCFILE: RCFILE stands for record columnar file. Stores records in a column-oriented
fashion rather than a row-oriented fashion — like the TEXTFILE format approach
 ORC: ORC stands for optimized row columnar. A format (new as of Hive 0.11) that has
significant optimizations to improve Hive reads and writes and the processing of tables.
For example, ORC files include optimizations for Hive complex types and new types such
as DECIMAL. Also lightweight indexes are included with ORC files to improve perfor-
mance.
 INPUTFORMAT, OUTPUTFORMAT: INPUTFORMAT will read data from the Hive
table. OUTPUTFORMAT does the same thing for writing data to the Hive table. To see
the default settings for the table, simply execute a DESCRIBE EXTENDED tablename
HiveQL statement and we’ll see the INPUTFORMAT and OUTPUTFORMAT classes
for your table.
Defining table record formats
The Java technology that Hive uses to process records and map them to column data types in
Hive tables is called SerDe, which is short for SerializerDeserializer. Figure 7 will help us to under-
stand how Hive keeps file formats separate from record formats.

Figure : How Hive Reads and Writes Records


When Hive is reading data from the HDFS (or local file system), a Java Deserializer formats
the data into a record that maps to table column data types. It is used at the time of HiveQL SE-
LECT statement. When Hive is writing data, a Java Serializer accepts the record Hive uses and
translates it such that the OUTPUTFORMAT class can write it to the HDFS (or local file system).

9
UNIT-VI

It is used at the time of HiveQL CREATE-TABLE-AS- SELECT statement. So the INPUTFOR-


MAT, OUTPUTFORMAT and SerDe objects allow Hive to separate the table record format from
the table file format.
Hive bundles a number of SerDes for us. We can also develop your own SerDes if you have
a more unusual data type that you want to manage with a Hive table. Some of those are specified
as below.
LazySimpleSerDe: The default SerDe that’s used with the TEXTFILE format;
ColumnarSerDe: Used with the RCFILE format.
RegexSerDe: RegexSerDe can form a powerful approach for building structured data in Hive ta-
bles from unstructured blogs, semi-structured log files, e-mails, tweets, and other data from social
media. Regular expressions allow us to extract meaningful information.
HBaseSerDe: Included with Hive to enables it to integrate with HBase.
JSONSerDe: A third-party SerDe for reading and writing JSON data records with Hive.
AvroSerDe: Included with Hive so that you can read and write Avro data in Hive tables.
CREATING AND MANAGING DATABASES AND TABLES:
 Creating,Dropping, and Altering Databases in HIVE

The below commands illustrates the creation and dropping of databases


1. user@user:~$ hive
2. hive> create database csedb;
OK
Time taken: 1.14 seconds
3. hive> show databases;
OK
csedb
default
Time taken: 0.339 seconds, Fetched: 2 row(s)
4. hive> use csedb;
OK
Time taken: 0.013 seconds
5. hive> alter database csedb set dbproperties('creator'='SDR','created_for'='Learning Hive
DDL');
OK

10
UNIT-VI

Time taken: 0.115 seconds


6. hive> drop database csedb cascade;
OK
Time taken: 0.688 seconds
hive> show databases;
OK
default
Time taken: 0.014 seconds, Fetched: 1 row(s)

 In Line 1 we have started Hive Cli.


 In Line 2 and 3 we have set to display our database named csedb.
 In Line 5 we are altering the database to include two new metadata items: creator and
created_for. As you can imagine, including custom metadata with your database can be quite
useful for documentation purposes and coordination within your working group.
 In Line 4, you get the command to view the metadata.
 In Line 6 you’re dropping the entire database — removing it from the server, in other words —
with the DROP command and CASCADE keyword. (Without the CASCADE keyword, you
couldn’t drop the database because the server has still stored our_first_ table) You can use the
DROP TABLE command to delete individual tables.
 The following query drops the database using CASCADE. It means dropping respective tables
before dropping the database.

When you drop a table from Hive Metastore, it removes the table/column data and their metadata. It
can be a normal table (stored in Metastore) or an external table (stored in local file system); Hive treats
both in the same manner, irrespective of their types.
SHOW
It is used to verify the databases.
hive> SHOW DATABASES;
default
csedb
Create table Example:
hive> create table customer(cno string, cname string,age int,profession string) row format delimited
fields terminated by ',' lines terminated by '\n' stored as textfile;
OK
Time taken: 0.686 seconds

11
UNIT-VI

Load data Example:


hive> load data local inpath '/home/user/cust.txt' into table customer;
OK
Time taken: 1.43 seconds
hive> select * from customer;
OK
101 Ria 22 DR
102 Safoora 21 TL
103 Aaliyah 22 HR
104 Ayaan 24 TL
105 Amaan 23 DR
106 xYZ 22 TL
Time taken: 0.4 seconds, Fetched: 6 row(s)
Create table and load data:
hive> create table if not exists txnrecords(txnno int,txndate string,cid int,amount double,category
string,product string,city string,state string,spendby string) row format delimited fields terminated by
','lines terminated by '\n' stored as textfile;
OK
Time taken: 0.104 seconds
If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already exists.
On successful creation of table, you get to see the following response:
hive> load data local inpath '/home/user/txnrecords.txt' into table txnrecords;
OK
Time taken: 0.249 seconds
hive> describe txnrecords;
OK
txnno int
txndate string
cid int
amount double
category string
product string
city string
state string
spendby string

12
UNIT-VI

Time taken: 0.062 seconds, Fetched: 9 row(s)


hive> select * from txnrecords;
OK
1 12-04-2019 101 600.0 kids car nrt ap debit
2 24-08-2019 102 400.0 kids actionfigure gnt ap debit
3 12-04-2019 101 500.0 kids shirt nrt ap debit
….
Create table external
hive> create external table exmp_customer(cno string, cname string,age int,profession string) row
format delimited fields terminated by ',' lines terminated by '\n' stored as textfile location
'hdfs://localhost:9000/user/hive/warehouse';
OK
Time taken: 0.084 seconds
Insert Example:
1.hive> insert into table exmp_customer select cno,cname,age,profession from customer;
OK
Time taken: 1.659 seconds
hive> select * from exmp_customer;
OK
101 Ria 22 DR
102 Safoora 21 TL
103 Aaliyah 22 HR
Time taken: 0.073 seconds, Fetched: 6 row(s)
2.hive> insert overwrite table exmp_customer select cno,cname,age,profession from customers;
OK
Time taken: 1.5 seconds
hive> select * from exmp_customer;
OK
101 Ria 22 DR
102 Safoora 21 TL
103 Aaliyah 22 HR
104 Ayaan 24 TL
Time taken: 0.043 seconds, Fetched: 4 row(s)
Creating Table as select
hive> create table newtxnrecords as select * from txnrecords;

13
UNIT-VI

hive> select * from newtxnrecords;


OK
1 12-04-2019 101 600.0 kids car nrt ap debit
2 24-08-2019 102 400.0 kids actionfigure gnt ap debit
3 12-04-2019 101 500.0 kids shirt nrt ap debit
4 24-07-2019 102 700.0 men shoe gnt ap debit
5 24-08-2019 102 400.0 kids spiderman hyd telangana debit
Alter Table Statement
It is used to alter a table in Hive.
Syntax
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
Rename To… Statement
The following query renames the table from employee to emp.
hive> ALTER TABLE emp RENAME TO employee;
Eg:
hive> ALTER TABLE employee CHANGE name ename String;
hive> ALTER TABLE employee CHANGE salary salary Double;
The following query adds a column named dept to the employee table.

hive> ALTER TABLE employee ADD COLUMNS (age int);

The following query deletes all the columns from the employee table and replaces it
with emp and name columns:

hive> ALTER TABLE employee REPLACE COLUMNS ( empid Int, ename STRING name String);

PARTITIONING AND BUCKETING:


Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values
of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of
the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used
for more efficient querying. Bucketing works based on the value of hash function of some column of a
table.

14
UNIT-VI

For example, a table named Tab1 contains employee data such as id, name, dept, and doj date of joining).
Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole
table for the required information. However, if you partition the employee data with the year and store it
in a separate file, it reduces the query processing time. The following example shows how to partition a
file and its data:
The following file contains employeedata table.
/tab1/employeedata/file1

id, name, dept, doj


1, smith, TP, 2012
2, john, HR, 2012
3, tiger, Admin, 2013

The above data is partitioned into two files using year.


/tab1/employeedata/2012/file2

1, gopal, TP, 2012


2, kiran, HR, 2012

/tab1/employeedata/2013/file3

3, kaleel,Admin, 2013

We can add partitions to a table by altering the table.

hive> ALTER TABLE employee


> ADD PARTITION (year=’2013’)
> location '/2012/part2012';

Renaming Partition

hive> ALTER TABLE employee PARTITION (year=’1203’)


> RENAME TO PARTITION (Yoj=’1203’);

Droping Partition:

hive> ALTER TABLE employee DROP [IF EXISTS]


> PARTITION (year=’1203’);

Example:

hive>set hive.exec.dynamic.partition.mode=nonstrict;
hive> create table txnbycatg(txnno int,txndate string,cid int,amount double,product string,city
string,state string,spendby string)partitioned by(category string)clustered by (state)into 4 buckets row
format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;

15
UNIT-VI

OK
Time taken: 0.056 seconds
hive> from txnrecords txn insert overwrite table txnbycatg partition(category) select
txn.txnno,txn.txndate,txn.cid,txn.amount,txn.product,txn.city,txn.state,txn.spendby,txn.category;
hive> select * from txnbycatg;
OK
1 12-04-2019 101 600.0 car nrt ap debit kids
2 24-08-2019 102 400.0 actionfigure gnt ap debit kids
3 12-04-2019 101 500.0 shirt nrt ap debit kids
4 24-07-2019 102 700.0 shoe gnt ap debit men
7 12-04-2018 104 600.0 Tshirt nrt ap debit men
8 24-08-2017 105 400.0 Top gnt ap debit women
Querying and Analysing Data:
Aggregations:
hive> select count(*) from customer;
OK
8
hive> select count(distinct category) from txnrecords;
Stage-Stage-1: HDFS Read: 10172 HDFS Write: 4972 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
3
Time taken: 2.38 seconds, Fetched: 1 row(s)
Group BY
Example:
ID | Name | Salary | Designation | Dept |
+------+-------------- -------------+-------------------+--------+ ----------------------
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 45000 | Proofreader | PR |
|1205 | Kranthi | 30000 | Op Admin | Admi
hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
Dept | Count(*) |
+------ +--------------+

16
UNIT-VI

|Admin | 1 |
|PR | 2 |
|TP | 3 |
Example2:
hive> select category,sum(amount) from txnrecords group by category;
OK
kids 2300.0
men 1300.0
women 400.0
Time taken: 1.23 seconds, Fetched: 3 row(s)
Joins:
JOIN is a clause that is used for combining specific fields from two tables by using values common
to each one. It is used to combine records from two or more tables in the database. JOIN clause is used
to combine and retrieve the records from multiple tables. JOIN is same as OUTER JOIN in SQL. A JOIN
condition is to be raised using the primary keys and foreign keys of the tables.
Examples:
hive> create table mailids(name string,email string) row format delimited fields terminated by ',' lines
terminated by '\n' stored as textfile;
OK
Time taken: 0.047 seconds
hive> load data local inpath '/home/user/mails.txt' into table mailids;
OK
Time taken: 0.543 seconds
hive> select * from mailids;
OK
ria [email protected]
safoora [email protected]
Ayaan [email protected]
amaan [email protected]
aaliyah [email protected]
xxx [email protected]
xyz [email protected]
rrr [email protected]
Time taken: 0.037 seconds, Fetched: 8 row(s)
hive> load data local inpath '/home/user/empl.txt' into table emp;

17
UNIT-VI

OK
Time taken: 0.145 seconds
hive> select * from emp;
OK
ria 22 30000 [email protected]
safoora 23 22000 [email protected]
Ayaan 24 34000 [email protected]
amaan 22 21000 [email protected]
aaliyah 22 18000 [email protected]
Time taken: 0.044 seconds, Fetched: 5 row(s)
hive> select e.name, e.age, e.sal,s.email from emp e join mailids s on e.name=s.name;
OK
ria 22 30000 [email protected]
safoora 23 22000 [email protected]
Ayaan 24 34000 [email protected]
amaan 22 21000 [email protected]
aaliyah 22 18000 [email protected]
Time taken: 15.667 seconds, Fetched: 5 row(s)
Left Outer Join:
The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no matches
in the right table.
Example:
hive> select e.name,e.age,e.sal,s.email from emp e left outer join mailids s on e.name=s.name;
OK
ria 22 30000 [email protected]
safoora 23 22000 [email protected]
Ayaan 24 34000 [email protected]
amaan 22 21000 [email protected]
aaliyah 22 18000 [email protected]
Time taken: 5.552 seconds, Fetched: 5 row(s)
Right outer Join:
The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no
matches in the left table.
Example:
hive> select e.name,e.age,e.sal,s.email from emp e right outer join mailids s on e.name=s.name;

18
UNIT-VI

OK
ria 22 30000 [email protected]
safoora 23 22000 [email protected]
Ayaan 24 34000 [email protected]
amaan 22 21000 [email protected]
aaliyah 22 18000 [email protected]
null null null [email protected]
null null null [email protected]
Time taken: 5.552 seconds, Fetched: 5 row(s)
FULL OUTER JOIN:
The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables that
fulfil the JOIN condition.
Example:
hive> select e.name,e.age,e.sal,s.email from emp e full join mailids s on e.name=s.name;
OK
Ayaan 24 34000 [email protected]
aaliyah 22 18000 [email protected]
amaan 22 21000 [email protected]
ria 22 30000 [email protected]
NULL NULL NULL [email protected]
safoora 23 22000 [email protected]
NULL NULL NULL [email protected]
NULL NULL NULL [email protected]
Time taken: 1.218 seconds, Fetched: 8 row(s)
Improving your Hive queries with indexes:
Creating an index is common practice with relational databases when we want to speed access to a
column or set of columns in your database. Without an index, the database system has to read all rows in
the table to find the data we have selected. Indexes become even more essential when the tables grow
extremely large. Hive supports index creation on tables.
Example:
hive> create index index_sals on table employee (salary) as 'COMPACT' with deferred rebuild;
OK
Time taken: 0.501 seconds
hive> alter index index_sals on employee rebuild;
hive> alter index index_sals on employee rebuild;;

19
UNIT-VI

OK
Time taken: 4.197 seconds
hive> show indexes on employee;
OK
index_sal employee salary csedb__employee_index_sal__ compact
index_sals employee salary csedb__employee_index_sals__ compact
Time taken: 0.073 seconds, Fetched: 2 row(s)
hive> describe csedb__employee_index_sals__;
OK
salary int
_bucketname string
_offsets array<bigint>
Time taken: 0.058 seconds, Fetched: 3 row(s)
hive> select salary,count(1) from employee where salary=25000 group by salary;
OK
25000 7
Time taken: 1.535 seconds, Fetched: 1 row(s)
hive> select salary,SIZE(`_offsets`) from csedb__employee_index_sals__ where salary=25000;
OK
25000 7
Time taken: 0.054 seconds, Fetched: 1 row(s)
hive> drop index index_sals on employee;
OK
Time taken: 1.611 seconds
Hive indexes:
Hive indexes are implemented as tables. This is why we need to first create the index table and
then build it to populate the table. Therefore, we can use indexes in at least two ways:
✓ Count on the system to automatically use indexes that you create.

✓ Rewrite some queries to leverage the new index table


Windowing in HiveQL
The concept of windowing, introduced in the SQL:2003 standard, allows the SQL programmer to
create a frame from the data against which aggregate and other window functions can operate. HiveQL
now supports windowing per the SQL standard.
Creating Views

20
UNIT-VI

You can create a view at the time of executing a SELECT statement.


Hive>create view view_name as SELECT……;
Examples:
hive> create view txnr as select * from txnrecords where category='kids';
OK
Time taken: 0.081 seconds
hive> select * from txnr;
OK
1 12-04-2019 101 600.0 kids car nrt ap debit
2 24-08-2019 102 400.0 kids actionfigure gnt ap debit
3 12-04-2019 101 500.0 kids shirt nrt ap debit
5 24-08-2019 102 400.0 kids spiderman hyd telangana debit
6 24-08-2019 103 400.0 kids spiderman medak telangana debit
Time taken: 0.052 seconds, Fetched: 5 row(s)
hive> create view txne as select * from txnrecords where category='men';
OK
Time taken: 0.057 seconds
hive> select * from txne;
OK
4 24-07-2019 102 700.0 men shoe gnt ap debit
7 12-04-2018 104 600.0 men Tshirt nrt ap debit
Time taken: 0.052 seconds, Fetched: 2 row(s)
Hive>create view emp_100 as SELECT * from emp where sal>3000;
Drop View
Hive>drop view ViewName;
hive> drop view txne;
Hive>drop view emp_100;

Other key HiveQL features:


i. Framework
Apache Hive is built on top of Hadoop distributed framework system (HDFS).
ii. Large datasets
However, in distributed storage, it helps to query large datasets residing.
iii. Warehouse
Also, we can say Hive is a distributed data warehouse.

21
UNIT-VI

iv. Language
Queries data using a SQL-like language called HiveQL (HQL).
v. Declarative language
HiveQL is a declarative language like SQL.
vi. Table structure
Table structure/s is/are similar to tables in a relational database.
vii. Multi-user
Multiple users can simultaneously query the data using Hive-QL.
viii. Data Analysis
However, to perform more detailed data analysis, Hive allows writing custom MapReduce framework
processes.
ix. ETL support
Also, it is possible to extract/transform/load (ETL) Data easily.
x. Data Formats
Moreover, Hive offers the structure on a variety of data formats.
xi. Storage
Hive allows access files stored in HDFS. Also, similar others data storage systems such as Apache
HBase.
x. Format conversion
Moreover, it allows converting the variety of format from to within Hive. Although, it is very simple
and possible.
Limitations of Hive
i. OLTP Processing issues
Hive is not designed for Online transaction processing (OLTP). Although, we can use it for the Online
Analytical Processing (OLAP).
ii. No Updates
It does not support updates and deletes, however, it does support overwriting or apprehending data.
iii. Subqueries
Basically, in Hive, Subqueries are not supported.

22

You might also like