HIVE
HIVE
Is Hadoop a Solution?
Hive Applications
What is Hive?
• Hive is a data warehouse system which is used to analyze structured
data. It is built on the top of Hadoop.
• It was developed by Facebook.
• Hive provides the functionality of reading, writing, and managing
large datasets residing in distributed storage.
• It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
• Using Hive, we can skip writing complex MapReduce programs.
• Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).
Data Definition Language (DDL) Commands:
• CREATE TABLE: Creates a new table in Hive.
• DROP TABLE: Deletes a table from Hive.
• ALTER TABLE: Modifies the structure of an existing table.
• CREATE DATABASE: Creates a new database in Hive.
• DROP DATABASE: Deletes a database from Hive.
• DESCRIBE: Shows the structure of a table.
• SHOW TABLES: Lists the tables in the current database.
• SHOW DATABASES: Lists the databases in Hive.
Data Manipulation Language (DML) Commands:
• INSERT INTO: Inserts data into a table.
• SELECT: Retrieves data from one or more tables.
• UPDATE: Modifies existing data in a table (limited support in Hive,
often achieved using INSERT INTO or INSERT OVERWRITE).
• DELETE: Deletes data from a table LOAD DATA INPATH: Loads data
into a table from a specified path in HDFS.
• EXPORT TABLE: Exports data from a table to HDFS.
• IMPORT TABLE: Imports data from HDFS into a table.
• CREATE INDEX: Creates an index on a table.
• DROP INDEX: Deletes an index from a table.
Features of Hive
• Hive is fast and scalable.
• It provides SQL-like queries (i.e., HQL)
that are implicitly transformed to
MapReduce.
• It is capable of analyzing large
datasets stored in HDFS.
• It allows different storage types such
as plain text, csv.
• It uses indexing to accelerate queries.
• It can operate on compressed data
stored in the Hadoop ecosystem.
• It supports user-defined functions
(UDFs) where user can provide its
functionality.
Differences between Hive and Pig
Hive Pig
Hive is commonly used by Data Analysts. Pig is commonly used by programmers.
float_column | double_column
+------------------+---------------------+
1234.56774902344 | 1234.5678
12345.6787109375 | 12345.6789
123456.7890625 | 123456.789
1234567.890625 | 1234567.89
Date/Time Types
•In Hive, it is not allowed to drop the database that contains the tables directly. In such a case, we
can drop the database either by dropping tables first or use Cascade keyword with the command.
This command automatically drops the tables present in the database first.
Hive - Create Table
hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited fields terminated by ',' ;
• hive> describe demo.employee
If the table is already existing
If not exists
hive> create table if not exists demo.employee (Id int, Name string , Sal
ary float) row format delimited fields terminated by ',' ;
creating a new table by using the schema of an existing table.
hive> create table if not exists demo.copy_employee
like demo.employee;
External table
To store the file on the created directory use the following command at
the terminal
hdfs dfs -put hive/emp_details /HiveDirectory
hive> create external table emplist (Id int, Name string , Salary float)
row format delimited fields terminated by ',' location '/HiveDirectory';
Hive - Load Data
• Once the internal table has been created, the next step is to load the
data into it.
• load the data of the file into the database by using the following
command:
• LOAD DATA LOCAL INPATH '/home/Desktop/hive/emp_details' INTO
TABLE demo.employee;
Hive - Drop Table
• hive> drop table new_employee;
• Hive - Alter Table
• we can perform modifications in the existing table like changing the
table name, column name, comments, and table properties
• Rename a Table
• Alter table old_table_name rename to new_table_name;
• Adding column
• we can add one or more columns in an existing table by using the
following signature
• Alter table table_name add columns(column_name datatype);
• Change Column
• we can rename a column, change its type
• Alter table table_name change old_column_name new_column_nam
e datatype;
• Alter table employee_data change name first_name string;
Delete or Replace Column
• alter table employee_data replace columns( id string, first_name strin
g, age int);
Built-in Operators
1. Relational Operators
2. Arithmetic Operators
3. Logical Operators
4. Complex Operators
Relational Operators
• These operators are used to compare two operands.
• hive> SELECT * FROM employee WHERE Id=1205;
• hive> SELECT * FROM employee WHERE Salary>=40000;
• Arithmetic Operators
• These operators support various common arithmetic operations on
the operands.
• hive> SELECT 20+30 ADD FROM temp;
Logical Operators
• The operators are logical expressions. All of them return either TRUE
or FALSE.
• hive> SELECT * FROM employee WHERE Salary>40000 && Dept=TP;
Complex Operators
These operators provide an expression to access the elements of Complex Types.
• Insert into table student dynamic partition (city) select id, name, age,
institute from student where city=‘Delhi’;
First, select the database in which we want to
create a table.
• hive> use show;
• Enable the dynamic partition by using the following commands: -
• hive> set hive.exec.dynamic.partition=true;
• hive> set hive.exec.dynamic.partition.mode=nonstrict;
• Create a table to store the data.
• hive> create table stud_demo(id int, name string, age int, institute
string, course string) row format delimited fields terminated by ',';
load the data into the table.
• hive> load data local inpath '/home/cloudera/Desktop/hive/student_
details' into table stud_demo;
• Create a partition table by using the following command: -
• hive> create table student_part (id int, name string, age int, institute
string) partitioned by (course string) row format delimited
fields terminated by ',';
insert the data of table into the partition table.
hive> insert into student_part partition(course) select id, name, age, in
stitute, course from stud_demo;
hive> select * from student_part;
• hive> select * from student_part where course= "java ";
hive> select * from student_part where course=
"hadoop";
Bucketing in Hive
• The bucketing in Hive is a data organizing technique.
• It is similar to partitioning in Hive with an added functionality that it
divides large datasets into more manageable parts known as buckets.
• So, we can use bucketing in Hive when the implementation of
partitioning becomes difficult.
• However, we can also divide partitions further in buckets.
Working of Bucketing in Hive
Dropping a View
hive> DROP VIEW emp_30000;
Creating an Index
• An Index is nothing but a pointer on a particular column of a table.
hive> CREATE INDEX index_salary ON TABLE employee(salary)
AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler
';
• If the column is modified, the changes are stored using an index
value.
• Dropping an Index
hive> DROP INDEX index_salary ON employee;
Illustrate the PIG string functions and date and time
functions with student database
• student_id, name, dob, address
• 1, A Gopi, 1995-07-15, 123 Hyderabad
• 2, B Ravi, 1998-03-20, 456 Banglore
• 3, C Krishna, 1996-11-10, 789 Chennai
-- Load student data
• student_data = LOAD 'student_info' USING PigStorage(',') AS
(student_id:int, name:chararray, dob:chararray, address:chararray);
• -- Convert names to uppercase
student_names_upper = FOREACH student_data GENERATE
UPPER(name) AS name_upper;
Dump(student_names_upper);
• Dump(student_initials);
• Dump(student_age);
• -- Output:
• (A Gopi, 1995-07-15, 1995), (B Ravi, 1998-03-20, 1998), (C Krishna,
1996-11-10, 1996)
-- Extract date from date of birth
• student_date = FOREACH student_data GENERATE name, dob, GetDay(ToDate(dob, 'yyyy-MM-
dd')) AS dob_day;
• Dump(student_date);