0% found this document useful (0 votes)
24 views80 pages

HIVE

BDa Sem

Uploaded by

gaddamlokesh20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views80 pages

HIVE

BDa Sem

Uploaded by

gaddamlokesh20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

HIVE

Is Hadoop a Solution?
Hive Applications
What is Hive?
• Hive is a data warehouse system which is used to analyze structured
data. It is built on the top of Hadoop.
• It was developed by Facebook.
• Hive provides the functionality of reading, writing, and managing
large datasets residing in distributed storage.
• It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
• Using Hive, we can skip writing complex MapReduce programs.
• Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).
Data Definition Language (DDL) Commands:
• CREATE TABLE: Creates a new table in Hive.
• DROP TABLE: Deletes a table from Hive.
• ALTER TABLE: Modifies the structure of an existing table.
• CREATE DATABASE: Creates a new database in Hive.
• DROP DATABASE: Deletes a database from Hive.
• DESCRIBE: Shows the structure of a table.
• SHOW TABLES: Lists the tables in the current database.
• SHOW DATABASES: Lists the databases in Hive.
Data Manipulation Language (DML) Commands:
• INSERT INTO: Inserts data into a table.
• SELECT: Retrieves data from one or more tables.
• UPDATE: Modifies existing data in a table (limited support in Hive,
often achieved using INSERT INTO or INSERT OVERWRITE).
• DELETE: Deletes data from a table LOAD DATA INPATH: Loads data
into a table from a specified path in HDFS.
• EXPORT TABLE: Exports data from a table to HDFS.
• IMPORT TABLE: Imports data from HDFS into a table.
• CREATE INDEX: Creates an index on a table.
• DROP INDEX: Deletes an index from a table.
Features of Hive
• Hive is fast and scalable.
• It provides SQL-like queries (i.e., HQL)
that are implicitly transformed to
MapReduce.
• It is capable of analyzing large
datasets stored in HDFS.
• It allows different storage types such
as plain text, csv.
• It uses indexing to accelerate queries.
• It can operate on compressed data
stored in the Hadoop ecosystem.
• It supports user-defined functions
(UDFs) where user can provide its
functionality.
Differences between Hive and Pig
Hive Pig
Hive is commonly used by Data Analysts. Pig is commonly used by programmers.

It follows SQL-like queries. It follows the data-flow language.

It can handle structured data. It can handle semi-structured data.

It works on server-side of HDFS cluster. It works on client-side of HDFS cluster.

Hive is slower than Pig. Pig is comparatively faster than Hive.


Hive Architecture
• Hive Client:
Hive allows writing applications in various languages, including Java,
Python, and C++.
• Thrift Server - It is a cross-language service provider platform.
• JDBC Driver - It is used to establish a connection between hive and
Java applications. The JDBC Driver is present in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
• ODBC Driver - It allows the applications that support the ODBC
protocol to connect to Hive.
• Hive CLI - The Hive CLI (Command Line Interface) is a shell where we
can execute Hive queries and commands.
• Hive Web User Interface - The Hive Web UI is just an alternative of
Hive CLI.
• It provides a web-based GUI for executing Hive queries and
commands.
• Hive MetaStore - It is a central repository that stores all the structure
information of various tables and partitions in the warehouse.
• Hive Server - It is referred to as Apache Thrift Server. It accepts the
request from different clients and provides it to Hive Driver.
• Hive Driver - It receives queries from different sources like web UI,
CLI, Thrift, and JDBC/ODBC driver. It transfers the queries to the
compiler.
HIVE Data Types
Integer Types

Type Size Range


TINYINT 1-byte signed integer -128 to 127
SMALLINT 2-byte signed integer -32,768 to 32,767
INT 4-byte signed integer -2,147,483,648 to
2,147,483,647
BIGINT 8-byte signed integer -
9,223,372,036,854,775,808
to
9,223,372,036,854,775,807
Decimal Type
Type Size Range

FLOAT 4-byte Single precision


floating point number
DOUBLE 8-byte Double precision
floating point number

float_column | double_column
+------------------+---------------------+
1234.56774902344 | 1234.5678
12345.6787109375 | 12345.6789
123456.7890625 | 123456.789
1234567.890625 | 1234567.89
Date/Time Types

• It supports traditional UNIX timestamp with optional nanosecond


precision.
• As string, it follows java.sql.Timestamp format "YYYY-MM-DD
HH:MM:SS.fffffffff" (9 decimal place precision)
• The range of Date type lies between 0000--01--01 to 9999--12--31.
String Types
• STRING
• The string is a sequence of characters. It values can be enclosed
within single quotes (') or double quotes (").
• Varchar
• The varchar is a variable length type whose range lies between 1 and
65535, which specifies that the maximum number of characters
allowed in the character string.
• CHAR
• The char is a fixed-length type whose maximum length is fixed at 255.
Complex Type
Type Size Range
Struct It is similar to C struct or an object struct('James','Roy')
where fields are accessed using the
"dot" notation.

Map It contains the key-value tuples where map('first','James','last','Roy')


the fields are accessed using array
notation.

Array It is a collection of similar type of values array('James','Roy')


that indexable using zero-based
integers.
Hive - Create Database
• In Hive, the database is considered as a catalog of tables.
• a unique name is assigned to each table.
• Hive provides a default database with a name default.
• To start hive
• Type command hive in the terminal
• hive>
• To know the list of databases
• hive> show databases;
To create new db
• hive> create database demo;

hive> show databases;


If we create two databases with the same
name
creating the database with the same name if
already exists
• hive> create a database if not exists demo;
Drop database
• hive> show databases;

hive> drop database demo;


If we try to drop the database that doesn't exist,
the following error generates:
hive> drop database if exists demo;

•In Hive, it is not allowed to drop the database that contains the tables directly. In such a case, we
can drop the database either by dropping tables first or use Cascade keyword with the command.

hive> drop database if exists demo cascade;

This command automatically drops the tables present in the database first.
Hive - Create Table

hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited fields terminated by ',' ;
• hive> describe demo.employee
If the table is already existing
If not exists
hive> create table if not exists demo.employee (Id int, Name string , Sal
ary float) row format delimited fields terminated by ',' ;
creating a new table by using the schema of an existing table.
hive> create table if not exists demo.copy_employee
like demo.employee;
External table
To store the file on the created directory use the following command at
the terminal
hdfs dfs -put hive/emp_details /HiveDirectory

hive> create external table emplist (Id int, Name string , Salary float)
row format delimited fields terminated by ',' location '/HiveDirectory';
Hive - Load Data
• Once the internal table has been created, the next step is to load the
data into it.
• load the data of the file into the database by using the following
command:
• LOAD DATA LOCAL INPATH '/home/Desktop/hive/emp_details' INTO
TABLE demo.employee;
Hive - Drop Table
• hive> drop table new_employee;
• Hive - Alter Table
• we can perform modifications in the existing table like changing the
table name, column name, comments, and table properties
• Rename a Table
• Alter table old_table_name rename to new_table_name;
• Adding column
• we can add one or more columns in an existing table by using the
following signature
• Alter table table_name add columns(column_name datatype);
• Change Column
• we can rename a column, change its type
• Alter table table_name change old_column_name new_column_nam
e datatype;
• Alter table employee_data change name first_name string;
Delete or Replace Column
• alter table employee_data replace columns( id string, first_name strin
g, age int);
Built-in Operators
1. Relational Operators
2. Arithmetic Operators
3. Logical Operators
4. Complex Operators
Relational Operators
• These operators are used to compare two operands.
• hive> SELECT * FROM employee WHERE Id=1205;
• hive> SELECT * FROM employee WHERE Salary>=40000;
• Arithmetic Operators
• These operators support various common arithmetic operations on
the operands.
• hive> SELECT 20+30 ADD FROM temp;
Logical Operators
• The operators are logical expressions. All of them return either TRUE
or FALSE.
• hive> SELECT * FROM employee WHERE Salary>40000 && Dept=TP;
Complex Operators
These operators provide an expression to access the elements of Complex Types.

Operator Operand Description


It returns the nth
element in the
A is an Array and n
A[n] array A. The first
is an int
element has index
0.
It returns the value
M is a Map<K, V>
M[key] corresponding to
and key has type K
the key in the map.
It returns the x field
S.x S is a struct
of S.
Example
SELECT emp_id, emp_name,
CASE
WHEN salary > 50000 THEN 'High'
WHEN salary > 40000 THEN 'Medium'
ELSE 'Low'
END AS salary_category
FROM employee;
Arrays
CREATE TABLE array_example (
id INT,
names ARRAY<STRING>
);

INSERT INTO array_example VALUES


(1, array('John', 'Alice', 'Bob')),
(2, array('Sarah', 'Michael'));

SELECT id, names[1] AS first_name FROM array_example;


structure
CREATE TABLE struct_example (
id INT,
employee STRUCT<name: STRING, age: INT, department: STRING>
);

INSERT INTO struct_example VALUES


(1, STRUCT('John', 30, 'HR')),
(2, STRUCT('Alice', 25, 'IT'));

SELECT id, employee.name AS name, employee.age AS age FROM


struct_example;
Map
CREATE TABLE map_example (
id INT,
scores MAP<STRING, INT>
);
INSERT INTO map_example VALUES
(1, map('math', 90, 'science', 85)),
(2, map('math', 95, 'science', 88));

SELECT id, scores['math'] AS math_score, scores['science'] AS


science_score FROM map_example;
Builtin functions
• round() function
• hive> SELECT round(2.6) from temp;
• floor() function
• hive> SELECT floor(2.6) from temp;
• ceil() function
• hive> SELECT ceil(2.6) from temp;
• SELECT SUM(salary) AS total_salary FROM employee;
• SELECT AVG(salary) AS avg_salary FROM employee;
• SELECT COUNT(*) AS total_employees FROM employee;
• SELECT MIN(salary) AS min_salary FROM employee;
• SELECT MAX(salary) AS max_salary FROM employee;
Partitioning in Hive
• The partitioning in Hive means dividing the table into some parts
based on the values of a particular column like date, course, city or
country.
• The advantage of partitioning is that since the data is stored in slices,
the query response time becomes faster.
• we have a data of 10 million students studying in an institute.
• Now, we have to fetch the students of a particular course.
• Traditional approach, we have to go through the entire data. This
leads to performance degradation.
• Partitioning in Hive and divide the data among the different datasets
based on particular columns.
The partitioning in Hive can be executed in two
ways
• Static partitioning
• Dynamic partitioning

To know the list of tables type the command in the terminal


Haddop fs –ls /user/hive/warehouse/
Static Partitioning
• In static or manual partitioning, it is required to pass the values of
partitioned columns manually while loading the data into the table.
• Hence, the data file doesn't contain the partitioned columns.
• Example of Static Partitioning
• First, select the database in which we want to create a table.
• hive> use test;
• Create the table and provide the partitioned columns by using the
following command: -
hive> create table student (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited fields terminated by ',';
• hive> describe student;

hive> load data local inpath '/home/cloudera/Desktop/hive/student_details1' into table


student partition(course= "java");
• hive> load data local inpath '/home/Desktop/cloudera/student_details2'
into table student partition(course= "hadoop");
hive> select * from student;
hive> select * from student where course="java";
hive> select * from student where course= "hadoop";
• Insert into table student partition (city=‘Chennai’) select id, name,
age, institute from student where city=‘Delhi’;
To know partitioning on a table
hive> show partitions table_name;
Dynamic Partitioning
• In dynamic partitioning, the values of partitioned columns exist within
the table.
• So, it is not required to pass the values of partitioned columns
manually.

• Insert into table student dynamic partition (city) select id, name, age,
institute from student where city=‘Delhi’;
First, select the database in which we want to
create a table.
• hive> use show;
• Enable the dynamic partition by using the following commands: -
• hive> set hive.exec.dynamic.partition=true;
• hive> set hive.exec.dynamic.partition.mode=nonstrict;
• Create a table to store the data.
• hive> create table stud_demo(id int, name string, age int, institute
string, course string) row format delimited fields terminated by ',';
load the data into the table.
• hive> load data local inpath '/home/cloudera/Desktop/hive/student_
details' into table stud_demo;
• Create a partition table by using the following command: -
• hive> create table student_part (id int, name string, age int, institute
string) partitioned by (course string) row format delimited
fields terminated by ',';
insert the data of table into the partition table.
hive> insert into student_part partition(course) select id, name, age, in
stitute, course from stud_demo;
hive> select * from student_part;
• hive> select * from student_part where course= "java ";
hive> select * from student_part where course=
"hadoop";
Bucketing in Hive
• The bucketing in Hive is a data organizing technique.
• It is similar to partitioning in Hive with an added functionality that it
divides large datasets into more manageable parts known as buckets.
• So, we can use bucketing in Hive when the implementation of
partitioning becomes difficult.
• However, we can also divide partitions further in buckets.
Working of Bucketing in Hive

•The concept of bucketing is based on the hashing technique.


•Here, modules of current column value and the number of required buckets is calculated (let say, F(x)
% 3).
•Now, based on the resulted value, the data is stored into the corresponding bucket.
Example of Bucketing in Hive

• hive> use showbucket;


• hive> create table emp_demo (Id int, Name string , Salary float)
row format delimited fields terminated by ',' ;
load the data into the table.
• hive> load data local inpath '/home/hive/emp_details' into table emp_
demo;
• Enable the bucketing by using the following command: -
• hive> set hive.enforce.bucketing = true;
• Create a bucketing table by using the following command: -
• hive> create table emp_bucket(Id int, Name string , Salary float)
clustered by (Id) into 3 buckets row format delimited fields terminated
by ',' ;
hive> insert overwrite table emp_bucket select *
from emp_demo;

• we can see that the data is divided into three buckets.


retrieve the data of bucket 0.
retrieve the data of bucket 1.
According to hash function :
8%3=2
5%3=2
2%3=2
So, these columns stored in bucket 2.
Views:
• The usage of view in Hive is same as that of the view in SQL.
• It is a standard RDBMS concept.
• We can execute all DML operations on a view.
• Creating a View
• hive> CREATE VIEW emp_30000 AS SELECT * FROM employee WHERE salary>30000;

Dropping a View
hive> DROP VIEW emp_30000;
Creating an Index
• An Index is nothing but a pointer on a particular column of a table.
hive> CREATE INDEX index_salary ON TABLE employee(salary)
AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler
';
• If the column is modified, the changes are stored using an index
value.
• Dropping an Index
hive> DROP INDEX index_salary ON employee;
Illustrate the PIG string functions and date and time
functions with student database
• student_id, name, dob, address
• 1, A Gopi, 1995-07-15, 123 Hyderabad
• 2, B Ravi, 1998-03-20, 456 Banglore
• 3, C Krishna, 1996-11-10, 789 Chennai
-- Load student data
• student_data = LOAD 'student_info' USING PigStorage(',') AS
(student_id:int, name:chararray, dob:chararray, address:chararray);
• -- Convert names to uppercase
student_names_upper = FOREACH student_data GENERATE
UPPER(name) AS name_upper;

Dump(student_names_upper);

-- Output: (GOPI), (RAVI), (KRISHNA)


-- Extract initials from names
• student_initials = FOREACH student_data GENERATE name,
SUBSTRING(name, 0, 1) AS initial;

• Dump(student_initials);

• -- Output: (GOPI,A), (RAVI,B), (KRISHNA,C)


-- Concatenate name and address

• student_name_address = FOREACH student_data GENERATE name,


address, CONCAT(name, ', ', address) AS name_address;
• Dump(student_name_address);

• -- Output: (A Gopi, 123 Hyderabad, B Ravi, 456 Banglore), (C Krishna,


789 Chennai)
-- Calculate age from date of birth

• student_age = FOREACH student_data GENERATE name, dob,


(int)TOTUPLE(ToUnixTime(CurrentTime()) - ToUnixTime(ToDate(dob,
'yyyy-MM-dd'))) / (365 * 24 * 60 * 60) AS age;

• Dump(student_age);

• -- Output: (A Gopi, 1995-07-15, 28), (B Ravi, 1998-03-20, 24), (C


Krishna, 1996-11-10, 25)
-- Extract year from date of birth

• student_dob_year = FOREACH student_data GENERATE name, dob,


GetYear(ToDate(dob, 'yyyy-MM-dd')) AS dob_year;
• Dump(student_dob_year);

• -- Output:
• (A Gopi, 1995-07-15, 1995), (B Ravi, 1998-03-20, 1998), (C Krishna,
1996-11-10, 1996)
-- Extract date from date of birth
• student_date = FOREACH student_data GENERATE name, dob, GetDay(ToDate(dob, 'yyyy-MM-
dd')) AS dob_day;

• Dump(student_date);

• -- Extract month from date of birth


• student_month = FOREACH student_data GENERATE name, dob, GetMonth(ToDate(dob, 'yyyy-
MM-dd')) AS dob_month;

• -- Display the contents of the student_month relation DUMP


Dump(student_month);

You might also like