0% found this document useful (0 votes)
13 views53 pages

Hive Part 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views53 pages

Hive Part 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

HIVE

HIVE
• Hive is a data warehouse system - used to analyse structured data.
• Built on the top of Hadoop.
• Developed by Facebook.
• Functionality of reading, writing, and managing large datasets
residing in distributed storage.
• Runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
HIVE
• Ability to bring structure to various data formats
• Simple interface for ad hoc querying, analyzing and summarizing
large amounts of data
• Access to files on various data stores such as HDFS and Hbase
• Using Hive, - skip writing complex MapReduce programs.
• Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).
HIVE
• Hive is not- A relational database
• It is not design for OnLine Transaction Processing(OLTP)
• It is not a language for real-time queries and row-level updates
Feature of HIVE
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
HIVE Architecture
Hive – Client
• Hive allows writing applications in various languages, including Java,
Python, and C++. It supports different types of clients:
• Thrift Server - It is a cross-language service provider platform that
serves the request from all those programming languages that
supports Thrift.
• JDBC Driver - It is used to establish a connection between hive and
Java applications. The JDBC Driver is present in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
• ODBC Driver - It allows the applications that support the ODBC
protocol to connect to Hive.
Hive – User Interface
• Hive is a data warehouse infrastructure software that can
create interaction between user and HDFS.
• The user interfaces that Hive supports are
• Hive Web UI: is a shell where we can execute Hive queries
and commands.
• Hive command line: It provides a web-based GUI for
executing Hive queries and commands.
• Hive server : It is referred to as Apache Thrift Server. It
accepts the request from different clients and provides it to
Hive Driver.
Hive - Driver
Hive - MetaStore
•Hive MetaStore - It is a central repository that stores all the
structure information of various tables and partitions in the
warehouse.

•It also includes metadata of column and its type information


used to read and write data and the corresponding HDFS files
where the data is stored.
Apache Hive Installation
• Java Installation - $ java -version
• Hadoop Installation - $hadoop version
• Download the Apache Hive tar file.
• http://mirrors.estointernet.in/apache/hive/hive-1.2.2/
• Unzip the downloaded tar file.
• tar -xvf apache-hive-1.2.2-bin.tar.gz
• Open the bashrc file. → $ sudo nano ~/.bashrc
• Provide the following HIVE_HOME path.
• export HIVE_HOME=/home/user/local/apache-hive-1.2.2-bin
• export PATH=$PATH:/home/user/local/apache-hive-1.2.2-bin/bin
• Update the environment variable. → $ source ~/.bashrc
• Let's start the hive → $ hive
Data Hierarchy in Hive
• Hive is organised hierarchically into:
• Databases: namespaces that separate tables and other objects

• Tables: homogeneous units of data with the same schema


• Analogous to tables in an RDBMS

• Partitions: determine how the data is stored


• Allow efficient access to subsets of the data

• Buckets/clusters
• For sub-sampling within a partition
• Join optimization
Hive
• Data Types
• DDL Commands
• DML Operations
• Data Retrieval Queries
Hive Data Types
• Basic datatypes
• Numbers
• Date / Time
• Strings
• Complex data types
HIVE DATA TYPES
Integer Types
Type Size Range

TINYINT 1-byte signed integer -128 to 127


SMALLINT 2-byte signed integer 32,768 to 32,767
INT 4-byte signed integer 2,147,483,648 to 2,147,483,647
BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

Decimal Types
Type Size Range

FLOAT 4-byte Single precision floating point number

DOUBLE 8-byte Double precision floating point number


HIVE DATA TYPES …
Date/Time Types
• TIMESTAMP
• supports UNIX timestamp with optional nanosecond precision.
• "YYYY-MM-DD HH:MM:SS.fffffffff" (9 decimal place precision)
• DATES
• The Date value is used to specify a particular year, month and
day, in the form YYYY--MM--DD.
• However, it didn't provide the time of the day. The range of Date type lies
between 0000--01--01 to 9999--12--31
HIVE DATA TYPES…
String Types
• Varchar
• The varchar is a variable length type whose range lies between
1 and 65535, which specifies that the maximum number of
characters allowed in the character string.
• CHAR
• The char is a fixed-length type whose maximum length is
fixed at 255.
HIVE DATA TYPES…
Complex Type
Type Size Range

Struct It is similar to C struct or an object where struct('James','Roy')


fields are accessed using the "dot" notation.

Map It contains the key-value tuples where the map('first','James','last','Roy')


fields are accessed using array notation.

Array It is a collection of similar type of values that array('James','Roy')


indexable using zero-based integers.
Hive DDL

Hive Data Definition Language (DDL)


DDL
• Used to describe data and data structures of a database
• Like SQL DDL, Hive DDL is used for managing, creating,
altering and dropping databases, tables and other objects in
a database.

19
Hive DDL

Hive Data Definition Language (DDL)


DDL Commands in Hive
• Create
• Alter
• Drop
• Show
• Truncate
• Describe

20
Hive - Create Database
hive > show databases;

hive> create database demo;


Hive - Drop Database
hive > show databases;

hive> drop database demo;

hive > show databases;


Hive - Alter Database
• add database properties or modify the properties
ALTER Database Command 1
Syntax:
DATABASE or SCHEMA is the same thing we can use any name.
SCHEMA in ALTER is added in hive 0.14.0 and later.

ALTER (DATABASE|SCHEMA) <database_name> SET DBPROPERTIES ('<property_name>'='<property_value>',..);

Step 1: Create a database with the name student


hive> CREATE DATABASE student;
Hive - Alter Database
ALTER Database Command 1
Syntax:
hive>
ALTER (DATABASE|SCHEMA) <database_name> SET DBPROPERTIES ('<property_name>'='<property_value>',..);

Step 2: Use ALTER to add properties to the database

hive> ALTER DATABASE student SET DBPROPERTIES ( ‘batch' = ‘IIITK-Batch2021' , ' Date' = ‘2021-09-
27');

Step 3: Describe the database to see the effect


hive> DESCRIBE DATABASE EXTENDED student;

Step 4: Let’s change the existing property to see the effect. In our example, we
are changing the batch from ‘IIITK-Batch2021’ to ‘IIITK-Batch2021-Set1’

hive> ALTER DATABASE student SET DBPROPERTIES ( 'owner' = 'IIITK-Batch2021-Set1' , 'Date' = ‘2024-
09-27');
Hive - Alter Database
ALTER Database Command 2
• With the help of the below command, we can change the database
directory on HDFS.

• The LOCATION with ALTER is only available in Hive 2.2.1, 2.4.0, and later.
One thing we should keep in mind that changing the database location
does not transfer data to the newly specified location.

• It only changes the parent-directory location and the newly added data
will be added to this new HDFS location.
Syntax:

ALTER (DATABASE|SCHEMA) <database_name> SET LOCATION 'Path_on_HDFS';


Hive - Alter Database
ALTER Database Command 2
Syntax:

ALTER (DATABASE|SCHEMA) <database_name> SET LOCATION 'Path_on_HDFS';

Step 1: Describe the database student to see its parent-directory.


By default, hive stores its data at /user/hive/warehouse on HDFS.

hive> DESCRIBE DATABASE EXTENDED student;

Step 2: Use ALTER to change the parent-directory location


(NOTE: /hive_db is the available directory on my HDFS ).

hive> ALTER DATABASE student SET LOCATION 'hdfs://localhost:9000/hive_db';

Step 3: Describe the database student to see the location is overridden or not.

hive> DESCRIBE DATABASE EXTENDED student;


Hive - Alter Database
ALTER Database Command 2
Commands with Output

We have successfully changed the location of the student database.


Now whatever tables you will add to this database will be made in /hive_db.
Hive - Alter Database
ALTER Database Command 3

• The below command is used to set or change the user name and its
ROLE.
• SET OWNER transfer the current user ownership to a new user or a
new role.
• By default, the user who makes the database is set as the owner of that
database.

Syntax:

ALTER DATABASE <database_name> SET OWNER [USER|ROLE] user_name or


role_name;
Hive - Alter Database
ALTER Database Command 3
Syntax:

ALTER DATABASE <database_name> SET OWNER [USER|ROLE]


user_name or role_name;

Step 1: Change the user name associated with the student database.

hive> DESCRIBE DATABASE EXTENDED student; # we have used it to


see the current user info

hive> ALTER DATABASE student SET OWNER USER Ram; # with this
we have changed the db owner from dikshant to Ram
Hive - Alter Database
ALTER Database Command 3
Step 1: Change the user name associated with the student database.

hive> DESCRIBE DATABASE EXTENDED student; # we have used it to see the current user info

hive> ALTER DATABASE student SET OWNER USER Ram; # with this we have changed the db owner
from dikshant to Ram
Hive - Alter Database
ALTER Database Command 3
Syntax:

ALTER DATABASE <database_name> SET OWNER [USER|ROLE] user_name or role_name;

Step 2: Now, change the role of ram to admin.

hive> ALTER DATABASE student SET OWNER ROLE admin;


Hive DDL – Table (Now)
• Create
• Alter
• Drop
• Show
• Truncate
• Describe
Hive DDL – Table - Create
Hive provides two types of table:-
Internal table
External table

Step 1: Create a database first so that we can create tables inside it.
hive> CREATE DATABASE database_name;
hive> SHOW DATABASES;

Step 2: Now, to have access to this database we have to use it.

hive> USE database-name;

Step 3: Now, start creating a table under this database-name


Hive DDL – Table - Create
Internal Table
• Internal tables are also called managed tables

• Lifecycle of their data is controlled by the Hive

• By default, these tables are stored in a subdirectory under the directory


defined by hive.metastore.warehouse.dir (i.e. /user/hive/warehouse)

• The internal tables are not flexible enough to share with other tools like Pig.

• If we try to drop the internal table, Hive deletes both table schema and
data.
Hive DDL – Table - Create
Create an internal table
hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ‘,' ;

Metadata of the created table


hive> describe demo.employee;
Hive DDL – Table - Create
Creating a table using the Existing Schema
• Hive allows creating a new table by using the schema of an
existing table.

hive> create table if not exists demo.copy_employee like demo.employee;

Here, we can say that


the new table is a copy
of an existing table with
same schema.
Hive DDL – Table - Create
External Table
• external table allows us to create and access a table and a data externally

• Two keywords
external keyword - used to specify the external table
location keyword - used to determine the location of loaded data

• As the table is external, the data is not present in the Hive directory.

• Therefore, if we try to drop the table, the metadata of the table will be
deleted, but the data still exists.

• In case Internal table, if we try to drop the internal table, Hive deletes both
table schema and data.
Hive DDL – Table - Create
External Table
To create an external table, follow the below steps: -

Step 1: Let's create a directory on HDFS by using the following command: -


> hdfs dfs -mkdir /HiveDirectory

Step 2: Now, store the file on the created directory.


> hdfs dfs -put hive/emp_details /HiveDirectory

Step 3: Let's create an external table using the following command: -

hive> create external table emplist (Id int, Name string , Salary float)
row format delimited
fields terminated by ','
location '/HiveDirectory';
Hive DDL – Table - Create
External Table
Step 3: Let's create an external table using the following command: -

hive> create external table emplist (Id int, Name string , Salary float)
row format delimited
fields terminated by ','
location '/HiveDirectory';
Hive - Load data into Table
• Once the internal table has been created, the next step is to load the data into it.
• So, in Hive, we can easily load data from any file to the database.
• Load the data of the file into the database

>load data local inpath '/home/codegyani/hive/emp_details' into table demo.em


ployee;

40
Hive - Load data into Table

Loading data from local file system


>load data local inpath '/home/codegyani/hive/emp_details' into table demo.employee;

> select * from demo.employee;

41
Hive - Load data into Table
• If we want to add more data into the current database, execute the same query
again by just updating the new file name.
>load data local inpath '/home/codegyani/hive/emp_details1' into table demo.employee;

> select * from demo.employee;

42
Hive – Load data into Table
Load unmatched data
• One or more column data doesn't match the data type of specified table
columns), it will not throw any exception.
• However, it stores the Null value at the position of unmatched tuple.
• add one more file to the current table. This file contains the unmatched data.

• Third column contains the data of string type, and the table allows the float type data. So, this
condition arises in an unmatched data situation.

Now, load the data into the table.


> load data local inpath '/home/codegyani/hive/emp_details2' into table demo.employee;

43
Hive – Load data into Table
Load unmatched data

• Third column contains the data of string type, and the table allows the float type data. So, this
condition arises in an unmatched data situation.

Now, load the data into the table.


> load data local inpath '/home/codegyani/hive/emp_details2' into table demo.employee;

44
Hive – Load data into Table
Load unmatched data

>select * from demo.employee

45
Hive - Alter Table
• In Hive, we can perform modifications in the
existing table like changing the table name, column
name, comments, and table properties.
• It provides SQL like commands to alter the table.

❑ Rename a Table
❑ Adding column
❑ Change Column
❑ Delete or Replace Column

46
Hive - Alter Table
❑ Rename a Table
change the name of an existing table

Syntax: ALTER TABLE old_table_name RENAME to new_table_name;

▪ existing tables present in the current database

hive> ALTER TABLE emp RENAME to employee_data;

47
Hive - Alter Table
❑ Rename a Table
▪ existing tables present in the current database

hive> Alter table emp rename to employee_data;

48
Hive - Alter Table
❑ Adding column
add one or more columns in an existing table

Syntax: ALTER TABLE table_name ADD COLUMNS(column_name datatype);

Schema of the table data of columns exists in the table

• Add a new column to the table by following command

hive> ALTER TABLE employee_data ADD COLUMNS (age int);

49
Hive - Alter Table
❑ Adding column
Schema of the table data of columns exists in the table

• Add a new column to the table by following command


hive> ALTER TABLE employee_data ADD COLUMNS (age int);

50
Hive - Alter Table
❑ Adding column
Schema of the table Data of columns exists in the table

• Add a new column to the table by following command


hive> ALTER TABLE employee_data ADD COLUMNS (age int);

Updated schema of the table

51
Hive - Alter Table
❑ Adding column
hive> ALTER TABLE employee_data ADD COLUMNS (age int);
Updated schema of the table

Updated data of the table

• add any data to the new


column, hive consider
NULL as the value.

52
A high-level comparison of SQL and HiveQL

53

You might also like