Chapter 5 Hive
Chapter 5 Hive
■ What is Hadoop?
– Apache Hadoop is an open source software framework used to develop data processing
– Applications built using HADOOP are run on large data sets distributed across clusters of
commodity computers.
■ Commodity computers are cheap and widely available. These are mainly useful for achieving greater
■ Core of Hadoop
HDFS
Storage Part
( Hadoop Distributed File System)
computation nodes.
■ Apache Hive is an open source data warehouse system built on top of Hadoop Housed for querying and
■ Initially, you must write complex Map-Reduce jobs, but now with the help of the Hive, you just need to submit merely
SQL queries.
■ Hive is mainly targeted towards users who are comfortable with SQL.
■ Hive abstracts the complexity of Hadoop. The main thing to notice is that there is no need to learn java for Hive.
■ The Hive generally runs on your workstation and converts your SQL query into a series of jobs for execution on
a Hadoop cluster.
■ Apache Hive organizes data into tables. This provides a means for attaching the structure to data stored in HDFS.
Why Apache Hive?
■ Facebook had faced a lot of challenges before the implementation of Apache Hive. Challenges like the size of the data
■ The traditional RDBMS could not handle the pressure. As a result, Facebook was looking out for better options.
■ But it has difficulty in programming and mandatory knowledge in SQL, making it an impractical solution.
■ Hence, Apache Hive allowed them to overcome the challenges they were facing.
Why Apache Hive?
– JDBC (Java Database Connector) /ODBC (Open Database Connectivity) drivers are available
Why Apache Hive?
■ Apache Hive saves developers from writing complex Hadoop MapReduce jobs for ad-hoc requirements.
■ Hive is very fast and scalable. It is highly extensible. Since Apache Hive is like SQL, hence it becomes very easy for the
■ Hive reduces the complexity of MapReduce by providing an interface where the user can submit SQL queries. So, now
business analysts can play with Big Data using Apache Hive and generate insights.
■ It also provides file access on various data stores like HDFS and Hbase
■ The most important feature of Apache Hive is that to learn Hive we don’t have to learn Java.
Hive Architecture
■ Metastore:
■ It stores metadata for each of the tables like their schema and location.
■ Hive also includes the partition metadata. This helps the driver to track the progress of various data sets distributed over
the cluster. It stores the data in a traditional RDBMS format. Hive metadata helps the driver to keep a track of the data
and it is highly crucial. Backup server regularly replicates the data which it can retrieve in case of data loss.
Hive Architecture
■ Driver:
■ The driver starts the execution of the statement by creating sessions. It monitors the life cycle and progress of
the execution.
■ Driver stores the necessary metadata generated during the execution of a HiveQL statement. It also acts as a collection
■ Complier:
■ It performs the compilation of the HiveQL query. This converts the query to an execution plan. The plan contains the
tasks. It also contains steps needed to be performed by the MapReduce to get the output as translated by the query.
■ The compiler in Hive converts the query to an Abstract Syntax Tree (AST). First, check for compatibility and compile-
■ Optimizer
■ It aggregates the transformations together, such as converting a pipeline of joins to a single join, for better performance.
■ The optimizer can also split the tasks, such as applying a transformation on data before a reduce operation, to provide
better performance.
Hive Architecture
■ Executor
■ Once compilation and optimization complete, the executor executes the tasks. Executor takes care of pipelining the
tasks.
■ CLI, UI, and Thrift Server –CLI (command-line interface)provides a user interface for an external user to interact with
Hive. Thrift server in Hive allows external clients to interact with Hive over a network, similar to the JDBC or ODBC
protocols.
Hive Architecture
■ Architecture
Hive Components
Hive Limitations
• STRING
• BOOLEAN
• arrays: ARRAY<data_type>
CREATE TABLEemployees(
nameSTRING,
salaryFLOAT,
subordinates ARRAY<STRING>, ‘’John’’
deductions MAP<STRING,FLOAT>,
addressSTRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>,
auth UNION<fbid:INT, gid:INT,email:STRING>
)
Hive -Data Types –Example
CREATE TABLEemployees(
nameSTRING,
salaryFLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING,FLOAT>,
addressSTRUCT<street:STRING,
[“Michael”,“Rumi”]
city:STRING,
state:STRING,
zip:INT>,
auth UNION<fbid:INT, gid:INT,email:STRING>
)
Hive -Data Types –Example
CREATE TABLEemployees(
nameSTRING,
salaryFLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING,FLOAT>,
addressSTRUCT<street:STRING,
city:STRING,
state:STRING, {
“Insurance”:500.00,
zip:INT>, “Charity”:600.00
auth UNION<fbid:INT, gid:INT,email:STRING> }
)
Hive -Metastore
• Metadata includes
• /apps/hive/warehouse on HDFS
• By default database named “default” will be selected as current db for the current session
• “SHOW TABLES” will list tables in current selected database which is “default” database.
• USE abhinav9884;
• Login to Hue
• SELECT * FROM x;
• DESCRIBE x;
• DESCRIBE FORMATTED x;
Hive –Tables
●Managed tables
●External tables
Hive –Managed Tables
CREATE TABLEnyse(
exchange1STRING,
symbol1 STRING,
ymd STRING,
price_openFLOAT,
price_high FLOAT,
price_low FLOAT,
price_closeFLOAT,
volumeINT,
price_adj_closeFLOAT
)
ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\t’;
DESCRIB Enyse;
DESCRIBE FORMATTED nyse;
Hive –Loadind Data-From Local Directory
● hadoop fs -copyToLocal/data/NYSE_daily
● Launch Hive
● use yourdatabase;
CREATE TABLEnyse_hdfs(
exchange1 STRING,
symbol1STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_closeFLOAT
)
ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\t';
Hive –Loadind Data-From HDFS
location 's3n://paid/default-datasets/miniwikistats/’ ;
Hive – Select Statements
● Select allcolumns
GROUP BY symbol1;
SET hive.map.aggr=true;
Hive – Saving Data
●In HDFS
insert overwrite directory 'onlycmc' select * from nyse where symbol1 = 'CMC';
Hive – Tables – DDL -- ALTER
● Rename a table
ALTERTABLExRENAMETOx1;
ALTERTABLEx1ADDCOLUMNS(bFLOAT,cINT);
Hive – Partitions
Jon, HR,2012
Monica, Finance,2015
Steve, Engineering,2012
Michael, Marketing,2015
Hive – Partitions– Hands-on
CREATE TABLEemployees(
nameSTRING,
departmentSTRING,
somedateDATE
)
PARTITIONED BY(yearSTRING)
ROWFORMATDELIMITEDFIELDSTERMINATEDBY',';
Hive – Partitions– Hands-on
● Load dataset2012.csv
●Load dataset2015.csv
•Divide long and complicated query into smaller and manageable pieces
• Download JSON-SERDEBINARIES
• ADD JAR
hdfs:///data/serde/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar;
• Create Table
LOCATION'/user/abhinav9884/senti/upload/data/tweets_raw';
Hive – Sorting & Distributing – Order By
ORDER By x
SORT By x
SORT By x
Hive – Sorting & Distributing – Distribute By
DISTRIBUTE By x
DISTRIBUTE By x
Hive – Sorting & Distributing – Cluster By
CLUSTER By x
ipSTRINGCOMMENT'IPAddressoftheUser')
STORED ASSEQUENCEFILE;
Hive – ORC files
○Reading
○Writing
○Processing
first_name STRING,
last_nameSTRING
) STORED ASORC;
SELECT * fromorc_table;
/apps/hive/warehouse
•We can override that location by mentioning 'location' in create table clause
•Load moves the data if it is on hdfs for both external and managed tables
•Dropping external table does not delete the data at the ’location'
•Tableau allows for instantaneous insight by transforming data into visually appealing, interactive visualizations
called dashboards
Hive – Connecting to Tableau - Steps
•Download and install Hortonworks ODBC driver for Apache Hive for yourOS
https://hortonworks.com/downloads/
Hive – Connecting to Tableau – Hands-on
https://data-flair.training/
Thank you