0% found this document useful (0 votes)
194 views69 pages

Chapter 5 Hive

Hive is an open source data warehouse system built on Hadoop that allows users to query and analyze large datasets stored in HDFS using SQL-like queries. It abstracts the complexity of Hadoop and MapReduce by allowing users to write SQL queries instead of complex MapReduce programs. Hive organizes data into tables and stores table metadata and data in a metastore and HDFS respectively. It then compiles SQL queries into a directed acyclic graph of MapReduce jobs for execution on a Hadoop cluster.

Uploaded by

Komal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
194 views69 pages

Chapter 5 Hive

Hive is an open source data warehouse system built on Hadoop that allows users to query and analyze large datasets stored in HDFS using SQL-like queries. It abstracts the complexity of Hadoop and MapReduce by allowing users to write SQL queries instead of complex MapReduce programs. Hive organizes data into tables and stores table metadata and data in a metastore and HDFS respectively. It then compiles SQL queries into a directed acyclic graph of MapReduce jobs for execution on a Hadoop cluster.

Uploaded by

Komal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

HIVE

Instructor: Oussama Derbel


Introduction

■ What is Hadoop?

– Apache Hadoop is an open source software framework used to develop data processing

applications which are executed in a distributed computing environment.

– Applications built using HADOOP are run on large data sets distributed across clusters of

commodity computers.

■ Commodity computers are cheap and widely available. These are mainly useful for achieving greater

computational power at low cost.


Introduction

■ Core of Hadoop

HDFS
Storage Part
( Hadoop Distributed File System)

MAPREDUCE Processing Part


Introduction

■ Apache Hadoop consists of two sub-projects

1. Hadoop MapReduce: MapReduce is a computational model

and software framework for writing applications which are

run on Hadoop. These MapReduce programs are capable of

processing enormous data in parallel on large clusters of

computation nodes.

2. HDFS (Hadoop Distributed File System): HDFS takes care of

the storage part of Hadoop applications.


Note
MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them
on compute nodes in a cluster. This distribution enables reliable and extremely rapid computations.
What is Hive?

■ Apache Hive is an open source data warehouse system built on top of Hadoop Housed for querying and

analyzing large datasets stored in Hadoop files.

■ Initially, you must write complex Map-Reduce jobs, but now with the help of the Hive, you just need to submit merely

SQL queries.

■ Hive is mainly targeted towards users who are comfortable with SQL.

■ Hive use language called HiveQL (HQL), which is like SQL.

■ HiveQL automatically translates SQL-like queries into MapReduce jobs.


What is Hive?

■ Hive abstracts the complexity of Hadoop. The main thing to notice is that there is no need to learn java for Hive.

■ The Hive generally runs on your workstation and converts your SQL query into a series of jobs for execution on

a Hadoop cluster.

■ Apache Hive organizes data into tables. This provides a means for attaching the structure to data stored in HDFS.
Why Apache Hive?

■ Facebook had faced a lot of challenges before the implementation of Apache Hive. Challenges like the size of the data

being generated increased or exploded, making it very difficult to handle them.

■ The traditional RDBMS could not handle the pressure. As a result, Facebook was looking out for better options.

■ To overcome this problem, Facebook initially tried using MapReduce.

■ But it has difficulty in programming and mandatory knowledge in SQL, making it an impractical solution.

■ Hence, Apache Hive allowed them to overcome the challenges they were facing.
Why Apache Hive?

■ With Apache Hive, we are now able to perform the following:

– Schema flexibility and evolution

– Tables can be portioned and bucketed

– Apache Hive tables are defined directly in the HDFS

– JDBC (Java Database Connector) /ODBC (Open Database Connectivity) drivers are available
Why Apache Hive?

■ Apache Hive saves developers from writing complex Hadoop MapReduce jobs for ad-hoc requirements.

■ Hence, hive provides summarization, analysis, and query of data.

■ Hive is very fast and scalable. It is highly extensible. Since Apache Hive is like SQL, hence it becomes very easy for the

SQL developers to learn and implement Hive Queries.

■ Hive reduces the complexity of MapReduce by providing an interface where the user can submit SQL queries. So, now

business analysts can play with Big Data using Apache Hive and generate insights.

■ It also provides file access on various data stores like HDFS and Hbase

■ The most important feature of Apache Hive is that to learn Hive we don’t have to learn Java.
Hive Architecture

■ Metastore:

■ It stores metadata for each of the tables like their schema and location.

■ Hive also includes the partition metadata. This helps the driver to track the progress of various data sets distributed over

the cluster. It stores the data in a traditional RDBMS format. Hive metadata helps the driver to keep a track of the data

and it is highly crucial. Backup server regularly replicates the data which it can retrieve in case of data loss.
Hive Architecture

■ Driver:

■ It acts like a controller which receives the HiveQL statements.

■ The driver starts the execution of the statement by creating sessions. It monitors the life cycle and progress of

the execution.

■ Driver stores the necessary metadata generated during the execution of a HiveQL statement. It also acts as a collection

point of data or query result obtained after the Reduce operation.


Hive Architecture

■ Complier:

■ It performs the compilation of the HiveQL query. This converts the query to an execution plan. The plan contains the

tasks. It also contains steps needed to be performed by the MapReduce to get the output as translated by the query.

■ The compiler in Hive converts the query to an Abstract Syntax Tree (AST). First, check for compatibility and compile-

time errors, then converts the AST to a Directed Acyclic Graph(DAG).


Hive Architecture

■ Optimizer

■ It performs various transformations on the execution plan to provide optimized DAG.

■ It aggregates the transformations together, such as converting a pipeline of joins to a single join, for better performance.

■ The optimizer can also split the tasks, such as applying a transformation on data before a reduce operation, to provide

better performance.
Hive Architecture

■ Executor

■ Once compilation and optimization complete, the executor executes the tasks. Executor takes care of pipelining the

tasks.

■ CLI, UI, and Thrift Server –CLI (command-line interface)provides a user interface for an external user to interact with

Hive. Thrift server in Hive allows external clients to interact with Hive over a network, similar to the JDBC or ODBC

protocols.
Hive Architecture

■ Architecture
Hive Components
Hive Limitations

• Does not provide row level updates (earlier versions)

• Not suitable for OLTP (Online Transactional Processing)

• Queries have higher latency

• Start-up overhead for MapReduce jobs

• Best when large dataset is maintained and mined


Hive -Data Types -Numeric

• TINYINT (1-byte signed integer)

• SMALLINT (2-byte signed integer)

• INT (4-byte signed integer)

• BIGINT (8-byte signed integer)

• FLOAT (4-byte single precision floating point number)

• DOUBLE (8-byte double precision floating point number)

• DECIMAL (User defined precisions)


Hive -Data Types –Date/Time

• TIMESTAMP ( Hive version > 0.8.0)

• DATE ( Hive version > 0.12.0 ) -YYYY-MM-DD


Hive -Data Types –String

• STRING

• VARCHAR ( Hive version > 0.12.0)

• CHAR ( Hive version > 0.13.0)


Hive -Data Types –Misc

• BOOLEAN

• BINARY ( Hive version > 0.8.0)


Hive -Data Types –Complex

• arrays: ARRAY<data_type>

• maps: MAP <primitive_type,data_type>

• structs: STRUCT <col_name : data_type [COMMENT col_comment],...>

• union: UNIONTYPE <data_type, data_type, ...> ( Hive version > 0.7.0)


Hive -Data Types –Example

CREATE TABLEemployees(
nameSTRING,
salaryFLOAT,
subordinates ARRAY<STRING>, ‘’John’’
deductions MAP<STRING,FLOAT>,
addressSTRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>,
auth UNION<fbid:INT, gid:INT,email:STRING>
)
Hive -Data Types –Example

CREATE TABLEemployees(
nameSTRING,
salaryFLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING,FLOAT>,
addressSTRUCT<street:STRING,
[“Michael”,“Rumi”]
city:STRING,
state:STRING,
zip:INT>,
auth UNION<fbid:INT, gid:INT,email:STRING>
)
Hive -Data Types –Example

CREATE TABLEemployees(
nameSTRING,
salaryFLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING,FLOAT>,
addressSTRUCT<street:STRING,
city:STRING,
state:STRING, {
“Insurance”:500.00,
zip:INT>, “Charity”:600.00
auth UNION<fbid:INT, gid:INT,email:STRING> }

)
Hive -Metastore

• Stores the metadata of tables into a relational database

• Metadata includes

• Details of databases and tables

• Table definitions: name of table, columns, partitions etc.


Hive -Warehouse

• Hive tables are stored in the Hive warehouse directory

• /apps/hive/warehouse on HDFS

• At the location specified in the table definition


Hive –Getting started – Command Line

• Login to CloudxLab Linux console

• Type “hive” to access hive shell

• By default database named “default” will be selected as current db for the current session

• Type “SHOW DATABASES” to see list of all databases


Hive –Getting started – Command Line

• “SHOW TABLES” will list tables in current selected database which is “default” database.

• Create your own database with your login name

• CREATE DATABAS Eabhinav9884;

• DESCRIBE DATABAS Eabhinav9884;

• DROP DATABASE abhinav9884;


Hive –Getting started – Command Line

• CREATE DATABASE abhinav9884;

• USE abhinav9884;

• CREATE TABLE x(a INT);


Hive –Getting started – Hue

• Login to Hue

• Click on “Query Editors” and select “Hive”

• Select your database (abhinav9984) from the list

• SELECT * FROM x;

• DESCRIBE x;

• DESCRIBE FORMATTED x;
Hive –Tables

●Managed tables

●External tables
Hive –Managed Tables

●Also called Internal

●Lifecycle managed by Hive

●Data is stored in the warehouse directory

●Dropping the table deletes data from warehouse


Hive –External Tables

●The lifecycle is not managed by Hive

●Hive assumes that it does not own the data

●Dropping the table does not delete the underlying data

●Metadata will be deleted


Hive –Managed Tables -Example

CREATE TABLEnyse(
exchange1STRING,
symbol1 STRING,
ymd STRING,
price_openFLOAT,
price_high FLOAT,
price_low FLOAT,
price_closeFLOAT,
volumeINT,
price_adj_closeFLOAT
)
ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\t’;

DESCRIB Enyse;
DESCRIBE FORMATTED nyse;
Hive –Loadind Data-From Local Directory

● hadoop fs -copyToLocal/data/NYSE_daily

● Launch Hive

● use yourdatabase;

● load data localinpath 'NYSE_daily'overwrite into table nyse;

● Copies the data from local file system to warehouse


Hive –Loadind Data-From HDFS

CREATE TABLEnyse_hdfs(
exchange1 STRING,
symbol1STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_closeFLOAT
)
ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\t';
Hive –Loadind Data-From HDFS

●Copy /data/NYSE_daily to your home directory in HDFS

●load data inpath 'hdfs:///user/abhinav9884/NYSE_daily' overwrite into table nyse_hdfs;

●Moves the data from specified location to warehouse

●Check if NYSE_daily is in your home directory in HDFS


Hive – External Tables

CREATE EXTERNAL TABLE nyse_external(


exchange1STRING,
symbol1 STRING,
ymd STRING,
price_openFLOAT,
price_high FLOAT,
price_low FLOAT,
price_closeFLOAT,
volumeINT,
price_adj_closeFLOAT
)
ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\t'
LOCATION'/user/abhinav9884/NYSE_daily';
describe formatted nyse_external;
Hive – S3 Based External Table

create external table miniwikistats(


projcode string,
pagename string,
pageviews int,
bytes int)

partitioned by(dt string)

row format delimited fields terminated by ' ‘

lines terminated by '\n’

location 's3n://paid/default-datasets/miniwikistats/’ ;
Hive – Select Statements

● Select allcolumns

SELECT * FROM nyse;

● Select only required columns

SELECT exchange1, symbol1 FROM nyse;


Hive – Aggregations

●Find average opening price for each stock

SELECTsymbol1,AVG(price_open) AS avg_price FROM nyse

GROUP BY symbol1;

●To improve performance set top-level aggregation in mapphase

SET hive.map.aggr=true;
Hive – Saving Data

●In local filesystem

insert overwrite local directory '/home/abhinav9884/onlycmc’

select * from nyse where symbol1 ='CMC';

●In HDFS

insert overwrite directory 'onlycmc' select * from nyse where symbol1 = 'CMC';
Hive – Tables – DDL -- ALTER

● Rename a table

ALTERTABLExRENAMETOx1;

● Change datatype of column

ALTER TABLE x1 CHANGEa a FLOAT;

● Add columns in existing table

ALTERTABLEx1ADDCOLUMNS(bFLOAT,cINT);
Hive – Partitions

#First name, Department,Year of joining Mark, Engineering, 2012

Jon, HR,2012

Monica, Finance,2015

Steve, Engineering,2012

Michael, Marketing,2015
Hive – Partitions– Hands-on

●Data is located at /data/bdhs/employees/ on HDFS


●Copy data to your home directory in HDFS
hadoop fs -cp /data/bdhs/employees.
●Create table

CREATE TABLEemployees(
nameSTRING,
departmentSTRING,
somedateDATE
)
PARTITIONED BY(yearSTRING)
ROWFORMATDELIMITEDFIELDSTERMINATEDBY',';
Hive – Partitions– Hands-on

● Load dataset2012.csv

load data inpath 'hdfs:///user/sandeepgiri9034/employees/2012.csv' into table employees partition(year=2012);

●Load dataset2015.csv

load data inpath 'hdfs:///user/sandeepgiri9034/employees/2015.csv' into table employees partition(year=2015);

● SHOW PARTITIONS employees;

● Check warehouse and metastore


Hive – Partitions– Summary

To avoid the full table scan

•The data is stored in different files in warehouse defined by the partitions

•Define the partitions using “partition by” in “create table”

•We can also add a partition later

•Partition can happen on multiple columns (year=2012, month=10, day=12)


Hive – Views

●SELECT * FROM employees where department='Engineering';


●Create a view
CREATE VIEW employees_engineeringAS
SELECT * FROM employees where department='Engineering';
●Now query from the view

SELECT * FROM employees_engineering;


Hive – Views- Summary

•Allows a query to be saved and treated like a table

•Logical construct -does not store data

•Hides the query complexity

•Divide long and complicated query into smaller and manageable pieces

•Similar to writing a function in a programming language


Hive – Load JSON Data

• Download JSON-SERDEBINARIES

• ADD JAR

hdfs:///data/serde/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar;

• Create Table

CREATE EXTERNAL TABLE tweets_raw (

ROW FORMAT SERDE'org.apache.hive.hcatalog.data.JsonSerDe'

LOCATION'/user/abhinav9884/senti/upload/data/tweets_raw';
Hive – Sorting & Distributing – Order By

ORDER By x

●Guarantees global ordering

●Data goes through just one reducer

●This is unacceptable for large datasets as it will overload the reducer

●You end up one sorted file as output


Hive – Sorting & Distributing – Sort By

SORT By x

●Orders data at each of N reducers

●Number of reducers are 1 per1GB

●You end up with N or more sorted files with overlapping ranges


Hive – Sorting & Distributing – Sort By

SORT By x
Hive – Sorting & Distributing – Distribute By

DISTRIBUTE By x

●Ensures each of N reducers gets non-overlapping ranges of x

●But doesn't sort the output of each reducer

●You end up with N or unsorted files with non-overlapping ranges


Hive – Sorting & Distributing – Distribute By

DISTRIBUTE By x
Hive – Sorting & Distributing – Cluster By

CLUSTER By x

●Gives global ordering

●Is the same as (DISTRIBUTE BY x and SORT BYx)

●CLUSTER BY is basically the more scalable version of ORDERBY


Hive – Bucketing

CREATETABLEpage_view(viewTimeINT,useridBIGINT, page_url STRING, referrer_urlSTRING,

ipSTRINGCOMMENT'IPAddressoftheUser')

COMMENT 'This is the page view table’

PARTITIONED BY(dt STRING, country STRING)

CLUSTERED BY(userid) INTO 32BUCKETS

ROW FORMAT DELIMITED FIELD STERMINATED BY'\001’

COLLECTION ITEMS TERMINATED BY '\002'

MAP KEYS TERMINATED BY '\003’

STORED ASSEQUENCEFILE;
Hive – ORC files

●Optimized Row Columnar file format

●Provides a highly efficient way to store Hive data

●Improves performance when

○Reading

○Writing

○Processing

●Has a built-in index, min/max values, and other aggregations

●Proven in large-scale deployments

○Facebook uses the ORC file format for a 300+ PB deployment


Hive – ORC files-Example

CREATE TABLE orc_table(

first_name STRING,

last_nameSTRING

) STORED ASORC;

INSERT INTO orc_table VALUES('John','Gill');

SELECT * fromorc_table;

To Know more, please visit https://orc.apache.org


Hive – Quick Recap

•Each table has got a location

•By default the table is in a directory under the location

/apps/hive/warehouse

•We can override that location by mentioning 'location' in create table clause

•Load data copies the data if it is local

•Load moves the data if it is on hdfs for both external and managed tables

•Dropping managed table deletes the data at the 'location'

•Dropping external table does not delete the data at the ’location'

•The metadata is stored in the relational database –hive metastore


Hive – Connecting to Tableau

•Tableau is a visualization tool

•Tableau allows for instantaneous insight by transforming data into visually appealing, interactive visualizations

called dashboards
Hive – Connecting to Tableau - Steps

•Download and install Tableau desktop from https://www.tableau.com/products/desktop

•Download and install Hortonworks ODBC driver for Apache Hive for yourOS

https://hortonworks.com/downloads/
Hive – Connecting to Tableau – Hands-on

Visualize top 10 stocks with highest opening price on Dec 31,2009


Hive – Quick Demo

1.Copy data from /data/ml100k/u.data into our hdfs home

2.Open Hive in Hue and runfollowing:

CREATE TABLE u_data(useridINT,movieidINT,ratingINT,unixtimeSTRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY'\t’

STORED AS TEXT FILE;

LOAD DATA IN PATH '/user/sandeepgiri9034/u.data'overwriteintotableu_data;

select * from u_data limit5;

Select movie id,avg(rating)arfromu_data group by movieid order by ar desc


Hive – Quick Demo

Join with MovieNames

create view top100mas

Select movieid,avg(rating)arfromu_data group by movie id order by ar desc


CREATE TABLE m_data( movieid INT, name STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY'|’
STORED AS TEXTFILE;

load data inpath '/user/sandeepgiri9034/u.item' into tablem_data;

select * from m_data limit100

select * from m_data, top100m where top100m.movieid =m_data.movieid


Hive – Lab

1.For each movie how many users rated it


2.For movies having more than 30 ratings, what is the average rating
References

https://data-flair.training/
Thank you

You might also like