Module-4
Module-4
Introduction to Hive
WHAT IS HIVE?
• Hive is a Data Warehousing tool that sits on top of Hadoop.
History of Hive
Recent releases of Hive
Hive Features
1. It is similar to SQL.
2. HQL is easy to code.
3. Hive supports rich data types such as structs, lists and maps.
4. Hive supports SQL filters, group-by and order-by clauses.
5. Custom Types, Custom Functions can be defined.
HIVE ARCHITECTURE
• Hive Architecture is depicted in Figure.
Hive architecture
• The various parts are as follows:
1. Hive Command-Line Interface (Hive CLI): The most commonly used interface to
interact with Hive.
2. Hive Web Interface: It is a simple Graphic User Interface to interact with Hive and to
execute query.
3. Hive Server: This is an optional server. This can be used to submit Hive Jobs from a
remote client.
4. JDBC/ODBC: Jobs can be submitted from a JDBC Client. One can write a Java code to
connect to Hive and submit jobs on it.
5. Driver: Hive queries are sent to the driver for compilation, optimization and execution.
6. Metastore: Hive table definitions and mappings to the data are stored in a Metastore. A
Metastore consists of the following:
➢ Metastore service: Offers interface to the Hive.
➢ Database: Stores data definitions, mappings to the data and others.
• The metadata which is stored in the metastore includes IDs of Database, IDs of Tables,
IDs of Indexes, etc., the time of creation of a Table, the Input Format used for a Table, the
Output Format used for a Table, etc.
• The metastore is updated whenever a table is created or deleted from Hive.
• There are three kinds of metastore.
1. Embedded Metastore: This metastore is mainly used for unit tests. Here, only one process
is allowed to connect to the metastore at a time. This is the default metastore for Hive. It is
Apache Derby Database. In this metastore, both the database and the metastore service run
embedded in the main Hive Server process. Figure shows an Embedded Metastore.
Embedded Metastore.
2. Local Metastore: Metadata can be stored in any RDBMS component like MySQL. Local
metastore allows multiple connections at a time. In this mode, the Hive metastore service runs
in the main Hive Server process, but the metastore database runs in a separate process, and
can be on a separate host. Figure shows a Local Metastore.
Local Metastore
3. Remote Metastore: In this, the Hive driver and the metastore interface run on different
JVMs (which can run on different machines as well) as in Figure. This way the database can
be fire-walled from the Hive user and also database credentials are completely isolated from
the users of Hive.
Remote Metastore
HIVE DATA TYPES
Text File
The default file format is text file. In this format, each record is a line in the file. In text file,
different control characters are used as delimiters. The delimiters are ^A (octal 001, separates
all fields),^B (octal 002, separates the elements in the array or struct), ^C (octal 003, separates
key-value pair), and \n. The term field is used when overriding the default delimiter. The
supported text files are CSV and TSV. JSON or XML documents too can be specified as text
file.
Sequential File
Sequential files are flat files that store binary key-value pairs. It includes compression support
which reduces the CPU, I/O requirement.
RCFile (Record Columnar File)
RCFile stores the data in Column Oriented Manner which ensures that Aggregation operation
is not an expensive operation. For example, consider a table which contains four columns as
shown in Table below.
Table: A table with four columns
Instead of only partitioning the table horizontally like the row-oriented DBMS (row-store),
RCFile partitions this table first horizontally and then vertically to serialize the data. Based on
the user-specified value, first the table is partitioned into multiple row groups horizontally.
Depicted in Table below, above Table is partitioned into two row groups by considering three
rows as the size of each row group.
Table: Table with two row groups
Next, in every row group RCFile partitions the data vertically like column-store. So the table
will be serialized as shown in Table below.
Table: Table in RCFile Format
Database
A database is like a container for data. It has a collection of tables which houses the data.
IF NOT EXIST: It is an optional clause. The create database statement with "IF Not
EXISTS" clause creates a database if it does not exist. However, if the database already exists
then it will notify the user that a database with the same name already exists and will not
show any error message.
COMMENT: This is to provide short description about the database.
WITH DBPROPERTIES: It is an optional clause. It is used to specify any properties of
database in the form of (key, value) separated pairs. In the above example, "Creator" is the
"Key" and "JOHN" is the value.
We can use "SCHEMA" in place of "DATABASE" in this command.
Note: We have not specified the location where the Hive database will be created. By default
all the Hive databases will be created under default warehouse directory (set by the property
hive.metastore.warehouse.dir) as /user/hive/warehouse/database_name.db. But if we want to
specify our own location, then the LOCATION clause can be specified. This clause is
optional.
By default, SHOW DATABASES lists all the databases available in the metastore. We can
use "SCHEMAS" in place of "DATABASES" in this command. The command has an
optional "Like" clause. It can be used to filter the database names using regular expressions
such as "*”,”?” etc.
SHOW DATABASES LIKE "Stu*"
SHOW DATABASES like "Stud???s"
There is no command to show the current database, but use the below command statement to
keep printing the current database name as suffix in the command line prompt.
set hive.cli.print.current.db=true;
Now assume that the database "STUDENTS" has 10 tables within it. How do we delete the
complete database along with the tables contained therein? Use the command:
DROP DATABASE STUDENTS CASCADE;
By default the mode is RESTRICT which implies that the database will NOT be dropped if it
contains tables.
Note: The complete syntax is as follows:
DROP DATABASE [IF EXISTS] database_name [RESTRICT | CASCADE]
Тables
Hive provides two kinds of table:
1. Internal or Managed Table
2. External Table
Managed Table
1. Hive stores the Managed tables under the warehouse folder under Hive.
2. The complete life cycle of table and data is managed by Hive.
3. When the internal table is dropped, it drops the data as well as the metadata.
When you create a table in Hive, by default it is internal or managed table. If one needs to
create an external ble, one will have to use the keyword "EXTERNAL".
To check whether an existing table is managed or external, use the below syntax:
DESCRIBE FORMATTED tablename;
It displays complete metadata of a table. You will see one row called table type which will
display either MANAGED_TABLE OR EXTERNAL_TABLE.
DESCRIBE FORMATTED STUDENT
External or Self-Managed Table
1. When the table is dropped, it retains the data in the underlying location.
2. External keyword is used to create an external table.
3. Location needs to be specified to store the dataset in that particular location.
Loading Data into Table from File
Let us understand the difference between INTO TABLE and OVERWRITE TABLE with an
example: Assume the "EXT_STUDENT" table already had 100 records and the "student.tsv"
file has 10 records After issuing the LOAD DATA statement with the INTO TABLE clause,
the table "EXT_STUDENT" will contain110 records; however, the same LOAD DATA
statement with the OVERWRITE clause will wipe out all the former content from the table
and then load the 10 records from the data file.
Collection Data Types
Querying Table
Partitions
In Hive, the query reads the entire dataset even though a where clause filter is specified on a
particular column. This becomes a bottleneck in most of the MapReduce jobs as it involves
huge degree of I/O. So it is necessary to reduce I/O required by the MapReduce job to
improve the performance of the query. A very common method to reduce I/O is data
partitioning.
Partitions split the larger dataset into more meaningful chunks.
Partition is of two types:
1. STATIC PARTITION: It is upon the user to mention the partition (the segregation unit)
where the data from the file is to be loaded.
2. DYNAMIC PARTITION: The user is required to simply state the column, basis which
the partitioning will take place. Hive will then create partitions basis the unique values in the
column on which partition is to be carried out.
Points to consider as you create partitions:
1. STATIC PARTITIONING implies that the user controls everything from defining the
PARTITION column to loading data into the various partitioned folders.
2. DYNAMIC PARTITIONING means Hive will intelligently get the distinct values for
partitioned column and segregate data into respective partitions. There is no manual
intervention.
By default, dynamic partitioning is enabled in Hive. Also by default it is strictly implying that
one is required to do one level of STATIC partitioning before Hive can perform DYNAMIC
partitioning inside this STATIC segregation unit.
In order to go with full dynamic partitioning, we have to set below property to non-strict in
Hive.
hive> set hive.exec.dynamic.partition.mode=nonstrict
Static Partition
Static partitions comprise columns whose values are known at compile time.
Dynamic Partition
Dynamic partition have columns whose values are known only at Execution Time.
Bucketing
Bucketing is similar to partition. However, there is a subtle difference between partition and
bucketing. In a partition, you need to create partition for each unique value of the column.
This may lead to situations where you may end up with thousands of partitions. This can be
avoided by using Bucketing in which you can limit the number of buckets that will be created.
A bucket is a file whereas a partition is a directory.
Views
In Hive, view support is available only in version starting from 0.6. Views are purely logical
object.
Sub-Query
In Hive, sub-queries are supported only in the FROM clause (Hive 0.12). You need to specify
name for subquery because every table in a FROM clause has a name. The columns in the
sub-query select list should have unique names. The columns in the subquery select list are
available to the outer query just like columns of a table.
Joins
Joins in Hive is similar to the SQL Join.
Aggregation
RCFILE IMPLEMENTATION
RCFile (Record Columnar File) is a data placement structure that determines how to store
relational tables on computer clusters.
USER-DEFINED FUNCTION (UDF)
In Hive, you can use custom functions by defining the User-Defined Function (UDF).
Introduction to Pig
WHAT IS PIG?
• Apache Pig is a platform for data analysis.
• It is an alternative to MapReduce Programming.
• Pig was developed as a research project at Yahoo.
Refer Figure
PIG ON HADOOP
• Pig runs on Hadoop.
• Pig uses both Hadoop Distributed File System and MapReduce Programming.
• By default, Pig reads input files from HDFS.
• Pig stores the intermediate data (data produced by MapReduce jobs) and the output in
HDFS.
• However, Pig can also read input from and place output to other sources.
• Pig supports the following:
1. HDFS commands.
2. UNIX shell commands.
3. Relational operators.
4. Positional parameters.
5. Common mathematical functions.
6. Custom functions.
7. Complex data structures.
PIG PHILOSOPHY
Pig philosophy.
1. Pigs Eat Anything: Pig can process different kinds of data such as structured and
unstructured data.
2. Pigs Live Anywhere: Pig not only processes files in HDFS, it also processes files in other
sources such as files in the local file system.
3. Pigs are Domestic Animals: Pig allows you to develop user-defined functions and the
same can be included in the script for complex operations.
4. Pigs Fly: Pig processes data quickly.
1. Pig Latin statements are basic constructs to process data using Pig.
2. Pig Latin statement is an operator.
3. An operator in Pig Latin takes a relation as input and yields another relation as output.
4. Pig Latin statements include schemas and expressions to process data.
5. Pig Latin statements should end with a semi-colon.
The following is a simple Pig Latin script to load, filter, and store "student" data.
A = load 'student' (rollno, name, gpa);
A = filter A by gpa > 4.0;
A = foreach A generate UPPER (name);
STORE A INTO 'myreport’
1. Keywords are not case sensitive such as LOAD, STORE, GROUP, FOREACH, DUMP,
etc.
2. Relations and paths are case-sensitive.
3. Function names are case sensitive such as PigStorage, COUNT.
Table describes simple data types supported in Pig. In Pig, fields of unspecified types are
considered as an array of bytes which is known as bytearray. Null: In Pig Latin, NULL
denotes a value that is unknown or is non-existent.
Table:Simple data types supported in Pig
RUNNING PIG
Interactive Mode
Pig can run in interactive mode by invoking grunt shell. Type pig to get grunt shell as shown
below.
Once you get the grunt prompt, you can type the Pig Latin statement as shown below.
Here, the path refers to HDFS path and DUMP displays the result on the console as shown
below.
Batch Mode
"Pig Script" need to be created to run pig in batch mode. Write Pig Latin statements in a file
and save it with .pig extension.
Local Mode
To run pig in local mode, you need to have your files in the local file system.
Syntax:
pig -x local filename
MapReduce Mode
To run pig in MapReduce mode, you need to have access to a Hadoop Cluster to read /write
file. This is the default mode of Pig.
Syntax:
pig filename
HDFS COMMANDS
We can work with all HDFS commands in Grunt shell. For example, you can create a
directory as shown below.
grunt> fs -mkdir /piglatindemos;
grunt>
RELATIONAL OPERATORS
FILTER
FILTER operator is used to select tuples from a relation based on specified conditions.
Objective: Find the tuples of those student where the GPA is greater than 4.0.
Input:
Student (rollno:int,name:chararray,gpa:float)
Act:
A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);
B = filter A by gpa > 4.0;
DUMP В;
FOREACH
Use FOREACH when you want to do data transformation based on columns of data.
GROUP
DISTINCT
DISTINCT operator is used to remove duplicate tuples. In Pig, DISTINCT operator works on
the entire tuple and NOT on individual fields.
LIMIT
ORDER BY
It is used to join two or more relations based on values in the common field. It always
performs inner Join.
UNION
SAMPLE
It is used to select random sample of data based on the specified sample size.
EVAL FUNCTION
AVG
AVG is used to compute the average of numeric values in a single column bag.
MAX
MAX is used to compute the maximum of numeric values in a single column bag.
COUNT
TUPLE
PIGGY BANK
Pig user can use Piggy Bank functions in Pig Latin script and they can also share their
functions in Piggy Bank.
Pig allows you to create your own function for complex analysis.
PIG versus HIVE