0% found this document useful (0 votes)
4 views34 pages

Module-4

Hive is a data warehousing tool built on Hadoop, designed for processing structured data through summarization, querying, and analysis. It uses HDFS for storage, MapReduce for execution, and provides a SQL-like language called HiveQL for querying. Key features include support for rich data types, partitioning, bucketing, and a metastore for managing metadata, while it is not suitable for real-time queries or OLTP.

Uploaded by

dhanulokesh06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views34 pages

Module-4

Hive is a data warehousing tool built on Hadoop, designed for processing structured data through summarization, querying, and analysis. It uses HDFS for storage, MapReduce for execution, and provides a SQL-like language called HiveQL for querying. Key features include support for rich data types, partitioning, bucketing, and a metastore for managing metadata, while it is not suitable for real-time queries or OLTP.

Uploaded by

dhanulokesh06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

MODULE-4

Introduction to Hive, Introduction to Pig

Introduction to Hive

WHAT IS HIVE?
• Hive is a Data Warehousing tool that sits on top of Hadoop.

Hive - a data warehousing tool


• Hive is used to process structured data in Hadoop.
• The three main tasks performed by Apache Hive are:
1. Summarization
2. Querying
3. Analysis
• Facebook initially created Hive component to manage their ever-growing volumes of
log data.
• Later Apache software foundation developed it as open-source and it came to be
known as Apache Hive.
• Hive makes use of the following:
1. HDFS for Storage.
2. MapReduce for execution.
3. Stores metadata/schemas in an RDBMS.
• Hive provides HQL (Hive Query Language) or HiveQL which is similar to SQL.
• Hive compiles SQL queries into MapReduce jobs and then runs the job in the Hadoop
Cluster.
• It is designed to support OLAP (Online Analytical Processing).
• Hive provides extensive data type functions and formats for data summarization and
analysis.
Note:
1. Hive is not RDBMS.
2. It is not designed to support OLTP (Online Transaction Processing).
3. It is not designed for real-time queries.
4. It is not designed to support row-level updates.

History of Hive and Recent Releases of Hive


• The history of Hive and recent releases of Hive are illustrated pictorially.

History of Hive
Recent releases of Hive

Hive Features
1. It is similar to SQL.
2. HQL is easy to code.
3. Hive supports rich data types such as structs, lists and maps.
4. Hive supports SQL filters, group-by and order-by clauses.
5. Custom Types, Custom Functions can be defined.

Hive Integration and Work Flow


• Figure depicts the flow of log file analysis.
• Explanation of the workflow. Hourly Log Data can be stored directly into HDFS and
then data cleansing is performed on the log file. Finally, Hive table(s) can be created
to query the log file.

Flow of log analysis file


Hive Data Units
1. Databases: The namespace for tables.
2. Tables: Set of records that have similar schema.
3. Partitions: Logical separations of data based on classification of given information as per
specific at tributes. Once hive has partitioned the data based on a specified key, it starts to
assemble the records into specific folders as and when the records are inserted.
4. Buckets (or Clusters): Similar to partitions but useses hash function to segregate data and
determines the cluster or bucket into which the record should be placed.
• Partitioning tables changes how Hive structures the data storage.
• Hive will create subdirectories reflecting the partitioning structure like
.../customers/ country=ABС
• Although partitioning helps in enhancing performance and is recommended, having
too many partitions may prove detrimental for few queries.
• Bucketing is another technique of managing large datasets.
• If we partition the dataset based on customer_ID, we would end up with far too many
partitions.
• Instead, if we bucket the customer table and use customer_id as the bucketing column,
the value of this column will be hashed by a user-defined number into buckets.
• Records with the same customer_id will always be placed in the same bucket.
• Assuming we have far more customer_ids than the number of buckets, each bucket
will house many customer_ids.
• While creating the table you can specify the number of buckets that you would like
your data to be distributed in using the syntax "CLUSTERED BY (customer_id)
INTO XX BUCKETS"; here XX is the number of buckets.
When to Use Partitioning/Bucketing?
• Bucketing works well when the field has high cardinality (cardinality is the number of
values a column or field can have) and data is evenly distributed among buckets.
• Partitioning works best when the cardinality of the partitioning field is not too high.
• Partitioning can be done on multiple fields with an order (Year/Month/ Day) whereas
bucketing can be done on only one field.
• Figure shows how these data units are arranged in a Hive Cluster.

Data units as arranged in a Hive


• Figure describes the semblance of Hive structure with database.

Semblance of Hive structure with database


• A database contains several tables.
• Each table is constituted of rows and columns.
• In Hive, tables stored as a folder and partition tables are stored as a sub-directory.
• Bucketed tables are stored as a file.

HIVE ARCHITECTURE
• Hive Architecture is depicted in Figure.
Hive architecture
• The various parts are as follows:
1. Hive Command-Line Interface (Hive CLI): The most commonly used interface to
interact with Hive.
2. Hive Web Interface: It is a simple Graphic User Interface to interact with Hive and to
execute query.
3. Hive Server: This is an optional server. This can be used to submit Hive Jobs from a
remote client.
4. JDBC/ODBC: Jobs can be submitted from a JDBC Client. One can write a Java code to
connect to Hive and submit jobs on it.
5. Driver: Hive queries are sent to the driver for compilation, optimization and execution.
6. Metastore: Hive table definitions and mappings to the data are stored in a Metastore. A
Metastore consists of the following:
➢ Metastore service: Offers interface to the Hive.
➢ Database: Stores data definitions, mappings to the data and others.
• The metadata which is stored in the metastore includes IDs of Database, IDs of Tables,
IDs of Indexes, etc., the time of creation of a Table, the Input Format used for a Table, the
Output Format used for a Table, etc.
• The metastore is updated whenever a table is created or deleted from Hive.
• There are three kinds of metastore.
1. Embedded Metastore: This metastore is mainly used for unit tests. Here, only one process
is allowed to connect to the metastore at a time. This is the default metastore for Hive. It is
Apache Derby Database. In this metastore, both the database and the metastore service run
embedded in the main Hive Server process. Figure shows an Embedded Metastore.

Embedded Metastore.
2. Local Metastore: Metadata can be stored in any RDBMS component like MySQL. Local
metastore allows multiple connections at a time. In this mode, the Hive metastore service runs
in the main Hive Server process, but the metastore database runs in a separate process, and
can be on a separate host. Figure shows a Local Metastore.
Local Metastore
3. Remote Metastore: In this, the Hive driver and the metastore interface run on different
JVMs (which can run on different machines as well) as in Figure. This way the database can
be fire-walled from the Hive user and also database credentials are completely isolated from
the users of Hive.

Remote Metastore
HIVE DATA TYPES

Primitive Data Types

Collection Data Types


HIVE FILE FORMAT
The file formats in Hive specify how records are encoded in a file.

Text File
The default file format is text file. In this format, each record is a line in the file. In text file,
different control characters are used as delimiters. The delimiters are ^A (octal 001, separates
all fields),^B (octal 002, separates the elements in the array or struct), ^C (octal 003, separates
key-value pair), and \n. The term field is used when overriding the default delimiter. The
supported text files are CSV and TSV. JSON or XML documents too can be specified as text
file.
Sequential File
Sequential files are flat files that store binary key-value pairs. It includes compression support
which reduces the CPU, I/O requirement.
RCFile (Record Columnar File)
RCFile stores the data in Column Oriented Manner which ensures that Aggregation operation
is not an expensive operation. For example, consider a table which contains four columns as
shown in Table below.
Table: A table with four columns

Instead of only partitioning the table horizontally like the row-oriented DBMS (row-store),
RCFile partitions this table first horizontally and then vertically to serialize the data. Based on
the user-specified value, first the table is partitioned into multiple row groups horizontally.
Depicted in Table below, above Table is partitioned into two row groups by considering three
rows as the size of each row group.
Table: Table with two row groups

Next, in every row group RCFile partitions the data vertically like column-store. So the table
will be serialized as shown in Table below.
Table: Table in RCFile Format

HIVE QUERY LANGUAGE (HQL)


Hive query language provides basic SQL like operations. Here are few of the tasks which
HQL can do easily.
1. Create and manage tables and partitions.
2. Support various Relational, Arithmetic, and Logical Operators.
3. Evaluate functions.
4. Download the contents of a table to a local directory or result of queries to HDFS directory.
DDL (Data Definition Language) Statements
These statements are used to build and modify the tables and other objects in the database.
The DDL commands are as follows:
1. Create/Drop/Alter Database
2. Create/Drop/Truncate Table
3. Alter Table/Partition/Column
4. Create/Drop/Alter View
5. Create/Drop/Alter Index
6. Show
7. Describe
DML (Data Manipulation Language) Statements
These statements are used to retrieve, store, modify, delete, and update data in database. The
DML commands are as follows:
1. Loading files into table.
2. Inserting data into Hive Tables from queries.
Note: Hive 0.14 supports update, delete, and transaction operations.

Starting Hive Shell


To start Hive, go to the installation path of Hive and type as below:

The sections have been designed as follows:


Objective: What is it that we are trying to achieve here?
Input (optional): What is the input that has been given to us to act upon?
Act: The actual statement/command to accomplish the task at hand.
Outcomе: The result/output as a consequence of executing the statement.

Database
A database is like a container for data. It has a collection of tables which houses the data.

Explanation of the syntax:

IF NOT EXIST: It is an optional clause. The create database statement with "IF Not
EXISTS" clause creates a database if it does not exist. However, if the database already exists
then it will notify the user that a database with the same name already exists and will not
show any error message.
COMMENT: This is to provide short description about the database.
WITH DBPROPERTIES: It is an optional clause. It is used to specify any properties of
database in the form of (key, value) separated pairs. In the above example, "Creator" is the
"Key" and "JOHN" is the value.
We can use "SCHEMA" in place of "DATABASE" in this command.
Note: We have not specified the location where the Hive database will be created. By default
all the Hive databases will be created under default warehouse directory (set by the property
hive.metastore.warehouse.dir) as /user/hive/warehouse/database_name.db. But if we want to
specify our own location, then the LOCATION clause can be specified. This clause is
optional.

By default, SHOW DATABASES lists all the databases available in the metastore. We can
use "SCHEMAS" in place of "DATABASES" in this command. The command has an
optional "Like" clause. It can be used to filter the database names using regular expressions
such as "*”,”?” etc.
SHOW DATABASES LIKE "Stu*"
SHOW DATABASES like "Stud???s"

DESCRIBE DATABASE EXTENDED shows database's properties given under


DBPROPERTIES argument at the time of creation. We can use "SCHEMA" in place of
"DATABASE”, “DESC" in place of "DESCRIBE” in this command.
In the above example, the ALTER DATABASE command is used to assign new ('edited-by' =
'JAMES') pair into DBPROPERTIES. This can be verified by using the 'describe extended'.
Hive> DESCRIBE DATABASE Student EXTENDED

There is no command to show the current database, but use the below command statement to
keep printing the current database name as suffix in the command line prompt.
set hive.cli.print.current.db=true;

Now assume that the database "STUDENTS" has 10 tables within it. How do we delete the
complete database along with the tables contained therein? Use the command:
DROP DATABASE STUDENTS CASCADE;
By default the mode is RESTRICT which implies that the database will NOT be dropped if it
contains tables.
Note: The complete syntax is as follows:
DROP DATABASE [IF EXISTS] database_name [RESTRICT | CASCADE]

Тables
Hive provides two kinds of table:
1. Internal or Managed Table
2. External Table

Managed Table
1. Hive stores the Managed tables under the warehouse folder under Hive.
2. The complete life cycle of table and data is managed by Hive.
3. When the internal table is dropped, it drops the data as well as the metadata.
When you create a table in Hive, by default it is internal or managed table. If one needs to
create an external ble, one will have to use the keyword "EXTERNAL".

To check whether an existing table is managed or external, use the below syntax:
DESCRIBE FORMATTED tablename;
It displays complete metadata of a table. You will see one row called table type which will
display either MANAGED_TABLE OR EXTERNAL_TABLE.
DESCRIBE FORMATTED STUDENT
External or Self-Managed Table
1. When the table is dropped, it retains the data in the underlying location.
2. External keyword is used to create an external table.
3. Location needs to be specified to store the dataset in that particular location.
Loading Data into Table from File

Let us understand the difference between INTO TABLE and OVERWRITE TABLE with an
example: Assume the "EXT_STUDENT" table already had 100 records and the "student.tsv"
file has 10 records After issuing the LOAD DATA statement with the INTO TABLE clause,
the table "EXT_STUDENT" will contain110 records; however, the same LOAD DATA
statement with the OVERWRITE clause will wipe out all the former content from the table
and then load the 10 records from the data file.
Collection Data Types

Querying Table
Partitions
In Hive, the query reads the entire dataset even though a where clause filter is specified on a
particular column. This becomes a bottleneck in most of the MapReduce jobs as it involves
huge degree of I/O. So it is necessary to reduce I/O required by the MapReduce job to
improve the performance of the query. A very common method to reduce I/O is data
partitioning.
Partitions split the larger dataset into more meaningful chunks.
Partition is of two types:
1. STATIC PARTITION: It is upon the user to mention the partition (the segregation unit)
where the data from the file is to be loaded.
2. DYNAMIC PARTITION: The user is required to simply state the column, basis which
the partitioning will take place. Hive will then create partitions basis the unique values in the
column on which partition is to be carried out.
Points to consider as you create partitions:
1. STATIC PARTITIONING implies that the user controls everything from defining the
PARTITION column to loading data into the various partitioned folders.
2. DYNAMIC PARTITIONING means Hive will intelligently get the distinct values for
partitioned column and segregate data into respective partitions. There is no manual
intervention.
By default, dynamic partitioning is enabled in Hive. Also by default it is strictly implying that
one is required to do one level of STATIC partitioning before Hive can perform DYNAMIC
partitioning inside this STATIC segregation unit.
In order to go with full dynamic partitioning, we have to set below property to non-strict in
Hive.
hive> set hive.exec.dynamic.partition.mode=nonstrict

Static Partition
Static partitions comprise columns whose values are known at compile time.
Dynamic Partition
Dynamic partition have columns whose values are known only at Execution Time.

Bucketing
Bucketing is similar to partition. However, there is a subtle difference between partition and
bucketing. In a partition, you need to create partition for each unique value of the column.
This may lead to situations where you may end up with thousands of partitions. This can be
avoided by using Bucketing in which you can limit the number of buckets that will be created.
A bucket is a file whereas a partition is a directory.
Views
In Hive, view support is available only in version starting from 0.6. Views are purely logical
object.

Sub-Query
In Hive, sub-queries are supported only in the FROM clause (Hive 0.12). You need to specify
name for subquery because every table in a FROM clause has a name. The columns in the
sub-query select list should have unique names. The columns in the subquery select list are
available to the outer query just like columns of a table.
Joins
Joins in Hive is similar to the SQL Join.
Aggregation

Hive supports aggregation functions like avg, count, etc.


Group By and Having
Data in a column or columns can be grouped on the basis of values contained therein by using
"Group By". "Having" clause is used to filter out groups NOT meeting the specified
condition.

RCFILE IMPLEMENTATION

RCFile (Record Columnar File) is a data placement structure that determines how to store
relational tables on computer clusters.
USER-DEFINED FUNCTION (UDF)
In Hive, you can use custom functions by defining the User-Defined Function (UDF).

Introduction to Pig

WHAT IS PIG?
• Apache Pig is a platform for data analysis.
• It is an alternative to MapReduce Programming.
• Pig was developed as a research project at Yahoo.

Key Features of Pig


1. It provides an engine for executing data flows (how your data should flow). Pig processes
data in parallel on the Hadoop cluster.
2. It provides a language called "Pig Latin" to express data flows.
3. Pig Latin contains operators for many of the traditional data operations such as join, filter,
sort, etc.
4. It allows users to develop their own functions (User Defined Functions) for reading,
processing, and writing data.
THE ANATOMY OF PIG
The main components of Pig are as follows:
1. Data flow language (Pig Latin).
2 Interactive shell where you can type Pig Latin statements (Grunt).
3. Pig interpreter and execution engine.

Refer Figure

The anatomy of Pig.

PIG ON HADOOP
• Pig runs on Hadoop.
• Pig uses both Hadoop Distributed File System and MapReduce Programming.
• By default, Pig reads input files from HDFS.
• Pig stores the intermediate data (data produced by MapReduce jobs) and the output in
HDFS.
• However, Pig can also read input from and place output to other sources.
• Pig supports the following:
1. HDFS commands.
2. UNIX shell commands.
3. Relational operators.
4. Positional parameters.
5. Common mathematical functions.
6. Custom functions.
7. Complex data structures.

PIG PHILOSOPHY

Figure describes the Pig philosophy.

Pig philosophy.
1. Pigs Eat Anything: Pig can process different kinds of data such as structured and
unstructured data.
2. Pigs Live Anywhere: Pig not only processes files in HDFS, it also processes files in other
sources such as files in the local file system.
3. Pigs are Domestic Animals: Pig allows you to develop user-defined functions and the
same can be included in the script for complex operations.
4. Pigs Fly: Pig processes data quickly.

USE CASE FOR PIG: ETL PROCESSING


Pig is widely used for "ETL" (Extract, Transform, and Load). Pig can extract data from
different sources such as ERP, Accounting, Flat Files, etc. Pig then makes use of various
operators to perform transformation on the data and subsequently loads it into the data
warehouse. Refer Figure.

Pig: ETL Processing


PIG LATIN OVERVIEW

Pig Latin Statements

1. Pig Latin statements are basic constructs to process data using Pig.
2. Pig Latin statement is an operator.
3. An operator in Pig Latin takes a relation as input and yields another relation as output.
4. Pig Latin statements include schemas and expressions to process data.
5. Pig Latin statements should end with a semi-colon.

Pig Latin Statements are generally ordered as follows:


1. LOAD statement that reads data from the file system.
2. Series of statements to perform transformations.
3. DUMP or STORE to display/store result.

The following is a simple Pig Latin script to load, filter, and store "student" data.
A = load 'student' (rollno, name, gpa);
A = filter A by gpa > 4.0;
A = foreach A generate UPPER (name);
STORE A INTO 'myreport’

Note: In the above example A is a relation and NOT a variable.

Pig Latin: Keywords

Keywords are reserved. It cannot be used to name things.

Pig Latin: Identifiers

1. Identifiers are names assigned to fields or other data structures.


2. It should begin with a letter and should be followed only by letters, numbers, and
underscores.

Table describes valid and invalid identifiers.

Pig Latin: Comments

In Pig Latin two two types of comments are supported:


1. Single line comments that begin with "--".
2. Multiline comments that begin with "/* and end with */".

Pig Latin: Case Sensitivity

1. Keywords are not case sensitive such as LOAD, STORE, GROUP, FOREACH, DUMP,
etc.
2. Relations and paths are case-sensitive.
3. Function names are case sensitive such as PigStorage, COUNT.

Operators in Pig Latin

Table describes operators in Pig Latin.

Table:Operators in Pig Latin

DATA TYPES IN PIG

Simple Data Types

Table describes simple data types supported in Pig. In Pig, fields of unspecified types are
considered as an array of bytes which is known as bytearray. Null: In Pig Latin, NULL
denotes a value that is unknown or is non-existent.
Table:Simple data types supported in Pig

Complex Data Types

Table describes complex data types in Pig.

Table:Complex data types in Pig

RUNNING PIG

Pig can run in two ways:


1. Interactive Mode
2. Batch Mode

Interactive Mode

Pig can run in interactive mode by invoking grunt shell. Type pig to get grunt shell as shown
below.

Once you get the grunt prompt, you can type the Pig Latin statement as shown below.

Here, the path refers to HDFS path and DUMP displays the result on the console as shown
below.
Batch Mode
"Pig Script" need to be created to run pig in batch mode. Write Pig Latin statements in a file
and save it with .pig extension.

EXECUTION MODES OF PIG

Pig can be executed in two modes:


1. Local Mode
2. MapReduce Mode

Local Mode
To run pig in local mode, you need to have your files in the local file system.
Syntax:
pig -x local filename

MapReduce Mode
To run pig in MapReduce mode, you need to have access to a Hadoop Cluster to read /write
file. This is the default mode of Pig.
Syntax:
pig filename

HDFS COMMANDS

We can work with all HDFS commands in Grunt shell. For example, you can create a
directory as shown below.
grunt> fs -mkdir /piglatindemos;
grunt>

RELATIONAL OPERATORS

FILTER

FILTER operator is used to select tuples from a relation based on specified conditions.

Objective: Find the tuples of those student where the GPA is greater than 4.0.
Input:
Student (rollno:int,name:chararray,gpa:float)
Act:
A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);
B = filter A by gpa > 4.0;
DUMP В;

FOREACH

Use FOREACH when you want to do data transformation based on columns of data.
GROUP

GROUP operator is used to group data.

DISTINCT

DISTINCT operator is used to remove duplicate tuples. In Pig, DISTINCT operator works on
the entire tuple and NOT on individual fields.
LIMIT

LIMIT operator is used to limit the number of output tuples.

ORDER BY

ORDER BY is used to sort a relation based on specific value.


JOIN

It is used to join two or more relations based on values in the common field. It always
performs inner Join.

UNION

It is used to merge the contents of two relations.


SPLIT

It is used to partition a relation into two or more relations.

SAMPLE

It is used to select random sample of data based on the specified sample size.
EVAL FUNCTION

AVG

AVG is used to compute the average of numeric values in a single column bag.

MAX

MAX is used to compute the maximum of numeric values in a single column bag.

COUNT

COUNT is used to count the number of elements in a bag.


COMPLEX DATA TYPES

TUPLE

A TUPLE is an ordered collection of fields.


МАР

MAP represents a key/value pair.

PIGGY BANK

Pig user can use Piggy Bank functions in Pig Latin script and they can also share their
functions in Piggy Bank.

USER-DEFINED FUNCTIONS (UDF)

Pig allows you to create your own function for complex analysis.
PIG versus HIVE

You might also like