0% found this document useful (0 votes)
238 views71 pages

Unit 4 HIVE - PIG

Tez, Spark. 4. Extensibility: HiveQL supports user-defined functions (UDFs), user-defined aggregate functions (UDAFs), and table-generating functions (UDTFs) for custom data processing and analysis. 5. Scalability: HiveQL queries can be parallelized and distributed across a Hadoop cluster for processing large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
238 views71 pages

Unit 4 HIVE - PIG

Tez, Spark. 4. Extensibility: HiveQL supports user-defined functions (UDFs), user-defined aggregate functions (UDAFs), and table-generating functions (UDTFs) for custom data processing and analysis. 5. Scalability: HiveQL queries can be parallelized and distributed across a Hadoop cluster for processing large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

HIVE

Agenda of Hive
• Introduction: What is HIVE?
• HIVE Architecture
• HIVE data Types
• HIVE File Formats
• HIVE Query Language(HiveQL)
• RCFile implementation
• SerDe,
• User-Defined Functions (UDF)
Introduction: What is HIVE?
• Hive is a data warehouse infrastructure
tool to process structured data in Hadoop.
• It resides on top of Hadoop to summarize
Big Data, and makes querying and
analyzing easy.
• Initially Hive was developed by Facebook,
later the Apache Software Foundation
took it up and developed it further as an
open source under the name Apache Hive.
• It is used by different companies. For
example, Amazon uses it in Amazon
Elastic MapReduce.
Features of HIVE
• It stores schema in a database and processed data into
HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL
or HQL.
• It is familiar, fast, scalable, and extensible.
HIVE Architecture
HIVE Architecture
• This component diagram contains different units. The
following table describes each unit:

Unit Operation
User Interface Hive is a data warehouse infrastructure software that can
create interaction between user and HDFS. The user
interfaces that Hive supports are Hive Web UI, Hive
command line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the
schema or Metadata of tables, databases, columns in a
table, their data types, and HDFS mapping.
HIVE Architecture
Unit Operation
HiveQL HiveQL is similar to SQL for querying on schema info on the
Process Metastore. It is one of the replacements of traditional
Engine approach for MapReduce program. Instead of writing
MapReduce program in Java, we can write a query for
MapReduce job and process it.
Execution The conjunction part of HiveQL process Engine and
Engine MapReduce is Hive Execution Engine. Execution engine
processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
HDFS or Hadoop distributed file system or HBASE are the data
HBASE storage techniques to store data into file system.
Working of HIVE
• The following diagram depicts the workflow between Hive and Hadoop.
HIVE Data Types
• All the data types in Hive are classified into four types,
given as follows:
• Column Types
• Literals
• Null Values
• Complex Types
HIVE Data Types
Column Types
The following table depicts
• Column type are used as column data types of various INT data types:
Hive. They are as follows:

• Integral Types

• Integer type data can be specified using integral


data types, INT. When the data range exceeds
the range of INT, you need to use BIGINT and if
the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.
HIVE Data Types
• String Types
The following table depicts
• String type data types can be specified using various CHAR data types:
single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR.
Hive follows C-types escape characters.

• The following table depicts various CHAR data


types:
HIVE Data Types
• Timestamp

• It supports traditional UNIX timestamp with optional nanosecond


precision. It supports java.sql.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.

• Dates

• DATE values are described in year/month/day format in the form


{{YYYY-MM-DD}}.

• Decimals

• The DECIMAL type in Hive is as same as Big Decimal format of Java. It


is used for representing immutable arbitrary precision. The syntax and
example is as follows:
HIVE Data Types
• Literals

• The following literals are used in Hive:

• Floating Point Types

• Floating point types are nothing but numbers with decimal points.
Generally, this type of data is composed of DOUBLE data type.

• Decimal Type

• Decimal type data is nothing but floating point value with higher range
than DOUBLE data type. The range of decimal type is approximately -
10-308 to 10308.
HIVE Data Types
Null Value

• Missing values are represented by the special value NULL.

Complex Types

• The Hive complex data types are as follows:

• Arrays

• Arrays in Hive are used the same way they are used in Java.

Syntax:
ARRAY<data_type>
HIVE Data Types
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>

Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
HIVE File Format
• Types of Hadoop File Formats
• Hive in HDFS can be created using five different Hadoop file
formats:
• Text files
• Sequence File
• Avro data files
• RCFILE FORMAT:
• Parquet file format

• Let’s learn about each Hadoop file formats in detail.


HIVE File Format
1. Text files

• Hive Text file format is a default storage format to load data from
comma-separated values (CSV), tab-delimited, space-delimited, or
text files that delimited by other special characters.

• You can use the text format to interchange the data with other client
applications. The text file format is very common for most of the
applications. Data is stored in lines, with each line being a record.

• Each line is terminated by a newline character (\n).The text file


format storage option is defined by specifying “STORED AS
TEXTFILE” at the end of the table creation.
HIVE File Format
2. Sequence File
• Flat files consisting of binary key-value pairs are sequence
files.
• When converting queries to MapReduce jobs, Hive chooses to
use the necessary key-value pairs for a given record.
• The key advantages of using a sequence file are that it
incorporates two or more files into one file.
• The sequence file format storage option is defined by
specifying “STORED AS SEQUENCEFILE” at the end of the
table creation.
Data serialization is the process of
HIVE File Format converting complex data structures or
objects into a format that can be
easily stored, transmitted, or
3. Avro Data Files reconstructed later.

• It’s a remote procedure call and data serialization


framework that uses JSON for defining data types and
Remote
protocols and serializes data Procedure
in a compact binary Calls
format(RPC):
Passing
to make it compact and efficient. data between different
processes or across a network by
• This file format can be used in any of the Hadoop’s tools
like Pig and Hive.Avro is serializing
one of the the data file
common at formats
the sender's
end and deserializing it at the
in applications based on Hadoop.
receiver's end.
• The option to store the data in the RC file format is
defined by specifying “STORED AS AVRO” at the end of
the table creation.
HIVE File Format
RCFILE FORMAT: (https://www.upsolver.com/blog/the-
file-format-fundamentals-of-big-data)
• The row columnar file format is very much similar to the
sequence file format.
• This also stores the data as key-value pairs and offers a high
row-level compression rate.
• This will be used when there is a requirement to perform
multiple rows at a time.
• RCFile format is supported by Hive version 0.6.0 and later.
• The RC file format storage option is defined by specifying
“STORED AS RCFILE” at the end of the table creation.
HIVE File Format
Parquet files :
• Parquet files support complex nested data structures in a flat
format.
• Parquet is broadly accessible. It supports multiple coding
languages, including Java, C++, and Python, to reach a broad
audience. This makes it usable in nearly any big data setting.
• Parquet is also self-describing. It contains metadata that
includes file schema and structure. You can use this to separate
different services for writing, storing, and reading Parquet
files.
• Parquet files are composed of row groups, header and footer.
Each row group contains data from the same columns. The
same columns are stored together in each row group:
HIVE File Format
HIVE Query Language
• The Hive Query Language (HiveQL) is a SQL-like language used for
querying and managing data in the Apache Hive data warehouse
system. HiveQL provides a familiar interface for users who are
already accustomed to SQL syntax, making it easier to interact with
large-scale datasets stored in Hadoop's distributed file system
(HDFS).
• DDL and DML are the parts of HIVE QL
• Data Definition Language (DDL) is used for creating, altering and
dropping databases, tables, views, functions and indexes.
• Data manipulation language is used to put data into Hive tables and
to extract data to the file system and also how to explore and
manipulate data with queries, grouping, filtering, joining etc.
HIVE Query Language
• Key Features of HiveQL:

1.SQL-like Syntax: HiveQL syntax resembles SQL, allowing


users familiar with SQL to write queries for data analysis and
manipulation.

2.Hive Metastore: HiveQL interacts with the Hive metastore,


which stores metadata about tables, partitions, columns, and
their corresponding HDFS file locations.
HIVE Query Language
3. Hadoop Ecosystem Integration: HiveQL seamlessly
integrates with various Hadoop ecosystem tools and
technologies, such as MapReduce, HDFS, YARN, etc
4. Table Creation and Manipulation: Users can create tables,
load data into them, alter table structures, and perform other
schema-related operations using HiveQL.
5. Data Querying and Transformation: HiveQL supports
querying, filtering, aggregating, joining, and transforming data
stored in HDFS using Hive's SQL-like syntax.
HIVE Query Language
HIVE Query Language
HIVE Query Language
• HiveQL enables users to perform complex data analysis,
transformations, and querying on large-scale datasets using
SQL-like syntax.

• However, it's important to note that while HiveQL provides


SQL abstraction, the underlying execution often involves
MapReduce or other execution engines, which might have
implications on performance and query optimization
strategies.
RCFile implementation
RCFile implementation
RCFile implementation
Implementing RCFile (Conceptual Overview):

1. Column-Oriented Data Storage: When implementing


RCFile, you need to organize data column-wise. Each
column's values are stored together, potentially allowing
for more efficient compression and retrieval of specific
columns during query processing.

2. Indexing and Metadata: RCFile includes metadata


and indexes to facilitate efficient data retrieval. These
indexes help in locating the beginning of column
chunks, allowing for quicker access to specific rows or
columns.
RCFile implementation
Implementing RCFile (Conceptual Overview):

3. Compression Strategies: Implementing RCFile involves choosing and


implementing compression algorithms for individual columns based on
their data types and characteristics. Compression aims to reduce storage
space and enhance I/O performance.

4. Integration with Processing Engines: RCFile integration typically


involves working within a framework or system (like Hive) that
understands the RCFile format. This involves reading, writing, and
processing RCFiles efficiently.
SerDe
• "SerDe" stands for Serializer/Deserializer and refers to a crucial component in
Apache Hive that enables it to interface with various data formats, allowing for the
serialization of data when it's stored in Hive tables and the deserialization when the
data is queried or retrieved.

• Key Functions of SerDe in Hive:

1. Serialization: SerDe helps convert structured data from its internal representation
within Hive into a format suitable for storage in files or databases. This involves
converting data structures into a serialized format that can be written to storage.

2. Deserialization: When Hive reads data from storage (like files in HDFS), SerDe
performs the reverse operation, converting the stored format back into Hive's
internal representation of the data. This allows Hive to interpret and query the
data.
SerDe
3. Format Interpretation: SerDe understands the specifics of
different data formats, including their encoding, data types, delimiters,
and other characteristics. It ensures that data is properly interpreted
and handled according to the format's specifications.
4. Integration with Hive: SerDe integrates closely with Hive's query
engine, allowing Hive to support a wide range of file formats and data
types. SerDe enables Hive to interact with these various formats
seamlessly.
SerDe
User-Defined Functions (UDF)
In Hive, User-Defined Functions (UDFs) enable users to extend Hive's functionality by
creating custom functions to perform specific operations that aren't covered by built-in
Hive functions. UDFs allow users to write their own logic in Java, Python, or other
languages and integrate it into Hive queries.

• Types of User-Defined Functions (UDFs) in Hive:

1. UDF (User-Defined Function): These functions take zero or more input parameters
and return a single value. For instance, you might create a UDF to perform custom
string manipulations or mathematical operations.

2. UDTF (User-Defined Table Function): Unlike UDFs, UDTFs can generate multiple
rows and columns as output for a single input row. They are used when the output of a
function needs to be a table-like structure.

3. UDAF (User-Defined Aggregate Function): UDAFs are used to perform custom


aggregation operations, such as computing custom aggregates like median, weighted
averages, etc.
User-Defined Functions (UDF)
Steps to Create a User-Defined Function (UDF) in Hive:

• Implement the Function: Write the custom logic for your function in Java,
Python, or another supported language. For Java-based UDFs, you'll typically
extend Hive's UDF or GenericUDF classes and override necessary methods.

• Compile the Code: Compile the code into a JAR file (for Java-based UDFs) or
prepare the script (for Python UDFs).

• Load the Function into Hive: Load the JAR file containing the UDF
implementation into Hive's environment using the ADD JAR command.

• Register the Function: Register the UDF with Hive using the CREATE
FUNCTION command, specifying the function name, class or script path, and
other necessary details.

• Use the Function: Once registered, you can use the UDF in your Hive queries
just like any built-in function.
User-Defined Functions (UDF)
PIG
Agenda of Pig
• Introduction: What is Pig? • Eval function

• The anatomy of Pig, • Complex Data Types

• Pig on Hadoop • Piggy Bank

• Pig philosophy, • User-defined Functions

• Use Case for Pig- ETL Processing • Parameter substitution

• Pig Latin overview • Diagnostic Operator

• Data types in Pig • Word Count Example using Pig

• Running Pig, Execution modes of Pig • When to use and not use Pig?

• HDFS commands • Pig at Yahoo

• Relational operators • Pig vs HIVE


What is Apache Pig
• Apache Pig is a high-level data flow platform for executing
MapReduce programs of Hadoop. The language used for Pig is
Pig Latin.

• Pig can handle any type of data, i.e., structured, semi-


structured or unstructured and stores the corresponding
results into Hadoop Data File System. Every task which can be
achieved using PIG can also be achieved using java used in
MapReduce.
Anatomy of Pig
• The anatomy of Pig involves several key components:

• Pig Latin: This is the scripting language used in Apache Pig. It's a data flow
language that describes data transformations such as loading data,
processing it, and storing the results.

• Parser: Pig Latin scripts are parsed by the parser, which checks the syntax
and translates the scripts into an execution plan.

• Optimizer: Once the script is parsed, an optimizer restructures and


optimizes the execution plan for better performance.

• Compiler: The optimized plan is then compiled into a series of MapReduce


jobs that can be executed on a Hadoop cluster.
Pig on Hadoop
• Pig runs on Hadoop. It makes use of both the Hadoop Distributed File
System, HDFS, and Hadoop’s processing system, MapReduce.

• HDFS is a distributed filesystem that stores files across all of the nodes in a
Hadoop cluster. It handles breaking the files into large blocks and
distributing them across different machines, including making multiple
copies of each block so that if any one machine fails no data is lost. By
default, Pig reads input files from HDFS, uses HDFS to store intermediate
data between MapReduce jobs, and writes its output to HDFS.

• Pig uses MapReduce to execute all of its data processing. It compiles the Pig
Latin scripts that users write into a series of one or more MapReduce jobs
that it then executes.
Pig philosophy
Pigs eat anything
Pig can operate on data whether it has metadata or not. It can operate on
data that is relational, nested, or unstructured.

Pigs live anywhere


Pig is intended to be a language for parallel data processing. It is not tied
to one particular parallel framework

Pigs are domestic animals


Pig is designed to be easily controlled and modified by its users.
Use Case for Pig- ETL Processing
Pig is commonly used for ETL (Extract, Transform, Load) processing in big data
scenarios. Here's how it can be applied in an ETL use case:

Scenario: A retail company wants to analyze its sales data, which is stored in
various formats across multiple sources, including CSV files, log files, and a
relational database.

ETL Process with Pig:

1.Extraction (E):
1. Pig can be used to extract data from diverse sources. For instance, it can
load CSV files, parse log files, and connect to a relational database using
Pig's built-in functions or custom loaders.
Use Case for Pig- ETL Processing
2. Transformation (T):
1. Once data is loaded, Pig facilitates transformation tasks. For example:
1.Cleaning data: Removing duplicates, handling missing values, and
standardizing formats.
2.Aggregation: Calculating total sales, average purchase amount, or
other statistical metrics.
3.Joining data: Merging information from different sources based on
common fields.
4.Data enrichment: Adding additional attributes or enriching data based
on business rules.
3. Load (L):
After transformation, Pig allows storing the processed data into various
output formats or systems like HDFS, HBase, relational databases, or even
directly into analytical tools.
Pig Latin overview
• Pig Latin is the high-level scripting language used in Apache Pig
for expressing data transformation and processing tasks.

• It abstracts the complexities of MapReduce programming and


allows users to write scripts to manipulate and analyze large
datasets on Apache Hadoop.
Data types in Pig
Running Pig & Execution modes of Pig
• You can run Pig locally on your machine or on your grid. You can
also run Pig on cloud (as part of Amazon’s Elastic MapReduce service).You
can also run Pig on Your Hadoop Cluster.

Apache Pig executes in two modes: Local Mode and MapReduce


Mode.

• Local Mode
• In this mode, all the files are installed and run from your local host and
local file system. There is no need of Hadoop or HDFS. This mode is
generally used for testing purpose.

• MapReduce Mode
• MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig.
HDFS commands
• In Hadoop, HDFS (Hadoop Distributed File System) commands are used to interact with the file
system, manage files and directories, and perform various operations. Here are some fundamental
HDFS commands:
HDFS commands
Relational operators
• In Apache Pig's Pig Latin scripting language, relational operators are used to
perform various transformations and operations on data. These operators help in
manipulating data within relations (bags, tuples) to achieve the desired output.
Here are some key relational operators in Pig Latin:
Relational operators
Eval Functions
Apache Pig supports various types of Eval Functions such as AVG, CONCAT,
COUNT, COUNT_STAR, and so on to perform a different type of operation.
The following is the list of Eval functions supported by Apache Pig.
Eval Functions
Complex Types
• Pig has three complex data types: maps, tuples, and bags. All of
these types can contain data of any type, including other complex
types.

Map
• A map in Pig is a chararray to data element mapping, where that
element can be any Pig type, including a complex type. The chararray
is called a key and is used as an index to find the element, referred
to as the value.

• For example, ['name’#’Jacky', 'age'#55] will create a map with two


keys, “name” and “age”. The first value is a chararray, and the
second is an integer.
Complex Types
Tuple
• A tuple is a fixed-length, ordered collection of Pig data elements. Tuples
are divided into fields, with each field containing one data element.
These elements can be of any type—they do not all need to be the
same type

• For example, (‘Rose', 55) describes a tuple constant with two fields.
Bag
• A bag is a collection of tuples. It's analogous to a set of records, where
each record (tuple) can have multiple fields of different types.

• Bag constants are constructed using braces, with tuples in the bag
separated by commas.
• For example, {(‘Peter’, 55), ('sally', 52), ('john', 25)} constructs a bag
with three tuples, each with two fields.
Piggy Bank

Since Apache Pig has been written in Java, the UDF's written using
Java language work efficiently compared to other languages. In
Apache Pig, we also have a Java repository for UDF's named
Piggybank. Using Piggybank, we can access Java UDF's written by
other users, and contribute our own UDF's.
User-defined Functions

In addition to the built-in functions, Apache Pig provides extensive


support for User Defined Functions (UDF’s). Using these UDF’s, we
can define our own functions and use them. The UDF support is
provided in six programming languages, namely, Java, Jython,
Python, JavaScript, Ruby and Groovy.

For writing UDF’s, complete support is provided in Java and limited


support is provided in all the remaining languages. Using Java, you
can write UDF’s involving all parts of the processing like data
load/store, column transformation, and aggregation. Since Apache
Pig has been written in Java, the UDF’s written using Java language
work efficiently compared to other languages.
User-defined Functions
Parameter substitution & Diagnostic operator

Parameter substitution in Apache Pig refers to the capability of


replacing specific parts of a Pig Latin script with parameterized
values. This feature allows users to define parameters externally
and use them within their scripts, enhancing script flexibility and
reusability.

The Diagnostic operator in Apache Pig is a tool used for


debugging and inspecting data during script development. It allows
users to print information about tuples or bags at different points
within a Pig Latin script to understand the data flow and identify
issues in the data processing pipeline.
Word Count Example using Pig
Assume you have a text file named input_text.txt containing some
text.
Word Count Example using Pig
Explanation of the Pig Script:

LOAD: Loads the text file as text_data using the TextLoader() function, treating each
line as a character array (line:chararray).

Tokenize Words: Splits each line into individual words using the TOKENIZE() function
and then flattens them into separate records using FLATTEN().

Grouping: Groups the words together based on their occurrences using the GROUP
BY operation.

Word Count: For each word group, the COUNT() function calculates the number of
occurrences of each word.

STORE: Saves the word count result into the word_count_output directory using
PigStorage(','), where each word and its count are stored as a comma-separated
values (CSV) file.
When to use and not use Pig?
Use Pig When:

Ad-Hoc Data Processing: Pig is excellent for ad-hoc data processing tasks where
you need to quickly write scripts to process and analyze data without getting into the
complexities of writing MapReduce jobs.

Data Transformation: It's suitable for data transformation tasks, especially when
dealing with semi-structured or unstructured data. Pig simplifies complex ETL
processes and data cleaning tasks.

Rapid Prototyping: For rapid prototyping and experimentation with data, Pig's high-
level scripting language allows you to iterate quickly.

Ease of Learning: Pig's scripting language, Pig Latin, is relatively easy to learn and
understand, making it accessible to users without extensive programming experience.
When to use and not use Pig?
NOT Use Pig When:

Real-Time Processing: Pig is not designed for real-time processing or low-latency


requirements. For real-time analytics or processing where immediate responses are
necessary, other tools like Apache Storm or Apache Flink might be better choices.

Highly Customized Operations: If your task involves highly customized operations


that cannot be easily expressed within Pig Latin control over the data flow, writing
custom MapReduce programs or using a programming language like Java might be
more appropriate.
Pig at Yahoo
In 2006, Apache Pig was developed as a research project
at Yahoo, especially to create and execute MapReduce
jobs on every dataset. In 2007, Apache Pig was open
sourced via Apache incubator. In 2008, the first release of
Apache Pig came out. In 2010, Apache Pig graduated as
an Apache top-level project.
Pig vs HIVE

You might also like