0% found this document useful (0 votes)

238 views71 pages

Unit 4 HIVE - PIG

Tez, Spark. 4. Extensibility: HiveQL supports user-defined functions (UDFs), user-defined aggregate functions (UDAFs), and table-generating functions (UDTFs) for custom data processing and analysis. 5. Scalability: HiveQL queries can be parallelized and distributed across a Hadoop cluster for processing large datasets.

Uploaded by

Ruparel Education Pvt. Ltd.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

238 views71 pages

Unit 4 HIVE - PIG

Uploaded by

Ruparel Education Pvt. Ltd.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

HIVE

Agenda of Hive
• Introduction: What is HIVE?
• HIVE Architecture
• HIVE data Types
• HIVE File Formats
• HIVE Query Language(HiveQL)
• RCFile implementation
• SerDe,
• User-Defined Functions (UDF)
Introduction: What is HIVE?
• Hive is a data warehouse infrastructure
tool to process structured data in Hadoop.
• It resides on top of Hadoop to summarize
Big Data, and makes querying and
analyzing easy.
• Initially Hive was developed by Facebook,
later the Apache Software Foundation
took it up and developed it further as an
open source under the name Apache Hive.
• It is used by different companies. For
example, Amazon uses it in Amazon
Elastic MapReduce.
Features of HIVE
• It stores schema in a database and processed data into
HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL
or HQL.
• It is familiar, fast, scalable, and extensible.
HIVE Architecture
HIVE Architecture
• This component diagram contains different units. The
following table describes each unit:
•
Unit Operation
User Interface Hive is a data warehouse infrastructure software that can
create interaction between user and HDFS. The user
interfaces that Hive supports are Hive Web UI, Hive
command line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the
schema or Metadata of tables, databases, columns in a
table, their data types, and HDFS mapping.
HIVE Architecture
Unit Operation
HiveQL HiveQL is similar to SQL for querying on schema info on the
Process Metastore. It is one of the replacements of traditional
Engine approach for MapReduce program. Instead of writing
MapReduce program in Java, we can write a query for
MapReduce job and process it.
Execution The conjunction part of HiveQL process Engine and
Engine MapReduce is Hive Execution Engine. Execution engine
processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
HDFS or Hadoop distributed file system or HBASE are the data
HBASE storage techniques to store data into file system.
Working of HIVE
• The following diagram depicts the workflow between Hive and Hadoop.
HIVE Data Types
• All the data types in Hive are classified into four types,
given as follows:
• Column Types
• Literals
• Null Values
• Complex Types
HIVE Data Types
Column Types
The following table depicts
• Column type are used as column data types of various INT data types:
Hive. They are as follows:

• Integral Types

• Integer type data can be specified using integral

data types, INT. When the data range exceeds
the range of INT, you need to use BIGINT and if
the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.
HIVE Data Types
• String Types
The following table depicts
• String type data types can be specified using various CHAR data types:
single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR.
Hive follows C-types escape characters.

• The following table depicts various CHAR data

types:
HIVE Data Types
• Timestamp

• It supports traditional UNIX timestamp with optional nanosecond

precision. It supports java.sql.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.

• Dates

• DATE values are described in year/month/day format in the form

{{YYYY-MM-DD}}.

• Decimals

• The DECIMAL type in Hive is as same as Big Decimal format of Java. It

is used for representing immutable arbitrary precision. The syntax and
example is as follows:
HIVE Data Types
• Literals

• The following literals are used in Hive:

• Floating Point Types

• Floating point types are nothing but numbers with decimal points.
Generally, this type of data is composed of DOUBLE data type.

• Decimal Type

• Decimal type data is nothing but floating point value with higher range
than DOUBLE data type. The range of decimal type is approximately -
10-308 to 10308.
HIVE Data Types
Null Value

• Missing values are represented by the special value NULL.

Complex Types

• The Hive complex data types are as follows:

• Arrays

• Arrays in Hive are used the same way they are used in Java.

Syntax:
ARRAY<data_type>
HIVE Data Types
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>

Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
HIVE File Format
• Types of Hadoop File Formats
• Hive in HDFS can be created using five different Hadoop file
formats:
• Text files
• Sequence File
• Avro data files
• RCFILE FORMAT:
• Parquet file format

• Let’s learn about each Hadoop file formats in detail.

HIVE File Format
1. Text files

• Hive Text file format is a default storage format to load data from
comma-separated values (CSV), tab-delimited, space-delimited, or
text files that delimited by other special characters.

• You can use the text format to interchange the data with other client
applications. The text file format is very common for most of the
applications. Data is stored in lines, with each line being a record.

• Each line is terminated by a newline character (\n).The text file

format storage option is defined by specifying “STORED AS
TEXTFILE” at the end of the table creation.
HIVE File Format
2. Sequence File
• Flat files consisting of binary key-value pairs are sequence
files.
• When converting queries to MapReduce jobs, Hive chooses to
use the necessary key-value pairs for a given record.
• The key advantages of using a sequence file are that it
incorporates two or more files into one file.
• The sequence file format storage option is defined by
specifying “STORED AS SEQUENCEFILE” at the end of the
table creation.
Data serialization is the process of
HIVE File Format converting complex data structures or
objects into a format that can be
easily stored, transmitted, or
3. Avro Data Files reconstructed later.

• It’s a remote procedure call and data serialization

framework that uses JSON for defining data types and
Remote
protocols and serializes data Procedure
in a compact binary Calls
format(RPC):
Passing
to make it compact and efficient. data between different
processes or across a network by
• This file format can be used in any of the Hadoop’s tools
like Pig and Hive.Avro is serializing
one of the the data file
common at formats
the sender's
end and deserializing it at the
in applications based on Hadoop.
receiver's end.
• The option to store the data in the RC file format is
defined by specifying “STORED AS AVRO” at the end of
the table creation.
HIVE File Format
RCFILE FORMAT: (https://www.upsolver.com/blog/the-
file-format-fundamentals-of-big-data)
• The row columnar file format is very much similar to the
sequence file format.
• This also stores the data as key-value pairs and offers a high
row-level compression rate.
• This will be used when there is a requirement to perform
multiple rows at a time.
• RCFile format is supported by Hive version 0.6.0 and later.
• The RC file format storage option is defined by specifying
“STORED AS RCFILE” at the end of the table creation.
HIVE File Format
Parquet files :
• Parquet files support complex nested data structures in a flat
format.
• Parquet is broadly accessible. It supports multiple coding
languages, including Java, C++, and Python, to reach a broad
audience. This makes it usable in nearly any big data setting.
• Parquet is also self-describing. It contains metadata that
includes file schema and structure. You can use this to separate
different services for writing, storing, and reading Parquet
files.
• Parquet files are composed of row groups, header and footer.
Each row group contains data from the same columns. The
same columns are stored together in each row group:
HIVE File Format
HIVE Query Language
• The Hive Query Language (HiveQL) is a SQL-like language used for
querying and managing data in the Apache Hive data warehouse
system. HiveQL provides a familiar interface for users who are
already accustomed to SQL syntax, making it easier to interact with
large-scale datasets stored in Hadoop's distributed file system
(HDFS).
• DDL and DML are the parts of HIVE QL
• Data Definition Language (DDL) is used for creating, altering and
dropping databases, tables, views, functions and indexes.
• Data manipulation language is used to put data into Hive tables and
to extract data to the file system and also how to explore and
manipulate data with queries, grouping, filtering, joining etc.
HIVE Query Language
• Key Features of HiveQL:

1.SQL-like Syntax: HiveQL syntax resembles SQL, allowing

users familiar with SQL to write queries for data analysis and
manipulation.

2.Hive Metastore: HiveQL interacts with the Hive metastore,

which stores metadata about tables, partitions, columns, and
their corresponding HDFS file locations.
HIVE Query Language
3. Hadoop Ecosystem Integration: HiveQL seamlessly
integrates with various Hadoop ecosystem tools and
technologies, such as MapReduce, HDFS, YARN, etc
4. Table Creation and Manipulation: Users can create tables,
load data into them, alter table structures, and perform other
schema-related operations using HiveQL.
5. Data Querying and Transformation: HiveQL supports
querying, filtering, aggregating, joining, and transforming data
stored in HDFS using Hive's SQL-like syntax.
HIVE Query Language
HIVE Query Language
HIVE Query Language
• HiveQL enables users to perform complex data analysis,
transformations, and querying on large-scale datasets using
SQL-like syntax.

• However, it's important to note that while HiveQL provides

SQL abstraction, the underlying execution often involves
MapReduce or other execution engines, which might have
implications on performance and query optimization
strategies.
RCFile implementation
RCFile implementation
RCFile implementation
Implementing RCFile (Conceptual Overview):

1. Column-Oriented Data Storage: When implementing

RCFile, you need to organize data column-wise. Each
column's values are stored together, potentially allowing
for more efficient compression and retrieval of specific
columns during query processing.

2. Indexing and Metadata: RCFile includes metadata

and indexes to facilitate efficient data retrieval. These
indexes help in locating the beginning of column
chunks, allowing for quicker access to specific rows or
columns.
RCFile implementation
Implementing RCFile (Conceptual Overview):

3. Compression Strategies: Implementing RCFile involves choosing and

implementing compression algorithms for individual columns based on
their data types and characteristics. Compression aims to reduce storage
space and enhance I/O performance.

4. Integration with Processing Engines: RCFile integration typically

involves working within a framework or system (like Hive) that
understands the RCFile format. This involves reading, writing, and
processing RCFiles efficiently.
SerDe
• "SerDe" stands for Serializer/Deserializer and refers to a crucial component in
Apache Hive that enables it to interface with various data formats, allowing for the
serialization of data when it's stored in Hive tables and the deserialization when the
data is queried or retrieved.

• Key Functions of SerDe in Hive:

1. Serialization: SerDe helps convert structured data from its internal representation
within Hive into a format suitable for storage in files or databases. This involves
converting data structures into a serialized format that can be written to storage.

2. Deserialization: When Hive reads data from storage (like files in HDFS), SerDe
performs the reverse operation, converting the stored format back into Hive's
internal representation of the data. This allows Hive to interpret and query the
data.
SerDe
3. Format Interpretation: SerDe understands the specifics of
different data formats, including their encoding, data types, delimiters,
and other characteristics. It ensures that data is properly interpreted
and handled according to the format's specifications.
4. Integration with Hive: SerDe integrates closely with Hive's query
engine, allowing Hive to support a wide range of file formats and data
types. SerDe enables Hive to interact with these various formats
seamlessly.
SerDe
User-Defined Functions (UDF)
In Hive, User-Defined Functions (UDFs) enable users to extend Hive's functionality by
creating custom functions to perform specific operations that aren't covered by built-in
Hive functions. UDFs allow users to write their own logic in Java, Python, or other
languages and integrate it into Hive queries.

• Types of User-Defined Functions (UDFs) in Hive:

1. UDF (User-Defined Function): These functions take zero or more input parameters
and return a single value. For instance, you might create a UDF to perform custom
string manipulations or mathematical operations.

2. UDTF (User-Defined Table Function): Unlike UDFs, UDTFs can generate multiple
rows and columns as output for a single input row. They are used when the output of a
function needs to be a table-like structure.

3. UDAF (User-Defined Aggregate Function): UDAFs are used to perform custom

aggregation operations, such as computing custom aggregates like median, weighted
averages, etc.
User-Defined Functions (UDF)
Steps to Create a User-Defined Function (UDF) in Hive:

• Implement the Function: Write the custom logic for your function in Java,
Python, or another supported language. For Java-based UDFs, you'll typically
extend Hive's UDF or GenericUDF classes and override necessary methods.

• Compile the Code: Compile the code into a JAR file (for Java-based UDFs) or
prepare the script (for Python UDFs).

• Load the Function into Hive: Load the JAR file containing the UDF
implementation into Hive's environment using the ADD JAR command.

• Register the Function: Register the UDF with Hive using the CREATE
FUNCTION command, specifying the function name, class or script path, and
other necessary details.

• Use the Function: Once registered, you can use the UDF in your Hive queries
just like any built-in function.
User-Defined Functions (UDF)
PIG
Agenda of Pig
• Introduction: What is Pig? • Eval function

• The anatomy of Pig, • Complex Data Types

• Pig on Hadoop • Piggy Bank

• Pig philosophy, • User-defined Functions

• Use Case for Pig- ETL Processing • Parameter substitution

• Pig Latin overview • Diagnostic Operator

• Data types in Pig • Word Count Example using Pig

• Running Pig, Execution modes of Pig • When to use and not use Pig?

• HDFS commands • Pig at Yahoo

• Relational operators • Pig vs HIVE

What is Apache Pig
• Apache Pig is a high-level data flow platform for executing
MapReduce programs of Hadoop. The language used for Pig is
Pig Latin.

• Pig can handle any type of data, i.e., structured, semi-

structured or unstructured and stores the corresponding
results into Hadoop Data File System. Every task which can be
achieved using PIG can also be achieved using java used in
MapReduce.
Anatomy of Pig
• The anatomy of Pig involves several key components:

• Pig Latin: This is the scripting language used in Apache Pig. It's a data flow
language that describes data transformations such as loading data,
processing it, and storing the results.

• Parser: Pig Latin scripts are parsed by the parser, which checks the syntax
and translates the scripts into an execution plan.

• Optimizer: Once the script is parsed, an optimizer restructures and

optimizes the execution plan for better performance.

• Compiler: The optimized plan is then compiled into a series of MapReduce

jobs that can be executed on a Hadoop cluster.
Pig on Hadoop
• Pig runs on Hadoop. It makes use of both the Hadoop Distributed File
System, HDFS, and Hadoop’s processing system, MapReduce.

• HDFS is a distributed filesystem that stores files across all of the nodes in a
Hadoop cluster. It handles breaking the files into large blocks and
distributing them across different machines, including making multiple
copies of each block so that if any one machine fails no data is lost. By
default, Pig reads input files from HDFS, uses HDFS to store intermediate
data between MapReduce jobs, and writes its output to HDFS.

• Pig uses MapReduce to execute all of its data processing. It compiles the Pig
Latin scripts that users write into a series of one or more MapReduce jobs
that it then executes.
Pig philosophy
Pigs eat anything
Pig can operate on data whether it has metadata or not. It can operate on
data that is relational, nested, or unstructured.

Pigs live anywhere

Pig is intended to be a language for parallel data processing. It is not tied
to one particular parallel framework

Pigs are domestic animals

Pig is designed to be easily controlled and modified by its users.
Use Case for Pig- ETL Processing
Pig is commonly used for ETL (Extract, Transform, Load) processing in big data
scenarios. Here's how it can be applied in an ETL use case:

Scenario: A retail company wants to analyze its sales data, which is stored in
various formats across multiple sources, including CSV files, log files, and a
relational database.

ETL Process with Pig:

1.Extraction (E):
1. Pig can be used to extract data from diverse sources. For instance, it can
load CSV files, parse log files, and connect to a relational database using
Pig's built-in functions or custom loaders.
Use Case for Pig- ETL Processing
2. Transformation (T):
1. Once data is loaded, Pig facilitates transformation tasks. For example:
1.Cleaning data: Removing duplicates, handling missing values, and
standardizing formats.
2.Aggregation: Calculating total sales, average purchase amount, or
other statistical metrics.
3.Joining data: Merging information from different sources based on
common fields.
4.Data enrichment: Adding additional attributes or enriching data based
on business rules.
3. Load (L):
After transformation, Pig allows storing the processed data into various
output formats or systems like HDFS, HBase, relational databases, or even
directly into analytical tools.
Pig Latin overview
• Pig Latin is the high-level scripting language used in Apache Pig
for expressing data transformation and processing tasks.

• It abstracts the complexities of MapReduce programming and

allows users to write scripts to manipulate and analyze large
datasets on Apache Hadoop.
Data types in Pig
Running Pig & Execution modes of Pig
• You can run Pig locally on your machine or on your grid. You can
also run Pig on cloud (as part of Amazon’s Elastic MapReduce service).You
can also run Pig on Your Hadoop Cluster.

Apache Pig executes in two modes: Local Mode and MapReduce

Mode.

• Local Mode
• In this mode, all the files are installed and run from your local host and
local file system. There is no need of Hadoop or HDFS. This mode is
generally used for testing purpose.

• MapReduce Mode
• MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig.
HDFS commands
• In Hadoop, HDFS (Hadoop Distributed File System) commands are used to interact with the file
system, manage files and directories, and perform various operations. Here are some fundamental
HDFS commands:
HDFS commands
Relational operators
• In Apache Pig's Pig Latin scripting language, relational operators are used to
perform various transformations and operations on data. These operators help in
manipulating data within relations (bags, tuples) to achieve the desired output.
Here are some key relational operators in Pig Latin:
Relational operators
Eval Functions
Apache Pig supports various types of Eval Functions such as AVG, CONCAT,
COUNT, COUNT_STAR, and so on to perform a different type of operation.
The following is the list of Eval functions supported by Apache Pig.
Eval Functions
Complex Types
• Pig has three complex data types: maps, tuples, and bags. All of
these types can contain data of any type, including other complex
types.

Map
• A map in Pig is a chararray to data element mapping, where that
element can be any Pig type, including a complex type. The chararray
is called a key and is used as an index to find the element, referred
to as the value.

• For example, ['name’#’Jacky', 'age'#55] will create a map with two

keys, “name” and “age”. The first value is a chararray, and the
second is an integer.
Complex Types
Tuple
• A tuple is a fixed-length, ordered collection of Pig data elements. Tuples
are divided into fields, with each field containing one data element.
These elements can be of any type—they do not all need to be the
same type

• For example, (‘Rose', 55) describes a tuple constant with two fields.
Bag
• A bag is a collection of tuples. It's analogous to a set of records, where
each record (tuple) can have multiple fields of different types.

• Bag constants are constructed using braces, with tuples in the bag
separated by commas.
• For example, {(‘Peter’, 55), ('sally', 52), ('john', 25)} constructs a bag
with three tuples, each with two fields.
Piggy Bank

Since Apache Pig has been written in Java, the UDF's written using
Java language work efficiently compared to other languages. In
Apache Pig, we also have a Java repository for UDF's named
Piggybank. Using Piggybank, we can access Java UDF's written by
other users, and contribute our own UDF's.
User-defined Functions

In addition to the built-in functions, Apache Pig provides extensive

support for User Defined Functions (UDF’s). Using these UDF’s, we
can define our own functions and use them. The UDF support is
provided in six programming languages, namely, Java, Jython,
Python, JavaScript, Ruby and Groovy.

For writing UDF’s, complete support is provided in Java and limited

support is provided in all the remaining languages. Using Java, you
can write UDF’s involving all parts of the processing like data
load/store, column transformation, and aggregation. Since Apache
Pig has been written in Java, the UDF’s written using Java language
work efficiently compared to other languages.
User-defined Functions
Parameter substitution & Diagnostic operator

Parameter substitution in Apache Pig refers to the capability of

replacing specific parts of a Pig Latin script with parameterized
values. This feature allows users to define parameters externally
and use them within their scripts, enhancing script flexibility and
reusability.

The Diagnostic operator in Apache Pig is a tool used for

debugging and inspecting data during script development. It allows
users to print information about tuples or bags at different points
within a Pig Latin script to understand the data flow and identify
issues in the data processing pipeline.
Word Count Example using Pig
Assume you have a text file named input_text.txt containing some
text.
Word Count Example using Pig
Explanation of the Pig Script:

LOAD: Loads the text file as text_data using the TextLoader() function, treating each
line as a character array (line:chararray).

Tokenize Words: Splits each line into individual words using the TOKENIZE() function
and then flattens them into separate records using FLATTEN().

Grouping: Groups the words together based on their occurrences using the GROUP
BY operation.

Word Count: For each word group, the COUNT() function calculates the number of
occurrences of each word.

STORE: Saves the word count result into the word_count_output directory using
PigStorage(','), where each word and its count are stored as a comma-separated
values (CSV) file.
When to use and not use Pig?
Use Pig When:

Ad-Hoc Data Processing: Pig is excellent for ad-hoc data processing tasks where
you need to quickly write scripts to process and analyze data without getting into the
complexities of writing MapReduce jobs.

Data Transformation: It's suitable for data transformation tasks, especially when
dealing with semi-structured or unstructured data. Pig simplifies complex ETL
processes and data cleaning tasks.

Rapid Prototyping: For rapid prototyping and experimentation with data, Pig's high-
level scripting language allows you to iterate quickly.

Ease of Learning: Pig's scripting language, Pig Latin, is relatively easy to learn and
understand, making it accessible to users without extensive programming experience.
When to use and not use Pig?
NOT Use Pig When:

Real-Time Processing: Pig is not designed for real-time processing or low-latency

requirements. For real-time analytics or processing where immediate responses are
necessary, other tools like Apache Storm or Apache Flink might be better choices.

Highly Customized Operations: If your task involves highly customized operations

that cannot be easily expressed within Pig Latin control over the data flow, writing
custom MapReduce programs or using a programming language like Java might be
more appropriate.
Pig at Yahoo
In 2006, Apache Pig was developed as a research project
at Yahoo, especially to create and execute MapReduce
jobs on every dataset. In 2007, Apache Pig was open
sourced via Apache incubator. In 2008, the first release of
Apache Pig came out. In 2010, Apache Pig graduated as
an Apache top-level project.
Pig vs HIVE

Kumpulan Kuis Kuis
100% (2)
Kumpulan Kuis Kuis
24 pages
Computer Project (School Management System)
38% (8)
Computer Project (School Management System)
26 pages
TOC Module-1 Notes
No ratings yet
TOC Module-1 Notes
19 pages
DEVOPS Spectrum
No ratings yet
DEVOPS Spectrum
44 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
33 pages
004artificial Intelligence 3rd Ed by Elaine Rich Kevin Knight Amp Shivashankar Nair
No ratings yet
004artificial Intelligence 3rd Ed by Elaine Rich Kevin Knight Amp Shivashankar Nair
44 pages
Django Ppts
No ratings yet
Django Ppts
243 pages
Unit 4 - Domain Testing
100% (1)
Unit 4 - Domain Testing
76 pages
FSD Unit III
100% (1)
FSD Unit III
36 pages
MST Lab Manual (R20)
No ratings yet
MST Lab Manual (R20)
73 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
CHAPTER 03: Big Data Technology Landscape
No ratings yet
CHAPTER 03: Big Data Technology Landscape
81 pages
AIML 4th and 5th Module Notes
No ratings yet
AIML 4th and 5th Module Notes
77 pages
Unit 3 Topic 4 Java Interfaces To HDFS
0% (1)
Unit 3 Topic 4 Java Interfaces To HDFS
15 pages
Cloud Computing Lab Manual-New
No ratings yet
Cloud Computing Lab Manual-New
150 pages
BDA Unit-1
No ratings yet
BDA Unit-1
19 pages
Unit 5 BDA
No ratings yet
Unit 5 BDA
34 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Chpater 1 - Unit 2
No ratings yet
Chpater 1 - Unit 2
31 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Hive Lecture Notes
100% (1)
Hive Lecture Notes
17 pages
Unit V
No ratings yet
Unit V
67 pages
FSD Notes
No ratings yet
FSD Notes
47 pages
UML Advanced Structural Modeling 1
No ratings yet
UML Advanced Structural Modeling 1
26 pages
Scripting Language Lab Manual
No ratings yet
Scripting Language Lab Manual
27 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
CS-703 (B) Data Warehousing and Data Mining Lab
No ratings yet
CS-703 (B) Data Warehousing and Data Mining Lab
50 pages
R Language
No ratings yet
R Language
59 pages
Cloud Computing Unit-1 Notes
No ratings yet
Cloud Computing Unit-1 Notes
12 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
TCS Ninja Interview Preparation
No ratings yet
TCS Ninja Interview Preparation
30 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
Bda Super Imp
No ratings yet
Bda Super Imp
35 pages
CS 3 - Problem Solving Agent
No ratings yet
CS 3 - Problem Solving Agent
80 pages
Hbase PPT PDF
No ratings yet
Hbase PPT PDF
100 pages
DSA-251 by Parikh Jain
No ratings yet
DSA-251 by Parikh Jain
19 pages
Co Po Mapping Bda With Justiificaton
No ratings yet
Co Po Mapping Bda With Justiificaton
4 pages
DBMS Unit 3
No ratings yet
DBMS Unit 3
98 pages
Unit-5 DS Notes
No ratings yet
Unit-5 DS Notes
19 pages
AI Lab MAnual Final
No ratings yet
AI Lab MAnual Final
44 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
Ai R16 - Unit-4
No ratings yet
Ai R16 - Unit-4
30 pages
Expression Tree
No ratings yet
Expression Tree
18 pages
Module II
No ratings yet
Module II
22 pages
Methodologies For Stream Data Processing and Stream Data Systems
No ratings yet
Methodologies For Stream Data Processing and Stream Data Systems
20 pages
Unit 4 Knowledge Representation
No ratings yet
Unit 4 Knowledge Representation
13 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
@vtucode - in Module 4 AI 2021 Scheme 5th Sem
No ratings yet
@vtucode - in Module 4 AI 2021 Scheme 5th Sem
11 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
Unit 5 2 Marks
No ratings yet
Unit 5 2 Marks
10 pages
Data Engineering Interview Preparation Questions
No ratings yet
Data Engineering Interview Preparation Questions
7 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Question Bank 1to11
No ratings yet
Question Bank 1to11
19 pages
JNTUGV B.tech R23 Course Structure
No ratings yet
JNTUGV B.tech R23 Course Structure
6 pages
DSBDa MCQ
No ratings yet
DSBDa MCQ
17 pages
Unit 3 AI Srs 13-14
No ratings yet
Unit 3 AI Srs 13-14
45 pages
Introduction To Data Science What Is Data Science?
No ratings yet
Introduction To Data Science What Is Data Science?
11 pages
Hive
No ratings yet
Hive
47 pages
R22 III II KRR CSEAIML Model QP
No ratings yet
R22 III II KRR CSEAIML Model QP
2 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
An Introduction To Formal Languages and Automata - Third Edition (Peter Linz) Mamad - Solution-Manual
33% (3)
An Introduction To Formal Languages and Automata - Third Edition (Peter Linz) Mamad - Solution-Manual
6 pages
Distributed Querry Optimization
No ratings yet
Distributed Querry Optimization
4 pages
Detailed SAP Interface Design Fundamental Document
No ratings yet
Detailed SAP Interface Design Fundamental Document
4 pages
Medical Store Management Proposal
No ratings yet
Medical Store Management Proposal
4 pages
Data Input
No ratings yet
Data Input
12 pages
SPSS
No ratings yet
SPSS
12 pages
Debremarkos University Burie Campas: Department of Computer Sience Database Lab Mannul
No ratings yet
Debremarkos University Burie Campas: Department of Computer Sience Database Lab Mannul
21 pages
Bare Metal Recovery
No ratings yet
Bare Metal Recovery
180 pages
AP QlickView Dashboard Documentation
No ratings yet
AP QlickView Dashboard Documentation
17 pages
Lecture 6 Document Databases Data Formats
No ratings yet
Lecture 6 Document Databases Data Formats
63 pages
Yimm Dpack Web: Migration Manual
No ratings yet
Yimm Dpack Web: Migration Manual
14 pages
Anushka Shetty 35
No ratings yet
Anushka Shetty 35
34 pages
FHS NEw
No ratings yet
FHS NEw
2 pages
Export Dpa
No ratings yet
Export Dpa
119 pages
ch3 Formal-Rel
No ratings yet
ch3 Formal-Rel
94 pages
Drop Database Create Database USE
No ratings yet
Drop Database Create Database USE
4 pages
Ajit Updated Resume
No ratings yet
Ajit Updated Resume
3 pages
Eric Android Kotlin Unit 4
No ratings yet
Eric Android Kotlin Unit 4
21 pages
Unit 3 MapReduce Part 2
No ratings yet
Unit 3 MapReduce Part 2
12 pages
Pert4 - Act1 - Sheka Tri Putra Darma - 11120095
No ratings yet
Pert4 - Act1 - Sheka Tri Putra Darma - 11120095
16 pages
Advanced Data Cleaning Techniques With PySpark
No ratings yet
Advanced Data Cleaning Techniques With PySpark
25 pages
Database Summary Note
No ratings yet
Database Summary Note
10 pages
Unit 2 - Collections
No ratings yet
Unit 2 - Collections
22 pages
DB Mangmnt Examples
No ratings yet
DB Mangmnt Examples
6 pages
Resume For Web
No ratings yet
Resume For Web
2 pages
Unit 3-Null Safety
No ratings yet
Unit 3-Null Safety
15 pages
CSprac 12
No ratings yet
CSprac 12
11 pages
Top 50 DBMS Interview Questions and Answers
No ratings yet
Top 50 DBMS Interview Questions and Answers
10 pages
Raj Mahamuni, PMP: SAP BPC / HANA / BW / FICO Sr. Lead Consultant at Infosys
No ratings yet
Raj Mahamuni, PMP: SAP BPC / HANA / BW / FICO Sr. Lead Consultant at Infosys
5 pages
12 EM Account
No ratings yet
12 EM Account
8 pages
Rapidminer Paper
No ratings yet
Rapidminer Paper
5 pages
Descargar Erwin Data Modeler 73 Full Serial
No ratings yet
Descargar Erwin Data Modeler 73 Full Serial
1 page
MCA 2 Repeater
No ratings yet
MCA 2 Repeater
1 page

Unit 4 HIVE - PIG

Uploaded by

Unit 4 HIVE - PIG

Uploaded by

HIVE

• Integer type data can be specified using integral

• The following table depicts various CHAR data

• It supports traditional UNIX timestamp with optional nanosecond

• DATE values are described in year/month/day format in the form

• The DECIMAL type in Hive is as same as Big Decimal format of Java. It

• The following literals are used in Hive:

• Floating Point Types

• Missing values are represented by the special value NULL.

• The Hive complex data types are as follows:

• Let’s learn about each Hadoop file formats in detail.

• Each line is terminated by a newline character (\n).The text file

• It’s a remote procedure call and data serialization

1.SQL-like Syntax: HiveQL syntax resembles SQL, allowing

2.Hive Metastore: HiveQL interacts with the Hive metastore,

• However, it's important to note that while HiveQL provides

1. Column-Oriented Data Storage: When implementing

2. Indexing and Metadata: RCFile includes metadata

3. Compression Strategies: Implementing RCFile involves choosing and

4. Integration with Processing Engines: RCFile integration typically

• Key Functions of SerDe in Hive:

• Types of User-Defined Functions (UDFs) in Hive:

3. UDAF (User-Defined Aggregate Function): UDAFs are used to perform custom

• The anatomy of Pig, • Complex Data Types

• Pig on Hadoop • Piggy Bank

• Pig philosophy, • User-defined Functions

• Use Case for Pig- ETL Processing • Parameter substitution

• Pig Latin overview • Diagnostic Operator

• Data types in Pig • Word Count Example using Pig

• HDFS commands • Pig at Yahoo

• Relational operators • Pig vs HIVE

• Pig can handle any type of data, i.e., structured, semi-

• Optimizer: Once the script is parsed, an optimizer restructures and

• Compiler: The optimized plan is then compiled into a series of MapReduce

Pigs live anywhere

Pigs are domestic animals

ETL Process with Pig:

• It abstracts the complexities of MapReduce programming and

Apache Pig executes in two modes: Local Mode and MapReduce

• For example, ['name’#’Jacky', 'age'#55] will create a map with two

In addition to the built-in functions, Apache Pig provides extensive

For writing UDF’s, complete support is provided in Java and limited

Parameter substitution in Apache Pig refers to the capability of

The Diagnostic operator in Apache Pig is a tool used for

Real-Time Processing: Pig is not designed for real-time processing or low-latency

Highly Customized Operations: If your task involves highly customized operations

You might also like