0% found this document useful (0 votes)

24 views9 pages

Big Data Unit-5

The document provides an overview of three big data tools: Pig, Hive, and HBase. Pig is a high-level platform for processing large datasets using Pig Latin scripts, while Hive is used for managing structured data with a SQL-like language called HiveQL. HBase is a non-relational database designed for real-time updates and handling sparse datasets, making it suitable for high-traffic applications.

Uploaded by

guptaraman600

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views9 pages

Big Data Unit-5

Uploaded by

guptaraman600

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

UNIT 5

Application of Big Data using :

1. Pig :
Pig is a high-level platform or tool which is used to process large datasets.
It provides a high level of abstraction letting you write simple data analysis
code.
It provides a high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes.
Applications :
1. Used for analyzing large datasets by writing Pig Latin scripts
2. Common in web companies for tracking user behavior or errors
3. Removing duplicates, nulls, formatting inconsistencies.
4. Useful for summarizing big data (e.g., total sales per region)
5. Analyzing user interactions, hashtags, or trending topics.
6. Pig simplifies ETL (Extract, Transform, Load) operations.

2. Hive :
Hive is often used to store and manage structured data in data warehouses built on Hadoop..
It makes querying and analyzing easy.
It allows querying and managing large datasets using a SQL-like language called HiveQL.
It translates HiveQL into MapReduce, Tez, or Spark jobs under the hood.
It supports structured and semi-structured data
It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Applications :

1. Ease of use
2. Streamlined security
3. Low overhead
4. Ideal for batch processing and aggregated data analysis.
5. Perform tasks like:Count, max, min, avg over large datasets and generate summary
statistics for decision making
6. BI tools like Tableau, Power BI, or QlikView can connect to Hive for visualization
and reporting.

3. HBase :
HBase is a column-oriented non-relational database management system that runs on
top of the Hadoop Distributed File System (HDFS).
HBase provides a fault-tolerant way of storing sparse data sets, which are common in
many big data use cases
1|Page
HBase does support writing applications in Apache Avro, REST and Thrift.
Application :

1 Used when you need real-time updates, unlike Hive which is batch-oriented.
2 Perfect for storing sensor data, logs, or metrics with timestamps.
3 Stores user profiles, posts, likes, shares, and comments.
4 Handles fast reads/writes for high-traffic platforms like Facebook or Twitter.
5 chat history storage, delivery receipts, notification logs.
PIG
Introduction to PIG :
o Pig is a high-level platform or tool which is used to process large datasets.
It provides a high level of abstraction for processing over MapReduce.
(High abstraction in Pig means you don’t write the logic for low-level
execution (like in MapReduce). Instead, you write simple, SQL-like
commands and Pig does the rest for you — translating them into efficient
parallel jobs.)
o It provides a high-level scripting language, known as Pig Latin which is
used to develop the data analysis codes.
o Pig Latin and Pig Engine are the two main components of the Apache Pig
tool. The result of Pig is always stored in the HDFS.
 One limitation of MapReduce is that the development cycle is very
long. Writing the reducer and mapper, compiling packaging the
code, submitting the job and retrieving the output is a time-
consuming task.
o Apache Pig reduces the time of development using the multi-query
approach.
o Pig is beneficial for programmers who are not from Java backgrounds. 200
lines of Java code can be written in only 10 lines using the Pig Latin
language.
o Programmers who have SQL knowledge needed less effort to learn Pig
Latin.

Execution Modes of Pig :

Apache Pig scripts can be executed in three ways :
Interactive Mode (Grunt shell) :
You can run Apache Pig in interactive mode using the Grunt shell. In this shell, you
can enter the Pig Latin statements and get the output (using the Dump operator).
Batch Mode (Script) :
You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file
with the .pig extension.

Embedded Mode (UDF) :

Apache Pig provides the provision of defining our own functions (User Defined
Functions) in programming languages such as Java and using them in our script.

Comparison of Pig with Databases :

PIG SQL

Pig Latin is a procedural language SQL is a

declarative
language

Works well with semi-structured and unstructured Supports strictly structured

data data
.

In Apache Pig, the schema is optional. We can store

data without designing a schema (values are stored Schema is mandatory
as $01, $02 etc.) in SQL

The data model in Apache Pig is The data model used

nested relational. in SQL is flat
relational.

Apache Pig provides limited opportunity for There is more opportunity

Query optimization. for query optimization in
SQL.

Not suitable for real time Designed for real time

querying

Grunt :

 The Grunt Shell is the interactive command-line interface of Apache Pig.

 Grunt shell is a shell command.
 The Grunt shell of the Apace pig is mainly used to write pig Latin scripts.
 Pig script can be executed with grunt shell which is a native shell provided by
Apache pig to execute pig queries.
 When you start Pig (by running pig command in terminal), it opens a shell.That
shell is called Grunt shell.

The prompt you see is:

grunt>

 This is where you can type commands like LOAD, DUMP, DESCRIBE,
ILLUSTRATE, etc.

 You can also run shell commands using sh or fs.

Syntax of sh command :
 grunt> sh ls

Syntax of fs command :
grunt>fs -ls

Pig Latin :
The Pig Latin is a data flow language used by Apache Pig to analyze the data in
Hadoop.
It is a textual language that abstracts the programming from the Java MapReduce
idiom into a notation.
The Pig Latin statements are used to process the data.
It is an operator that accepts a relation as an input and generates another relation as an
output.
· It can span multiple lines.
· Each statement must end with a semi-colon.
· It may include expression and schemas.
· By default, these statements are processed using multi-query execution

User-Defined Functions :
 Apache Pig provides extensive support for User Defined
Functions(UDF’s).
 Using these UDF’s, we can define our own functions and use them. The
UDF support is provided in six programming languages:
· Java
· Jython
· Python
· JavaScript
· Ruby
· Groovy
 For writing UDF’s, complete support is provided in Java and limited
support is provided in all the remaining languages.
 Using Java, you can write UDF’s involving all parts of the processing like
data load/store, column transformation, and aggregation.
 Since Apache Pig has been written in Java, the UDF’s written using Java
language work efficiently compared to other languages.
Types of UDF’s in Java :
Filter Functions :

• The filter functions are used as conditions in filter statements. • These functions
accept a Pig value as input and return a Boolean value.

Eval Functions :

• The Eval functions are used in FOREACH-GENERATE statements. • These

functions accept a Pig value as input and return a Pig result.
Algebraic Functions :

• The Algebraic functions act on inner bags in a FOREACHGENERATE

statement.
• These functions are used to perform full MapReduce operations on an inner bag.

Data Processing Operators :

The Apache Pig Operators is a high-level procedural language for querying large data
sets using Hadoop and the Map-Reduce Platform.
A Pig Latin statement is an operator that takes a relation as input and produces another
relation as output.
These operators are the main tools for Pig Latin provides to operate on the data. They
allow you to transform it by sorting, grouping, joining, projecting, and filtering. The
Apache Pig operators can be classified as :
Relational Operators :
Relational operators are the main tools Pig Latin provides to operate on the data.
Some of the Relational Operators are :
LOAD: The LOAD operator is used to loading data from the file system or HDFS
storage into a Pig relation.
FOREACH: This operator generates data transformations based on columns of data. It
is used to add or remove fields from a relation.
FILTER: This operator selects tuples from a relation based on a condition. JOIN:
JOIN operator is used to performing an inner, equijoin join of two or more relations
based on common field values
ORDER BY: Order By is used to sort a relation based on one or more fields in either
ascending or descending order using ASC and DESC keywords.
GROUP: The GROUP operator groups together the tuples with the same group key
(key field).
COGROUP: COGROUP is the same as the GROUP operator. For readability,
programmers usually use GROUP when only one relation is involved and COGROUP
when multiple relations are reinvolved.
Diagnostic Operator :
The load statement will simply load the data into the specified relation in Apache Pig.
To verify the execution of the Load statement, you have to use the Diagnostic
Operators.
Some Diagnostic Operators are :
DUMP: The DUMP operator is used to run Pig Latin statements and display the
results on the screen.
DESCRIBE: Use the DESCRIBE operator to review the schema of a particular
relation. The DESCRIBE operator is best used for debugging a script. ILLUSTRATE:
ILLUSTRATE: This operator is used to review how data is transformed through a
sequence of Pig Latin statements. ILLUSTRATE command is your best friend when it
comes to debugging a script.
EXPLAIN: The EXPLAIN operator is used to display the logical, physical, and
MapReduce execution plans of a relation.

Hive
Apache Hive Architecture :
The above figure shows the architecture of Apache Hive and its major components.
The major components of Apache Hive are :
1. Hive Client
2. Hive Services
3. Processing and Resource Management
4. Distributed Storage
HIVE CLIENT :
Hive supports applications written in any language like Python, Java, C++, Ruby, etc
using JDBC, ODBC, and Thrift drivers, for performing queries on the Hive. Hence, one
can easily write a hive client application in any language of its own choice.
Hive clients are categorized into three types :
1. Thrift Clients : The Hive server is based on Apache Thrift so that it can serve the
request from a thrift client.
2. JDBC client : Hive allows for the Java applications to connect to it using the JDBC
driver. JDBC driver uses Thrift to communicate with the Hive Server. 3. ODBC client :
Hive ODBC driver allows applications based on the ODBC protocol to connect to Hive.
Similar to the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive
Server.
HIVE SERVICE :
To perform all queries, Hive provides various services like the Hive server2, Beeline,
etc.
The various services offered by Hive are :
1. Beeline
2. Hive Server 2
3. Hive Driver
4. Hive Compiler
5. Optimizer
6. Metastore

PROCESSING AND RESOURCE MANAGEMENT :

Hive internally uses a MapReduce framework as a defacto engine for executing the
queries.
MapReduce is a software framework for writing those applications that process a
massive amount of data in parallel on the large clusters of commodity hardware.
MapReduce job works by splitting data into chunks, which are processed by map
reduce tasks.

DISTRIBUTED STORAGE :
Hive is built on top of Hadoop, so it uses the underlying Hadoop Distributed File
System for the distributed storage.

Hive Shell :
 Hive shell is a primary way to interact with hive.
 It is a default service in the hive.
 It is also called CLI (command line interference).
 Hive shell is similar to MySQL Shell.
 Hive users can run HQL queries in the hive shell.
 In hive shell up and down arrow keys are used to scroll previous
commands. HiveQL is case-insensitive (except for string comparisons).
 The tab key will autocomplete (provides suggestions while you type into the
field) Hive keywords and functions.
Hive Shell can run in two modes :
Non-Interactive mode :
Non-interactive mode means run shell scripts in administer zone.
Hive Shell can run in the non-interactive mode, with the -f option.
Example:
$hive -f script.q, Where script. q is a file.
Interactive mode :
The hive can work in interactive mode by directly typing the command “hive” in the
terminal.
Example:
$hive
Hive> show databases;

Hive Services :
The following are the services provided by Hive :
Hive CLI (Beeline ): The Hive CLI (Command Line Interface) is a shell where we
can execute Hive queries and commands.

• Hive Web User Interface: The Hive Web UI is just an alternative of Hive CLI. It

provides a web-based GUI for executing Hive queries and commands. • Hive metastore: It
is a central repository that stores all the structure information of various tables and
partitions in the warehouse. It also includes metadata of column and its type information,
the serializers and deserializers which is used to read and write data and the corresponding

HDFS files where the data is stored. • Hive Server: It is referred to as Apache Thrift
Server. It accepts the request from different clients and provides it to Hive Driver.

• Hive Driver: It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.

• Hive Compiler: The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts HiveQL
statements into MapReduce jobs.

MT6765 Android Scatter
100% (1)
MT6765 Android Scatter
14 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Acer Aspire One D270 Service Manual-Aod270
50% (2)
Acer Aspire One D270 Service Manual-Aod270
405 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
Unit 5
No ratings yet
Unit 5
39 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
Unit V Notes
No ratings yet
Unit V Notes
17 pages
Unit 5
No ratings yet
Unit 5
24 pages
Unit5 Part1 Notes
No ratings yet
Unit5 Part1 Notes
21 pages
Unit 5
No ratings yet
Unit 5
19 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
Unit 5
No ratings yet
Unit 5
76 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
Notes 5 Unit Big Data
No ratings yet
Notes 5 Unit Big Data
23 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Apache Pig in Nosql Databases
No ratings yet
Apache Pig in Nosql Databases
5 pages
BDP U4
No ratings yet
BDP U4
58 pages
BDA - Unit-4 Part 1
No ratings yet
BDA - Unit-4 Part 1
47 pages
Big Data Unit 5 Big Data Notes of Unit 5
No ratings yet
Big Data Unit 5 Big Data Notes of Unit 5
16 pages
Notes UNIT 5 Bigdata
No ratings yet
Notes UNIT 5 Bigdata
18 pages
BDA Unit5
No ratings yet
BDA Unit5
36 pages
Bda Unit Iv Notes
No ratings yet
Bda Unit Iv Notes
32 pages
Notes
No ratings yet
Notes
19 pages
Unit-4 PIG
No ratings yet
Unit-4 PIG
9 pages
Big Data - Unit 5 - Frame Works - Mini Xerox - Easy Read
No ratings yet
Big Data - Unit 5 - Frame Works - Mini Xerox - Easy Read
23 pages
Apache Pig
No ratings yet
Apache Pig
23 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
Pig
No ratings yet
Pig
61 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
Unit 5 Short
No ratings yet
Unit 5 Short
14 pages
IMTC634 - Data Science - Chapter 16
No ratings yet
IMTC634 - Data Science - Chapter 16
20 pages
Notes of Aktu Btech 3 Yr Big Data
No ratings yet
Notes of Aktu Btech 3 Yr Big Data
15 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Apache Pig Handy Notes Lab
No ratings yet
Apache Pig Handy Notes Lab
11 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
BIGDATUNIT5
No ratings yet
BIGDATUNIT5
32 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
Unit-5 (1) BD
No ratings yet
Unit-5 (1) BD
18 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
Unit 4
No ratings yet
Unit 4
29 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Unit 5 (Pig, Hive, Hbase)
No ratings yet
Unit 5 (Pig, Hive, Hbase)
18 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
HCI Chapter 4 Paradigms
No ratings yet
HCI Chapter 4 Paradigms
19 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
BD 5
No ratings yet
BD 5
28 pages
Army Acronyms E F
No ratings yet
Army Acronyms E F
28 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
123 App
No ratings yet
123 App
214 pages
Relocation Methodology DataCenter
No ratings yet
Relocation Methodology DataCenter
7 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
MLT Unit - 1
No ratings yet
MLT Unit - 1
38 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
101 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
06-539 CyberCat Programming Manual
No ratings yet
06-539 CyberCat Programming Manual
138 pages
BigData Unit 4
No ratings yet
BigData Unit 4
13 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
User Guide: Dragonlink V3 Advanced Complete System Dragonlink Osd
No ratings yet
User Guide: Dragonlink V3 Advanced Complete System Dragonlink Osd
71 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Process Navigator Overview
No ratings yet
Process Navigator Overview
16 pages
How To - Configure L2TP VPN Between Cyberoam and Windows 7
No ratings yet
How To - Configure L2TP VPN Between Cyberoam and Windows 7
12 pages
CC - Week 6 - IaaS
No ratings yet
CC - Week 6 - IaaS
51 pages
EcoStruxure Power & Process - High Level
No ratings yet
EcoStruxure Power & Process - High Level
19 pages
Lecture 09
No ratings yet
Lecture 09
29 pages
Android Services With Examples
No ratings yet
Android Services With Examples
9 pages
21CSB0B20 DBMS Assignment
No ratings yet
21CSB0B20 DBMS Assignment
12 pages
System Programming: Lecture No. 02 Topic: Input Output Bscs-7 Semester
No ratings yet
System Programming: Lecture No. 02 Topic: Input Output Bscs-7 Semester
15 pages
Non-Vitrea Floating License Server Install Guide
No ratings yet
Non-Vitrea Floating License Server Install Guide
24 pages
Secure File Storage Using Hybrid
No ratings yet
Secure File Storage Using Hybrid
3 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
Course Outcomes
No ratings yet
Course Outcomes
38 pages
CS532 Chapter I Spring 2024
No ratings yet
CS532 Chapter I Spring 2024
23 pages
AST - 1 - Essentials of Ethical Hacking
No ratings yet
AST - 1 - Essentials of Ethical Hacking
38 pages
MLT Unit-2
No ratings yet
MLT Unit-2
30 pages
MLT Unit-3
No ratings yet
MLT Unit-3
39 pages
MLT Unit-4
No ratings yet
MLT Unit-4
33 pages
11 Most In-Demand Programming Languages in 2021 - Berkeley Boot Camps
No ratings yet
11 Most In-Demand Programming Languages in 2021 - Berkeley Boot Camps
7 pages
Driver Server
No ratings yet
Driver Server
10 pages
Curriculum Vitae: RAGHAVENDRA K P (Diploma in Electronics & Communication)
No ratings yet
Curriculum Vitae: RAGHAVENDRA K P (Diploma in Electronics & Communication)
3 pages
13.assistance To Alzheimer S Patient
No ratings yet
13.assistance To Alzheimer S Patient
6 pages
Database Login Form Task Rutuja Shejul
No ratings yet
Database Login Form Task Rutuja Shejul
7 pages
Python - Quizizz
No ratings yet
Python - Quizizz
4 pages
CV-Abie Yudha P
No ratings yet
CV-Abie Yudha P
1 page
Updated Finald Date Sheet Spring 2025 (3rd-8th)
No ratings yet
Updated Finald Date Sheet Spring 2025 (3rd-8th)
1 page
Data Analatics
No ratings yet
Data Analatics
6 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet

Big Data Unit-5

Uploaded by

Big Data Unit-5

Uploaded by

UNIT 5

Application of Big Data using :

Execution Modes of Pig :

Embedded Mode (UDF) :

Comparison of Pig with Databases :

Pig Latin is a procedural language SQL is a

Works well with semi-structured and unstructured Supports strictly structured

In Apache Pig, the schema is optional. We can store

The data model in Apache Pig is The data model used

Apache Pig provides limited opportunity for There is more opportunity

Not suitable for real time Designed for real time

 The Grunt Shell is the interactive command-line interface of Apache Pig.

The prompt you see is:

 You can also run shell commands using sh or fs.

• The Eval functions are used in FOREACH-GENERATE statements. • These

• The Algebraic functions act on inner bags in a FOREACHGENERATE

Data Processing Operators :

PROCESSING AND RESOURCE MANAGEMENT :

You might also like