0% found this document useful (0 votes)

15 views8 pages

Unit 5-1

The document is a question bank for a unit on frameworks, focusing on Pig, Hive, HBase, and Zookeeper, with questions categorized into three parts based on marks. It covers definitions, advantages, disadvantages, applications, and data processing operators related to Pig and Hive, along with details about HBase and Zookeeper. Additionally, it includes notes on the architecture and querying capabilities of Hive, as well as characteristics of HBase and Zookeeper.

Uploaded by

SNEHA B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views8 pages

Unit 5-1

Uploaded by

SNEHA B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

1

UNIT V – FRAMEWORKS
QUESTION BANK
PART A (2 Marks each)
1. What is Pig? Imp
2. What are the advantages of Pig over MapReduce?
3. What are the disadvantages of Pig over MapReduce?
4. List out the applications on Big data using Pig.
5. What is Hive?
6. How querying data in Hive?

PART B (5 Marks each)

7. Briefly explain data processing operators in Pig.
8. Explain Hive services in detail. Imp
9. Explain Hive QL. Imp
10. Explain HBase. Write about its characteristics.
11. Explain Zookeeper. Write about its characteristics.

PART C (15 Marks each)

12. Explain Different data processing operators in Pig. Imp
13. Explain the fundamentals of Imp
a) HBase
b) Zookeeper

NOTES
WHAT IS PIG?
▪ It is a high-level platform or tool which is used to process large datasets.
▪ It provides a high level of abstraction for processing over MapReduce.
o MapReduce working out how to fit data processing into a pattern, which
often requires multiple MapReduce stages, can be a challenge.
▪ The data structures are much richer in Pig.
▪ Typically, being multivalued and nested.
▪ Set of transformations can apply to the data are much more powerful—they include
joins etc.
▪ Pig is made up of two pieces:
• It provides a high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes.
• The execution environment to run Pig Latin programs.
▪ There are currently two environments:
o Local execution in a single JVM
o Distributed execution on a Hadoop cluster.

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

ADVANTAGES OF PIG OVER MAPREDUCE

▪ The development cycle of MapReduce is very long.
▪ Writing the mappers and reducers, compiling, and packaging the code, submitting
the job(s), and retrieving the results is time-consuming in MapReduce.
▪ Pig can process terabytes of data simply by issuing a half-dozen lines of Pig Latin.
▪ Pig was created at Yahoo! to make it easier for researchers and engineers to mine the
huge datasets.
▪ Pig is very supportive for a programmer writing a query, since it provides several
commands for introspecting (ആത്മപരിശ ോധന) the data structures in the
program.
▪ Pig was designed to be extensible.
▪ Virtually all parts of the processing path are customizable: loading, storing, filtering,
grouping, and joining can all be altered by user-defined functions (UDFs).
▪ These functions operate on Pig’s nested data model, so they can integrate very
deeply with Pig’s operators.
▪ As another benefit, UDFs tend to be more reusable than the libraries developed for
writing MapReduce programs.
▪ Writing queries in Pig Latin will save time.

DISADVANTAGES OF PIG OVER MAPREDUCE

▪ Pig is not suitable for all data processing tasks.
▪ Like MapReduce, it is designed for batch processing of data.
▪ If you want to perform a query that touches only a small amount of data in a large
dataset, then Pig will not perform well, since it is set up to scan the whole dataset, or
at least large portions of it.
▪ In some cases, Pig does not perform as well as programs written in MapReduce.
▪ However, the Pig team implements sophisticated algorithms for implementing Pig’s
relational operators.

APPLICATIONS ON BIG DATA USING PIG

▪ For exploring large datasets Pig Scripting is used.
▪ Provides supports across large data sets for Ad-hoc queries.
▪ In the prototyping of large data-sets processing algorithms.
▪ Required to process the time-sensitive data loads.
▪ For collecting large amounts of datasets in form of search logs and web crawls.
▪ Used where the analytical insights are needed using the sampling.

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

DATA PROCESSING OPERATORS IN PIG

1) LOADING & STORING DATA
▪ Pig Storage to store tuples as plain-text values separated by a colon character:
▪ EX:
Joe:cherry:2
Ali:apple:3
Joe:banana:2
Eve:apple:7
2) FILTERING DATA (FILTER OPERATOR)
▪ Once some data loaded into a relation, the next step is to filter it to remove the data
that not interested in.
▪ By filtering early, minimize the amount of data flowing through the system, which
can improve efficiency.
3) STREAM
▪ The STREAM operator allows you to transform data in a relation using an external
program or script.
▪ It is named by analogy with Hadoop Streaming, which provides a similar capability
for MapReduce.
▪ STREAM can use built-in commands with arguments.
▪ The STREAM operator uses PigStorage to serialize and deserialize relations to and
from the program’s standard input and output streams.
4) JOIN
▪ Pig has very good built-in support for join operations, making it much more
approachable.
▪ Pig supports inner joins & outer joins.
5) COGROUP
▪ JOIN always gives a flat structure: a set of tuples.
▪ The COGROUP statement is similar to JOIN, but creates a nested set of output tuples.
6) GROUP
▪ Although COGROUP groups the data in two or more relations, the GROUP statement
groups the data in a single relation.
▪ GROUP supports grouping by more than equality of keys.
7) CROSS
▪ Pig Latin includes the cross-product operator (also known as the cartesian product),
which joins every tuple in a relation with every tuple in a second relation.
▪ The size of the output is the product of the size of the inputs, potentially making the
output very large.
8) SORTING DATA
▪ Relations are unordered in Pig.

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

▪ We can ascend and descend data.

▪ The LIMIT statement is useful for limiting the number of results.
▪ Using LIMIT can improve the performance of a query.
9) COMBINING AND SPLITTING DATA
▪ We have several relations to combine into one. For this, the UNION statement is
used.
▪ The SPLIT operator is the opposite of UNION; it partitions a relation into two or more
relations.

HIVE
▪ One of the biggest ingredients in the Information Platform built by Jeff’s team at
Facebook was Hive, a framework for data warehousing on top of Hadoop.
▪ Hive grew from a need to manage and learn from the huge volumes of data.
▪ After trying a few different systems, the team chose Hadoop for storage and
processing, since it was cost-effective and met their scalability needs.
▪ Hive was created to make it possible for analysts with strong SQL skills (but meager
Java programming skills) to run queries on the huge volumes of data that Facebook
stored in HDFS.
▪ Today, Hive is a successful Apache project used by many organizations as a general-
purpose, scalable data processing platform.

HIVE SERVICES
▪ You can specify the service to run using the --service option.
▪ Type hive --service help to get a list of available service names; the most useful are
described below.
1) cli
• The command line interface to Hive. This is the default service.
2) hiveserver
• Runs Hive as a server, enabling access from a range of clients written in
different languages. Applications using the JDBC, and ODBC connectors need
to run a Hive server to communicate with Hive.
3) hwi
• The Hive Web Interface.
4) jar
• The Hive equivalent to hadoop jar, a convenient way to run Java applications
that includes both Hadoop and Hive classes.
5) metastore
• By default, the metastore is run in the same process as the Hive service.

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

• Using this service, it is possible to run the metastore as a standalone (remote)

process.

HIVE ARCHITECTURE
▪ If you run Hive as a server (hive --service hiveserver), then there are a number of
different mechanisms for connecting to it from applications.
▪ The relationship between Hive clients and Hive services is illustrated in the following
figure.

HIVE QL
▪ Hive’s SQL dialect,
called HiveQL.
▪ The below table
provides a high-
level comparison
of SQL and
HiveQL.

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

QUERYING DATA IN HIVE

▪ The use various forms of the SELECT statement to retrieve data from Hive.

SORTING & AGGREGATING

▪ Sorting data in Hive can be achieved by use of a standard ORDER BY clause.

▪ ORDER BY produces a result that is totally sorted.
▪ SORT BY produces a sorted file per reducer.

MapReduce Script

▪ Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and REDUCE
clauses make it possible to invoke an external script or program from Hive.

JOINS

▪ The simplest kind of join is the inner join, where each match in the input tables
results in a row in the output.
▪ Outer joins allow you to find nonmatches in the tables being joined.

FUNDAMENTALS OF HBASE
▪ HBase is a distributed column-oriented database built on top of HDFS.
▪ HBase is the Hadoop application to use when you require real-time read/write
random-access to very large datasets.
▪ HBase comes at the scaling problem from the opposite direction. It is built from the
ground-up to scale linearly just by adding nodes.

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

▪ HBase is not relational and does not support SQL, but given the proper problem
space, it is able to do what an RDBMS cannot: host very large, sparsely populated
tables on clusters made from commodity hardware.
▪ Production users of HBase include Adobe, StumbleUpon, Twitter, and groups at
Yahoo!.
▪ The below figure shows HBase cluster members.

CHARACTERISTICS OF HBASE
1) No real indexes: Rows are stored sequentially, as are the columns within each row.
Therefore, no issues with index, and insert performance is independent of table size.
2) Automatic partitioning: As your tables grow, they will automatically be split into
regions and distributed across all available nodes.
3) Scale linearly and automatically with new nodes: Add a node, point it to the existing
cluster, and run the region server. Regions will automatically rebalance and load will
spread evenly.
4) Commodity hardware: Clusters are built on less cost. RDBMSs are I/O hungry,
requiring more costly hardware.
5) Fault tolerance: No need to worry about individual node downtime.
6) Batch processing: MapReduce integration allows fully parallel, distributed jobs
against your data with locality awareness.

FUNDAMENTALS OF ZOOKEEPER
▪ For building general distributed applications using Hadoop’s distributed coordination
service, called ZooKeeper.
▪ ZooKeeper can’t make partial failures go away.
▪ ZooKeeper give you a set of tools to build distributed applications that can safely
handle partial failures.

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

▪ ZooKeeper also has the following characteristics:

• It is simple
• It is expressive: The ZooKeeper primitives are a rich set of building blocks that
can be used to build a large class of coordination data structures and
protocols.
• It is highly available
• It facilitates loosely coupled interactions: ZooKeeper interactions support
participants that do not need to know about one another.
• ZooKeeper is a library: ZooKeeper provides an open source, shared
repository of implementations and recipes of common coordination patterns.
Over time, the community can add to and improve the libraries, which is to
everyone’s benefit.
• ZooKeeper is highly performant

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

Chapter 5 - Introducing Pig Pig Architecture
No ratings yet
Chapter 5 - Introducing Pig Pig Architecture
81 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Hive - PIG - HBase - Zookeeper
100% (1)
Hive - PIG - HBase - Zookeeper
31 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
Unit 4 Hadoop Eco System PDF
No ratings yet
Unit 4 Hadoop Eco System PDF
78 pages
Pig, Hive, and Jaql: IBM Information Management Cloud Computing Center of Competence IBM Toronto Lab
No ratings yet
Pig, Hive, and Jaql: IBM Information Management Cloud Computing Center of Competence IBM Toronto Lab
40 pages
DA Unit-5
No ratings yet
DA Unit-5
78 pages
S Pig Hive HBase Zookeeper 07
No ratings yet
S Pig Hive HBase Zookeeper 07
21 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Unit-V CC&BD CS62
No ratings yet
Unit-V CC&BD CS62
73 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
BigData Analytics Unit-V
No ratings yet
BigData Analytics Unit-V
21 pages
Unit 4
No ratings yet
Unit 4
29 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
CH 6 BDA
No ratings yet
CH 6 BDA
10 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
BD 5
No ratings yet
BD 5
28 pages
Case Study Pig Hive Hbase
No ratings yet
Case Study Pig Hive Hbase
15 pages
Hive Pig
No ratings yet
Hive Pig
20 pages
6 H Data With Hive Big Data Analytics B.tech. Final Year
No ratings yet
6 H Data With Hive Big Data Analytics B.tech. Final Year
24 pages
06 Hadoop Query Languages
No ratings yet
06 Hadoop Query Languages
23 pages
Bda 06
No ratings yet
Bda 06
15 pages
Unit-5 (1) BD
No ratings yet
Unit-5 (1) BD
18 pages
Unit V Notes
No ratings yet
Unit V Notes
17 pages
Bda Ia-3 QB-1
No ratings yet
Bda Ia-3 QB-1
17 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Class 9 Computer Project
No ratings yet
Class 9 Computer Project
28 pages
Notes
No ratings yet
Notes
19 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
42 pages
Bigdata: What Is Pig?
No ratings yet
Bigdata: What Is Pig?
16 pages
05a Pig
No ratings yet
05a Pig
52 pages
S Pig Hive HBase Zookeeper
No ratings yet
S Pig Hive HBase Zookeeper
19 pages
Bda 4 Og
No ratings yet
Bda 4 Og
18 pages
Bda Unit Iv Notes
No ratings yet
Bda Unit Iv Notes
32 pages
S Pig Hive HBase
No ratings yet
S Pig Hive HBase
19 pages
Unit 5
No ratings yet
Unit 5
24 pages
Unit 5 2 Marks
No ratings yet
Unit 5 2 Marks
10 pages
Big Data Unit 5 (Easy Notes) Edushine Classes
No ratings yet
Big Data Unit 5 (Easy Notes) Edushine Classes
42 pages
BIGDATUNIT5
No ratings yet
BIGDATUNIT5
32 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
Notes of Aktu Btech 3 Yr Big Data
No ratings yet
Notes of Aktu Btech 3 Yr Big Data
15 pages
Unit5 Part1 Notes
No ratings yet
Unit5 Part1 Notes
21 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
Bda Module 5
No ratings yet
Bda Module 5
26 pages
Module 5 - Data Analytics
No ratings yet
Module 5 - Data Analytics
4 pages
Unit 5 (Pig, Hive, Hbase)
No ratings yet
Unit 5 (Pig, Hive, Hbase)
18 pages
Big Data - Unit 5 - Frame Works - Mini Xerox - Easy Read
No ratings yet
Big Data - Unit 5 - Frame Works - Mini Xerox - Easy Read
23 pages
Bda Notes Jntuk R20 Unit 4
No ratings yet
Bda Notes Jntuk R20 Unit 4
14 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Unit 5 Short
No ratings yet
Unit 5 Short
14 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Notes UNIT 5 Bigdata
No ratings yet
Notes UNIT 5 Bigdata
18 pages
Notes 5 Unit Big Data
No ratings yet
Notes 5 Unit Big Data
23 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
18 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
Unit 5
No ratings yet
Unit 5
19 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
Gis MCQS
No ratings yet
Gis MCQS
16 pages
Department of Computer Science and Engineering: A Mini Project Report
No ratings yet
Department of Computer Science and Engineering: A Mini Project Report
31 pages
Cissp Domain 7
No ratings yet
Cissp Domain 7
59 pages
Unit 3: Management Information System System Analysis Concept
100% (1)
Unit 3: Management Information System System Analysis Concept
3 pages
09 Intro ERP Using GBI Slides HCM en v2.01
No ratings yet
09 Intro ERP Using GBI Slides HCM en v2.01
19 pages
Current Log
No ratings yet
Current Log
44 pages
Bai Giang LT IoTs - Chuong1
No ratings yet
Bai Giang LT IoTs - Chuong1
58 pages
Introduction Introduction To Oracle
0% (1)
Introduction Introduction To Oracle
3 pages
KKHHJK
No ratings yet
KKHHJK
42 pages
Pilot r195W: Cloud Managed Residential Access Point
No ratings yet
Pilot r195W: Cloud Managed Residential Access Point
2 pages
Unit 5 Memory Organization
No ratings yet
Unit 5 Memory Organization
48 pages
SPADE An Environment For Software Process Analysis
No ratings yet
SPADE An Environment For Software Process Analysis
32 pages
Assignment 2 Frontsheet - 7436
No ratings yet
Assignment 2 Frontsheet - 7436
45 pages
Understanding Edge Processor
No ratings yet
Understanding Edge Processor
127 pages
My North Star Is Customer Success (Where "Success" Means Whatever Aligns Best With Customer
No ratings yet
My North Star Is Customer Success (Where "Success" Means Whatever Aligns Best With Customer
280 pages
Manuj Grover Resume PDF
No ratings yet
Manuj Grover Resume PDF
1 page
Mbds Big Data Hadoop 2019 2020 TP 5 en
No ratings yet
Mbds Big Data Hadoop 2019 2020 TP 5 en
11 pages
Abhijeet Wankhade Angular Developer
No ratings yet
Abhijeet Wankhade Angular Developer
7 pages
Certified Penetration Tester
No ratings yet
Certified Penetration Tester
14 pages
Diploma Thesis
No ratings yet
Diploma Thesis
107 pages
ACTIVITY 2 - 25 Points
No ratings yet
ACTIVITY 2 - 25 Points
4 pages
Script Late Injection: A Framework To Introduce JavaScript Into Web Pages
No ratings yet
Script Late Injection: A Framework To Introduce JavaScript Into Web Pages
9 pages
Founding Design Engineer at Asteroid - Y Combinator's Work at A Startup
No ratings yet
Founding Design Engineer at Asteroid - Y Combinator's Work at A Startup
3 pages
Connecting A Customer System To SAP HCI: Getting Started
No ratings yet
Connecting A Customer System To SAP HCI: Getting Started
28 pages
IMP-Architecture of Airport Operation Database System
No ratings yet
IMP-Architecture of Airport Operation Database System
4 pages
Annexure 1 For Support and Maintenance
No ratings yet
Annexure 1 For Support and Maintenance
7 pages
How Do You Connect Localhost in The Android Emulator - Stack Overflow
No ratings yet
How Do You Connect Localhost in The Android Emulator - Stack Overflow
5 pages
Traffic Shaping by Token Bucket: Unit 03.04.04 CS 5220: Computer Communications
No ratings yet
Traffic Shaping by Token Bucket: Unit 03.04.04 CS 5220: Computer Communications
10 pages
DoS Host Alert 84961
No ratings yet
DoS Host Alert 84961
5 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Unit 5-1

Uploaded by

Unit 5-1

Uploaded by

1

PART B (5 Marks each)

PART C (15 Marks each)

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

ADVANTAGES OF PIG OVER MAPREDUCE

DISADVANTAGES OF PIG OVER MAPREDUCE

APPLICATIONS ON BIG DATA USING PIG

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

DATA PROCESSING OPERATORS IN PIG

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

▪ We can ascend and descend data.

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

• Using this service, it is possible to run the metastore as a standalone (remote)

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

QUERYING DATA IN HIVE

SORTING & AGGREGATING

▪ Sorting data in Hive can be achieved by use of a standard ORDER BY clause.

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

▪ ZooKeeper also has the following characteristics:

Prepared By STAS EDAPPALLY NITHIN SEBASTIAN

You might also like