Unit 5-1
Unit 5-1
UNIT V – FRAMEWORKS
QUESTION BANK
PART A (2 Marks each)
1. What is Pig? Imp
2. What are the advantages of Pig over MapReduce?
3. What are the disadvantages of Pig over MapReduce?
4. List out the applications on Big data using Pig.
5. What is Hive?
6. How querying data in Hive?
NOTES
WHAT IS PIG?
▪ It is a high-level platform or tool which is used to process large datasets.
▪ It provides a high level of abstraction for processing over MapReduce.
o MapReduce working out how to fit data processing into a pattern, which
often requires multiple MapReduce stages, can be a challenge.
▪ The data structures are much richer in Pig.
▪ Typically, being multivalued and nested.
▪ Set of transformations can apply to the data are much more powerful—they include
joins etc.
▪ Pig is made up of two pieces:
• It provides a high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes.
• The execution environment to run Pig Latin programs.
▪ There are currently two environments:
o Local execution in a single JVM
o Distributed execution on a Hadoop cluster.
HIVE
▪ One of the biggest ingredients in the Information Platform built by Jeff’s team at
Facebook was Hive, a framework for data warehousing on top of Hadoop.
▪ Hive grew from a need to manage and learn from the huge volumes of data.
▪ After trying a few different systems, the team chose Hadoop for storage and
processing, since it was cost-effective and met their scalability needs.
▪ Hive was created to make it possible for analysts with strong SQL skills (but meager
Java programming skills) to run queries on the huge volumes of data that Facebook
stored in HDFS.
▪ Today, Hive is a successful Apache project used by many organizations as a general-
purpose, scalable data processing platform.
HIVE SERVICES
▪ You can specify the service to run using the --service option.
▪ Type hive --service help to get a list of available service names; the most useful are
described below.
1) cli
• The command line interface to Hive. This is the default service.
2) hiveserver
• Runs Hive as a server, enabling access from a range of clients written in
different languages. Applications using the JDBC, and ODBC connectors need
to run a Hive server to communicate with Hive.
3) hwi
• The Hive Web Interface.
4) jar
• The Hive equivalent to hadoop jar, a convenient way to run Java applications
that includes both Hadoop and Hive classes.
5) metastore
• By default, the metastore is run in the same process as the Hive service.
HIVE ARCHITECTURE
▪ If you run Hive as a server (hive --service hiveserver), then there are a number of
different mechanisms for connecting to it from applications.
▪ The relationship between Hive clients and Hive services is illustrated in the following
figure.
HIVE QL
▪ Hive’s SQL dialect,
called HiveQL.
▪ The below table
provides a high-
level comparison
of SQL and
HiveQL.
MapReduce Script
▪ Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and REDUCE
clauses make it possible to invoke an external script or program from Hive.
JOINS
▪ The simplest kind of join is the inner join, where each match in the input tables
results in a row in the output.
▪ Outer joins allow you to find nonmatches in the tables being joined.
FUNDAMENTALS OF HBASE
▪ HBase is a distributed column-oriented database built on top of HDFS.
▪ HBase is the Hadoop application to use when you require real-time read/write
random-access to very large datasets.
▪ HBase comes at the scaling problem from the opposite direction. It is built from the
ground-up to scale linearly just by adding nodes.
▪ HBase is not relational and does not support SQL, but given the proper problem
space, it is able to do what an RDBMS cannot: host very large, sparsely populated
tables on clusters made from commodity hardware.
▪ Production users of HBase include Adobe, StumbleUpon, Twitter, and groups at
Yahoo!.
▪ The below figure shows HBase cluster members.
CHARACTERISTICS OF HBASE
1) No real indexes: Rows are stored sequentially, as are the columns within each row.
Therefore, no issues with index, and insert performance is independent of table size.
2) Automatic partitioning: As your tables grow, they will automatically be split into
regions and distributed across all available nodes.
3) Scale linearly and automatically with new nodes: Add a node, point it to the existing
cluster, and run the region server. Regions will automatically rebalance and load will
spread evenly.
4) Commodity hardware: Clusters are built on less cost. RDBMSs are I/O hungry,
requiring more costly hardware.
5) Fault tolerance: No need to worry about individual node downtime.
6) Batch processing: MapReduce integration allows fully parallel, distributed jobs
against your data with locality awareness.
FUNDAMENTALS OF ZOOKEEPER
▪ For building general distributed applications using Hadoop’s distributed coordination
service, called ZooKeeper.
▪ ZooKeeper can’t make partial failures go away.
▪ ZooKeeper give you a set of tools to build distributed applications that can safely
handle partial failures.