0% found this document useful (0 votes)
14 views16 pages

Big Data Unit 5 Big Data Notes of Unit 5

The document provides an overview of Apache Pig and Hive, tools used for processing large datasets in Hadoop. It discusses Pig's features, including its high-level scripting language Pig Latin, execution modes, and user-defined functions, as well as Hive's architecture, services, and query language HiveQL. Additionally, it compares Pig and Hive with traditional databases, highlighting their unique characteristics and functionalities.

Uploaded by

guptadikshant21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views16 pages

Big Data Unit 5 Big Data Notes of Unit 5

The document provides an overview of Apache Pig and Hive, tools used for processing large datasets in Hadoop. It discusses Pig's features, including its high-level scripting language Pig Latin, execution modes, and user-defined functions, as well as Hive's architecture, services, and query language HiveQL. Additionally, it compares Pig and Hive with traditional databases, highlighting their unique characteristics and functionalities.

Uploaded by

guptadikshant21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

lOMoARcPSD|51175266

Big Data unit 5 - Big Data notes of unit 5

Big Data (Dr. A.P.J. Abdul Kalam Technical University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Dikshant Gupta ([email protected])
lOMoARcPSD|51175266

PIG

Introduction to PIG:

 Pig is a high-level platform or tool which is used to process large datasets.


 It provides a high level of abstraction for processing over MapReduce.
 It provides a high-level scripting language, known as Pig Latin which is used to develop
the data analysis codes.
 Pig Latin and Pig Engine are the two main components of the Apache Pig tool.
 The result of Pig is always stored in the HDFS.
One limitation of MapReduce is that the development cycle is very long. Writing the reducer and
mapper, compiling packaging the code, submitting the job and retrieving the output is a time-
consuming task.
 Apache Pig reduces the time of development using the multi-query approach.
 Pig is beneficial for programmers who are not from Java backgrounds.
 200 lines of Java code can be written in only 10 lines using the Pig Latin language.
 Programmers who have SQL knowledge needed less effort to learn Pig Latin.
Execution Modes of Pig :
Apache Pig scripts can be executed in three ways :
Interactive Mode (Grunt shell) :
You can run Apache Pig in interactive mode using the Grunt shell.
In this shell, you can enter the Pig Latin statements and get the output (using the Dump
operator).
Batch Mode (Script) :
You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with
the .pig extension.
Embedded Mode (UDF) :
Apache Pig provides the provision of defining our own functions (User Defined Functions) in
programming languages such as Java and using them in our script.
Comparison of Pig with Databases :

PIG SQL

SQL is a declarative
Pig Latin is a procedural language
language

In Apache Pig, the schema is optional. We can


Schema is mandatory in
store data without designing a schema (values
SQL.
are stored as $01, $02 etc.)

Downloaded by Dikshant Gupta ([email protected])


lOMoARcPSD|51175266

The data model in Apache Pig is nested The data model used in
relational. SQL is flat relational.

There is more opportunity


Apache Pig provides limited opportunity for
for query optimization in
Query optimization.
SQL.

Grunt :
Grunt shell is a shell command.
The Grunt shell of the Apace pig is mainly used to write pig Latin scripts.
Pig script can be executed with grunt shell which is a native shell provided by Apache pig to
execute pig queries.
We can invoke shell commands using sh and fs.
Syntax of sh command :
grunt> sh
ls

Syntax of fs command:
grunt>fs -ls

Pig Latin :
The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop.
It is a textual language that abstracts the programming from the Java MapReduce idiom into a
notation.
The Pig Latin statements are used to process the data.
It is an operator that accepts a relation as an input and generates another relation as an output.
· It can span multiple lines.
· Each statement must end with a semi-colon.
· It may include expression and schemas.
· By default, these statements are processed using multi-query execution
User-Defined Functions :
Apache Pig provides extensive support for User Defined Functions(UDF’s).
Using these UDF’s, we can define our own functions and use them.
The UDF support is provided in six programming languages:
· Java
· Jython
· Python
· JavaScript
· Ruby
· Groovy

Downloaded by Dikshant Gupta ([email protected])


lOMoARcPSD|51175266

For writing UDF’s, complete support is provided in Java and limited support is provided in all
the remaining languages.
Using Java, you can write UDF’s involving all parts of the processing like data load/store,
column transformation, and aggregation.
Since Apache Pig has been written in Java, the UDF’s written using Java language work
efficiently compared to other languages.
Types of UDF’s in Java :
Filter Functions :

 The filter functions are used as conditions in filter statements.


 These functions accept a Pig value as input and return a Boolean value.
Eval Functions :

 The Eval functions are used in FOREACH-GENERATE statements.


 These functions accept a Pig value as input and return a Pig result.
Algebraic Functions :

 The Algebraic functions act on inner bags in a FOREACHGENERATE statement.


 These functions are used to perform full MapReduce operations on an inner bag.
Data Processing Operators :
The Apache Pig Operators is a high-level procedural language for querying large data sets using
Hadoop and the Map-Reduce Platform. A Pig Latin statement is an operator that takes a relation
as input and produces another relation as output.
These operators are the main tools for Pig Latin provides to operate on the data.
They allow you to transform it by sorting, grouping, joining, projecting, and filtering.
The Apache Pig operators can be classified as :
Relational Operators :
Relational operators are the main tools Pig Latin provides to operate on the data.
Some of the Relational Operators are :
LOAD: The LOAD operator is used to loading data from the file system or HDFS storage into a
Pig relation.
FOREACH: This operator generates data transformations based on columns of data. It is used to
add or remove fields from a relation.
FILTER: This operator selects tuples from a relation based on a condition.
JOIN: JOIN operator is used to performing an inner, equijoin join of two or more relations
based on common field values

Downloaded by Dikshant Gupta ([email protected])


lOMoARcPSD|51175266

ORDER BY: Order By is used to sort a relation based on one or more fields in either ascending
or descending order using ASC and DESC keywords.
GROUP: The GROUP operator groups together the tuples with the same group key (key field).
COGROUP: COGROUP is the same as the GROUP operator. For readability, programmers
usually use GROUP when only one relation is involved and COGROUP when multiple relations
are reinvolved.
Diagnostic Operator :
The load statement will simply load the data into the specified relation in Apache Pig.
To verify the execution of the Load statement, you have to use the Diagnostic Operators.
Some Diagnostic Operators are :
DUMP: The DUMP operator is used to run Pig Latin statements and display the results on the
screen.
DESCRIBE: Use the DESCRIBE operator to review the schema of a particular relation. The
DESCRIBE operator is best used for debugging a script.
ILLUSTRATE: ILLUSTRATE operator is used to review how data is transformed through a
sequence of Pig Latin statements. ILLUSTRATE command is your best friend when it comes to
debugging a script.
EXPLAIN: The EXPLAIN operator is used to display the logical, physical, and MapReduce
execution plans of a relation.
Hive:
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data and makes querying and analyzing easy. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Benefits:

1. Ease of use
2. Accelerated initial insertion of data
3. Superior scalability, flexibility, and cost-efficiency
4. Streamlined security
5. Low overhead
6. Exceptional working capacity

Downloaded by Dikshant Gupta ([email protected])


lOMoARcPSD|51175266

The above figure shows the architecture of Apache Hive and its major components.
The major components of Apache Hive are :
1. Hive Client
2. Hive Services
3. Processing and Resource Management
4. Distributed Storage
HIVE CLIENT :
Hive supports applications written in any language like Python, Java, C++, Ruby, etc using
JDBC, ODBC, and Thrift drivers, for performing queries on the Hive. Hence, one can easily
write a hive client application in any language of its own choice.
Hive clients are categorized into three types :
1. Thrift Clients : The Hive server is based on Apache Thrift so that it can serve the request
from a thrift client.
2. JDBC client : Hive allows for the Java applications to connect to it using the JDBC driver.
JDBC driver uses Thrift to communicate with the Hive Server.
3. ODBC client : Hive ODBC driver allows applications based on the ODBC protocol to
connect to Hive. Similar to the JDBC driver, the ODBC driver uses Thrift to communicate with
the Hive Server.

Downloaded by Dikshant Gupta ([email protected])


lOMoARcPSD|51175266

HIVE SERVICE :
To perform all queries, Hive provides various services like the Hive server2, Beeline, etc.
The various services offered by Hive are :
1. Beeline
2. Hive Server 2
3. Hive Driver
4. Hive Compiler
5. Optimizer
6. Execution Engine
7. Metastore
8. HCatalog
9. WebHCat
PROCESSING AND RESOURCE MANAGEMENT :
Hive internally uses a MapReduce framework as a defacto engine for executing the queries.
MapReduce is a software framework for writing those applications that process a massive
amount of data in parallel on the large clusters of commodity hardware.
MapReduce job works by splitting data into chunks, which are processed by map-reduce tasks.

DISTRIBUTED STORAGE :
Hive is built on top of Hadoop, so it uses the underlying Hadoop Distributed File System for the
distributed storage.
Hive Shell :
Hive shell is a primary way to interact with hive. It is a default service in the hive. It is also
called CLI (command line interference). Hive shell is similar to MySQL Shell. Hive users can
run HQL queries in the hive shell. In hive shell up and down arrow keys are used to scroll
previous commands. HiveQL is case-insensitive (except for string comparisons).
The tab key will autocomplete (provides suggestions while you type into the field) Hive
keywords and functions. Hive Shell can run in two modes :
Non-Interactive mode :
Non-interactive mode means run shell scripts in administer zone. Hive Shell can run in the non-
interactive mode, with the -f option. Example: $hive -f script.q, Where script. q is a file.
Interactive mode :
The hive can work in interactive mode by directly typing the command “hive” in the terminal.
Example: $hive
Hive> show databases;
Hive Services :

Downloaded by Dikshant Gupta ([email protected])


lOMoARcPSD|51175266

The following are the services provided by Hive :


Hive CLI: The Hive CLI (Command Line Interface) is a shell where we can execute Hive
queries and commands.
Hive Web User Interface: The Hive Web UI is just an alternative of Hive CLI. It provides a
web-based GUI for executing Hive queries and commands.
Hive metastore: It is a central repository that stores all the structure information of various
tables and partitions in the warehouse. It also includes metadata of column and its type
information, the serializers and deserializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
Hive Server: It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
Hive Driver: It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
Hive Compiler: The purpose of the compiler is to parse the query and perform semantic analysis
on the different query blocks and expressions. It converts HiveQL statements into MapReduce
jobs.
Hive Execution Engine: Optimizer generates the logical plan in the form of DAG of map-
reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the
order of their dependencies.

MetaStore :

Hive metastore (HMS) is a service that stores Apache Hive and other metadata in a backend
RDBMS, such as MySQL or PostgreSQL. Impala, Spark, Hive, and other services share the
metastore. The connections to and from HMS include HiveServer, Ranger, and the NameNode,
which represents HDFS. Beeline, Hue, JDBC, and Impala shell clients make requests through
thrift or JDBC to HiveServer. The HiveServer instance reads/writes data to HMS. By default,
redundant HMS operate in active/active mode. The physical data resides in a backend RDBMS,
one for HMS. All connections are routed to a single RDBMS service at any given time. HMS
talks to the NameNode over thrift and functions as a client to HDFS. HMS connects directly to
Ranger and the NameNode (HDFS), and so does HiveServer. One or more HMS instances on the
backend can talk to other services, such as Ranger.

Comparison with Traditional Database :

Downloaded by Dikshant Gupta ([email protected])


lOMoARcPSD|51175266

RDBMS HIVE

It is used to maintain the


It is used to maintain a data warehouse.
database.

It uses SQL (Structured Query


It uses HQL (Hive Query Language).
Language).

Schema is fixed in RDBMS Schema varies in it.

Normalized and de-normalized both type


Normalized data is stored.
of data is stored.

Tables in rdms are sparse. The table in hive is dense.

It doesn’t support partitioning. It supports automation partition.

No partition method is used The sharding method is used for partition

HiveQL :
Even though based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL
offers extensions not in SQL, including multitable inserts and create table as select. HiveQL
lacked support for transactions and materialized views and only limited subquery support.
Support for insert, update, and delete with full ACID functionality was made available with
release 0.14.
Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce
Tez, or Spark jobs, which are submitted to Hadoop for execution.
Example :
DROP TABLE IF EXISTS docs;
CREATE TABLE docs
(line STRING);

Checks if table docs exist and drop it if it does. Creates a new table called docs with a single
column of type STRING called line.
LOAD DATA INPATH 'input_file' OVERWRITE INTO TA
BLE docs;

Loads the specified file or directory (In this case “input_file”) into the table.
OVERWRITE specifies that the target table to which the data is being loaded is to be re-written;
Otherwise, the data would be appended.
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) temp

Downloaded by Dikshant Gupta ([email protected])


lOMoARcPSD|51175266

GROUP BY word
ORDER BY word;

The query CREATE TABLE word_counts AS SELECT word, count(1) AS count creates a table
called word_counts with two columns: word and count.
This query draws its input from the inner query (SELECT explode(split(line, '\
s')) AS word FROM docs) temp".
This query serves to split the input words into different rows of a temporary table aliased
as temp.
The GROUP BY WORD groups the results based on their keys.
This results in the count column holding the number of occurrences for each word of
the word column.
The ORDER BY WORDS sorts the words alphabetically.
Tables :
Here are the types of tables in Apache Hive:
Managed Tables :

In a managed table, both the table data and the table schema are managed by Hive.
The data will be located in a folder named after the table within the Hive data warehouse, which
is essentially just a file location in HDFS.
By managed or controlled we mean that if you drop (delete) a managed table, then Hive will
delete both the Schema (the description of the table) and the data files associated with the table.
Default location is /user/hive/warehouse.
The syntax for Managed Tables :
CREATE TABLE IF NOT EXISTS stocks
(exchange STRING,
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED
BY ',' ;

External Tables :
An external table is one where only the table schema is controlled by Hive. In most cases, the
user will set up the folder location within HDFS and copy the data file(s) there. This location is

Downloaded by Dikshant Gupta ([email protected])


lOMoARcPSD|51175266

included as part of the table definition statement. When an external table is deleted, Hive will
only delete the schema associated with the table. The data files are not affected.
Syntax for External Tables :
CREATE EXTERNAL TABLE IF NOT EXISTS stocks
(exchange STRING,
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

Querying Data :
A query is a request for data or information from a database table or a combination of tables.
This data may be generated as results returned by Structured Query Language (SQL) or as
pictorials, graphs or complex results, e.g., trend analyses from data-mining tools. One of several
different query languages may be used to perform a range of simple to complex database queries.
SQL, the most well-known and widely-used query language, is familiar to most database
administrators (DBAs)

User-Defined Functions :
In Hive, the users can define their own functions to meet certain client requirements. These are
known as UDFs in Hive. User-Defined Functions written in Java for specific modules. Some of
UDFs are specifically designed for the reusability of code in application frameworks. The
developer will develop these functions in Java and integrate those UDFs with the Hive.
During the Query execution, the developer can directly use the code, and UDFs will return
outputs according to the user-defined tasks.
It will provide high performance in terms of coding and execution.
The general type of UDF will accept a single input value and produce a single output value.
We can use two different interfaces for writing Apache Hive User-Defined Functions :
1. Simple API
2. Complex API
Sorting And Aggregating :
Sorting data in Hive can be achieved by use of a standard ORDER BY clause, but there is a
catch. ORDER BY produces a result that is totally sorted, as expected, but to do so it sets the
number of reducers to one, making it very inefficient for large datasets. When a globally sorted

Downloaded by Dikshant Gupta ([email protected])


lOMoARcPSD|51175266

result is not required and in many cases it isn’t, then you can use Hive’s nonstandard extension,
SORT BY instead. SORT BY produces a sorted file per reducer.
If you want to control which reducer a particular row goes to, typically so you can perform some
subsequent aggregation.
This is what Hive’s DISTRIBUTE BY clause does.
Example :
· To sort the weather dataset by year and temperature, in such a way to ensure that all the
rows for a given year end up in the same reducer partition :
Hive> FROM records2
> SELECT year, temperature
> DISTRIBUTE BY year
> SORT BY year ASC,
temperature DESC;

· Output :
1949 111
1949 78
1950 22
1950 0
1950 -11
MapReduce Scripts in Hive / Hive Scripts :
Similar to any other scripting language, Hive scripts are used to execute a set of Hive commands
collectively. Hive scripting helps us to reduce the time and effort invested in writing and
executing the individual commands manually. Hive scripting is supported in Hive 0.10.0 or
higher versions of Hive.
Joins and SubQueries :
JOINS :
Join queries can perform on two tables present in Hive. Joins are of 4 types, these are :
· Inner join: The Records common to both tables will be retrieved by this Inner Join.
· Left outer Join: Returns all the rows from the left table even though there are no matches in
the right table.
· Right Outer Join: Returns all the rows from the Right table even though there are no
matches in the left table.
· Full Outer Join: It combines records of both the tables based on the JOIN Condition given
in the query. It returns all the records from both tables and fills in NULL Values for the columns
missing values matched on either side.
SUBQUERIES :

Downloaded by Dikshant Gupta ([email protected])


lOMoARcPSD|51175266

A Query present within a Query is known as a subquery. The main query will depend on the
values returned by the subqueries. Subqueries can be classified into two types :
· Subqueries in FROM clause
· Subqueries in WHERE clause
When to use :
· To get a particular value combined from two column values from different tables.
· Dependency of one table values on other tables.
· Comparative checking of one column values from other tables.
Syntax :

Subquery in FROM clause


SELECT <column names 1, 2...n>From (SubQuery)
<TableName_Main >
Subquery in WHERE clause

SELECT <column names 1, 2...n> From<TableName_Main>WHERE col1


IN (SubQuery);

Hadoop Random Access Databases

Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the
databases that store huge amounts of data and access the data in a random manner.

What is HBase?

HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.

HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).

It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in
the Hadoop File System.

One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses
the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and
provides read and write access.

Downloaded by Dikshant Gupta ([email protected])


lOMoARcPSD|51175266

HBase and HDFS

HDFS HBase

HDFS is a distributed file system


HBase is a database built on top of the HDFS.
suitable for storing large files.

HDFS does not support fast


HBase provides fast lookups for larger tables.
individual record lookups.

It provides high latency batch


It provides low latency access to single rows from billions
processing; no concept of batch
of records (Random access).
processing.

HBase internally uses Hash tables and provides random


It provides only sequential access of
access, and it stores the data in indexed HDFS files for
data.
faster lookups.

Storage Mechanism in HBase

HBase is a column-oriented database and the tables in it are sorted by row. The table schema
defines only column families, which are the key value pairs. A table have multiple column
families and each column family can have any number of columns. Subsequent column values
are stored contiguously on the disk. Each cell value of the table has a timestamp. In short, in an
HBase:

 Table is a collection of rows.

 Row is a collection of column families.

 Column family is a collection of columns.

 Column is a collection of key value pairs.

Given below is an example schema of table in HBase.

Rowid Column Family Column Family Column Family Column Family

col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3

Downloaded by Dikshant Gupta ([email protected])


lOMoARcPSD|51175266

Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns of data, rather
than as rows of data. Shortly, they will have column families.

Row-Oriented Database Column-Oriented Database

It is suitable for Online Analytical


It is suitable for Online Transaction Process (OLTP).
Processing (OLAP).

Such databases are designed for small number of Column-oriented databases are designed for
rows and columns. huge tables.

The following image shows column families in a column-oriented database:

HBase and RDBMS

HBase RDBMS

HBase is schema-less, it doesn't have the concept of An RDBMS is governed by its schema,
fixed columns schema; defines only column which describes the whole structure of
families. tables.

It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to
scalable. scale.

No transactions are there in HBase. RDBMS is transactional.

It has de-normalized data. It will have normalized data.

Downloaded by Dikshant Gupta ([email protected])


lOMoARcPSD|51175266

It is good for semi-structured as well as structured


It is good for structured data.
data.

Features of HBase

 HBase is linearly scalable.

 It has automatic failure support.

 It provides consistent read and writes.

 It integrates with Hadoop, both as a source and a destination.

 It has easy java API for client.

 It provides data replication across clusters.

Where to Use HBase

 Apache HBase is used to have random, real-time read/write access to Big Data.

 It hosts very large tables on top of clusters of commodity hardware.

 Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts
up on Google File System, likewise Apache HBase works on top of Hadoop and HDFS.

Applications of HBase

 It is used whenever there is a need to write heavy applications.

 HBase is used whenever we need to provide fast random access to available data.

 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

HBase History

Year Event

Nov 2006 Google released the paper on BigTable.

Feb 2007 Initial HBase prototype was created as a Hadoop contribution.

Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.

Jan 2008 HBase became the sub project of Hadoop.

Oct 2008 HBase 0.18.1 was released.

Jan 2009 HBase 0.19.0 was released.

Sept 2009 HBase 0.20.0 was released.

May 2010 HBase became Apache top-level project.

Downloaded by Dikshant Gupta ([email protected])

You might also like