0% found this document useful (0 votes)
296 views

Big Data Analytics Unit 4

This document provides an overview of Pig and Hive, two frameworks for analyzing large datasets using Hadoop. Pig allows for data manipulation through Pig Latin scripts that are compiled into MapReduce jobs. Hive provides SQL-like queries through HiveQL that are also compiled into MapReduce or Spark jobs. Both Pig and Hive simplify the process of working with large datasets on Hadoop clusters for data analysts and scientists.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
296 views

Big Data Analytics Unit 4

This document provides an overview of Pig and Hive, two frameworks for analyzing large datasets using Hadoop. Pig allows for data manipulation through Pig Latin scripts that are compiled into MapReduce jobs. Hive provides SQL-like queries through HiveQL that are also compiled into MapReduce or Spark jobs. Both Pig and Hive simplify the process of working with large datasets on Hadoop clusters for data analysts and scientists.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Big Data Analytics

Unit 4
SURESH BABU M
ASST PROF
IT DEPT
UNIT-IV
Hadoop Eco System-I
 Pig: Introduction to PIG, Execution Modes of Pig,
Comparison of Pig with Databases, Grunt, Pig Latin, User
Defined Functions, Data Processing operators.
 Hive: Hive Shell, Hive Services, Hive Metastore,
Comparison with Traditional Databases, HiveQL, Tables,
Querying Data and User Defined Functions.
4.1 Introduction to Pig
 Apache Pig raises the level of abstraction for processing large
datasets.
 With Pig, the data structures are much richer, typically being
multivalued and nested, and the transformations you can
apply to the data are much more powerful.
 Pig is made up of two pieces:
• The language used to express data flows, called Pig Latin.
• The execution environment to run Pig Latin programs. There
are currently twoenvironments: local execution in a single
JVM and distributed execution on a Hadoopcluster.
 Pig is a scripting language for exploring large
datasets.
 Pig was designed to be extensible.
 As another benefit, UDFs tend to be more
reusable than the libraries developed for writing
MapReduce programs.
 Pig is an open-source high level data flow system.
It provides a simple language called Pig Latin, for
queries and data manipulation, which are then
compiled in to MapReduce jobs that run on
Hadoop.
4.2 Execution Types
 Pig has two execution types or modes: local mode and MapReduce
mode.
Local mode
1)In local mode, Pig runs in a single JVM and accesses the local filesystem.
This mode is suitable only for small datasets and when trying out Pig.
% pig -x local
2)MapReduce mode
 The MapReduce mode is also known as Hadoop Mode.
 It is the default mode. In this Pig renders Pig Latin into MapReduce jobs
and executes them on the cluster.
 It can be executed against semi-distributed or fully distributed Hadoop
installation. Here, the input and output data are present on HDFS.
$ pig
4.3 Comparison of Pig with Databases
4.4 Grunt
 Grunt is an interactive shell for running Pig commands. Grunt
is started when no file is specified for Pig to run and the -e
option is not used. It is also possible to run Pig scripts from
within Grunt using run and exec.
4.5 Pig Latin
4.5.1 Structures
4.5.2 Statements
4.5.3 Expressions
4.5.4 Types
4.5.5 Schemas
4.5.6 Functions
4.5.7 Macros
4.5.1 Structures

 A Pig Latin program consists of a collection of


statements.
 a GROUP operation is a type of statement:
grouped_records = GROUP records BY year;
Statements are usually terminated with a semicolon
 Pig Latin has mixed rules on case sensitivity. Operators
and commands are not case sensitive (to make interactive
use more forgiving); however, aliases and function names
are case sensitive.
4.5.2 Statements

 When the Pig Latin interpreter sees the first line


containing the LOAD statement, it confirms that it is
syntactically and semantically correct and adds it to the
logical plan, but it does not load the data from the file.
 Pig validates the GROUP and FOREACH...GENERATE
statements, and adds them to the logical plan without
executing them. The trigger for Pig to start execution is
the DUMP statement. At that point, the logical plan is
compiled into a physical plan and executed.
The physical plan that Pig prepares is a series of
MapReduce jobs, which in local mode
Pig runs in the local JVM and in MapReduce mode Pig
runs on a Hadoop cluster.
Multiquery Execution
 Relations B and C are both derived from A, so to save reading
A twice, Pig can run this script as a single MapReduce job by
reading A once and writing two output files from the job, one
for each of B and C. This feature is called multiquery execution.
4.5.3 Expressions
4.5.4 Types
4.5.5 Schemas

 LOAD statement is used to attach a schema to a relation:


4.5.6 Functions
Functions in Pig come in four types:
1)Eval function: A function that takes one or more expressions
and returns another expression. An example of a built-in eval
function is MAX, which returns the maximum value of the
entries in a bag.
2) Filter function:An example of a built-in filter function is
IsEmpty, which tests whether a bag or a map contains any
items.
3)Load function:A function that specifies how to load data into a
relation from external storage.
4) Store function:A function that specifies how to save the
contents of a relation to external storage.
4.5.7 Macros
 Macros provide a way to package reusable pieces of Pig Latin
code from within Pig Latin itself.
4.6 User-Defined Functions
4.7 Data Processing Operators
Data Processing Operators
4.7.1 Loading and Storing Data
4.7.2 Filtering Data
4.7.3 Grouping and Joining Data
4.7.4 Sorting Data
4.7.5 Combining and Splitting Data
4.7.1 Loading and Storing Data

Storing the results is straightforward, too. Here’s an


example of using PigStorage to store tuples as plain-text
values separated by a colon character:
4.7.2 Filtering Data
 Once you have some data loaded into a relation,
often the next step is to filter it to remove the
data that you are not interested in.
 By filtering early in the processing pipeline, you
minimize the amount of data flowing through the
system, which can improve efficiency.
FOREACH...GENERATE
 The FOREACH...GENERATE operator is used
to act on every row in a relation. It can be used to
remove fields or to generate new ones.
4.7.3 Grouping and Joining Data

 Pig has very good built-in support for join operations.


 JOIN Let’s look at an example of an inner join. Consider the relations A and B:
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
COGROUP: The COGROUP statement is similar to JOIN, but
instead creates a nested set of output tuples. COGROUP
generates a tuple for each unique grouping key.
GROUP:Where COGROUP groups the data in two or more
relations, the GROUP statement groups the data in a single
relation. GROUP supports grouping by more than equality
of keys.
4.7.4 Sorting Data
 Relations are unordered in Pig. Consider a relation A:
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
4.7.5 Combining and Splitting Data
Hive: Hive Shell, Hive Services,
Hive Metastore, Comparison with
Traditional Databases, HiveQL,
Tables, Querying Data and User
Defined Functions.
4.8 HIVE
 Hive was created to make it possible for analysts with strong
SQL skills (but meager Java programming skills) to run
queries on the huge volumes of data that Facebook stored in
HDFS.
 Today, Hive is a successful Apache project used by many
organizations as a general-purpose, scalable data processing
platform.
 Of course, SQL isn’t ideal for every big data problem—it’s
not a good fit for building complex machine-learning
algorithms
What is HIVE
 Hive is a data warehouse system which is used to
analyze structured data. It is built on the top of
Hadoop. It was developed by Facebook.
 Hive provides the functionality of reading,
writing, and managing large datasets residing in
distributed storage. It runs SQL like queries
called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
Features of Hive
These are the following features of Hive:
 Hive is fast and scalable.
 It provides SQL-like queries (i.e., HQL) that are implicitly
transformed to MapReduce or Spark jobs.
 It is capable of analyzing large datasets stored in HDFS.
 It allows different storage types such as plain text, RCFile, and
HBase.
 It uses indexing to accelerate queries.
 It can operate on compressed data stored in the Hadoop
ecosystem.
 It supports user-defined functions (UDFs) where user can provide
its functionality.
4.8 The Hive Shell
 The shell is the primary way that we will interact with Hive,
by issuing commands in HiveQL. HiveQL is Hive’s query
language, a dialect of SQL.
4.9 Hive Services
The Hive shell is only one of several services that you can
run using the hive command.
1)Cli:The command-line interface to Hive (the shell). This
is the default service.
2) hiveserver2: HiveServer 2 improves on the original
Hive‐Server by supporting authentication and multiuser
concurrency.
3) beeline: A command-line interface to Hive that works in
embedded mode (like the regular CLI), or by connecting
to a HiveServer 2 process using JDBC.
4)Hwi :The Hive Web Interface. A simple web interface
that can be used as an alternative to the CLI without
having to install any client software.
5)jarThe Hive equivalent of hadoop jar, a convenient way
to run Java applications that includes both Hadoop and
Hive classes on the classpath.
6)Metastore By default, the metastore is run in the same
process as the Hive service. Using this service, it is
possible to run the metastore as a standalone (remote)
process. Set the METASTORE_PORT environment
variable
Hive clients
 Thrift Client The Hive server is exposed as a Thrift service, so
it’s possible to interact with it using any programming
language that supports Thrift. There are third-party projects
providing clients for Python and Ruby.
 JDBC driver: a Java application will connect to a Hive server
running in a separate process at the given host and port.
 ODBC driver:An ODBC driver allows applications that
support the ODBC protocol (such as business intelligence
software) to connect to Hive.
4.10 Hive Metastore
 The metastore is the central repository of Hive metadata.
 The metastore is divided into two pieces: a service and the
backing store for the data.
1)embedded metastore: one embedded Derby database can access
the database files on disk at any one time, which means you
can have only one Hive session open at a time that accesses
the same metastore.
2)local metastore: The solution to supporting multiple sessions
(and therefore multiple users) is to use a standalone database.
metastore service still runs in the same process as the Hive
service but connects to a database running in a separate
process, either on the same machine or on a remote
machine.
3) Remote metastore: where one or more metastore servers run
in separate processes to the Hive service. This brings better
manageability and security because the database tier can be
completely firewalled off, and the clients no longer need the
database credentials.
4.11 Comparison with Traditional
Databases
Schema on ReadVersus Schema onWrite
 schema on write :The data is checked against the schema when
it is written into the database.
 schema on read :Hive, on the other hand, doesn’t verify the
data when it is loaded, but rather when a query is issued.
 Schema on read makes for a very fast initial load, since the
data does not have to be read, parsed, and serialized to disk
in the database’s internal format.
 Schema on write makes query time performance faster
because the database can index columns and perform
compression on the data. The trade-off, however, is that it
takes longer to load data into the database.
Updates, Transactions, and Indexes
 Hive has long supported adding new rows in bulk to an
existing table by using INSERT INTO to add new data
files to a table.
 HDFS does not provide in-place file updates, so changes
resulting from inserts, updates, and deletes are stored in
small delta files.
 Delta files are periodically merged into the base table
files by MapReduce jobs that are run in the background
by the metastore.
 Hive also has support for table- and partition-level
locking.
 There are currently two index types: compact and bitmap.
 Compact indexes store the HDFS block numbers of each
value, rather than each file offset, so they don’t take up
much disk space but are still effective for the case where
values are clustered together in nearby rows.
 Bitmap indexes use compressed bitsets to efficiently
store the rows that a particular value appears in, and they
are usually appropriate for low-cardinality columns (such
as gender or country).
4.12 HiveQL
Operators and Functions
 The usual set of SQL operators is provided by Hive:
relational operators ,arithmetic operators and logical
operators.
 Hive comes with a large number of built-in functions—
too many to list here—divided into categories that
include mathematical and statistical functions, string
functions, date functions (for operating on string
representations of dates), conditional functions,
aggregate functions, and functions for working with
XML (using the xpath function) and JSON.
Conversions:
For example, a TINYINT will be converted to an INT if an
expression expects an INT; however, the reverse conversion
will not occur, and Hive will return an error unless the
CAST operator is used.
 Any numeric type can be implicitly converted to a wider
type, or to a text type (STRING,VARCHAR, CHAR).
 All the text types can be implicitly converted to another text
type.
 TIMESTAMP and DATE can be implicitly converted to a text
type.
 BOOLEAN types cannot be converted to any other type, and
they cannot be implicitly converted to any other type in
expressions.
4.13 Tables
4.13.1 Managed Tables and External Tables
4.13.2 Partitions and Buckets
4.13.3 Storage Formats
4.13.4 Importing Data
4.13.5 Altering Tables
4.13.6 Dropping Tables
4.13.1 Managed Tables and External
Tables
 A Hive table is logically made up of the data being stored and
the associated metadata describing the layout of the data in
the table.
Managed Tables and External Tables
 Managed Tables : When you create a table in Hive, by default
Hive will manage the data, which means that Hive moves the
data into its warehouse directory.
 External table:which tells Hive to refer to the data that is at an
existing location outside the warehouse directory.
 When you load data into a managed table, it is moved into
Hive’s warehouse directory.
 If the table is later dropped, using:
DROP TABLE managed_table;
Managed table: the table, including its metadata and its data, is
deleted.
 External Table: When you drop an external table, Hive will
leave the data untouched and only delete the metadata.
4.13.2 Partitions and Buckets
Partitions and Buckets
 Hive organizes tables into partitions—a way of dividing a table
into coarse-grained parts based on the value of a partition column,
such as a date. Using partitions can make it faster to do queries
on slices of the data.
 Tables or partitions may be subdivided further
into buckets to give extra structure to the data that
may be used for more efficient queries.
 For example, bucketing by user ID means we can
quickly evaluate a user-based query by running it
on a randomized sample of the total set of users.
 A table may be partitioned in multiple
dimensions. For example, in addition to
partitioning logs by date, we might also
subpartition each date partition by country to permit
efficient queries by location.
Buckets
 There are two reasons why you might want to organize
your tables (or partitions) into buckets. The first is to
enable more efficient queries.
 Bucketing imposes extra structure on the table, which
Hive can take advantage of when performing certain
queries.
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
Here we are using the user ID to determine the bucket
4.13.3 Storage Formats
Storage Formats
 There are two dimensions that govern table storage in Hive:
the row format and the file format.
 In Hive parlance, the row format is defined by a SerDe, a
portmanteau word for a Serializer-Deserializer.
 When acting as a deserializer, which is the case when
querying a table, a SerDe will deserialize a row of data from
the bytes in the file to objects used internally by Hive to
operate on that row of data.
 When used as a serializer, which is the case when performing
an INSERT or CTAS (see “Importing Data” on page 500), the
table’s SerDe will serialize Hive’s internal representation of a
row of data into the bytes that are written to the output file.
4.13.4 Importing Data
 You can also populate a table with data from another
Hive table using an LOAD & INSERT statement, or at
creation time.
 using the CTAS construct, which is an abbreviation used to refer
to CREATE TABLE...AS SELECT.
Multitable insert
 multitable insert is more efficient than multiple INSERT
statements because the source table needs to be scanned
only once to produce the multiple disjoint outputs.
CREATE TABLE...AS SELECT
CREATE TABLE target AS SELECT col1, col2 FROM source;
 A CTAS operation is atomic, so if the SELECT query fails for
some reason, the table is not created.
4.13.5 Altering Tables
 You can rename a table using the ALTER TABLE statement:
ALTER TABLE source RENAME TO target;
 Hive allows you to change the definition for columns, add
new columns, or even replace all existing columns in a table
with a new set.
ALTER TABLE target ADD COLUMNS (col3 STRING);
4.13.6 Dropping Tables
 The DROP TABLE statement deletes the data and metadata
for a table. In the case of external tables, only the metadata is
deleted; the data is left untouched.
 If you want to delete all the data in a table but keep the table
definition, use TRUNCATE
TABLE. For example:
TRUNCATE TABLE my_table;
 In a similar vein, if you want to create a new, empty table
with the same schema as another table, then use the LIKE
keyword:
CREATE TABLE new_table LIKE existing_table;
4.14 Querying Data

 4.14.1 Sorting and Aggregating


 4.14.2 MapReduce Scripts
 4.14.3 Joins
 4.14.4 Subqueries
 4.14.4 Views
4.14.1 Sorting and Aggregating

 Sorting data in Hive can be achieved by using a standard


ORDER BY clause.
4.14.2 MapReduce Scripts
 Using an approach like Hadoop Streaming, the
TRANSFORM, MAP, and REDUCE clauses make it possible
to invoke an external script or program from Hive.
4.14.3 Joins
Inner joins
Outer joins
 Outer joins allow you to find nonmatches in the tables being
joined. In the current example, when we performed an inner
join, the row for Ali did not appear in the output, because the
ID of the item she purchased was not present in the things
table.
Semi joins
4.14.4 Subqueries
 A subquery is a SELECT statement that is embedded in
another SQL statement.
 Hive has limited support for subqueries, permitting a
subquery in the FROM clause of a SELECT statement, or in
the WHERE clause in certain cases.
4.14.4 Views
 A view is a sort of “virtual table” that is defined by a SELECT
statement.
 Views in Hive are read-only, so there is no way to load or
insert data into an underlying base table via a view.
 Views can be used to present data to users in a way that
differs from the way it is actually stored on disk.
 Views may also be used to restrict users’ access to particular
subsets of tables that they are authorized to see.
4.15 User-Defined Functions
 UDFs have to be written in Java, the language that Hive itself
is written in.
 There are three types of UDF in Hive: (regular) UDFs,
user-defined aggregate functions (UDAFs), and user-defined
table-generating functions (UDTFs).
 They differ in the number of rows that they accept as input
and produce as output.
Writing a UDF
Writing a UDAF
An evaluator must implement five methods:
1)init():The init() method initializes the evaluator and resets its
internal state.
2)iterate():The iterate() method is called every time there is a
new value to be aggregated. The evaluator should update its
internal state with the result of performing the aggregation.
3) terminatePartial():The terminatePartial() method is called
when Hive wants a result for the partial aggregation.
4) merge():The merge() method is called when Hive decides to
combine one partial aggregation with another.
5)terminate():The terminate() method is called when the final
result of the aggregation is needed.

You might also like