Unit 5 Handouts
Unit 5 Handouts
HIVE
Hive Architecture
• Hive is a data warehouse system which is used to analyze structured data. It is built on the top of Hadoop. It
was developed by Facebook.
• Hive provides the functionality of reading, writing, and managing large datasets residing in distributed
storage. It runs SQL like queries called HQL (Hive query language) which gets internally converted to
MapReduce jobs.
• Using Hive, we can skip the requirement of the traditional approach of writing complex MapReduce
programs. Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and User
Defined Functions (UDF).
Features of Hive
• Hive is fast and scalable.
• It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark jobs.
• It is capable of analyzing large datasets stored in HDFS.
• It allows different storage types such as plain text, RCFile, and HBase.
• It uses indexing to accelerate queries.
• It can operate on compressed data stored in the Hadoop ecosystem.
• It supports user-defined functions (UDFs) where user can provide its functionality.
Hive Services
Hive Architecture The following are the services provided by Hive:-
• Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and
Hive Client commands.
Hive allows writing applications in various languages, including Java, Python, • Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-
and C++. It supports different types of clients such as:- based GUI for executing Hive queries and commands.
• Hive MetaStore - It is a central repository that stores all the structure information of various
tables and partitions in the warehouse. It also includes metadata of column and its type
• Thrift Server - It is a cross-language service provider platform that serves the information, the serializers and deserializers which are used to read and write data and the
request from all the programming languages that supports Thrift. corresponding HDFS files where the data is stored.
• Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients
• JDBC Driver - It is used to establish a connection between hive and Java and provides it to Hive Driver.
applications. The JDBC Driver is present in the class • Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC
org.apache.hadoop.hive.jdbc.HiveDriver. driver. It transfers the queries to the compiler.
• Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis
• ODBC Driver - It allows the applications that support the ODBC protocol to on the different query blocks and expressions. It converts HiveQL statements into MapReduce
connect to Hive. jobs.
• Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce
tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the order of
their dependencies.
It follows SQL-like queries. It follows the data-flow language. Mapreduce Not Supported Supported
1
6/7/2024
2
6/7/2024
2.SHOW DATABASE
The SHOW DATABASES statement lists all the databases present in the Hive. 6.ALTER DATABASE
Syntax: The ALTER DATABASE statement in Hive is used to change the metadata associated with the database in Hive.
SHOW (DATABASES|SCHEMAS); Syntax :
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);
3.DESCRIBE DATABASE
• The DESCRIBE DATABASE statement in Hive shows the name of Database in Hive, its comment (if set), and its 7.CREATE TABLE
location on the file system. The CREATE TABLE statement in Hive is used to create a table with the given name. If a table or view already
• The EXTENDED can be used to get the database properties. exists with the same name, then the error is thrown. We can use IF NOT EXISTS to skip the error.
Syntax: Syntax:
DESCRIBE DATABASE/SCHEMA [EXTENDED] db_name; CREATE TABLE [IF NOT EXISTS] [db_name.] table_name [(col_name data_type [COMMENT col_comment], ... [COMMENT
col_comment])] [COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION
4.USE DATABASE hdfs_path];
The USE statement in Hive is used to select the specific database for a session on which all subsequent HiveQL
statements would be executed. 8.SHOW TABLES
Syntax: The SHOW TABLES statement in Hive lists all the base tables and views in the current database.
USE database_name; Syntax:
SHOW TABLES [IN database_name];
5.DROP DATABASE
• The DROP DATABASE statement in Hive is used to Drop (delete) the database. 9.DESCRIBE TABLE
• The default behavior is RESTRICT which means that the database is dropped only when it is empty. To drop the The DESCRIBE statement in Hive shows the lists of columns for the specified table.
database with tables, we can use CASCADE. Syntax:
Syntax: DESCRIBE [EXTENDED|FORMATTED] [db_name.] table_name[.col_name ( [.field_name])];
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
Syntax:
Syntax:
TRUNCATE TABLE table_name;
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2
...)];
3
6/7/2024
Features
• Horizontally Scalable
Why Hbase
• Integration with map reduce
HDFS is used to store ,manage and access data in hadoop
but it can access data only in sequential manner and performs • Column family oriented database
only batch processing hence we use Hbase to access data more
efficiently • Automatic failure support
• Flexible schema
4
6/7/2024
Data Models
Tables: Data is stored in a table format in HBase. But here tables are in column-oriented
format.
Row Key: Row keys are used to search records which make searches fast.
1. Get
Column Qualifiers: Each column’s name is known as its column qualifier. Get operation is similar to the Select statement of the relational database. It is used to fetch the content of an
HBase table.
Cell: Data is stored in cells. The data is dumped into cells which are specifically identified by We can execute the Get command on the HBase shell as below.
rowkey and column qualifiers.
Syntax:get 'table name', 'row key' <filters>
eg: get 'my_table', 'row1', {COLUMN=>'cf1:col1', TIMESTAMP=>ts}
Timestamp: Timestamp is a combination of date and time. Whenever data is stored, it is
stored with its timestamp. This makes easy to search for a particular version of data 2. Put
Put operation is used to read multiple rows of a table. It is different from getting in which we need to specify a set
of rows to read. Using Scan we can iterate through a range of rows or all the rows in a table.
Hbase Architecture
3. Scan
Scan operation is used to read multiple rows of a table. It is different from Get in which we need to specify a HBase has four major components
HMaster Server,
set of rows to read. Using Scan we can iterate through a range of rows or all the rows in a table.
HBase Region Server,
Eg:scan 'table_name' [, {OPTIONS}] Regions
Zookeeper.
You can add additional options to the scan command, such as specifying a range of rows or columns, using
filters, and setting other scan parameters.
4. Delete
Delete operation is used to delete a row or a set of rows from an HBase table.
5
6/7/2024
Zookeeper
Hmaster
• HBase HMaster performs DDL operations (create and delete tables) and assigns regions to the
Region servers as you can see in the above image. Region servers
• It coordinates and manages the Region Server (similar as NameNode manages DataNode in • Each region server have various regions in it
HDFS). • Region server is used to communicate with client whenever client makes an request
• It assigns regions to the Region Servers on startup and re-assigns regions to Region Servers
during recovery and load balancing.
• It monitors all the Region Server’s instances in the cluster (with the help of Zookeeper) and Regions
performs recovery activities whenever any Region Server is down. These are tables that are split up and spread across the servers
• It provides an interface for creating, deleting and updating tables.
•To interact with HBase, users and applications can leverage a variety •The most common are the Java API, the REST API, and the Thrift
of client interfaces. API.
•These clients provide different levels of functionality, flexibility, and •These clients allow developers to perform CRUD (Create, Read,
ease of use, allowing developers to choose the best fit for their specific Update, Delete) operations on HBase tables.
needs and use cases.
• Modify existing data in HBase tables through put operations. HBase's versioning capabilities
UPDATE allow you to track changes over time.
• Remove rows, columns, or entire tables from HBase as needed. Utilize administrative
DELETE commands to manage the lifecycle of your data.
6
6/7/2024
table.
S.No. Constructors and Description HTableDescriptor getTableDescriptor()
4
Returns the table descriptor for this table.
1 HTable()
byte[] getTableName()
HTable(TableName tableName, 5
Returns the name of this table.
ClusterConnection connection,
2 ExecutorService pool)
Using this constructor, you can create an
object to access an HBase table.
(ii)Class Put
This class is used to perform Put operations for a single row. It belongs to (iii)Class Get
the org.apache.hadoop.hbase.client package. This class is used to perform Get operations on a single row. This class belongs to
S.No. Constructors and Description the org.apache.hadoop.hbase.client package.
7
6/7/2024
hbase(main):012:0> get 'emp', '1' hbase(main):006:0> delete 'emp', '1', 'personal data:city', 1417521848375
0 row(s) in 0.0060 secondsdeleteall ‘<table name>’, ‘<row>’,
hbase> get 'table name', ‘rowid’, {COLUMN ⇒ ‘column family:column name ’} '
Syntax for deleting
deleteall all cells‘<row>’,
‘<table name>’, in a row:
(vi)VIEW TABLE:
2.)HBase REST Client
The scan command is used to view the data in HTable. Using the scan command, you can get the table
data.
1 HTTP-based Interface
2 Asynchronous Operations
. .
8
6/7/2024
Example:
Using Apache Spark to count the number of employees in the employee
Example: table.
Retrieving personal data for an employee with ID 001 from the
employee table.
4.)Coprocessors
HBase's coprocessor framework allows you to extend its functionality by
get 'employee', '001', {COLUMN => 'personal_data'} running custom code directly on the RegionServer, enabling advanced data
processing and analysis.
Example:
Implementing a simple coprocessor to log every Put operation performed
on the employee table.
Explanation: .
Table Creation: Demonstrates how to create a table with specified column families.
Data Retrieval: Shows how to retrieve specific data from a table.
Batch Processing: Utilizes Apache Spark to process data in parallel from HBase.
Coprocessors: Illustrates a simple coprocessor that performs custom actions on Put
operations.
.
.
• Pig Latin program is a step-by-step set of operations on • SQL statements are a set of constraints that, taken
• Pig is made up of two pieces: an input relation, in which each step is a single together, define the output.
transformation.
1. The language used to express data flows, called Pig Latin.
• It will operate on any source of tuples. The most • RDBMSs store data in tables, with tightly predefined
2. The execution environment to run Pig Latin programs. There are currently two common representation is a text file with tab-separated schemas.
environments: local execution in a single JVM and distributed execution on a Hadoop fields, and Pig provides a built-in load function for this
cluster. format. We can define a schema at runtime, but it’s
optional.
• Pig latin: Offers the best of both SQL and Map-Reduce combined with high-level • There is no data import process to load the data. The
data is loaded from the filesystem (usually HDFS) as the • There is data import process to load the data into the
declarative querying with low-level procedural programming. first step in the processing. RDBMS.
9
6/7/2024
• Pig Latin does not support random reads or queries in the Find for each suffciently large category, the average pagerank of high-pagerank urls in that category
order of tens of milliseconds. Nor does it support random • Supports random reads or queries
writes to update small portions of data; all writes are bulk. SQL:
SELECT category, AVG(pagerank)
• Pig Latin’s ability to use UDFs (User Defined Functions) and FROM urls WHERE pagerank > 0.2
streaming operators and Pig’s nested data structures makes GROUP BY category HAVING COUNT(*) > 10^6
Pig Latin more customizable than most SQL versions. • Not very customizable.
10
Grunt
Running Pig Programs
Grunt has line-editing facilities like those found in in the bash shell. For instance, the Ctrl-E key combination will move the cursor
to the end of the line.
There are three ways of executing Pig programs, all of which work in both local and
MapReduce mode: Grunt remembers command history, too,1 and you can recall lines in the history buffer using Ctrl-P or Ctrl-N (for previous and
next) or, equivalently, the up or down cursor keys.
Script
Another handy feature is Grunt’s completion mechanism, which will try to complete Pig Latin keywords and functions when you
Pig can run a script file that contains Pig commands. For example, pig script.pig runs the commands in the local file script.pig.
press the Tab key. For example, consider the following incomplete line:
Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on the command line.
grunt> a = foreach b ge
Grunt If you press the Tab key at this point, ge will expand to generate, a Pig Latin keyword:
Grunt is an interactive shell for running Pig commands. Grunt is started when no file is specified for Pig to run, and the -e grunt> a = foreach b generate
option is not used. It is also possible to run Pig scripts from within Grunt using run and exec.
Embedded
You can run Pig programs from Java using the PigServer class, much like you can use JDBC to run SQL programs from Java. For
programmatic access to Grunt, use
PigRunner.
10
6/7/2024
Relations are given names, or aliases, so they can be referred to. This relation is given the records alias. We can examine the contents of an alias using the DUMP
operator:
An Example grunt> DUMP records;
Let’s look at a simple example by writing the program in Pig Latin to calculate the maximum recorded temperature by (1950,0,1)
(1950,22,1)
year for the weather dataset. (1950,-11,1)
(1949,111,1)
- - max_temp.pig: Finds the maximum temperature by year
(1949,78,1)
records = LOAD 'input/ncdc/micro-tab/sample.txt‘ AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year; We can also see the structure of a relation—the relation’s schema—using the DESCRIBE operator on the relation’s alias:
max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); grunt> DESCRIBE records;
DUMP max_temp; records: {year: chararray,temperature: int,quality: int}
This tells us that records have three fields, with aliases year, temperature, and quality, which are the names we gave them in the AS clause. The fields have the types
To explore what’s going on, we’ll use Pig’s Grunt interpreter, which allows us to enter lines and interact with the program to understand given to them in the AS clause, too. For this small dataset, no records are filtered out:
grunt> filtered_records = FILTER records BY temperature != 9999 AND
what it’s doing. Start up Grunt in local mode, then enter the first line of the Pig script: >> (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grunt> DUMP filtered_records;
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' (1950,0,1)
>> AS (year:chararray, temperature:int, quality:int); (1950,22,1)
(1950,-11,1)
(1949,111,1)
For simplicity, the program assumes that the input is tab-delimited text, with each line having just year, temperature, and quality fields. (1949,78,1)
The result of the LOAD operator is a relation, which is just a set of tuples. A tuple is just like a row of data in a database table, with
multiple fields in a particular order. In this example, the LOAD function produces a set of (year, temperature, quality) tuples that are The third statement uses the GROUP function to group the records relation by the year field. Let’s use DUMP to see what it produces:
present in the input file. We write a relation with one tuple per line, where tuples are represented as comma- separated items in grunt> grouped_records = GROUP filtered_records BY year;
grunt> DUMP grouped_records;
parentheses: (1949,{(1949,111,1),(1949,78,1)})
(1950,0,1) (1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
(1950,22,1) We now have two rows, or tuples, one for each year in the input data. The first field in each tuple is the field being grouped by (the year), and the second field is a bag of
(1950,-11,1) tuples for that year. A bag is just an unordered collection of tuples, which in Pig Latin is represented using curly braces.
(1949,111,1)
By grouping the data in this way, we have created a row per year, so now all that remains is to find the maximum temperature for the tuples in each bag. Before we
do this, let’s understand the structure of the grouped_records relation:
Pig Data Model
grunt> DESCRIBE grouped_records; • Atom - Simple atomic value (ie: number or string)
grouped_records: {group: chararray,filtered_records: {year: chararray,
temperature: int,quality: int}} • Tuple - Sequence of fields; each field any type
This tells us that the grouping field is given the alias group by Pig, and the second field is the same structure as the filtered_records relation that was being grouped.
• Bag - Collection of tuples
With this information, we can try the fourth transformation: - Duplicates possible
grunt> max_temp = FOREACH grouped_records GENERATE group,
>> MAX(filtered_records.temperature); - Tuples in a bag can have different field lengths and field types
FOREACH processes every row to generate a derived set of rows, using a GENERATE clause to define the fields in each derived row. In this example, the first field is
group, which is just the year. The second field is a little more complex.
• Map - Collection of key-value pairs. Key is an atom; value can be any type
Tuple
The filtered_records.temperature reference is to the temperature field of the filtered_records bag in the grouped_records relation. MAX is a built-in function for
calculating the maximum value of fields in a bag. In this case, it calculates the maximum temperature for the fields in each filtered_records bag. Let’s check the
result:
grunt> DUMP max_temp;
(1949,111)
(1950,22)
So we’ve successfully calculated the maximum temperature for each year.
Atom Map
Bag
11
11
6/7/2024
Compilation Compilation
Pig system does two tasks:
Building a Logical Plan
Builds a Logical Plan from a Pig Latin script
Supports execution platform independence Verify that input files and bags referred to are valid
No processing of data performed at this stage
Create a logical plan for each bag defined
24 25
Compilation Compilation
Building a Logical Plan Example Building a Physical Plan
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) Step 1: Create a map-reduce job Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
Step 2: Push commands into the
D = FILTER C BY city IS ‘kitchener’ Filter map and reduce functions where Map Filter
OR city IS ‘waterloo’; possible
STORE D INTO ‘local_user_count.dat’;
Group Group
Building a physical plan happens only when output is Reduce
specified by STORE or DUMP
Foreach Foreach
30 34
Compilation and Execution of Pig Latin Script Syntax and Semantics of Pig Latin
Pig Latin Program Structure
• When the Pig Latin interpreter sees the first line containing the LOAD statement, it confirms that it is • Pig Latin program consists of a collection of statements. A statement can be thought of as an
syntactically and semantically correct, and adds it to the logical plan, but it does not load the data from operation, or a command.
the file (or even check whether the file exists). (e.g.) grouped_records = GROUP records BY year;
• Pig validates the GROUP and FOREACH...GENERATE statements, and adds them to the logical plan ls /
without executing them. • Statements are usually terminated with a semicolon
• The trigger for Pig to start execution is the DUMP statement. At that point, the logical plan is compiled
• Pig Latin has two forms of comments.
into a physical plan and executed.
• The physical plan that Pig prepares is a series of MapReduce jobs, which in local mode Pig runs in the -- My program
local JVM, and in MapReduce mode Pig runs on a Hadoop cluster. DUMP A; -- What's in A?
• We can see the logical and physical plans created by Pig using the EXPLAIN command on a relation • C-style comments are more flexible since they delimit the beginning and end of the comment block
with /* and */ markers. They can span lines or be embedded in a single line:
(EXPLAIN max_temp; for example).
• EXPLAIN will also show the MapReduce plan, which shows how the physical operators are grouped into /*
MapReduce jobs. * Description of my program spanning
• This is a good way to find out how many MapReduce jobs Pig will run for your query. * multiple lines.
*/
A = LOAD 'input/pig/join/A';
B = LOAD 'input/pig/join/B';
C = JOIN A BY $0, /* ignored */ B BY $1;
DUMP C;
12
6/7/2024
Syntax and Semantics of Pig Latin Syntax and Semantics of Pig Latin
13
6/7/2024
Pig provides built-in functions TOTUPLE, TOBAG and TOMAP, which are used for turning expressions into tuples,
bags and maps.
Schemas
A relation in Pig may have an associated schema, which gives the fields in the relation names and types. AS clause in a
Schemas
LOAD statement is used to attaches a schema to a relation: Fields in a relation with no schema can be referenced only using positional notation:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' $0 refers to the first field in a relation, $1 to the second, and so on. Their types default
>> AS (year:int, temperature:int, quality:int); to bytearray:
grunt> DESCRIBE records; grunt> projected_records = FOREACH records GENERATE $0, $1, $2;
records: {year: int,temperature: int,quality: int} grunt> DUMP projected_records;
(1950,0,1)
It’s possible to omit type declarations completely, too: (1950,22,1)
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' (1950,-11,1)
>> AS (year, temperature, quality); (1949,111,1)
grunt> DESCRIBE records; (1949,78,1)
records: {year: bytearray,temperature: bytearray,quality: bytearray} grunt> DESCRIBE projected_records;
projected_records: {bytearray,bytearray,bytearray}
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year, temperature:int, quality:int);
grunt> DESCRIBE records;
records: {year: bytearray,temperature: int,quality: int}
Eval function
A function that takes one or more expressions and returns another expression.
e.g. of a built-in eval function is MAX, which returns the maximum value of the entries in a bag. Some eval functions are aggregate
functions, which means they operate on a bag of data to produce a scalar value; MAX is an example of an aggregate function.
Filter function
A special type of eval function that returns a logical boolean result. Filter functions are used in the FILTER operator to remove unwanted
rows. They can also be used in other relational operators that take boolean conditions and, in general, expressions using boolean or
conditional expressions. An example of a built-in filter function is IsEmpty, which tests whether a bag or a map contains any items.
Load function
A function that specifies how to load data into a relation from external storage.
Store function
A function that specifies how to save the contents of a relation to external storage.
14
6/7/2024
JOIN - Let’s look at an example of an inner join. Consider the relations A and B:
grunt> DUMP A;
(2,Tie)
grunt> C = JOIN A BY $0 LEFT OUTER, B BY $1;
(4,Coat)
grunt> DUMP C;
(3,Hat)
(1,Scarf,,)
(1,Scarf)
(2,Tie,Joe,2)
grunt> DUMP B;
(2,Tie,Hank,2)
(Joe,2)
(3,Hat,Eve,3)
(Hank,4)
(4,Coat,Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
We can join the two relations on the numerical (identity) field in each:
grunt> C = JOIN A BY $0, B BY $1;
grunt> DUMP C;
(2,Tie,Joe,2)
(2,Tie,Hank,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
GROUP
Sorting Data
Relations are unordered in Pig. Consider a relation A:
The GROUP statement groups the data in a single relation. GROUP supports grouping by more than equality of grunt> DUMP A;
keys: you can use an expression or user-defined function as the group key. For example, consider the following (2,3)
relation A: (1,2)
grunt> DUMP A; (2,4)
(Joe,cherry) There is no guarantee which order the rows will be processed in. In particular, when retrieving the contents of A using DUMP
(Ali,apple) or STORE, the rows may be written in any order. If we want to impose an order on the output, we can use the ORDER operator
(Joe,banana) to sort a relation by one or more fields. The default sort order compares fields of the same type using the natural ordering, and
(Eve,apple) different types are given an arbitrary, but deterministic, ordering (a tuple is always “less than” a bag, for example). The
following example sorts A by the first field in ascending order and by the second field in descending order:
Let’s group by the number of characters in the second field:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> B = GROUP A BY SIZE($1); grunt> DUMP B;
grunt> DUMP B; (1,2)
(5,{(Ali,apple),(Eve,apple)}) (2,4)
(6,{(Joe,cherry),(Joe,banana)}) (2,3)
Any further processing on a sorted relation is not guaranteed to retain its order. For example: grunt> C = FOREACH B
GROUP creates a relation whose first field is the grouping field, which is given the alias group. The second field is GENERATE *;
a bag containing the grouped fields with the same schema as the original relation (in this case, A). Even though relation C has the same contents as relation B, its tuples may be emitted in any order by a DUMP or a STORE. It is
for this reason that it is usual to perform the ORDER operation just before retrieving the output.
15
6/7/2024
The SPLIT operator is the opposite of UNION; it partitions a relation into two or more relations
16