0% found this document useful (0 votes)
11 views16 pages

Unit 5 Handouts

CCS334 Big Data Analytics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views16 pages

Unit 5 Handouts

CCS334 Big Data Analytics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

6/7/2024

HIVE
Hive Architecture
• Hive is a data warehouse system which is used to analyze structured data. It is built on the top of Hadoop. It
was developed by Facebook.
• Hive provides the functionality of reading, writing, and managing large datasets residing in distributed
storage. It runs SQL like queries called HQL (Hive query language) which gets internally converted to
MapReduce jobs.
• Using Hive, we can skip the requirement of the traditional approach of writing complex MapReduce
programs. Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and User
Defined Functions (UDF).

Features of Hive
• Hive is fast and scalable.
• It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark jobs.
• It is capable of analyzing large datasets stored in HDFS.
• It allows different storage types such as plain text, RCFile, and HBase.
• It uses indexing to accelerate queries.
• It can operate on compressed data stored in the Hadoop ecosystem.
• It supports user-defined functions (UDFs) where user can provide its functionality.

Hive Services
Hive Architecture The following are the services provided by Hive:-
• Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and
Hive Client commands.
Hive allows writing applications in various languages, including Java, Python, • Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-
and C++. It supports different types of clients such as:- based GUI for executing Hive queries and commands.
• Hive MetaStore - It is a central repository that stores all the structure information of various
tables and partitions in the warehouse. It also includes metadata of column and its type
• Thrift Server - It is a cross-language service provider platform that serves the information, the serializers and deserializers which are used to read and write data and the
request from all the programming languages that supports Thrift. corresponding HDFS files where the data is stored.
• Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients
• JDBC Driver - It is used to establish a connection between hive and Java and provides it to Hive Driver.
applications. The JDBC Driver is present in the class • Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC
org.apache.hadoop.hive.jdbc.HiveDriver. driver. It transfers the queries to the compiler.
• Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis
• ODBC Driver - It allows the applications that support the ODBC protocol to on the different query blocks and expressions. It converts HiveQL statements into MapReduce
connect to Hive. jobs.
• Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce
tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the order of
their dependencies.

Differences between Hive and SQL


Limitations of Hive
On the basis of SQL HiveQL
• Hive is not capable of handling real-time data.
UPDATE, DELETE UPDATE, DELETE
• It is not designed for online transaction processing. Update-commands in table structure
INSERT, INSERT,
• Hive queries contain high latency.
Manages Relational data Data Structures

Transaction Supported Limited Support Supported


Differences between Hive and Pig
Indexes Supported Supported

It contains Boolean, integral, floating-point,


Hive Pig It contain a total of five data types i.e., Integral,
fixed-point, timestamp(nanosecond precision) ,
Data Types floating-point, fixed-point, text and binary
Date, text and binary strings, temporal, array,
strings, temporal
Hive is commonly used by Data Pig is commonly used by map, struct, Union
Analysts. programmers. Functions Hundreds of built-in functions Hundreds of built-in functions

It follows SQL-like queries. It follows the data-flow language. Mapreduce Not Supported Supported

Multitable inserts in table Not supported Supported


It can handle structured data. It can handle semi-structured
data. Select command Supported
Supported with SORT BY clause for partial
ordering and LIMIT to restrict number of rows
returned
It works on server-side of HDFS It works on client-side of HDFS
cluster. cluster. Joins Supported
Inner joins, outer joins, semi join, map joins,
cross joins
Hive is slower than Pig. Pig is comparatively faster than Subqueries Supported Only Used in FROM, WHERE, or HAVING clauses
Hive. Views Can be Updated Read-only i.e. cannot be updated

1
6/7/2024

HIVE Data Types Decimal Type


Type Size Range
Hive data types are categorized in numeric types, string types, misc types, and complex types. A list of Hive
data types is given below. FLOAT 4-byte Single precision
floating point
number
Integer Types DOUBLE 8-byte Double precision
floating point
Type Size Range number
TINYINT 1-byte signed integer -128 to 127
Date/Time Types
SMALLINT 2-byte signed integer 32,768 to 32,767 TIMESTAMP
INT 4-byte signed integer 2,147,483,648 to • It supports traditional UNIX timestamp with optional nanosecond precision.
2,147,483,647 • As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
• As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with decimal precision.
BIGINT 8-byte signed integer - • As string, it follows java.sql.Timestamp format "YYYY-MM-DD HH:MM:SS.fffffffff" (9 decimal place precision)
9,223,372,036,854,77
5,808 to DATES
9,223,372,036,854,77 The Date value is used to specify a particular year, month and day, in the form YYYY--MM--DD. However, it
5,807 didn't provide the time of the day. The range of Date type lies between 0000--01--01 to 9999--12--31.

String Types Hive Different File Formats


STRING Different file formats and compression codecs work better for different data sets in Apache Hive.
The string is a sequence of characters. It values can be enclosed within single quotes (') or double quotes ("). Following are the Apache Hive different file formats:
Varchar • Text File
The varchar is a variable length type whose range lies between 1 and 65535, which specifies that the • Sequence File
maximum number of characters allowed in the character string. • RC File (Row Columnar)
CHAR • AVRO File
The char is a fixed-length type whose maximum length is fixed at 255. • ORC File (Optimized Row Columnar )
• Parquet File (Column Oriented)
Complex Type
Struct It is similar to C struct or an object struct('James','Roy') Hive Text File Format
where fields are accessed using the • Hive Text file format is a default storage format.
"dot" notation.
• You can use the text format to interchange the data with other client application.
• The text file format is very common most of the applications.
• Data is stored in lines, with each line being a record.
Map It contains the key-value tuples where map('first','James','last','Roy')
the fields are accessed using array • Each lines are terminated by a newline character (\n).
notation.
Hive Sequence File Format
• Sequence files are Hadoop flat files which stores values in binary key-value pairs.
Array It is a collection of similar type of values array('James','Roy') • The sequence files are in binary format and these files are able to split.
that indexable using zero-based • The main advantages of using sequence file is to merge two or more files into one file.
integers.

Hive RC File Format


Hive DDL commands
Hive DDL commands are the statements used for defining and changing the structure of a table or database in
• RC File is Row Columnar file format.
Hive. It is used to build or modify the tables and other objects in the database.
• This is another form of Hive file format which offers high row level compression rates.
• If we have requirement to perform multiple rows at a time then you can use RC File format. DDL Command Use With
CREATE Database, Table
Hive AVRO File Format
Databases, Tables, Table Properties, Partitions,
• AVRO is open source framework that provides data serialization and data exchange services for Hadoop. SHOW
Functions, Index
• We can exchange data between Hadoop ecosystem and program written in any programming languages.
• Avro is one of the popular file format in Big Data Hadoop based applications. DESCRIBE Database, Table, view
USE Database
Hive ORC File Format DROP Database, Table
• The ORC file stands for Optimized Row Columnar file format.
• The ORC file format provides a highly efficient way to store data in Hive table. ALTER Database, Table
• This file system was actually designed to overcome limitations of the other Hive file formats. TRUNCATE Table
• The Use of ORC files improves performance when Hive is reading, writing, and processing data from large tables.
1.CREATE DATABASE
The CREATE DATABASE statement is used to create a database in the Hive. The DATABASE and SCHEMA are
Hive Parquet File Format interchangeable. We can use either DATABASE or SCHEMA.
• Parquet is a column-oriented binary file format. Syntax:
• The parquet is highly efficient for the types of large-scale queries. CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
• Parquet is especially good for queries scanning particular columns within a particular table. [COMMENT database_comment]
• The Parquet table uses compression Snappy, gzip; currently Snappy by default. [LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];

2
6/7/2024

2.SHOW DATABASE
The SHOW DATABASES statement lists all the databases present in the Hive. 6.ALTER DATABASE
Syntax: The ALTER DATABASE statement in Hive is used to change the metadata associated with the database in Hive.
SHOW (DATABASES|SCHEMAS); Syntax :
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);
3.DESCRIBE DATABASE
• The DESCRIBE DATABASE statement in Hive shows the name of Database in Hive, its comment (if set), and its 7.CREATE TABLE
location on the file system. The CREATE TABLE statement in Hive is used to create a table with the given name. If a table or view already
• The EXTENDED can be used to get the database properties. exists with the same name, then the error is thrown. We can use IF NOT EXISTS to skip the error.
Syntax: Syntax:
DESCRIBE DATABASE/SCHEMA [EXTENDED] db_name; CREATE TABLE [IF NOT EXISTS] [db_name.] table_name [(col_name data_type [COMMENT col_comment], ... [COMMENT
col_comment])] [COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION
4.USE DATABASE hdfs_path];
The USE statement in Hive is used to select the specific database for a session on which all subsequent HiveQL
statements would be executed. 8.SHOW TABLES
Syntax: The SHOW TABLES statement in Hive lists all the base tables and views in the current database.
USE database_name; Syntax:
SHOW TABLES [IN database_name];
5.DROP DATABASE
• The DROP DATABASE statement in Hive is used to Drop (delete) the database. 9.DESCRIBE TABLE
• The default behavior is RESTRICT which means that the database is dropped only when it is empty. To drop the The DESCRIBE statement in Hive shows the lists of columns for the specified table.
database with tables, we can use CASCADE. Syntax:
Syntax: DESCRIBE [EXTENDED|FORMATTED] [db_name.] table_name[.col_name ( [.field_name])];
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

10. DROP TABLE


• The DROP TABLE statement in Hive deletes the data for a particular table and remove all metadata associated Hive DML commands
with it from Hive metastore. • Hive DML (Data Manipulation Language) commands are used to insert, update, retrieve, and delete data from
• If PURGE is not specified then the data is actually moved to the .Trash/current directory. If PURGE is specified, the Hive table once the table and database schema has been defined using Hive DDL commands.
then data is lost completely. • The various Hive DML commands are:
1. LOAD
Syntax: 2. SELECT
DROP TABLE [IF EXISTS] table_name [PURGE]; 3. INSERT
4. DELETE
11. ALTER TABLE 5. UPDATE
The ALTER TABLE statement in Hive enables you to change the structure of an existing table. Using the ALTER 6. EXPORT
TABLE statement we can rename the table, add columns to the table, change the table properties, etc. 1.LOAD Command
• The LOAD statement in Hive is used to move data files into the locations corresponding to Hive tables.
Syntax to Rename a table: • If a LOCAL keyword is specified, then the LOAD command will look for the file path in the local filesystem.
ALTER TABLE table_name RENAME TO new_table_name; • If the LOCAL keyword is not specified, then the Hive will need the absolute URI of the file.
• In case the keyword OVERWRITE is specified, then the contents of the target table/partition will be deleted
12. TRUNCATE TABLE and replaced by the files referred by filepath.
TRUNCATE TABLE statement in Hive removes all the rows from the table or partition. • If the OVERWRITE keyword is not specified, then the files referred by filepath will be appended to the table.

Syntax:
Syntax:
TRUNCATE TABLE table_name;
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2
...)];

2.SELECT COMMAND 4.DELETE command


The SELECT statement in Hive is similar to the SELECT statement in SQL used for retrieving data from the • The DELETE statement in Hive deletes the table data. If the WHERE clause is specified, then it deletes the rows
database. that satisfy the condition in where clause.
• The DELETE statement can only be used on the hive tables that support ACID.
Syntax: Syntax:
SELECT col1,col2 FROM tablename; DELETE FROM tablename [WHERE expression];
3.INSERT Command 5.UPDATE Command
The INSERT command in Hive loads the data into a Hive table. We can do insert to both the Hive table or • The update can be performed on the hive tables that support ACID.
partition. • The UPDATE statement in Hive deletes the table data. If the WHERE clause is specified, then it updates the
column of the rows that satisfy the condition in WHERE clause.
a. INSERT INTO • Partitioning and Bucketing columns cannot be updated.
The INSERT INTO statement appends the data into existing data in the table or partition. INSERT INTO Syntax:
statement works from Hive version 0.8. UPDATE tablename SET column = value [, column = value ...] [WHERE expression];
Syntax:
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM 6.EXPORT Command
from_statement; • The Hive EXPORT statement exports the table or partition data along with the metadata to the specified output
location in the HDFS.
• Metadata is exported in a _metadata file, and data is exported in a subdirectory ‘data.’
b. INSERT OVERWRITE
Syntax:
The INSERT OVERWRITE table overwrites the existing data in the table or partition.
EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]
Syntax:
TO 'export_target_path' [ FOR replication('eventid') ];
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, ..) [IF NOT EXISTS]] select_statement FROM
from_statement;

3
6/7/2024

3.Merge data in tables


HiveQL Queries Data can be merged from tables using classic SQL joins like inner, full outer, left, right join.
• HiveQL has a SQL like aused for summarizing and querying large chunks of data through Hadoop environment. Code:
• Hive is widely used to alter, create and drop tables, databases, views or user-defined functions by big data Select a.roll_number, class, section from students as a
professionals. inner join pass_table as b
• Some of the data definition language (DDL) used to load data and modify it in the database are Create, Alter, on a.roll_number = b.roll_number
Show, describe, describe formatted, drop, truncate.
4. Ordering a table
ORDER BY clause enables total ordering of the data set by passing all data through one reducer. This may take a
1. HiveQL query for information_schema database long time for large data tables, so SORT BY clause can be used to achieve partial sorting, by sorting each reducer.
• Hive queries can be written to get information about Hive privileges, tables, views or columns. Code:
• Information_schema data is a read-only and user-friendly way to know the state of the system similar to sys Select customer_id, spends from customer as a order by spends DESC limit 100
database data.
5. Aggregation of data in a table
Code:
Select * from information_schema.columns where table_schema = ‘database_name’ • Aggregation is done using aggregate functions that returns a single value after doing computation on many
rows. These are count(col), sum(col), avg(col), min(col), max(col), stddev_pop(col), percentile_approx(int_expr, P,
2. Creation and loading of data into a table NB), where NB is number of histogram bins for estimation), collect_set(col), this returns duplicate elements after
The bulk load operation is used to insert data into managed tables as Hive does not support row-level insert, removing collection column.
delete or update. • The set property which helps in improving the performance of aggregation is hive.map.aggr = true.
Code: • “GROUP BY” clause is used with an aggregate function.
LOAD DATA LOCAL INPATH ‘$Home/students_address’ OVERWRITE INTO TABLE students Code:
PARTITION (class = “12”, section = “science”); Select year(date_yy), avg(spends) from customer_spends where merchant = “Retail” group by year(date_yy)

6. Conditional statements HBase


CASE…WHEN…THEN clause is similar to if-else statements to perform a conditional operation on any
column in a query.
HBase is a data model to store large amount of structured data.
Code: It is an open source, distributed database developed
Select customer, by Apache software foundation written in Java.
Case when percentage <40 then “Fail”
When percentage >=40 and percentage <80 then “Average” Else “Excellent” HBase is an essential part of our Hadoop ecosystem.
End as rank From students; HBase runs on top of HDFS (Hadoop Distributed File System).
7. Filtering of data
WHERE clause is used to filter data in HiveQL. LIKE is used along with WHERE clause as a predicate operator
to match a regular expression in a record. It stores huge amount of data in tabular format

Features
• Horizontally Scalable
Why Hbase
• Integration with map reduce
HDFS is used to store ,manage and access data in hadoop
but it can access data only in sequential manner and performs • Column family oriented database
only batch processing hence we use Hbase to access data more
efficiently • Automatic failure support

• Flexible schema

4
6/7/2024

Data Models

Tables: Data is stored in a table format in HBase. But here tables are in column-oriented
format.

Row Key: Row keys are used to search records which make searches fast.

Data model operations


Column Families: Various columns are combined in a column family. These column families are
stored together which makes the searching process faster because data belonging to same The major operation data models are Get, Put, Scan, and Delete. Using these operations we can read, write and
column family can be accessed together in a single seek. delete records from a table.

1. Get
Column Qualifiers: Each column’s name is known as its column qualifier. Get operation is similar to the Select statement of the relational database. It is used to fetch the content of an
HBase table.
Cell: Data is stored in cells. The data is dumped into cells which are specifically identified by We can execute the Get command on the HBase shell as below.
rowkey and column qualifiers.
Syntax:get 'table name', 'row key' <filters>
eg: get 'my_table', 'row1', {COLUMN=>'cf1:col1', TIMESTAMP=>ts}
Timestamp: Timestamp is a combination of date and time. Whenever data is stored, it is
stored with its timestamp. This makes easy to search for a particular version of data 2. Put
Put operation is used to read multiple rows of a table. It is different from getting in which we need to specify a set
of rows to read. Using Scan we can iterate through a range of rows or all the rows in a table.

Eg:put 'table_name', 'row_key', 'column_family:column_qualifier', 'value'

Hbase Architecture
3. Scan
Scan operation is used to read multiple rows of a table. It is different from Get in which we need to specify a HBase has four major components
HMaster Server,
set of rows to read. Using Scan we can iterate through a range of rows or all the rows in a table.
HBase Region Server,
Eg:scan 'table_name' [, {OPTIONS}] Regions
Zookeeper.
You can add additional options to the scan command, such as specifying a range of rows or columns, using
filters, and setting other scan parameters.

4. Delete
Delete operation is used to delete a row or a set of rows from an HBase table.

Eg:Eg:delete 'table_name', 'row_key', 'column_family:column_qualifier' [, timestamp]

The various types of internal delete markers as below.


Delete It is used for a specific version of a column.
Delete column Can be used for all column version.
Delete family It is used for all columns of a particular ColumnFamily.

5
6/7/2024

Zookeeper

• Zookeeper acts like a coordinator inside HBase distributed environment.


• It helps in maintaining server state inside the cluster by communicating through sessions.
• Every Region Server along with HMaster Server sends continuous heartbeat at regular
interval to Zookeeper and it checks which server is alive and available
• It also provides server failure notifications so that, recovery measures can be executed.

Hmaster
• HBase HMaster performs DDL operations (create and delete tables) and assigns regions to the
Region servers as you can see in the above image. Region servers
• It coordinates and manages the Region Server (similar as NameNode manages DataNode in • Each region server have various regions in it
HDFS). • Region server is used to communicate with client whenever client makes an request
• It assigns regions to the Region Servers on startup and re-assigns regions to Region Servers
during recovery and load balancing.
• It monitors all the Region Server’s instances in the cluster (with the help of Zookeeper) and Regions
performs recovery activities whenever any Region Server is down. These are tables that are split up and spread across the servers
• It provides an interface for creating, deleting and updating tables.

Introduction to HBase Clients


Hbase Clients
• HBase is a powerful NoSQL database that is part of the Apache •HBase provides several client interfaces to interact with the
Hadoop ecosystem. database.

•To interact with HBase, users and applications can leverage a variety •The most common are the Java API, the REST API, and the Thrift
of client interfaces. API.

•These clients provide different levels of functionality, flexibility, and •These clients allow developers to perform CRUD (Create, Read,
ease of use, allowing developers to choose the best fit for their specific Update, Delete) operations on HBase tables.
needs and use cases.

HBase Operations (Create, Read, Update, Delete)


• Define the table schema, including column families and data types. Use the HBase shell or API Some of the Hbase clients are:
CREATE to create new tables and insert data.
1. Java client
• Perform scans and get operations to retrieve data from HBase. Leverage filters and other 2. REST client
READ query options to refine your searches. 3. Thrift client

• Modify existing data in HBase tables through put operations. HBase's versioning capabilities
UPDATE allow you to track changes over time.

• Remove rows, columns, or entire tables from HBase as needed. Utilize administrative
DELETE commands to manage the lifecycle of your data.

6
6/7/2024

1.)Java Client Methods and description


This chapter describes the java client API for HBase that is used to perform CRUD operations on HBase
tables. HBase is written in Java and has a Java Native API. Therefore it provides programmatic access to S.No. Methods and Description
Data Manipulation Language (DML).
void close()
(i)Class HTable 1
Releases all the resources of the HTable.
HTable is an HBase internal class that represents an HBase table. It is an implementation of table that is void delete(Delete delete)
used to communicate with a single HBase table. This class belongs to 2
Deletes the specified cells/row.
the org.apache.hadoop.hbase.client class.
void put(Put put)
Constructors: 3 Using this method, you can insert data into the
.

table.
S.No. Constructors and Description HTableDescriptor getTableDescriptor()
4
Returns the table descriptor for this table.
1 HTable()
byte[] getTableName()
HTable(TableName tableName, 5
Returns the name of this table.
ClusterConnection connection,
2 ExecutorService pool)
Using this constructor, you can create an
object to access an HBase table.

(ii)Class Put
This class is used to perform Put operations for a single row. It belongs to (iii)Class Get
the org.apache.hadoop.hbase.client package. This class is used to perform Get operations on a single row. This class belongs to
S.No. Constructors and Description the org.apache.hadoop.hbase.client package.

Put(byte[] row) Constructors:


1 Using this constructor, you can create a Put operation for S.No. Constructors and Description
the specified row.
Get(byte[] row)
Put(byte[] rowArray, int rowOffset, int rowLength) 1 Using this constructor, you can create a Get operation for
2 Using this constructor, you can make a copy of the passed- the specified row.
in row key to keep local.
2 Get(Get get)
. .

Methods and description:


S.No. Methods and Description
S.No. Methods and Description
Put add(byte[] family, byte[] qualifier, byte[] value) Get addColumn(byte[] family, byte[] qualifier)
1 Adds the specified column and value to this Put 1 Retrieves the column from the specific family with the
operation. specified qualifier.
Put add(byte[] family, byte[] qualifier, long ts, byte[] Get addFamily(byte[] family)
value) 2
2 Retrieves all columns from the specified family.
Adds the specified column and value, with the specified
timestamp.

(iv)CREATE DATA: (v)UPDATE DATA:


To create data in an HBase table, the following commands and methods are used: To update an existing cell value using the put command the following is the syntax.
•put command, Synatx for update using put():
•add() method of Put class, and
•put() method of HTable class. put ‘table name’,’row ’,'Column family:column name',’new value’

Syntax for put(): Example for update using put():

put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’


<,,. <,,.

hbase(main):002:0> put 'emp','row1','personal:city','Delhi‘


Example for put(): 0 row(s) in 0.0400 seconds

hbase(main):005:0> put 'emp','1','personal data:name','raju‘


0 row(s) in 0.6600 seconds
hbase(main):006:0> put 'emp','1','personal data:city','hyderabad‘
0 row(s) in 0.0410 seconds

7
6/7/2024

(vi)READ DATA: (vii)DELETE DATA:


The get command and the get() method of HTable class are used to read data from a table in HBase. Using the delete command, you can delete a specific cell in a table.
Using get command, you can get a single row of data at a time.

Syntax for delete command:


Syntax for get():
delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’

get ’<table name>’,’row1’

Example for get():


<,,. <,,.

Example for delete command:

hbase(main):012:0> get 'emp', '1' hbase(main):006:0> delete 'emp', '1', 'personal data:city', 1417521848375
0 row(s) in 0.0060 secondsdeleteall ‘<table name>’, ‘<row>’,

Syntax for reading a specific column using get():

hbase> get 'table name', ‘rowid’, {COLUMN ⇒ ‘column family:column name ’} '
Syntax for deleting
deleteall all cells‘<row>’,
‘<table name>’, in a row:

(vi)VIEW TABLE:
2.)HBase REST Client
The scan command is used to view the data in HTable. Using the scan command, you can get the table
data.
1 HTTP-based Interface

Flexibility and Interoperability


Syntax for scan command: 2

scan ‘<table name>’ 3 Ease of Use


<,,.

Example for scan command: 3.)HBase Thrift Client


hbase(main):010:0> scan 'emp' 1 Language-Independent

2 Asynchronous Operations

3 Security and Authentication

Choosing the Right HBase Client


HBase Examples
Client Use Cases Strengths Limitations
Custom, high-performance Low-level API, batch Requires Java expertise, 1)Table Creation
Java Client applications that require processing, high may be more complex for
precise control over HBase performance. some use cases.
operations. HBase provides a flexible data model that allows you to
create tables with dynamic column families and store both
structured and unstructured data.
Web-based applications, Flexibility, May have higher latency
REST Client mobile apps, and other interoperability, ease compared to Java client, Example:
systems that can communicate of use. limited to HTTP-based
over HTTP. interactions.
Creating a table named employee in HBase with two column
families: (i)personal_data and (ii)work_data.

Polyglot systems that need Language-independent, May have slightly higher


Thrift Client to leverage HBase's asynchronous latency compared to Java
capabilities from various operations, security client, requires Thrift
programming languages. features. protocol expertise.
create 'employee', 'personal_data', 'work_data'

. .

8
6/7/2024

2.)Data retrieval 3.)Batch Processing


HBase supports efficient data scanning and querying through its HBase's batch processing capabilities enable high-throughput data
row-oriented storage and powerful APIs for reading, writing, and ingestion and processing, making it well-suited for big data workloads.
deleting data.

Example:
Using Apache Spark to count the number of employees in the employee
Example: table.
Retrieving personal data for an employee with ID 001 from the
employee table.
4.)Coprocessors
HBase's coprocessor framework allows you to extend its functionality by
get 'employee', '001', {COLUMN => 'personal_data'} running custom code directly on the RegionServer, enabling advanced data
processing and analysis.
Example:
Implementing a simple coprocessor to log every Put operation performed
on the employee table.

Explanation: .

Table Creation: Demonstrates how to create a table with specified column families.
Data Retrieval: Shows how to retrieve specific data from a table.
Batch Processing: Utilizes Apache Spark to process data in parallel from HBase.
Coprocessors: Illustrates a simple coprocessor that performs custom actions on Put
operations.
.
.

Differences between Pig Latin and RDBMS


Pig
A data flow language and execution environment for exploring very large datasets. Pig runs Pig Latin RDBMS
on HDFS and MapReduce clusters.
• Pig Latin is a data flow programming language. • SQL is a declarative programming language

• Pig Latin program is a step-by-step set of operations on • SQL statements are a set of constraints that, taken
• Pig is made up of two pieces: an input relation, in which each step is a single together, define the output.
transformation.
1. The language used to express data flows, called Pig Latin.
• It will operate on any source of tuples. The most • RDBMSs store data in tables, with tightly predefined
2. The execution environment to run Pig Latin programs. There are currently two common representation is a text file with tab-separated schemas.
environments: local execution in a single JVM and distributed execution on a Hadoop fields, and Pig provides a built-in load function for this
cluster. format. We can define a schema at runtime, but it’s
optional.

• Pig latin: Offers the best of both SQL and Map-Reduce combined with high-level • There is no data import process to load the data. The
data is loaded from the filesystem (usually HDFS) as the • There is data import process to load the data into the
declarative querying with low-level procedural programming. first step in the processing. RDBMS.

9
6/7/2024

Differences between Pig Latin and RDBMS


Pig Latin RDBMS
Pig Latin Example
• Pig Latin supports complex, nested data structures.
Table urls: (url,category, pagerank)
• Operates on flatter data structures.

• Pig Latin does not support random reads or queries in the Find for each suffciently large category, the average pagerank of high-pagerank urls in that category
order of tens of milliseconds. Nor does it support random • Supports random reads or queries
writes to update small portions of data; all writes are bulk. SQL:
SELECT category, AVG(pagerank)
• Pig Latin’s ability to use UDFs (User Defined Functions) and FROM urls WHERE pagerank > 0.2
streaming operators and Pig’s nested data structures makes GROUP BY category HAVING COUNT(*) > 10^6
Pig Latin more customizable than most SQL versions. • Not very customizable.

• Pig Latin does not have features to support online, low-


Pig Latin:
latency queries such as transactions and indexes. good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
•RDBMSs have features to support online, low- big_groups = FILTER groups BY COUNT(good_urls)>10^6;
latency queries such as transactions and indexes. output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

Pig Components Pig Components

 Pig Latin (The Language)


Pig
 Data Structures Pig Latin
Script Map-Reduce
 Commands Statements
 Pig (The Compiler) User- Compile
 Logical & Physical Plans Defined
Functions
 Optimization Optimize
 Efficiency Write Results Read Data
Grunt (The Interpreter)
 Pig Pen (The Debugger)

10

Grunt
Running Pig Programs
Grunt has line-editing facilities like those found in in the bash shell. For instance, the Ctrl-E key combination will move the cursor
to the end of the line.
There are three ways of executing Pig programs, all of which work in both local and
MapReduce mode: Grunt remembers command history, too,1 and you can recall lines in the history buffer using Ctrl-P or Ctrl-N (for previous and
next) or, equivalently, the up or down cursor keys.
Script
Another handy feature is Grunt’s completion mechanism, which will try to complete Pig Latin keywords and functions when you
Pig can run a script file that contains Pig commands. For example, pig script.pig runs the commands in the local file script.pig.
press the Tab key. For example, consider the following incomplete line:
Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on the command line.
grunt> a = foreach b ge
Grunt If you press the Tab key at this point, ge will expand to generate, a Pig Latin keyword:
Grunt is an interactive shell for running Pig commands. Grunt is started when no file is specified for Pig to run, and the -e grunt> a = foreach b generate
option is not used. It is also possible to run Pig scripts from within Grunt using run and exec.

Embedded
You can run Pig programs from Java using the PigServer class, much like you can use JDBC to run SQL programs from Java. For
programmatic access to Grunt, use
PigRunner.

10
6/7/2024

Relations are given names, or aliases, so they can be referred to. This relation is given the records alias. We can examine the contents of an alias using the DUMP
operator:
An Example grunt> DUMP records;
Let’s look at a simple example by writing the program in Pig Latin to calculate the maximum recorded temperature by (1950,0,1)
(1950,22,1)
year for the weather dataset. (1950,-11,1)
(1949,111,1)
- - max_temp.pig: Finds the maximum temperature by year
(1949,78,1)
records = LOAD 'input/ncdc/micro-tab/sample.txt‘ AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year; We can also see the structure of a relation—the relation’s schema—using the DESCRIBE operator on the relation’s alias:
max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); grunt> DESCRIBE records;
DUMP max_temp; records: {year: chararray,temperature: int,quality: int}
This tells us that records have three fields, with aliases year, temperature, and quality, which are the names we gave them in the AS clause. The fields have the types
To explore what’s going on, we’ll use Pig’s Grunt interpreter, which allows us to enter lines and interact with the program to understand given to them in the AS clause, too. For this small dataset, no records are filtered out:
grunt> filtered_records = FILTER records BY temperature != 9999 AND
what it’s doing. Start up Grunt in local mode, then enter the first line of the Pig script: >> (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grunt> DUMP filtered_records;
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' (1950,0,1)
>> AS (year:chararray, temperature:int, quality:int); (1950,22,1)
(1950,-11,1)
(1949,111,1)
For simplicity, the program assumes that the input is tab-delimited text, with each line having just year, temperature, and quality fields. (1949,78,1)
The result of the LOAD operator is a relation, which is just a set of tuples. A tuple is just like a row of data in a database table, with
multiple fields in a particular order. In this example, the LOAD function produces a set of (year, temperature, quality) tuples that are The third statement uses the GROUP function to group the records relation by the year field. Let’s use DUMP to see what it produces:
present in the input file. We write a relation with one tuple per line, where tuples are represented as comma- separated items in grunt> grouped_records = GROUP filtered_records BY year;
grunt> DUMP grouped_records;
parentheses: (1949,{(1949,111,1),(1949,78,1)})
(1950,0,1) (1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
(1950,22,1) We now have two rows, or tuples, one for each year in the input data. The first field in each tuple is the field being grouped by (the year), and the second field is a bag of
(1950,-11,1) tuples for that year. A bag is just an unordered collection of tuples, which in Pig Latin is represented using curly braces.
(1949,111,1)

By grouping the data in this way, we have created a row per year, so now all that remains is to find the maximum temperature for the tuples in each bag. Before we
do this, let’s understand the structure of the grouped_records relation:
Pig Data Model
grunt> DESCRIBE grouped_records; • Atom - Simple atomic value (ie: number or string)
grouped_records: {group: chararray,filtered_records: {year: chararray,
temperature: int,quality: int}} • Tuple - Sequence of fields; each field any type
This tells us that the grouping field is given the alias group by Pig, and the second field is the same structure as the filtered_records relation that was being grouped.
• Bag - Collection of tuples
With this information, we can try the fourth transformation: - Duplicates possible
grunt> max_temp = FOREACH grouped_records GENERATE group,
>> MAX(filtered_records.temperature); - Tuples in a bag can have different field lengths and field types
FOREACH processes every row to generate a derived set of rows, using a GENERATE clause to define the fields in each derived row. In this example, the first field is
group, which is just the year. The second field is a little more complex.
• Map - Collection of key-value pairs. Key is an atom; value can be any type
Tuple
The filtered_records.temperature reference is to the temperature field of the filtered_records bag in the grouped_records relation. MAX is a built-in function for
calculating the maximum value of fields in a bag. In this case, it calculates the maximum temperature for the fields in each filtered_records bag. Let’s check the
result:
grunt> DUMP max_temp;
(1949,111)
(1950,22)
So we’ve successfully calculated the maximum temperature for each year.
Atom Map

Bag

11

Data Model Data Model


 Control over dataflow
Ex 1 (less efficient)  User-Defined Functions (UDFs)
spam_urls = FILTER urls BY isSpam(url);
 Ex: spam_urls = FILTER urls BY isSpam(url);
culprit_urls = FILTER spam_urls BY pagerank > 0.8;
Ex 2 (most efficient)  Can be used in many Pig Latin statements
highpgr_urls = FILTER urls BY pagerank > 0.8;  Useful for custom processing tasks
spam_urls = FILTER highpgr_urls BY isSpam(url);
 Fully nested  Can use non-atomic values for input and output
More natural for procedural programmers (target user) than  Currently must be written in Java
normalization
 Data is often stored on disk in a nested fashion
 Facilitates ease of writing user-defined functions
 No schema required
15 16

11
6/7/2024

Compilation Compilation
 Pig system does two tasks:
 Building a Logical Plan
 Builds a Logical Plan from a Pig Latin script
 Supports execution platform independence  Verify that input files and bags referred to are valid
 No processing of data performed at this stage
 Create a logical plan for each bag defined

 Compiles the Logical Plan to a Physical Plan and Executes


 Convert the Logical Plan into a series of Map-Reduce statements to
be executed (in this case) by Hadoop Map-Reduce

24 25

Compilation Compilation
 Building a Logical Plan Example  Building a Physical Plan
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)  Step 1: Create a map-reduce job Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
 Step 2: Push commands into the
D = FILTER C BY city IS ‘kitchener’ Filter map and reduce functions where Map Filter
OR city IS ‘waterloo’; possible
STORE D INTO ‘local_user_count.dat’;

Group Group
Building a physical plan happens only when output is Reduce
specified by STORE or DUMP

Foreach Foreach

30 34

Compilation and Execution of Pig Latin Script Syntax and Semantics of Pig Latin
Pig Latin Program Structure
• When the Pig Latin interpreter sees the first line containing the LOAD statement, it confirms that it is • Pig Latin program consists of a collection of statements. A statement can be thought of as an
syntactically and semantically correct, and adds it to the logical plan, but it does not load the data from operation, or a command.
the file (or even check whether the file exists). (e.g.) grouped_records = GROUP records BY year;
• Pig validates the GROUP and FOREACH...GENERATE statements, and adds them to the logical plan ls /
without executing them. • Statements are usually terminated with a semicolon
• The trigger for Pig to start execution is the DUMP statement. At that point, the logical plan is compiled
• Pig Latin has two forms of comments.
into a physical plan and executed.
• The physical plan that Pig prepares is a series of MapReduce jobs, which in local mode Pig runs in the -- My program
local JVM, and in MapReduce mode Pig runs on a Hadoop cluster. DUMP A; -- What's in A?
• We can see the logical and physical plans created by Pig using the EXPLAIN command on a relation • C-style comments are more flexible since they delimit the beginning and end of the comment block
with /* and */ markers. They can span lines or be embedded in a single line:
(EXPLAIN max_temp; for example).
• EXPLAIN will also show the MapReduce plan, which shows how the physical operators are grouped into /*
MapReduce jobs. * Description of my program spanning
• This is a good way to find out how many MapReduce jobs Pig will run for your query. * multiple lines.
*/
A = LOAD 'input/pig/join/A';
B = LOAD 'input/pig/join/B';
C = JOIN A BY $0, /* ignored */ B BY $1;
DUMP C;

12
6/7/2024

Syntax and Semantics of Pig Latin Syntax and Semantics of Pig Latin

Keywords -- max_temp.pig: Finds the maximum temperature by year


records = LOAD 'input/ncdc/micro-tab/sample.txt'
Pig Latin has a list of keywords that have a special meaning in the language and AS (year:chararray, temperature:int, quality:int);
cannot be used as identifiers. filtered_records = FILTER records BY temperature != 9999 AND
(e.g.) operators (LOAD, ILLUSTRATE) (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality ==
commands (cat, ls) 9);
expressions (matches, FLATTEN) grouped_records = GROUP filtered_records BY year;
and functions (DIFF, MAX) max_temp = FOREACH grouped_records GENERATE group,
Pig Latin has mixed rules on case sensitivity. MAX(filtered_records.temperature);
Operators and commands are not case sensitive DUMP max_temp;
Aliases and function names are case-sensitive.

Pig Latin relational operators Pig Latin diagnostic operators

Pig Latin expressions


Pig commands to interact with Hadoop
filesystems

13
6/7/2024

Pig Latin types


Pig Latin expressions

Pig provides built-in functions TOTUPLE, TOBAG and TOMAP, which are used for turning expressions into tuples,
bags and maps.

Schemas
A relation in Pig may have an associated schema, which gives the fields in the relation names and types. AS clause in a
Schemas
LOAD statement is used to attaches a schema to a relation: Fields in a relation with no schema can be referenced only using positional notation:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' $0 refers to the first field in a relation, $1 to the second, and so on. Their types default
>> AS (year:int, temperature:int, quality:int); to bytearray:
grunt> DESCRIBE records; grunt> projected_records = FOREACH records GENERATE $0, $1, $2;
records: {year: int,temperature: int,quality: int} grunt> DUMP projected_records;
(1950,0,1)
It’s possible to omit type declarations completely, too: (1950,22,1)
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' (1950,-11,1)
>> AS (year, temperature, quality); (1949,111,1)
grunt> DESCRIBE records; (1949,78,1)
records: {year: bytearray,temperature: bytearray,quality: bytearray} grunt> DESCRIBE projected_records;
projected_records: {bytearray,bytearray,bytearray}
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year, temperature:int, quality:int);
grunt> DESCRIBE records;
records: {year: bytearray,temperature: int,quality: int}

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt';


grunt> DESCRIBE records;
Schema for records unknown.

Functions Pig’s built-in functions


Categories of Pig Functions

Eval function
A function that takes one or more expressions and returns another expression.
e.g. of a built-in eval function is MAX, which returns the maximum value of the entries in a bag. Some eval functions are aggregate
functions, which means they operate on a bag of data to produce a scalar value; MAX is an example of an aggregate function.

Filter function
A special type of eval function that returns a logical boolean result. Filter functions are used in the FILTER operator to remove unwanted
rows. They can also be used in other relational operators that take boolean conditions and, in general, expressions using boolean or
conditional expressions. An example of a built-in filter function is IsEmpty, which tests whether a bag or a map contains any items.

Load function
A function that specifies how to load data into a relation from external storage.

Store function
A function that specifies how to save the contents of a relation to external storage.

14
6/7/2024

Pig’s built-in functions Data Processing Operators


Loading and Storing Data Filtering Data
We have seen how to load data from external Once we have some data loaded into a relation, the next step is often to
filter it to remove the data that we are not interested in. By filtering early in
storage for processing in Pig. Here’s an
the processing pipeline, we minimize the amount of data flowing through
example of using PigStorage to store tuples as
plain-text values separated by a colon the system, which can improve efficiency. The FOREACH...GENERATE
operator is used to act on every row in a relation. It can be used to remove
character:
fields or to generate new ones.
In this example, we do both:
grunt> STORE A INTO 'out' USING
grunt> DUMP A;
PigStorage(':');
grunt> cat out (Joe,cherry,2)
(Ali,apple,3)
Joe:cherry:2
Ali:apple:3 (Joe,banana,2)
(Eve,apple,7)
Joe:banana:2
grunt> B = FOREACH A GENERATE $0, $2+1, 'Constant';
Eve:apple:7
grunt> DUMP B;
(Joe,3,Constant)
(Ali,4,Constant)
(Joe,3,Constant)
(Eve,8,Constant)

Grouping and Joining Data


Pig has very good built-in support for join operations. Since the large datasets that are suitable for analysis by Pig (and Pig also supports outer joins using a syntax that is similar to SQL
MapReduce in general) are not normalized, joins are used more infrequently in Pig than they are in SQL.

JOIN - Let’s look at an example of an inner join. Consider the relations A and B:
grunt> DUMP A;
(2,Tie)
grunt> C = JOIN A BY $0 LEFT OUTER, B BY $1;
(4,Coat)
grunt> DUMP C;
(3,Hat)
(1,Scarf,,)
(1,Scarf)
(2,Tie,Joe,2)
grunt> DUMP B;
(2,Tie,Hank,2)
(Joe,2)
(3,Hat,Eve,3)
(Hank,4)
(4,Coat,Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
We can join the two relations on the numerical (identity) field in each:
grunt> C = JOIN A BY $0, B BY $1;
grunt> DUMP C;
(2,Tie,Joe,2)
(2,Tie,Hank,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)

GROUP
Sorting Data
Relations are unordered in Pig. Consider a relation A:
The GROUP statement groups the data in a single relation. GROUP supports grouping by more than equality of grunt> DUMP A;
keys: you can use an expression or user-defined function as the group key. For example, consider the following (2,3)
relation A: (1,2)
grunt> DUMP A; (2,4)
(Joe,cherry) There is no guarantee which order the rows will be processed in. In particular, when retrieving the contents of A using DUMP
(Ali,apple) or STORE, the rows may be written in any order. If we want to impose an order on the output, we can use the ORDER operator
(Joe,banana) to sort a relation by one or more fields. The default sort order compares fields of the same type using the natural ordering, and
(Eve,apple) different types are given an arbitrary, but deterministic, ordering (a tuple is always “less than” a bag, for example). The
following example sorts A by the first field in ascending order and by the second field in descending order:
Let’s group by the number of characters in the second field:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> B = GROUP A BY SIZE($1); grunt> DUMP B;
grunt> DUMP B; (1,2)
(5,{(Ali,apple),(Eve,apple)}) (2,4)
(6,{(Joe,cherry),(Joe,banana)}) (2,3)
Any further processing on a sorted relation is not guaranteed to retain its order. For example: grunt> C = FOREACH B
GROUP creates a relation whose first field is the grouping field, which is given the alias group. The second field is GENERATE *;
a bag containing the grouped fields with the same schema as the original relation (in this case, A). Even though relation C has the same contents as relation B, its tuples may be emitted in any order by a DUMP or a STORE. It is
for this reason that it is usual to perform the ORDER operation just before retrieving the output.

15
6/7/2024

Limit Combining and Splitting Data


Sometimes you have several relations that you would like to combine into one. For this, the UNION statement is used.
The LIMIT statement is useful for limiting the number of results, as a quick and dirty way to get a sample of a relation. It can be For example:
used immediately after the ORDER statement to retrieve the first n tuples. Usually, LIMIT will select any n tuples from a relation, grunt> DUMP A;
but when used immediately after an ORDER statement, the order is retained (in an exception to the rule that processing a (2,3)
relation does not retain its order): (1,2)
(2,4)
grunt> D = LIMIT B 2; grunt> DUMP B;
grunt> DUMP D; (z,x,8)
(1,2) (w,y,1)
(2,4) grunt> C = UNION A, B;
grunt> DUMP C;
If the limit is greater than the number of tuples in the relation, all tuples are returned (2,3)
(so LIMIT has no effect). (1,2)
Using LIMIT can improve the performance of a query because Pig tries to apply the (2,4)
limit as early as possible in the processing pipeline, to minimize the amount of data (z,x,8)
that needs to be processed. (w,y,1)
For this reason, we should always use LIMIT if we are not interested in the entire output.

The SPLIT operator is the opposite of UNION; it partitions a relation into two or more relations

16

You might also like