- Hadoop Related Tools
Syllabus
Hibase - data model and implementations - Hbase cliens - Hbase examples - praxis, Pig - Grunt -
pig data model - Pig Latin - developing a
ind testing Pig Latin scripts. Hive = data types and file
formats - HiveQL data definition - HiveQL data manipulation - HiveQL queries.
Contents
5.1 Hbase
5.2 Data Model and implementations
5.3 Hbase Clients
5.4 Praxis
55 Pig
5.6 Hive
5.7 HiveQL Data Definition
5.8 HiveQL Data Manipulation
5.9 HiveQL Queries
5.10 Two Marks Questions with Answers
(6-1)Hadoop Related Tool,
Hbase
gistributed database modeled after
HBase i e, non-relational,
Base is an open source, non-relatio! an dat bull on Hades
Google's BigTable. HBase is an open source and sorte
It is column oriented and horizontally scalable.
It is a part of the Hadoop ecosystem that provides random real-time read/write
access to data in the Hadoop file system. It runs on top of Hadoop and HDFs,
providing Big Table-like capabilities for Hadoop
HBase supports massively parallelized processing vi# MapReduce for using HBase
as both source and sink.
HBase supports an easy-to-use Java API for programmatic access. It also supports
Thrift and REST for non-Java front-ends.
HBese is a column oriented distributed database in Hadoop environment. It can
store massive amounts of data from terabytes to petabytes. HBase is scalable,
distributed big data storage on top of the Hadoop eco system. :
The HBase physical architecture consists of servers in a Master-Slave relationship,
Typically, the HBase cluster has one Master node, called HMaster and multiple
Region Servers called HRegionServer. Fig, 5.1.1 shows Hbase architecture.
Client
Zookeeper
Region Server _ [Region Server Region Server eK eeRS
Region Region Region ]
Region Region Region | |
q J I
HDFS
Fig. 6.1.1 Hbase architecture
Zookeeper is a centralized monitoring server which maintains configuration
information and provides distributed synchronization. If the client wants to
communicate with regions servers, client has to approach Zookeeper.
HMaster in the master server of Hbase and it coordinates the HBase cluster.
HMaster is responsible for the administrative operations of the cluster.
TECHNICAL PUBLICATIONS - an up-thrust for knowledgaBig Da
P 7
a Analytics 5-3 Hadoop Related Tools
HRegions servers ; It will perform the following functions in communication with
HMaster and Zookeeper.
1. Hosting and managing regions,
2. Splitting regions automatically.
3. Handling read and writes requests.
4, Communicating with clients directly
HRegions : For each column family, HRegions maintain a store. */ain components
of HRegions are Memstore and Hfile.
Data model in HBase is designed to accommodate semi-structured data that could
vary in field size, data type and columns.
HBase is a column-oriented, non-relational database. This means that data is stored
in individual columns and indexed by a unique row key. This architecture allows
for rapid retrieval of individual rows and columns and efficient scans over
individual columns within a table.
Both data and requests are distributed across all servers in an HBase cluster,
allowing user to query results on petabytes of data within milliseconds. HBase is
most effectively used to store non-relational data, accessed via the HBase API.
Features and Application of Hbase
Features of Hbase :
1.
2.
3.
4.
5.
6.
Hbase is linearly scalable.
.. It has automatic failure support.
. It provides consistent read and writes.
It integrates with Hadoop, both as a source and a destination.
. It has easy java API for client.
.. It provides data replication across clusters.
Where to use Hbase 7
. Apache Hbase is used to have random, real-time read/write access to Big Data.
. It hosts very large tables on top of clusters of commodity hardware.
. Apache Hbase is a non-relational database modeled after Google's Bigtable.
Bigtable acts up on Google File System, likewise Apache HBase works on top of
Hadoop and HDFS.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Data Analytics
Applications of Hbase :
1, It is used whenever there is a need to write heavy ap]
2. Hbase is used whenever we need to provide fast random
Hadoop Related Too,
plications.
access to available data,
3. Companies such as Facebook, Twitter, Yahoo and Adobe use HBase internally,
Difference between HDFS and Hbase
6
(EXE Difference between Hbase and Relational Database
HDFS
HDIS is a distributed file system suitable
for storing large files.
HIDES does not support fast individual
record lookups.
It provides high latency batch processing;
no concept of batch processing.
It provides only sequential access of data,
HDFS are
operations.
suited for high latency
In HDFS, data are primarily accessed
through Map Reduce jobs.
HIDES doesn't have the concept of
random read and write operations,
Hbase
HBase is Schema-less
It is Column - oriented datastore.
It is designed to store denormalized
data.
It contains wide and sparsely populated
tables
HBase is a database built on top of the
HDFS.
HBase provides fast lookups for larger
tables.
It provides low latency access to’ single
rows from billions of records (Random
access). 3
HBase intemally uses Hash tables and
provides random access and it stores the
data in indexed HDFS files for faster
lookups.
HBase is suited for low, latency, |
operations.
HBase provides access to single
rows
from billions of records. R
HBase data is accessed through shell
commands, client API in Java, REST,
Avro or Thrift,
Relational Database
Relational Database is based on a Fixed
Schema.
It is Row - oriented datastore.
It is designed to store normalized data.
It contains thin tables,
TECHNICAL PUBLICATIONS® - an up-thnust for knowledgeELT ee
aig Data Analytics 5-5
Se Hadoop Related Tools
| 5 Hbase supports automatic partitioning Relational database has 7g builtin |
| z ~ _ Support for partitioning,
|
| It is good for semi-structured
| 6 Te good for semi-structured as well ts good for sutured data
|
|
1
_ RDBMS is transactional.
[EREE Limitations of HBase
« It takes a very long time to recover if the HMaster
1 Goes down. It takes a long time
to activate another node if the first nodes go down.
«In HBase, cross data operations and join operations are very difficult to perform.
« HBase needs a new format when we want to migrate from RDBMS external
sources to HBase servers.
+ It is-very challenging in HBase to support querying process.
* It takes enormous time to develop security factor to grant access to the users.
* HBase allows only one default sort for a table and it does not support large size
of binary files.
HBase is expensive in terms of hardware requirement and memory blocks’
allocations.
15.2.1 Data Model and Implementations
The Apache HBase Data Model is designed to accommodate structured or
semi-structured data that could vary in field size, data type and columns. HBase
stores data in tables, which have rows and columns. The table schema is very
different from traditional relational database tables.
A database consists of multiple tables. Each table consists of multiple rows, sorted
by row key. Each row contains a row key and one or more column families.
Each column family is defined when the table is created. Column families can
contain multiple columns. (family : column), A cell is uniquely identified by
(table,row,family : column). A cell contains an uninterpreted array of bytes and a
timestamp.
HBase data model has some logical components which are as follows :
1. Tables 2, Rows
3. Column Families /Columns 4. Versions/Timestamp 5. Cells
Tables : The HBase Tables are more like logical collection of rows stored in
Separate partitions called Regions. As shown above, every Region is then served
by exactly one Region Server.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Data Analytics 5-6 Hadoop Related Too),
* The syntax to create a table in HBase shell is shown below.
create '
','' of y
+ Example: create ‘CustomerContactinformation” CustomerName’ , ’ Contactinfo’
* Tables are automatically partitioned horizontally by HBase into regions. Each
region comprises a subset of a table's rows. A region is denoted by the table jg
belongs to. Fig. 5.2.1 shows region with table.
Region server 7
‘TablelA, Table A, Region 1
= Le Table A, Region 2
Region 1 | }® Table G, Region 1070
: Table L, Region 25
tJ Region server 86
Region 2| Pe | Table A, Region 3
h Table C, Region 30
i 7 Table F, Region 160
Region's | 1 ma: Table F, Region 776 |
--! -_ Region server 367
= Table A, Region 4
Region 4 [ - i} Table , Region 17
5 Table E, Region 52
Table P, Region 1116.
Fig, 5.2.1 Region with table
* There is one region server per node. There are many regions in a region server. At
any time, a given region is pinned to a particular region server. Tables are split
into regions and are scattered across region servers. A table must have at least one
region.
* Rows : A row is one instance of data in a table and is identified by a rowkey.
Rowkeys are unique in a Table and are always treated as a bytef ].
© Column families : Data in a row are grouped together as Column Families. Each
Column Family has one more Columns and these Columns in a family are stored
together in a low level storage file known as HFile. Column Families form the
basic unit of physical storage to which certain HBase features like compression are
applied.
Columns : A Column Family is made of one or more columns. A Column is
identified by a Column Qualifier that consists of the Column Family name
concatenated with the Column name using a colon - example : columnfamily :
columnname. There can be multiple Columns within a Column Family and Rows
within a table can have varied number of Columns.
TECHNICAL PUBLICATIONS® - an upthrist for knowiedgepig Date Analytics 5-7 Hadoop Related Tools
+ Cell : A Cell stores data and is essentially
Column Family and the Column (Column Qui
called its value and the data type is always tre
@ unique combination of rowkey,
ialifier). The data stored in a Cell is
‘ated as byte[ J.
+ Version : The data stored in a cell is versioned and vers
by the timestamp. The number of versions of data ret
configurable and this value by default is 3.
ions of data are identified
ined in a column family is
+ Timerto-Live : TTL is a built-in feature of HBase that ages out data based on its
timestamp. This idea comes in handy in use cases where data needs to be held
| only for a certain duration of time. So, if on a ma
older than the specified TTL in the past, the record in question doesn't get put in
| the HFile being generated by the major compaction; that is, the older records are
removed as a part of the normal upkeep of the table.
jor compaction the timestamp is
+ If TIL is not used and an aging requirement is still needed, then a much more
I/O intensive operation would need to be done
| EBB) Hbase Clients
| * There are a number of client options for interacting with an HBase cluster. There
, are a number of client options for interacting with an HBase cluster.
41, Java
¢ Hbase is written in Java.
* Example : Creating table and inserting data in Hbase table are shown in the
following program.
public class ExampleClient
Dublic static void main (String]} args) throws IOException _
{
Configuration config = HBaseConfiguration.create();
H Create table
HBaseAdmin admin = new HBaseA.dmin(confi
HTableDescriptor htd = new HTableDescriptor("test");
HColumnDescriptor hed = new HColumnDescription(“data”);
btd.addFamily(hcd);
admin.createTable(htd);
byte [] tablename = htd.getName();
// Run some operations -- a put
Htable table = new HTable(config, tablename);
byte |] row! = Bytes,toBytes("row1
Put pi = new put(row!);
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeHade
Big Deta Analytics 5-8 adoop Related Too
byte {] databyte: = Bytes.toBytes(“data”): Rs
pi.add(databytes, Bytes.toBytes("FN"), pytes.toBytes( "value!
table.put(p1);
}
33 *
© To create a table, we need to first create an instance of HBaseAdmin and then ask
family named data.
it to create the table named test with a single column
2. MapReduce
+ HBace classes and utilities in the org.apachehadoop hbasemapreduce package
facilitate using HBase as a source and/or sink in MapReduce jobs. The
TableinputFormat class makes splits on region boundaries so maps are handed a
single region to work on. TheTableOutputFormat will write the result of
MapReduce into HBase.
* Example : A MapReduce application to count the m
table
public class RowCounter {
static final String NAME = ‘rowcounter’;
static class RoweCounterMapper
extends TableMapper {
/** Counter enumeration to count the actual rows. */
public static enum Counters {ROWS}
@Override .
public void map(ImmutableBytesWritable row, Result values,
Context context)
throvs IOException {
for (KeyVelue value: values.list()) {
if (value.gezVelue()-length > 9) {
context. getCounter{Counters.ROWS).increment(1);
break;
Fd}
public static Job createSubmittableJob(Configuration conf, String|} args)
throws IOException { .
String tableName = argsi0I;
Job job = new Job(conf, NAME + ‘* + tableName);
job.setJarByClass(RowCounter.class);
// Columns are space delimited
StringBuilder sb StringBuilder();
final int columnofiset = 1,
for (int i = colummnofiset; i < argslength; i++) {
if (i > columnofiset) {
umber of rows in an HBase
sb.append("");
}
sb.append(argsli]);
TECHNICAL PUBLICATIONS” - an up-thrust for knowledgesig 0312 Analytics 5-9 Hadoop Related Tools
an = new Scan();
st sorfilter(new FirstKeyOnlyFilter());
() > 0 {
if (solar jumnNanie :sbtoString().split("")) {
for (6 elds = columnName.split(
iqfelds.tencth
scan.
yelse {
scan.addColumn(Bytes toBytes(fields[0}), Bytes.toBytes(tields(1]));
}
Il second argument is the table name.
job setOutputFormatClass(NullOutputFormat.class);
TableMapReduceUtil.initTableMapperJob(tableName, scan,
powCounterMapper.class, ImmutableBytesWritable.class, Result.class, job);
job.setNumReduceTasks(0); It.clas:
return jobi
» le static void main(String|] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
String|] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 1) {
system.err.println("ERROR: Wrong number of parameters: * + args.length);
System.err printin("Usage: RowCounter { ...|);
System.exit(-1);
t Job job = createSubmittableJob(conf, otherArgs);
System exit(job.waitForCompletion(true) ? 0 : 1);
y}
3. Avro, REST, and Thrift
+ HBase ships with Avro, REST and Thrift interfaces. These are useful when the
interacting application is written in a language other than Java. In all cases, a Java
server hosts an instance of the HBase client brokering application Avro, REST, and
Thrift requests in and out of the HBase cluster. This extra work proxying requests
and responses means these interfaces are slower than using the Java client directly.
REST : To put up a stargate instance, start it using the following command:
% bbase-dasmon.sh start rest
This will start a server instance, by default on port 8080, background it and catch
any emissions by the server in logfiles under the HBase logs directory.
Clients can ask for the response to be formatted as JSON, Google's protobufs, or as
XML, depending on how the client HTTP Accept header is set.
To stop the REST server, type :
% bbase-daemon.sh stop rest
TECHNICAL PUBLICATIONS® - an up-hrust for knawtedgeHadooe Releted Tose,
cere: to field Thrift dents 5,
on por 080, background it ang
logfiles under the HBas> dogs directory. The
used generating classes,
* Avro: The Avro server is started and stopped in the seme manner as we staxt
and stop the Thrift or REST services. The Avro server by default uses port 9090,
EZ] Praxis
* When Hbase duster running under load, following issues are considered =
L Versions : A particular Hbase version would run on any Hadoop that had a
matching minor v HBzse 0203 would run on an Hadoop 0202, but
HiBase 0.19.5 would not ran on Hadoop 0.20.0
HDES : In MapReduce, HDFS files are opened, with their content streamed
through 2 mep tesk and then closed. In HBase, date files are opened on duster
startup and kept open. Because of this, HBese tends to see issues not normally
encountered by MapReduce clients.
Running out of file descriptors : Because of open files on a loaded cluster, it
doesn't take long before we run into system - and Hadoop - imposed limits. Each
open file consumes at least one descriptor over on the remote datanode. The
default limit on the number of file descriptors per process is 1024. HBase process
y
regionservers log.
ing out of datanode threads : The Hadoop datanode has an upper bound of
256 on the number of threads it can run at any one time.
«+ Sync: We must run HBase on an HDFS that has a working sync. Otherwise, there
is Joss of data. This means running HBase on Hadoop 0.21.x, which adds 4
working sync/append to Hadoop 0.20.
Ul: HBase runs a web server on the master to present 2 view on the state of
running cluster, By default, it listens on port 60010. The master UI displays a list
of basic attributes such as software versions, cluster load, request rates, lists of
cluster tables and participating regionservers.
TECHNICAL PUBLICATIONS? - on upthrust for knowledgopeta AnahtS 8-11
mn
Hadoop Related Tools
gchema Design : HBase tables are Me these in en RDEMS
versioned, TOwS are sorted and columns can be added ow a
except thet cells are
led on the fiy by the diet =
jong 25 the column family they belong to precusts IPT Geet es
» Joins: There is no native database join fedlity in HBase, but
so that there iS no need for datzbase joins pull
tables. A wide row cen sometimes be made to hold
t wide tables con make
all data that pertains to a
particular primary key.
Bris
7 Pig is an open-source high level date fow system A highlevel plifom for
creating MapReduce programs used in Hadoop. it translates into efSdent
sequences of one or more MapReduce jobs.
« Pig offers a high-level language to write data anzlysis programs which we call 2s
Pig Latin. The salient property of pig programs is thet their structure is amenable
to substantial parallelization, which in tums enables them to handle very large
data sets.
« Pig makes use of both, the Hadoop Distributed File System as well as the
MapReduce.
Features of Pig Hadoop :
1. Inbuilt operators : Apache Pig provides a very good set of operators for
performing several data operations like sort, join, filter, etc.
2. Ease of programming.
3. Automatic optimization : The tasks in Apache Pig are automatically optimized.
4. Handles all kinds of data : Apache Pig can analyze both structured and
unstructured data and store the results in HDFS.
Fig, 5.5.1 shows Pig architecture. (Refer Fig. 53.1 on next page)
Pig has two execution modes-:
Local mode : To run pig in local mode, we need access to a single machine; all
files are installed and run using local host and file system. Specify local mode
using the -x flag (pig-x local).
Mapreduce mode : To run pig in mapreduce mode, we need access to 2 Hadoop
cluster and HDFS installation. Mapreduce mode is the default mode; but don’t
need to, specify it using the -x flag
TECHNICAL PUBLICATIONS” - an up-trust for knowtedgeBig Date Analytics 5-12
Hedoop Related Too,
Pig Latin
scripts
t
Execution engine
MapReduce
|
Apache Pig
Hadoop
HDFS
ee,
Fig, 5.5.1 Pig architecture
« Pig Hadoop framework has four main components :
L
Parser : When a Pig Latin script is sent to Hadoop Pig, it is first handled by the
parser, The parser is responsible for checking the syntax of the script, along
with other miscellaneous checks. Parser gives an output in the form of a
Directed Acyclic Graph (DAG) that coniains Pig Latin statements, together with
other logical operators represented as nodes.
Optimizer : After the output from the parser is retrieved, a logical plan for
DAG is passed to a logical optimizer. The optimizer is responsible for carrying
out the logical optimizations.
Compiler : The role of the compiler comes in when the output from the
optimizer is received. The compiler compiles the logical plan sent by the
optimize the logical plan is then converted into a series of MapReduce tasks OF
jobs.
TECHNICAL PUBLICATIONS® - en up-thrust for knowledgeag 08! Analytics oe Hadoop Related Tools
i
4, Execution Engine : After the logical plan is converted to MapReduce jobs, these
jobs are sent to Hadoop in a properly sorted order and these jobs are executed
rn Hadoop for yielding the desired result.
Pig can run on two types of environments : The local environment in a single JVM
or the distributed environment on a Hadoop cluster.
Pig has variety of scalar data types and standard data Processing options. Pig
cupports Map data; a map being a set of key - value pairs,
Most pig operators take a relation as an input and give a relation as the output. It
allows normal arithmetic operations and relational operations too.
pig's language layer currently consists of a textual language called Pig Latin. Pig
Latin is a data flow language. This means it allows users to describe how data
from one or more inputs should be read, processed and then stored to one or
more outputs in parallel.
+ These data flows can be simple linear flows, or complex workflows that include
points where’ multiple inputs are joined and where data is split into multiple
streams to be processed by different operators. To be mathematically precise, a Pig
Latin script describes a directed acyclic graph (DAG), where the edges are data
flows and the nodes are operators that process the data.
+ The first step in a Pig program is to LOAD the data, which we want to
manipulate from HDFS. Then run the data through a set of transformations.
Finally, DUMP the data to the screen or STORE the results in a file somewhere.
Advantages of Pig :
1. Fast execution that works with MapReduce, Spark and Tez.
2. Its ability to process almost any amount of data, regardless of size.
3. A strong documentation process that helps new users learn Pig Latin.
4. Local and remote. interoperability that lets professionals work from anywhere with
a reliable connection.
Pig disadvantages :
1. Slow start-up and clean-up of MapReduce jobs
2. Not suitable for interactive OLAP analytics
3. Complex applications may require many user defined function.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Data Analytics 5-14 Hadoop Related Too),
[ESE Pig Data Model
© With Pig, when the data is loaded th
load from the disk into Pig will have a spe
model is rich enough to manage most of what
structures and nested hierarchical data structures.
+ However, Pig data types can be divided into two groups in general terms : Scalar
forms and complex types.
e data model is specified. Any data that we
cific schema and structure. Pig data
('s thrown in its way like table - like
© Scalar types contain a single value, while complex types include other values, such
as the values of Tuple, Container and Map.
In its data model, Pig Latin has those four types :
s Atom : An atom is any single attribute, like, for example, a string or a
number ‘Hadoop'. The atomic values of Pig are scalar forms that appear, for
float, double, char array
example, in most programming languages, int, long,
and byte array.
Tuple : A tuple is a record generated by a series of fields. For example, each
field can be of any form, 'Hadoop;' ,' or 6. Just think of a tuple in a table as
a row.
Bag : A pocket is a set of tuples, which are not special. The bag’s schema is
flexible, each tuple in the set can contain an arbitrary number of fields and
can be of any sort.
» Map : A map is a set of pairs with main values. The value can store any
type and the key needs to be unique. A char array must be the key of a map
and the value may be of any kind.
E27 Pig Latin
© The Pig Latin is a data flow language used by Apache Pig to analyze the data in
Hadoop. It is a textual language that abstracts the programming from the Java
MapReduce idiom into a notation.
The Pig Latin statements are used to process the data. It is an operator that
accepts a relation as an input and generates another relation as an output,
a) It can span multiple lines.
b) Each statement must end with a semi-colon.
¢) It may include expression and schemas.
d) By default, these statements are processed using multi - query execution
Pig Latin statements work with relations. A relation can be defined as follows :
a) A relation is a bag (more specifically, an outer bag).
b) A bag is a collection of tuples:
©
TECHNICAL PUBLICATIONS® - an up-thrust for knowiedgegoannas
5-15
Hadoop Related Tools
«) A tuple is an ordered set of fields.
4) A field is a piece of data,
Pig Latin Datatypes / 7
1, Int : "nt" represents a signed 32-bit integer. For Example : 13
9. Long : It represents a signed 64-bit integer. For Example ; 13L
3, Float : This data type represents a signed 32-bit floating point. For Example :
130.5F .
4, Double : “double” represents a 64-bit floating point, For Example : 13.5
5, Chararray : It represents,a character array (string) in Unicode UTF-8 format. For
Example : 'Big Data’
6. Bytearray : This data type represents a Byte array.
7. Boolean : "Boolean" represents a Boolean value. For Example : true/false.
fel Developing and Testing Pig Latin Scripts
Pig provides several tools and diagnostic operators to help us to develop
applications.
Scripts in Pig can be executed in interactive or batch mode. To use pig in
interactive mode, we invoke it in local or map-reduce mode then enter commands
one after the other. In batch mode, we save commands in a pig file’and specify
the path to the file when invoking pig.
At an overly simplified level a Pig script consists of three steps. In the first step
we load data from HDFS. In the second step we perform transformations on the
data. In the final step we store transformed data. Transformations are the heart of
Pig scripts. . .
Pig has a schema concept that is used when.loading data to specify what it should
expect. First specify columns and optionally their data types. Any columns in data
but not included in the schema are truncated.
When we have fewer columns than those specified in schema they are filled with
nulls. To load sample data sets we first move them to HDFS then from there we
Will load into Pig.
ig programs can be packaged in three different ways.
:.This function is nothing more than a file consists of Pig Latin
commands, identified by the .pig suffix. Ending Pig program with the pig
©xtension is a convention but not required. The commands are interpreted by
the Pig Latin compiler and then runs in the order determined by the Pig
optimizer,
TECHNICAL PUBLICATIONS® - en up-thrust for knowledge. He
Big Deta Analytics 8516 120000 Retatey Te
2
2. Grunt : Grunt acts as a command interpreter where we can interactive
Pig Latin at the Grunt command line and immediately see the Tesponse enter
method is useful for prototyping during early development stage and
what-if scenarios. with
3. Embedded : Pig Latin statements can run within Java, JavaScript and Pyty
programs. ‘on
Pig scripts, Grunt shell Pig commands and embedded Pig programs may
executed in either Local mode or on MapReduce mode. The Grunt shel] enables ay,
interactive shell to submit Pig commands and run Pig scripts. To start the
shell in Interactive mode, we need to submit the command pig at the shel},
To tell the complier whether a script or Grunt shell is executed locally op in
Hadoop mode just specify it in the -x flag to the pig command. The following i,
an example of how we would specify running our Pig script in local mode :
pig -x local mindStick.pig.
Here's how we would run the Pig script in Hadoop mode, which is the default
we don't specify the flag :
pig -x mapreduce mindstick.pig
By default, when we specify the pig command without any parameters, it start
the Grunt shell in Hadoop mode. If we want to start the Grunt shell in local mode
just add the -x local flag to the command.
Ba Hive
Apache Hive is an open source data warehouse software for reading, writing and
managing large data set files that are stored directly in either the Apache. Hadoop
Distributed File System (HDFS) or other data storage systems such as Apache
HBase,
Data analysts often use Hive to analyze data, query large amounts of unstructured
data and generate data summaries.
Features of Hive :
1. It stores schema in a database and processes data into HDFS.
2. It is designed for OLAP.
3. It provides SQL type language for querying called HiveQL or HQL.
4, It is familiar, fast, scalable and extensible
Hive supports variety of storage formats : TEXTFILE for plaintext, SEQUENCEFILE
for binary key-value pairs, RCFILE stores columns of a table in a record columnat
format
TECHNICAL PUBLICA’ Tions® * 8N up-thrust for knowledgeHadoop Related Tools
Hive table structure consists of rows and columns The
: rows typi
to some record, transaction, or particular entity deta, YPM comespona
f the corresponding col
« The values of Pt '§ columns Tepresent the vari
characteristics for each row. arious attributes or
« Hadoop and its ecosystem are used to appl
Therefore, if a table structure is an approp
Hive may be a good tool to use.
Y some structure to unstructured data
miate way to view the restnictured data
« Following are some Hive use cases :
1, Exploratory or ad-hoc analysis of HDFS data
+ Data can be queried,
transformed and exported to analytical tools,
2, Extracts or data feeds to reporting systems,
dashboards, or data repositories
such as HBase.
3. Combining external structured data to data already residing in HDFS,
Advantages :
1. Simple querying for anyone already familiar with SQL.
2 Its ability to connect with a variety of relational databases, including Postgres and
MySQL.
3. Simplifies working with large amounts of data.
Disadvantages :
1. Updating data is complicated
2. No-teal time access to data
3. High latency,
Program Example : Write a code in JAVA for 4 simple Word Count application
that counts the number of occurrences of each word in a given input set using the
Hadoop Map-Reduce framework on local-standalone set-up.
Amport java io.]OException;
import java.util. StringTokenizer;
import °rg.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
‘Port org.apache.hadoop io.IntWritable;
import org.apache.hadoop io. Text;
import org, ‘apache.hadoop.mapreduce.Job; 4
import °rg. apache. hadoop.mapreduce.Mapper;
import org.apache hadoop mapreduce Reducer;
Spo" cig.epache hhadoop.mapreducelipinputFlelnputFommat
TECHNICAL PUBLICATIONS? - an up-thrust for knowledgeHadoop Related Toolg
Big Data Analytics 18
i r ts
import org.apache.hadoop.mapreduce.lib.output FileOutputF oma :
public class WordCount {
Public static class TokenizerMapper
extends Mapper {
private final static IntWritable one = new IntWritable(1);
Private Text word = new Text();
Public voici map(Object key, Text value, Context context )
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
+
}
public static class IntSumReducer
extends Reducer {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum); ‘
context.write(key, result);
i
}
public static void main(String|] args) throws Exception {
Configuration conf = new Configuration(); ;
Job job = Job.getinstance(conf, "word count’);
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat,addinputPath(job, new Path(args(0)));
FileQutputFormat.setOutputPath(job, new Path(args(1)));
System.exit(job.waitForCompletion(true) ? 0: 1);
TECHNICAL PUBLICATIONS® - an up-thrust for knowiedgeig 042 Analytics 5-79
Hadoop Related Tools
| Hive Architecture
rig. 56:1 shows Hive architecture
* User
| interfaces
eae am
Meta store —
is xeoiltion engine
HDFS or HBASE data storage
Fig. 5.6.1 Hive architecture
Hive OL process engi
e User Interface : Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS.
* The user interfaces that Hive supports are Hive Web UI, Hive command line and
Hive HD Insight.
* Meta Store : Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data types and HDFS
mapping.
HiveQL Process Engine : HiveQL is similar to SQL for querying on schema info
on the Metastore. It is one of the replacements of traditional approach for
MapReduce program. Instead of writing MapReduce program in Java, we can
write a query for MapReduce job and process it.
Execution engine : The conjunction part of HiveQL process Engine and
MapReduce is Hive Execution Engine. Execution engine processes the query and
generates results as same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE : Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.
b TECHNICAL PUBLICATIONS® - an up-thrust for knowledgea Hadoop Related Tools
Working of Hive :
© Fig. 5.62 shows Hive working.
Hive
Job tracker | Mapreduce|
T
Task tracker
HDFS
sk
| 3
Compiler 2s} tete store
Data node|
Fig. 5.6.2 Hive working
1. Execute query : The Hive interface such as command line or Web UI sends query
to driver to execute.
2. Get plan : The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
3. Get metadata : The compiler sends metadata request to metastore.
Send metadata : Metastore sends metadata as a response to the compiler.
5. Send plan : The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
6. Execute plan : The driver sends the execute plan to the execution engine.
Execute job : Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name node and it
assigns this job to TaskTracker, which is in Data node, Here, the query executes
MapReduce job.
7.1Metadata Ops : Meanwhile in execution, the execution engine can execute
metadata operations with Metastore.
8. Fetch result : The execution engine receives the results from data nodes.
9. Send results ; The execution engine sends those resultant values to the driver.
10. Send results : The driver sends the results to Hive Interfaces
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge: 5-21
naiytos
gig D348
Hadoop Related Toots
Data Types and File Formats
| ata types * isa 5 |
I “e live data types can be classified into two categories : Primary data types and
Complex data types.
primary data types are of four types :
miscellaneous types
Numeric data types : Integral types are TINYINT, SMALLINT, INT and BIGRT.
Floating types are FLOAT, DOUBLE and DECIMAL.
String data types are string, varchar and char.
Date/Time data types : Hive provides DATE and TIMESTAMP data types in
traditional UNIX time stamp format for date/time related fields in hive. DATE
values are represented in the form YYYY-MM-DD. TIMESTAMP use the format
yyyy-mam-dd hh:mmsss[f..].
Numeric, string, date/time and
Miscellaneous types : Hive supports two more primitive data types : BOOLEAN
and BINARY. Hive stores true or false values only.
. Data types =
e data types 1} Complex data types
Tinyint, Samilint, Integer, Bigint UIST
Numeri¢ | Float, Double, Decimal, Numeric Arey | Collection of similar data
: “SET ¢
Date time | Timestamp, Date, Interval Map| (ea value combiner
String | String, Varchar, Char ‘Struct Serer different data
ee Similar to UNION in C
Misc. | Boolean, Binary ; Union Fats oe dats
Pri
Fig. 5.6.3,
* Complex type are Array, Map, Struct and Union.
* Array in Hive is an ordered sequence of similar type elements that are
indexable using the zero-based integers.
Map in Hive is a collection of key-value pairs, where the fields are accessed
using array notations of keys (eg., [’key']).
TECHNICAL PUBLICATIONS® - an up-hrust for knewiedgeci
Big Date Analytics 5-22 Hadoop Related Tools
in C language. It is a record type
» STRUCT in Hive is similar to the STRUCT Teco
hich can be any primitive data
that encapsulates a set of named fields, w!
type.
= UNION type in Hive is similar to the
point of time can hold exactly one data
UNION in C. UNION types at any
type from its specified data types.
2. File formats :
© In Hive it refers to how records are store
structured data, each record has to be its own
in a file defines a file format. These file f
encoding, compression rate, usage of space and disk I/O.
TEXTFILE, SEQUENCEFILE, RCFILE and ORCFILE.
at used in Hadoop. In Hive if we
f from CSV (Comma Separated
d inside the file, As we are dealing with
structure. How records are encoded
ormats mainly vary between data
© Hive support file format :
*° TEXTFILE format is a famous input/output form:
define a table as TEXTFILE it can load data o!
Values), delimited by Tabs, Spaces and JSON data.
© Sequence files are flat files consisting of binary key - value pairs. When Hive
converts queries to MapReduce jobs, it decides on the appropriate key - value
pairs to be used for a given record. Sequence files are in the binary format which
can be split and the main use of these files is to club two or more smaller files
and make them as a one sequence file. In Hive we can create a sequence file by
specifying STORED AS SEQUENCEFILE in the end of a CREATE TABLE
statement.
RCFILE stands of Record Columnar File which is another type of binary file
format which offers high compression rate on the top of the rows. RCFILE is used
when we want to perform operations on multiple rows at a time. RCFILEs are flat
files consisting of binary key/value pairs.
© Facebook uses RCFILE as its default file format for storing of data in their data
warehouse as they perform different types of analytics using Hive.
ORCFILE : ORC stands for Optimized Row Columnar which means it can store
data in an optimized way than the other file formats. ORC reduces the size of the
original data up to 75 %. An ORC file contains rows data in groups called as
Stripes along with a file footer. ORC format improves the performance when Hive
processing the data.
HiveQL Data Defin
HiveQL is the Hive query language. Hive offers no support for row level inserts,
updates and deletes. Hive doesn't support transactions. DDL statements are used
to define or change Hive databases and database objects
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgea2 Hadoop Related Tools
pig ontt anon
os of Hive DDL commands are : CREATE, SHOW, DESCRIBE, USE, DROP,
. TER ‘and TRUNCATE.
Hive ppL commands
. ooo 7
5 d Use with |
DDL comman |
Database, table
“CREATE
Databases, tables, Table properties, Partitions,
Functions, Index
Database, Table, view
Database
od
soled
Database, Table
ALTER
hee
| SRUNCATE _
|
|
i
|
|
|
Hive database : In Hive, the database is considered as a catalog or namespace of
tables, It is also common to use databases to organize production tables into
groups. If we do not specify a database, the default database is used.
logical
Let's create a new database by using the following command :
hivoS GREATE DATABASE Rollcall;
Make sure the database we are creating doesn't exist on: Hive warehouse, if exists
it throws Database Rollcall already exists error.
At any time, we can see the databases that already exist as follows =
hive> SHOW DATABASES;
default
Rollcall
hive> CREATE DATABASE student;
hive> SHOW DATABASES;
default
Rollcall
student
+ Hive will create a directory for each database. Tables in that database will be
sored in subdirectories of the database directory. The exception is tables in the
lefault database, which doesn't have its own directory.
* Drop Database Statement :
Syntax: sas ;
deere DATABASE StatementDROP (DATABASE |SCHEMA) [IF EXISTS)
ame [RESTRICT | CASCADE];
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeSig Data Analytics 5-246 Hadoop Related Tool
Example : hive> DROP DATABASE IF EXISTS userid;
* ALTER DATABASE : The ALTER DATABASE statement in Hive is used to
change the metadata associated with the database in Hive. Syntax for changing
Database Properties :
ALTER (DATABASE |SCHEMA) database_name SET DBPROPERTIES
(property_name=property value, ...);
[EQ] Hivea pata Manipulation
* Data manipulation language is a subset of SQL statements that modify the data
stored in tables. Hive has no row - level insert, update and delete operations, the
only way to put data into an table is to use one of the "bulk" load operations.
Inserting data into tables from queries :
* The INSERT statement perform loading data into a table from a query.
INSERT OVERWRITE TABLE students
PARTITION (branch = 'CSE', classe = 'OR')
SELECT * FROM college_students se
WHERE se.bra = 'CSE' AND se.cla = OR!
+ With OVERWRITE, any previous contents of the partition are replaced. If we drop
the keyword OVERWRITE or replace it with INTO, Hive appends the data rather
than replaces it. This feature is only available in Hive v0.8.0 or later.
* We can mix INSERT OVERWRITE clauses and INSERT INTO clauses, as well.
Dynamic partition inserts :
* Hive also supports a dynamic partition feature, where it can infer the partitions to
create based on query parameters. Hive determines the values of the partition
keys, from the last two columns in the SELECT clause.
* The static partition keys must come before the dynamic partition keys. Dynamic
partitioning is not enabled by default. When it is enabled, it works in “strict” mode
by default, where it expects at least some columns to be static. This helps protect
against a badly designed query that generates a gigantic number of partitions.
¢ Hive Data Manipulation Language (DML) Commands
2) LOAD - The LOAD statement transfers data files into the locations that
correspond to Hive tables.
b) SELECT - The SELECT statement in Hive functions similarly to the SELECT
statement in SQL. It is primarily for retrieving data from the database,
©) INSERT - The INSERT clause loads the data into a Hive table, Users can also
perform an insert to both the Hive table and/or partition,
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge5-25
sig 088 Analytics Hadcop Reloted Tools
i
_ The DELETE clause deletes all the data in th ;
ELETE The ‘ ie table. Specifi
a) oF targeted and deleted ifthe WHERE clause is specified. emcee
2 UPDATE - The UPDATE command in Hive updates the data in the table. I the
gery includes the WHERE clause, then it updates the column of the rows that
meet the condition in the WHERE clause.
) EXPORT - The Hive EXPORT command moves the table or pation data
together with the metadata to a designated output location in the HDFS.
) IMPORT - The Hive IMPORT statement imports the data from a pattcuarized
Jocation to a new or currently existing table.
Ea HiveQL Queries
« The Hive Query Language (HiveQL) is a query language for Hive to process and
analyze structured data in a Metastore, Hive Query Language is used for
processing and analyzing structured data. It separates users from the complexity
of Map Reduce programming.
SELECT ... FROM Clauses :
«SELECT is the projection operator in SQL. The FROM clause identifies from which
table, view or nested query we select records. For a given record, SELECT
specifies the columns to keep, as well as the outputs of function calls on one or
more columns.
+ Here's the syntax of Hive's SELECT statement.
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table, reference
(WHERE where_condition]
(GROUP BY col list]
{HAVING having_condition}
ICLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
(LIMIT number}
* SELECT is the projection operator in HiveQL. The points are :
a) SELECT scans the table specified by the FROM clause
b) WHERE gives the condition of what to filter
©) GROUP BY gives a list of columns which specify how to aggregate the records
4) CLUSTER BY, DISTRIBUTE BY, SORT BY specify the sort order and algorithm
©) LIMIT specifies how many # of records to retrieve.
TECHNICAL PUBLICATIONS® - an up-hrust for knosedgede late
2G Dats Stents pz28 Hadoop Related Tools
Computing with Columns ae
* When we select the columns, we can manipulate column values using either
arithmetic operators or function calls. Math, date and string functions are also
popular.
* Here's an example query that uses both operators and functions.
SELECT upper(nams), seles_cost FROM products;
WHERE Clauses : A WHERE clause is used to filter the result set by using
predicate operators and logical operators. Functions can also be used to compute
the condition.
* GROUP BY Cizuses : A GROUP BY clause is frequently used with aggregate
functions, to group the result set by columns and apply aggregate functions over
ech group. Functions can also be used to compute the grouping key.
Two Marks Questions with Answers
Qi What is HBase 7
Ans. : HBase is 2 distributed column - oriented database built on top of the Hadoop
fe . It is an open-source project and is horizontally scalable. HBzse is a data
nilaz to Google's big table designed to provide quick random access to
structured data.
.
ge amounts
Q2 What is Hive 7
dive is 2 datz warehouse system for Hadoop that facilitates easy data
i-hoc queries and the analysis of large datasets stored in Hadoop
© provides 2 mechanism to project structure onto this data
the date using 2 SQL-like language called HiveQL
Q3 What is Hive data definition ?
igns relational structure to the files stored on the HDFS
ne _ Structured data to extract specific information. For
MESSAGE, LINENUBER. etc.
Q4 Enxpizin services provided by Zoékeeper in Hbase.
Various services that Zookeeper provides include =
lishing client communication with region servers.
S eg
cking server failure and network partitions.
tain configuration information
wides ephemeral nodes, which represent different region servers.
TECHNICAL PUBLICATIONS® - an upthnist or knowledge5-27
sytios Hadocp Patated Tork
jg 00ts AE Felated Todls
rr
tis Zookeeper ?
at
as cervice keeps track of all the region servers that are there in en |
ster - tacking information about how many region serves are there axa |
servers are holding which DataNode. |
e responsibilities of HMaster 7
eon
hse
region
What are thi
. Responsibilities of HMaster >
| a) Manages and monitors the Hadoop cluster
| ‘p) Performs administration
) Controlling the failover
d) DDL operations are handled by the HMaster }
) Whenever 2 client wants to change the schema and change any of Ge
metadata operations, HMaster is responsible for all these operations.
Where to Use HBase ?
Ans. = Hadoop HBase is used to have random real - time access to the big data. It cn |
host large tables on top of cluster commodity. HBsse is 2 non - relational database |
which modelled after Google's big table. It works similar to a big table to store te |
files of Hadoop.
a8 Explain unique features of Hbase 7
Q7
Ans.:
© HBase is built for low latency operations
« HBase is used extensively for random read and write operations
| * HBase stores a large amount of data in terms of tables
«Automatic and configurable sharding of tables |
| « HBase stores data in the form of key/value pairs in a columnar model i
| Q9 Explain data model in Hbase 7 }
| Ans. : The data model in HBase is designed to accommodzte semi - structured diss |
| that could vary in field size, data type and columns. Additionally, the layout of te |
|
|
| data model makes it easier to partition the data and distribute it across the cluster.
| Q.40 What is the difference between Pig Latin and Pig engine ?
| Ans. : Pig Latin is a scripting language similar to Perl used to search large data ses. Tt |
is composed of a sequence of transformations and operations that are applied to the |
| ‘Rput data to create data. ‘
| Th es
| Gis Pig engine is the environment in which Pig Latin prosers are executed. It
( Sanslates Pig Latin operators into MapReduce jobs.
TECHNICAL PUBLICATIONS® - an up-tmust for IowedgeBg Dats Ansitics 5-28 Hadoop Related Tools
| Q.41 What Is pig storage 7
Ans. : Pig has a builtin load function called pig
wish to import data from a file system into the Pig,
torage. In addition, whenever we
can use Pig storage.
| Q.12_ What are the features of Hive ?
Ans. :
# It stores schema in a database dnd processed data into HDFS.
* It is designed for OLAP.
|
|
|
| * It provides SQL type language for querying called HiveQL or HQL.
L
* It is familiar, fast, scalable and extensible.
aaa
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeai
a2
as
a4
Qs
a6
Q7z
as
Qo
Q.10
Q.44 a)
b)
Q.12 a)
time : Thr
SOLVED MODEL QUESTION PAPER
[As Per New Syllabus}
Big Data Analytics
Semester - V (AI&DS)
Vertical - 1 (Verticals for AIDS 1) (AI&DS)
Vertical - 1 (Data Science) (CSEAT/CS&BS)
Vertical - VI (Diversified Courses) (EEE)
rs] [Maximum Marks : 100
eels ‘Answer ALL Questions
PART A - (10 x 2 = 20 Marks)
What is Hadoop ? (Refer Two Marks Q.15 of Chapter - 1)
What is data science ? (Refer Two Marks Q.1 of Chapter - 1)
Explain Cassandra data center, (Refer Two Marks Q.10 of Chapter - 2)
What is the difference bettveen Sharding and replication ?
(Refer Two Marks Q.4 of Chapter - 2)
Why is a block in HDFS so large ? (Refer Two Marks Q.5 of Chapter - 3)
What is MapFile ? (Refer Two Marks Q.11 of Chapter - 3)
Define MapReduce. (Refer Two Marks Q.1 of Chapter - 4)
Explain First.In First Out (FIFO) scheduling.
(Refer Two Marks Q.7 of Chapter - 4)
What is pig storage ? (Refer Two Marks Q.11 of Chapter - 5)
What is Zookeeper ? (Refer Two Marks Q.5 of Chapter - 5)
PART B - (5 x 13 = 65 Marks)
i) What is unstructured data? Compare structured and unstructured data.
(Refer section 1.3) (6)
ii) Explain application of big data. (Refer section 1.6) a
OR
i) What is web analytics? Why web analytics is important ? (Refer section 1.5)
[6]
ii) Draw and explan Hadoop ecosystem. (Refer section 1.8.1) i]
i) Briefly discuss schemaless database. (Refer section 2.3) a]
il) What is CAP theorem? Explain, (Refer section 2.1.3) ta
(M1)Se Data Anadis Mo?
b)
Q.t3 a)
b)
Q44 a)
b)
Q.15 2)
b)
Q.16 a)
b)
‘Solved Moxie! Quostion Paper
aad
senting with replioation.
nad 2.5.6) ta
iD Discuss read te Quonims. (Refer section 2.6.3) (7)
What is > Explain fisture of Hadoop streaming.
(Refer section 3.2) le)
ii) Explai Seanism of HFS. (Refer section 3.4.5) rea]
oR
ii) Avro (Refer section 3.5.6)
iii) Date integrity in HDFS (Refer section 3.5.)
iv) oop loos (Refer section 3.5.2) (13)
3) Discuss data flow in MapRaduce programming model. (Refer section 4.1.2) [6]
iD on YARN. (Refer section 4.4) eal
OR
i) Discuss Input - Output format of MapReduce. (Refer section 4.9.1) {6
ii) What is capacity scheduler? Compare capacity and fair scheduler.
(Refer sections 4.6.3 and 4.6.4) (7)
i) What is Hbase? Draw architecture of Hbase. Explain difference between
HDFS and Hbase. (Refer section 5.1) (13)
OR
i) Write skort note on Hbase client. (Refer section 5.3) 13]
ii) What is pig? Explain feature of pig. Draw architecture of pig.
(Refer section 5.5) (7
PART C - (1x 15 = 15 Marks)
i) What is open source technology? Explain advantages, disadvantages ~
and application of open source. (Refer section 1.9) m
4) Explain failures in classic map reduce and YARN. (Refer section 4.5) [8]
OR
Explain with diagram various aggregate data model of NoSQL.
(Refer section 2.2) (15)
gQa00
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge