0% found this document useful (0 votes)

35 views42 pages

UNIT5

The document discusses several Hadoop related tools including Hbase, Pig, and Hive. It provides details on the data model and implementations of Hbase including tables, rows, column families, and cells. It also discusses Pig's data model and Pig Latin. Hive's data types, file formats, and queries are mentioned as well.

Uploaded by

2CSE19 Dharun P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views42 pages

UNIT5

Uploaded by

2CSE19 Dharun P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

UNIT -V

HADOOP RELATED TOOLS

Hbase – data model and implementations – Hbase clients – Hbase examples – praxis. Pig – Grunt – pig
data model – Pig Latin – developing and testing Pig Latin scripts. Hive – data types and file formats –
HiveQL data definition – HiveQL data manipulation – HiveQL queries.

Hbase
HBase is a distributed column-oriented database built on top of HDFS. HBase is the Hadoop
application to use when you require real-time read/write random access to very large datasets. Although
there are countless strategies and implementations for database storage and retrieval, most solutions—
especially those of the relational variety—are not built with very large scale and distribution in mind.
Many vendors offer replication and parti‐ tioning solutions to grow the database beyond the confines
of a single node, but these add-ons are generally an afterthought and are complicated to install and maintain
Whirlwind Tour of the Data Model Applications store data in labeled tables. Tables are made of
rows and columns. Table cells—the intersection of row and column coordinates—are versioned. By
default, their version is a timestamp auto-assigned by HBase at the time of cell insertion. A cell’s content is
an uninterpreted array of bytes. An example HBase table for storing photos

Table row keys are also byte arrays, so theoretically anything can serve as a row key, from strings to
binary representations of long or even serialized data structures. Table rows are sorted by row key, aka
the table’s primary key. The sort is byte-ordered. All table accesses are via the primary key

Row columns are grouped into column families. All column family members have a common prefix, so,
for example, the columns info:format and info:geo are both members of the info column family, whereas
contents:image belongs to the contents family. The column family prefix must be composed of printable
characters. The qual‐ ifying tail, the column family qualifier, can be made of any arbitrary bytes. The
column family and the qualifier are always separated by a colon character (:).

A table’s column families must be specified up front as part of the table schema defini‐ tion, but
new column family members can be added on demand. For example, a new column info:camera can be
offered by a client as part of an update, and its value per‐ sisted, as long as the column family info
already exists on the table

Regions

Tables are automatically partitioned horizontally by HBase into regions. Each region comprises a subset
of a table’s rows. A region is denoted by the table it belongs to, its first row (inclusive), and its last row
(exclusive). Initially, a table comprises a single region, but as the region grows it eventually crosses a
configurable size threshold, at which point it splits at a row boundary into two new regions of
approximately equal size. Until this first split happens, all loading will be against the single server
hosting the original region.

Locking

Row updates are atomic, no matter how many row columns constitute the row-level transaction. This
keeps the locking model simple.

Implementation

Just as HDFS and YARN are built of clients, workers, and a coordinating master—the namenode and
datanodes in HDFS and resource manager and node managers in YARN—so is HBase made up of an
HBase master node orchestrating a cluster of one or more regionserver workers (see Figure 20-2). The
HBase master is responsible for bootstrapping a virgin install, for assigning regions to registered
regionservers, and for recovering regionserver failures. The master node is lightly loaded. The
regionservers carry zero or more regions and field client read/write requests. They also manage region
splits, informing the HBase master about the new daughter regions so it can manage the offlining of
parent regions and assignment of the replacement daughters.
by default it manages a ZooKeeper instance as the authority on cluster state, although
it can be configured to use an existing ZooKeeper cluster instead. The ZooKeeper ensemble hosts
vitals such as the location of the hbase:meta catalog table and the address of the current cluster
master. Assignment of regions is mediated via ZooKeeper in case participating servers crash
midassignment.
Hosting the assignment transaction state in ZooKeeper makes it so recovery can pick up
on the assignment where the crashed server left off. At a minimum, when bootstrapping a client
connection to an HBase cluster, the client must be passed the location of the ZooKeeper
ensemble. Thereafter, the client navigates the ZooKeeper hierarchy to learn cluster attributes such
as server locations.

HBase in operation Internally, HBase keeps a special catalog table named hbase:meta,
within which it maintains the current list, state, and locations of all user-space regions afloat on
the cluster. Entries in hbase:meta are keyed by region name, where a region name is made up of
the name of the table the region belongs to, the region’s start row, its time of creation, and finally,
an MD5 hash of all of these (i.e., a hash of table name, start row, and creation timestamp). Here is
an example region name for a region in the table TestTable whose start row is xyz:
TestTable,xyz,1279729913622.1b6e176fb8d8aa88fd4ab6bc80247ece.

Commas delimit the table name, start row, and timestamp. The MD5 hash is surrounded by a
leading and trailing period.

Enter HBase, which has the following characteristics:

No real indexes

Rows are stored sequentially, as are the columns within each row. Therefore, no issues with

index bloat, and insert performance is independent of table size.

Automatic partitioning

As your tables grow, they will automatically be split into regions and distributed across all
available nodes.

Scale linearly and automatically with new nodes

Add a node, point it to the existing cluster, and run the regionserver. Regions will automatically rebalance,
and load will spread evenly
DATA MODEL AND IMPLEMENTATIONS

Apache HBase data model is distributed, multidimensional, persistent, and a sorted

amp that is index by the column key, row key, and timestamp and that is the reason
Apache HBase is also called a key-value storage system.

The following are the Data model terminology used in Apache HBase.

1. Table
Apache HBase organizes data into tables which are composed of character and easy to
use with the file system.

2. Row
Apache HBase stores its data based on rows and each row has its unique row key. The
row key is represented as a byte array.

3. Column Family
The column families are used to store the rows and it also provides the structure to
store data in Apache HBase. It is composed of characters and strings and can be used
with a file system path. Each row in the table will have the same columns family but a
row doesn't need to be stored in all of its column family.

4. Column Qualifier
A column qualifier is used to point to the data that is stored in a column family. It is
always represented as a byte.

5. Cell
The cell is the combination of the column family, row key, column qualifier, and
generally, it is called a cell's value.

6. Timestamp
The value which is stored in the cell are versioned and each version is identified by a
version number that is assigned during creation time. In case if we don't mention
timestamp while writing data then the current time is considered.

A sample table in Apache HBase should look like below.

The above table has two column families which are named Personal and Office. Both
column families are having two columns. The data is stored in a cell and rows are
sorted by row keys.

HBase Data Types

In Apache HBase, there is no such concept of data types. It’s all byte array. It's a kind
of byte-in and byte-out database, in which, when a value is inserted, it is converted into
a byte array using the Put and Result interfaces. Apache HBase uses the serialization
framework to converts user data into byte arrays.

We can store the value up to 10 to 15 MB in the Apache HBase cell. In case the value
is higher then we can store it in Hadoop HDFS and store the file path metadata
information in Apache HBase.

HBase Data Store

The following is the Conceptual and Physical view of Apache HBase.

1. Conceptual View
We can see that a table is viewed as a set of rows at the conceptual level.

The following is the conceptual view of how data is stored in HBase

2. Physical View
The Physical view table is physically stored by the column family.

The following example represents the tables that will be stored as column-family-
based tables.

Namespace
A namespace is a logical grouping of tables. It is similar to relational databases in
group related tables.

Let us see the representation of namespace.

Now let us see each component of the namespace.

1. Table
All tables are part of the namespace. If there is no namespace defined then the table
will be assigned to the default namespace.

2. RegionServer group
It is possible to have a default RegionServer group for a namespace. In that case, a
table created will be a member of RegionServer.

3. Permission
Using namespace a user can define Access Control Lists such as a read, delete, and
update permission, and by using write permission a user can create a table.

4. Quota
This component is used to define a quota that the namespace can contain for tables and
regions.

5. Predefined namespaces
There are two predefined special namespaces.

 hbase: This is a system namespace that is used to contain HBase internal tables.
 default: This namespace is for all the tables for which a namespace is not
defined.

Data Model Operations

The major operation data models are Get, Put, Scan, and Delete. Using these
operations we can read, write and delete records from a table.

Let us see each operation in detail.

1. Get
Get operation is similar to the Select statement of the relational database. It is used to
fetch the content of an HBase table.

We can execute the Get command on the HBase shell as below.

hbase(main) :001:0> get 'table name', 'row key' <filters>

2. Put
Put operation is used to read multiple rows of a table. It is different from getting in
which we need to specify a set of rows to read. Using Scan we can iterate through a
range of rows or all the rows in a table.

3. Scan
Scan operation is used to read multiple rows of a table. It is different from Get in
which we need to specify a set of rows to read. Using Scan we can iterate through a
range of rows or all the rows in a table.

4. Delete
Delete operation is used to delete a row or a set of rows from an HBase table. It can be
executed through HTable.delete().

Once the Delete command is executed, it is marked as a tombstone and when

compaction takes place, the row is finally deleted from the table.

The various types of internal delete markers as below.

 Delete It is used for a specific version of a column.

 Delete column Can be used for all column version.
 Delete family It is used for all columns of a particular ColumnFamily.

Praxis

In this section, we discuss some of the common issues users run into when running an HBase cluster
under load.

HDFS
HBase’s use of HDFS is very different from how it’s used by MapReduce. In MapReduce, generally,
HDFS files are opened with their content streamed through a map task and then closed. In HBase,
datafiles are opened on cluster startup and kept open so that we avoid paying the costs associated with
opening files on each access. Because of this, HBase tends to see issues not normally encountered by
MapReduce clients:

Running out of file descriptors

Because we keep files open, on a loaded cluster it doesn’t take long before we run into system-
and Hadoop-imposed limits. For instance, say we have a cluster that has three nodes, each running an
instance of a datanode and a regionserver, and we’re running an upload into a table that is currently at
100 regions and 10 column families. Allow that each column family has on average two flush files.
Doing the math, we can have 100 × 10 × 2, or 2,000, files open at any one time. Add to this total other
miscellaneous descriptors consumed by outstanding scanners and Java libraries. Each open file
consumes at least one descriptor over on the remote data‐ node.

The default limit on the number of file descriptors per process is 1,024. When we exceed the
filesystem ulimit, we’ll see the complaint about “Too many open files” in logs, but often we’ll first see
indeterminate behavior in HBase. The fix requires increasing the file descriptor ulimit count; 10,240 is
a common setting. Consult the HBase Reference Guide for how to increase the ulimit on your cluster.

Running out of datanode threads

Similarly, the Hadoop datanode has an upper bound on the number of threads it can run at any one
time. Hadoop 1 had a low default of 256 for this setting (dfs.da tanode.max.xcievers), which would
cause HBase to behave erratically. Hadoop 2 increased the default to 4,096, so you are much less
likely to see a problem for recent versions of HBase (which only run on Hadoop 2 and later). You can
change the setting by configuring dfs.datanode.max.transfer.threads (the new name for this property)
in hdfs-site.xml.

HBase runs a web server on the master to present a view on the state of your running cluster. By
default, it listens on port 60010. The master UI displays a list of basic attributes such as software
versions, cluster load, request rates, lists of cluster tables, and participating regionservers. Click on a
regionserver in the master UI, and you are taken to the web server running on the individual
regionserver. It lists the regions this server is carrying and basic metrics such as resources consumed
and request rates.

Metrics

Hadoop has a metrics system that can be used to emit vitals over a period to a context (this is covered
in “Metrics and JMX” on page 331). Enabling Hadoop metrics, and in particular tying them to Ganglia
or emitting them via JMX, will give you views on what is happening on your cluster, both currently
and in the recent past. HBase also adds metrics of its own—request rates, counts of vitals, resources
used. See the file hadoopmetrics2-hbase.properties under the HBase conf directory. Counters

At StumbleUpon, the first production feature deployed on HBase was keeping counters for the
stumbleupon.com frontend. Counters were previously kept in MySQL, but the rate of change was such
that drops were frequent, and the load imposed by the counter writes was such that web designers self
imposed limits on what was counted. Using the incrementColumnValue() method on HTable, counters
can be incremented many thousands of times a second.

Pig

Pig is a scripting platform that runs on Hadoop clusters designed to process and analyze large datasets.
Pig is extensible, self-optimizing, and easily programmed.

Programmers can use Pig to write data transformations without knowing Java. Pig uses both structured
and unstructured data as input to perform analytics and uses HDFS to store the results.

Components of Pig
There are two major components of the Pig:

 Pig Latin script language

 A runtime engine

Pig Latin script language

The Pig Latin script is a procedural data flow language. It contains syntax and commands that can
be applied to implement business logic. Examples of Pig Latin are LOAD and STORE.

A runtime engine

The runtime engine is a compiler that produces sequences of MapReduce programs. It uses HDFS
to store and retrieve data. It is also used to interact with the Hadoop system (HDFS and
MapReduce).

The runtime engine parses, validates, and compiles the script operations into a sequence of
MapReduce jobs.

How Pig Works and Stages of Pig Operations

Pig operations can be explained in the following three stages:

Stage 1: Load data and write Pig script

In this stage, data is loaded and Pig script is written.

A = LOAD ‘myfile’

AS (x, y, z);

B = FILTER A by x > 0;

C = GROUP B BY x;

D = FOREACH A GENERATE

x, COUNT(B);

STORE D INTO ‘output’;

Stage 2: Pig Operations

In the second stage, the Pig execution engine Parses and checks the script. If it passes the script
optimized and a logical and physical plan is generated for execution.
The job is submitted to Hadoop as a job defined as a MapReduce Task. Pig Monitors the status of
the job using Hadoop API and reports to the client.

Stage 3: Execution of the plan

In the final stage, results are dumped on the section or stored in HDFS depending on the user
command.

Let us now understand a few salient features of Pig

Salient Features of Pig

Developers and analysts like to use Pig as it offers many features. Some of the features are as
follows:

 Provision for step-by-step procedural control and the ability to operate directly over files

 Schemas that, though optional, can be assigned dynamically

 Support to User Defined Functions, or UDFs, and to various data types

Pig works in two execution modes:

Local and MapReduce.

Local mode

In the local mode, the Pig engine takes input from the Linux file system and the output is stored in
the same file system. Pig Execution local mode is explained below.

MapReduce mode
In MapReduce mode, the Pig engine directly interacts and executes in HDFS and MapReduce as
shown in the diagram given below.

Let us now look into interactive modes of Pig.

Pig Interactive Modes

The two modes in which a Pig Latin program can be written are Interactive and Batch.

Interactive mode

Interactive mode means coding and executing the script, line by line, as shown in the image given
below.

Batch mode

In Batch mode, all scripts are coded in a file with the extension .pig and the file is directly
executed as shown in the diagram given below.
Since we have already learned about Hive and Impala which works on SQL, let’s now see how
Pig is different from SQL.

Pig vs. SQL

Given below are some differences between Pig and Sql.

Difference Pig SQL

Pig is a scripting language used to SQL is a query language used to interact with
Definition
interact with HDFS. databases residing in the database engine.

Pig offers a step-by-step execution

Query Style SQL offers the single block execution style.
style.

Pig does a lazy evaluation, which

means that data is processed only
Evaluation SQL offers immediate evaluation of a query.
when the STORE or DUMP
command is encountered.

In SQL, you need to run the “join” command

Pipeline
Pipeline Splits are supported in Pig. twice for the result to be materialized as an
Splits
intermediate result.
PIG DATA MODEL

Types

Pig’s data types can be divided into two categories: scalar types, which contain a single value,
and complex types, which contain other types.

Scalar Types

Pig’s scalar types are simple types that appear in most programming languages. With the exception
of bytearray, they are all represented in Pig interfaces by java.lang classes, making them easy to
work with in UDFs:

int

An integer. Ints are represented in interfaces by java.lang.Integer. They store a four-byte signed
integer. Constant integers are expressed as integer numbers, for example, 42.

long

A long integer. Longs are represented in interfaces by java.lang.Long. They store an eight-byte
signed integer. Constant longs are expressed as integer numbers with an L appended, for
example, 5000000000L.

float

A floating-point number. Floats are represented in interfaces by java.lang.Float and use four bytes to
store their value. You can find the range of values representable by Java’s Float type
at http://java.sun.com/docs/books/jls/third_edition/html/typesValues.html#4.2.3. Note that because
this is a floating-point number, in some calculations it will lose precision. For calculations that
require no loss of precision, you should use an int or long instead. Constant floats are expressed as a
floating-point number with an f appended. Floating-point numbers can be expressed in simple
format, 3.14f, or in exponent format, 6.022e23f.

double

A double-precision floating-point number. Doubles are represented in interfaces

by java.lang.Double and use eight bytes to store their value. You can find the range of values
representable by Java’s Double type
at http://java.sun.com/docs/books/jls/third_edition/html/typesValues.html#4.2.3. Note that because
this is a floating-point number, in some calculations it will lose precision. For calculations that
require no loss of precision, you should use an int or long instead. Constant doubles are expressed as
a floating-point number in either simple format, 2.71828, or in exponent format, 6.626e-34.

chararray

A string or character array. Chararrays are represented in interfaces by java.lang.String. Constant

chararrays are expressed as string literals with single quotes, for example, 'fred'. In addition to
standard alphanumeric and symbolic characters, you can express certain characters in chararrays by
using backslash codes, such as \t for Tab and \n for Return. Unicode characters can be expressed
as \u followed by their four-digit hexadecimal Unicode value. For example, the value for Ctrl-A is
expressed as \u0001.

bytearray

A blob or array of bytes. Bytearrays are represented in interfaces by a Java class DataByteArray that
wraps a Java byte[]. There is no way to specify a constant bytearray.

Complex Types

Pig has three complex data types: maps, tuples, and bags. All of these types can contain data of any
type, including other complex types. So it is possible to have a map where the value field is a bag,
which contains a tuple where one of the fields is a map.

Map

A map in Pig is a chararray to data element mapping, where that element can be any Pig type,
including a complex type. The chararray is called a key and is used as an index to find the element,
referred to as the value.

Because Pig does not know the type of the value, it will assume it is a bytearray. However, the actual
value might be something different. If you know what the actual type is (or what you want it to be),
you can cast it; see Casts. If you do not cast the value, Pig will make a best guess based on how you
use the value in your script. If the value is of a type other than bytearray, Pig will figure that out at
runtime and handle it. See Schemas for more information on how Pig handles unknown types.

By default there is no requirement that all values in a map must be of the same type. It is legitimate
to have a map with two keys name and age, where the value for name is a chararray and the value
for age is an int. Beginning in Pig 0.9, a map can declare its values to all be of the same type. This is
useful if you know all values in the map will be of the same type, as it allows you to avoid the
casting, and Pig can avoid the runtime type-massaging referenced in the previous paragraph.
Map constants are formed using brackets to delimit the map, a hash between keys and values, and a
comma between key-value pairs. For example, ['name'#'bob', 'age'#55] will create a map with two
keys, “name” and “age”. The first value is a chararray, and the second is an integer.

Tuple

A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into fields, with
each field containing one data element. These elements can be of any type—they do not all need to
be the same type. A tuple is analogous to a row in SQL, with the fields being SQL columns. Because
tuples are ordered, it is possible to refer to the fields by position; see Expressions in foreach for
details. A tuple can, but is not required to, have a schema associated with it that describes each
field’s type and provides a name for each field. This allows Pig to check that the data in the tuple is
what the user expects, and it allows the user to reference the fields of the tuple by name.

Tuple constants use parentheses to indicate the tuple and commas to delimit fields in the tuple. For
example, ('bob', 55) describes a tuple constant with two fields.

Bag

A bag is an unordered collection of tuples. Because it has no order, it is not possible to reference
tuples in a bag by position. Like tuples, a bag can, but is not required to, have a schema associated
with it. In the case of a bag, the schema describes all tuples within the bag.

Bag constants are constructed using braces, with tuples in the bag separated by commas. For
example, {('bob', 55), ('sally', 52), ('john', 25)} constructs a bag with three tuples, each with two
fields.

Pig users often notice that Pig does not provide a list or set type that can store items of any type. It is
possible to mimic a set type using the bag, by wrapping the desired type in a tuple of one field. For
instance, if you want to store a set of integers, you can create a bag with a tuple with one field, which
is an int. This is a bit cumbersome, but it works.

Bag is the one type in Pig that is not required to fit into memory. As you will see later, because bags
are used to store collections when grouping, bags can become quite large. Pig has the ability to spill
bags to disk when necessary, keeping only partial sections of the bag in memory. The size of the bag
is limited to the amount of local disk available for spilling the bag.

MEMORY REQUIREMENTS OF PIG DATA TYPES

In the previous sections I often referenced the size of the value stored for each type (four
bytes for integer, eight bytes for long, etc.). This tells you how large (or small) a value those types
can hold. However, this does not tell you how much memory is actually used by objects of those
types. Because Pig uses Java objects to represent these values internally, there is an additional
overhead. This overhead depends on your JVM, but it is usually eight bytes per object. It is even
worse for chararrays because Java’s String uses two bytes per character rather than one.

So, if you are trying to figure out how much memory you need in Pig to hold all of your data
(e.g., if you are going to do a join that needs to hold a hash table in memory), do not count the bytes
on disk and assume that is how much memory you need. The multiplication factor between disk and
memory is dependent on your data, whether your data is compressed on disk, your disk storage
format, etc. As a rule of thumb, it takes about four times as much memory as it does disk to represent
the uncompressed data.

PIG LATIN

This section gives an informal description of the syntax and semantics of the Pig Latin
programming language.3 It is not meant to offer a complete reference to the language, 4 but there should
be enough here for you to get a good understanding of Pig Latin’s constructs.

Structure

A Pig Latin program consists of a collection of statements. A statement can be thought of as an

operation or a command.5 For example, a GROUP operation is a type of statement:

grouped_records = GROUP records BY year;

The command to list the files in a Hadoop filesystem is another example of a statement:

ls /

Statements are usually terminated with a semicolon, as in the example of the GROUP statement.
In fact, this is an example of a statement that must be terminated with a semicolon; it is a syntax
error to omit it. The ls command, on the other hand, does not have to be terminated with a
semicolon. As a general guideline, statements or commands for interactive use in Grunt do not
need the terminating semicolon. This group includes the interactive Hadoop commands, as well as
the diagnostic operators such as DESCRIBE. It’s never an error to add a terminating semicolon,
so if in doubt, it’s simplest to add one.

Statements that have to be terminated with a semicolon can be split across multiple lines for
readability:

records = LOAD 'input/ncdc/micro-tab/sample.txt'

AS (year:chararray, temperature:int, quality:int);

Pig Latin has two forms of comments. Double hyphens are used for single-line comments.
Everything from the first hyphen to the end of the line is ignored by the Pig Latin interpreter:

-- My program

DUMP A; -- What's in A?

C-style comments are more flexible since they delimit the beginning and end of the comment
block with /* and */ markers. They can span lines or be embedded in a single line:*
* Description of my program spanning
* multiple lines.
*/
A = LOAD 'input/pig/join/A';
B = LOAD 'input/pig/join/B';
C = JOIN A BY $0, /* ignored */ B BY $1;
DUMP C;

Expressions
An expression is something that is evaluated to yield a value. Expressions can be used in Pig
as a part of a statement containing a relational operator. Pig has a rich variety of expressions,
many of which will be familiar from other programming languages
Types
Pig has a boolean type and six numeric types: int, long, float, double, biginteger, and
bigdecimal, which are identical to their Java counterparts. There is also a bytearray type, like
Java’s byte array type for representing a blob of binary data, and chararray, which, like
java.lang.String, represents textual data in UTF-16 format (although it can be loaded or stored in
UTF-8 format).
Schemas
A relation in Pig may have an associated schema, which gives the fields in the relation names and
types. We’ve seen how an AS clause in a LOAD statement is used to attach a schema to a
relation:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year:int, temperature:int, quality:int);
grunt> DESCRIBE records;
records: {year: int,temperature: int,quality: int}

DEVELOPING AND TESTING PIG LATIN SCRIPTS

Development Tools
Pig provides several tools and diagnostic operators to help you develop your applications. In
this section we will explore these and also look at some tools others have written to make it easier
to develop Pig with standard editors and integrated development environments (IDEs).
Syntax Highlighting and Checking
Syntax highlighting often helps users write code correctly, at least syntactically, the first time
around. Syntax highlighting packages exist for several popular editors. The packages listed
in Table 7-1 were created and added at various times, so how their highlighting conforms with
current Pig Latin syntax varies.

describe

describe shows you the schema of a relation in your script. This can be very helpful as you are
developing your scripts. It is especially useful as you are learning Pig Latin and understanding
how various operators change the data. describe can be applied to any relation in your script, and
you can have multiple describes in a script:
--describe.pig

divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,

date:chararray, dividends:float);

trimmed = foreach divs generate symbol, dividends;

grpd = group trimmed by symbol;

avgdiv = foreach grpd generate group, AVG(trimmed.dividends);

describe trimmed;

describe grpd;

describe avgdiv;

trimmed: {symbol: chararray,dividends: float}

grpd: {group: chararray,trimmed: {(symbol: chararray,dividends: float)}}

avgdiv: {group: chararray,double}

describe uses Pig’s standard schema syntax. For information on this syntax, see Schemas. So, in
this example, the relation trimmed has two fields: symbol, which is a chararray, and dividends,
which is a float. grpd also has two fields, group (the name Pig always assigns to the group by key)
and a bag trimmed, which matches the name of the relation that Pig grouped to produce the bag.
Tuples in trimmed have two fields: symbol and dividends. Finally, in avgdiv there are two fields,
group and a double, which is the result of the AVG function and is unnamed.

explain

One of Pig’s goals is to allow you to think in terms of data flow instead of MapReduce. But
sometimes you need to peek into the barn and see how Pig is compiling your script into
MapReduce jobs. Pig provides explain for this. explain is particularly helpful when you are trying
to optimize your scripts or debug errors. It was written so that Pig developers could examine how
Pig handled various scripts, thus its output is not the most user-friendly. But with some effort,
explain can help you write better Pig Latin.

There are two ways to use explain. You can explain any alias in your Pig Latin script, which will
show the execution plan Pig would use if you stored that relation. You can also take an existing
Pig Latin script and apply explain to the whole script in Grunt. This has a couple of advantages.
One, you do not have to edit your script to add the explain line. Two, it will work with scripts that
do not have a single store, showing how Pig will execute the entire script:

--explain.pig

divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);

grpd = group divs by symbol;

avgdiv = foreach grpd generate group, AVG(divs.dividends);

store avgdiv into 'average_dividend';

bin/pig -x local -e 'explain -script explain.pig'

This will produce a printout of several graphs in text format; we will examine this output
momentarily. When using explain on a script in Grunt, you can also have it print out the plan in
graphical format.

The flow of this chart is bottom to top so that the Load operator is at the very
bottom. The lines between operators show the flow. Each of the four operators
created by the script (Load, CoGroup, ForEach, and Store) can be seen. Each of
these operators also has a schema, described in standard schema syntax.
The CoGroup and ForEach operators also have expressions attached to them (the
lines dropping down from those operators). In the CoGroup operator, the
projection indicates which field is the grouping key (in this case, field 1).
The ForEach operator has a projection expression that projects field 0 (the
group field) and a UDF expression, which indicates that the UDF being used
is org.apache.pig.builtin.AVG. Notice how each of the Project operators has
an Input field, indicating from which operator they are drawing their
input. Figure 7-2 shows how this plan looks when the -dot option is used
instead.
After optimizing the logical plan, Pig produces a physical plan. This plan
describes the physical operators Pig will use to execute the script, without
reference to how they will be executed in MapReduce. The physical plan for
our plan in Figure 7-1 is shown in Figure 7-3.

This looks like the logical plan, but with a few differences. The load and store functions that will
be used have been resolved (in this case to org.apache.pig.builtin.PigStorage, the default load and
store function), and the actual paths that will be used have been resolved. This example was run in
local mode, so the paths are local files. If it had been run on a cluster, it would have showed a
path like hdfs://nn.machine.domain/filepath.

HIVE

The Apache Hive™ data warehouse software facilitates reading, writing, and managing large
datasets residing in distributed storage using SQL. The structure can be projected onto data
already in storage."
In other words, Hive is an open-source system that processes structured data in Hadoop, residing
on top of the latter for summarizing Big Data, as well as facilitating analysis and queries.

Architecture of Hive

Hive chiefly consists of three core parts:

 Hive Clients: Hive offers a variety of drivers designed for communication with different
applications. For example, Hive provides Thrift clients for Thrift-based applications.
These clients and drivers then communicate with the Hive server, which falls under Hive
services.

 Hive Services: Hive services perform client interactions with Hive. For example, if a
client wants to perform a query, it must talk with Hive services.

 Hive Storage and Computing: Hive services such as file system, job client, and meta store
then communicates with Hive storage and stores things like metadata table information
and query results.

Hive's Features

These are Hive's chief characteristics:

 Hive is designed for querying and managing only structured data stored in tables
 Hive is scalable, fast, and uses familiar concepts

 Schema gets stored in a database, while processed data goes into a Hadoop Distributed File System
(HDFS)

 Tables and databases get created first; then data gets loaded into the proper tables

 Hive supports four file formats: ORC, SEQUENCEFILE, RCFILE (Record Columnar File), and
TEXTFILE

 Hive uses an SQL-inspired language, sparing the user from dealing with the complexity of MapReduce
programming. It makes learning more accessible by utilizing familiar concepts found in relational
databases, such as columns, tables, rows, and schema, etc.

 The most significant difference between the Hive Query Language (HQL) and SQL is that Hive
executes queries on Hadoop's infrastructure instead of on a traditional database

Limitations of Hive

Of course, no resource is perfect, and Hive has some limitations. They are:

 Hive doesn’t support OLTP. Hive supports Online Analytical Processing (OLAP), but not Online
Transaction Processing (OLTP).

 It doesn’t support subqueries.

 It has a high latency.

 Hive tables don’t support delete or update operations.

Hive Modes

Depending on the size of Hadoop data nodes, Hive can operate in two different modes:

 Local mode

 Map-reduce mode

User Local mode when:

 Hadoop is installed under the pseudo mode, possessing only one data node

 The data size is smaller and limited to a single local machine

 Users expect faster processing because the local machine contains smaller datasets.

Use Map Reduce mode when:

 Hadoop has multiple data nodes, and the data is distributed across these different nodes
 Users must deal with more massive data sets

MapReduce is Hive's default mode.

Hive and Hadoop on AWS

Amazon Elastic Map Reduce (EMR) is a managed service that lets you use big data processing
frameworks such as Spark, Presto, Hbase, and, yes, Hadoop to analyze and process large data
sets. Hive, in turn, runs on top of Hadoop clusters, and can be used to query data residing in
Amazon EMR clusters, employing an SQL language.

Relational Database Hive

Maintains a database Maintains a data warehouse

Fixed schema Varied schema

Sparse tables Dense tables

Supports automation
Doesn’t support partitioning
partition

Uses HQL (Hive Query

Uses SQL (Structured Query Language)
Language)
Hive Different File Formats

Different file formats and compression codecs work better for different data sets in Apache Hive.

Following are the Apache Hive different file formats:

 Text File
 Sequence File
 RC File
 AVRO File
 ORC File
 Parquet File
Hive Text File Format

Hive Text file format is a default storage format. You can use the text format to interchange the
data with other client application. The text file format is very common most of the applications.
Data is stored in lines, with each line being a record. Each lines are terminated by a newline
character (\n).
The text format is simple plane file format. You can use the compression (BZIP2) on the text file
to reduce the storage spaces.
Create a TEXT file by add storage option as ‘STORED AS TEXTFILE’ at the end of a Hive
CREATE TABLE command.

Hive Text File Format Examples

Below is the Hive CREATE TABLE command with storage format specification:

Create table textfile_table

(column_specs)

stored as textfile;

Hive Sequence File Format

Sequence files are Hadoop flat files which stores values in binary key-value pairs. The
sequence files are in binary format and these files are able to split. The main advantages of using
sequence file is to merge two or more files into one file.
Create a sequence file by add storage option as ‘STORED AS SEQUENCEFILE’ at the end of a
Hive CREATE TABLE command.

Hive Sequence File Format Example

Below is the Hive CREATE TABLE command with storage format specification:
Create table sequencefile_table

(column_specs)

stored as sequencefile;

Hive RC File Format

RCFile is row columnar file format. This is another form of Hive file format which offers
high row level compression rates. If you have requirement to perform multiple rows at a time then
you can use RCFile format.
The RCFile are very much similar to the sequence file format. This file format also stores the data
as key-value pairs.

Create RCFile by specifying ‘STORED AS RCFILE’ option at the end of a CREATE TABLE
Command:

Hive RC File Format Example

Below is the Hive CREATE TABLE command with storage format specification:

Create table RCfile_table

(column_specs)

stored as rcfile;

Hive AVRO File Format

AVRO is open source project that provides data serialization and data exchange services
for Hadoop. You can exchange data between Hadoop ecosystem and program written in any
programming languages. Avro is one of the popular file format in Big Data Hadoop based
applications.
Create AVRO file by specifying ‘STORED AS AVRO’ option at the end of a CREATE TABLE
Command.

Hive AVRO File Format Example

Below is the Hive CREATE TABLE command with storage format specification:

Create table avro_table

(column_specs)

stored as avro;
Hive ORC File Format

The ORC file stands for Optimized Row Columnar file format. The ORC file format
provides a highly efficient way to store data in Hive table. This file system was actually designed
to overcome limitations of the other Hive file formats. The Use of ORC files improves
performance when Hive is reading, writing, and processing data from large tables.
Create ORC file by specifying ‘STORED AS ORC’ option at the end of a CREATE TABLE
Command.

Hive ORC File Format Examples

Below is the Hive CREATE TABLE command with storage format specification:

Create table orc_table

(column_specs)

stored as orc;

Hive Parquet File Format

Parquet is a column-oriented binary file format. The parquet is highly efficient for the
types of large-scale queries. Parquet is especially good for queries scanning particular columns
within a particular table. The Parquet table uses compression Snappy, gzip; currently Snappy by
default.
Create Parquet file by specifying ‘STORED AS PARQUET’ option at the end of a CREATE
TABLE Command.

Hive Parquet File Format Example

Below is the Hive CREATE TABLE command with storage format specification:

Create table parquet_table

(column_specs)

stored as parquet;

HIVE DATA TYPES

Hive data types are categorized into numeric types, string types, misc types, and complex types. A
list of Hive data types is given below.

a. Numeric Types

TINYINT (1-byte signed integer, from -128 to 127)

SMALLINT (2-byte signed integer, from -32,768 to 32,767)

INT (4-byte signed integer, from -2,147,483,648 to 2,147,483,647)

BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)

FLOAT (4-byte single precision floating point number)

DOUBLE (8-byte double precision floating point number)

DECIMAL (Hive 0.13.0 introduced user definable precision and scale)

b. Date/Time Types

TIMESTAMP

DATE

c. String Types

STRING

VARCHAR

CHAR

d. Misc Types

BOOLEAN

BINARY

e. Complex Types

arrays:

It is a collection of similar types of values that are indexable using zero-based integers.

maps:

It contains the key-value tuples where the fields are accessed using array notation.

structs:

a complex data type in Hive that can store a set of fields of different data types.
2. Tables

In Hive, we can create a table by using conventions similar to SQL. It supports a wide range of
flexibility where the data files for tables are stored. It provides two types of table: -

- Internal table

- External table

Internal Table

The internal tables are also called managed tables as the lifecycle of their data is controlled
by the Hive. By default, these tables are stored in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (i.e. /user/hive/warehouse). The internal tables are not flexible
enough to share with other tools like Pig. If we try to drop the internal table, Hive deletes both
table schema and data.

External Table

The external table allows us to create and access a table and data externally. The external
keyword is used to specify the external table, whereas the location keyword is used to determine
the location of loaded data.

As the table is external, the data is not present in the Hive directory. Therefore, if we try to drop
the table, the metadata of the table will be deleted, but the data still exists.

3. Partition

The partitioning in Hive means dividing the table into some parts based on the values of a
particular column like date, course, city or country. The advantage of partitioning is that since the
data is stored in slices, the query response time becomes faster.

The partitioning in Hive can be executed in two ways -

- Static partitioning

- Dynamic partitioning

i. Hive Static Partitioning

- In static or manual partitioning, it is required to pass the values of partitioned columns manually
while loading the data into the table. Hence, the data file doesn’t contain the partitioned columns.
- If you want to use the Static partition in the hive you should set property set hive.mapred.mode
= strict This property set by default in hive-site.xml

- Static partition is in Strict Mode.

ii. Hive Dynamic Partitioning

Dynamic partitions provide us with flexibility and create partitions automatically depending on
the data that we are inserting into the table.

To use DP we first need to enables :

- hive> set hive.exec.dynamic.partition=true;

- hive> set hive.exec.dynamic.partition.mode=nonstrict;

If you want to partition a number of columns but you don’t know how many columns then also
dynamic partition is suitable.

4. View

Basically, Apache Hive View is similar to Hive tables, which are generated on the basis of
requirements.

Use1: Filter Data

Use2: Reduce Complexity

5. Bucket

· Bucketing in the hive is the concept of breaking data down into ranges, which are known as
buckets, to give extra structure to the data so it may be used for more efficient queries. The range
for a bucket is determined by the hash value of one or more columns in the dataset

· We use the concept of bucketing when the partitioning is more difficult.

· Bucketing tables also can result in more efficient use of overall resources; memory utilization is
low when the joins are done at the bucket level, instead of doing a full broadcast join of one of the
tables. The greater the number of buckets, the less memory is needed — but too many buckets can
create unneeded parallelism. It may take some experimenting at first, but eventually, you will
figure out the ideal bucket count for highly efficient scans of the datasets.

·The concept of bucketing is based on the hashing technique.

Here, modules of the current column value and the number of required buckets is calculated (let
say, F(x) % 3).

HIVEQL: DATA DEFINITION & HIVEQL DATA MANIPULATION QUERIES

HiveQL is the Hive query language. Like all SQL dialects in widespread use, it doesn’t
fully conform to any particular revision of the ANSI SQL standard. It is perhaps closest to
MySQL’s dialect, but with significant differences. Hive offers no support for row-level inserts,
updates, and deletes. Hive doesn’t support transactions. Hive adds extensions to provide better
performance in the context of Hadoop and to integrate with custom extensions and even external
programs.

Databases in Hive

The Hive concept of a database is essentially just a catalog or namespace of tables. However, they
are very useful for larger clusters with multiple teams and users, as a way of avoiding table name
collisions. It’s also common to use databases to organize production tables into logical groups.

If you don’t specify a database, the default database is used.

The simplest syntax for creating a database is shown in the following example:

hive> CREATE DATABASE financials;

Hive will throw an error if financials already exists. You can suppress these warnings with this
variation:

hive> CREATE DATABASE IF NOT EXISTS financials;

hive> SHOW DATABASES;

default
financials

hive> CREATE DATABASE human_resources;

hive> SHOW DATABASES;

default
financials
human_resources
If you have a lot of databases, you can restrict the ones listed using a regular expression, a
concept we’ll explain in LIKE and RLIKE, if it is new to you. The following example lists only
those databases that start with the letter h and end with any other characters (the .* part):

hive> SHOW DATABASES LIKE 'h.*';

human_resources
hive> ...

You can override this default location for the new directory as shown in this example:

hive> CREATE DATABASE financials

> LOCATION '/my/preferred/directory';

You can add a descriptive comment to the database, which will be shown by the DESCRIBE
DATABASE <database> command.

hive> CREATE DATABASE financials

> COMMENT 'Holds all financial tables';

hive> DESCRIBE DATABASE financials;

financials Holds all financial tables
hdfs://master-server/user/hive/warehouse/financials.db

Finally, you can drop a database:

hive> DROP DATABASE IF EXISTS financials;

The IF EXISTS is optional and suppresses warnings if financials doesn’t exist.

By default, Hive won’t permit you to drop a database if it contains tables. You can either
drop the tables first or append the CASCADE keyword to the command, which will cause the
Hive to drop the tables in the database first:
hive> DROP DATABASE IF EXISTS financials CASCADE;

Alter Database

hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'Joe Dba');

Creating Tables

The CREATE TABLE statement follows SQL conventions, but Hive’s version offers
significant extensions to support a wide range of flexibility where the data files for tables are
stored, the formats used, etc

CREATE TABLE IF NOT EXISTS mydb.employees (

name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
LOCATION '/user/hive/warehouse/mydb.db/employees'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...);

Managed Tables

The tables we have created so far are called managed tables or sometimes called internal
tables, because Hive controls the lifecycle of their data (more or less). As we’ve seen, Hive stores
the data for these tables in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.

External Tables

Suppose we are analyzing data from the stock markets. Periodically, we ingest the data for
NASDAQ and the NYSE from a source like Infochimps (http://infochimps.com/datasets) and we
want to study this data with many tools.
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

Partitioned, Managed Tables

The general notion of partitioning data is an old one. It can take many forms, but often it’s
used for distributing load horizontally, moving data physically closer to its most frequent users,
and other purposes.

Hive has the notion of partitioned tables. We’ll see that they have important performance
benefits, and they can help organize data in a logical fashion, such as hierarchically.

We’ll discuss partitioned managed tables first. Let’s return to our employees table and imagine
that we work for a very large multinational corporation. Our HR people often run queries
with WHERE clauses that restrict the results to a particular country or to a particular first-level
subdivision (e.g., state in the United States or province in Canada). (First-level subdivision is an
actual term, used here, for example: http://www.commondatahub.com/state_source.jsp.) We’ll
just use the word state for simplicity. We have redundant state information in the address field. It is
distinct from the state partition. We could remove the state element from address. There is no
ambiguity in queries, since we have to use address.state to project the value inside the address. So, let’s
partition the data first by country and then by state:

CREATE TABLE employees (

name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
PARTITIONED BY (country STRING, state STRING);

Partitioning tables changes how Hive structures the data storage. If we create this table in
the mydb database, there will still be an employees directory for the table:

hdfs://master_server/user/hive/warehouse/mydb.db/employees

However, Hive will now create subdirectories reflecting the partitioning structure.
For example:

...

.../employees/country=CA/state=AB
.../employees/country=CA/state=BC
...
.../employees/country=US/state=AL
.../employees/country=US/state=AK
...

Once created, the partition keys (country and state, in this case) behave like regular columns. There

is one known exception, due to a bug (see Aggregate functions). In fact, users of the table don’t
need to care if these “columns” are partitions or not, except when they want to optimize

query performance.
For example, the following query selects all employees in the state of Illinois in the United States:
SELECT * FROM employees
WHERE country = 'US' AND state = 'IL';
Note that because the country and state values are encoded in directory names, there is no reason to
have this data in the data files themselves. In fact, the data just gets in the way in the files, since you have
to account for it in the table schema, and this data wastes space.

UNIT 5 Notes
No ratings yet
UNIT 5 Notes
47 pages
Unit 5 BDA
No ratings yet
Unit 5 BDA
34 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
42 pages
HBase
No ratings yet
HBase
14 pages
HBase - Tutorial
No ratings yet
HBase - Tutorial
14 pages
Lesson 6 NoSQL Databases HBase
100% (1)
Lesson 6 NoSQL Databases HBase
47 pages
Chapter 12 HBase
No ratings yet
Chapter 12 HBase
108 pages
Pbds Unit-5
No ratings yet
Pbds Unit-5
60 pages
HBase
No ratings yet
HBase
39 pages
Bda - Unit 5
No ratings yet
Bda - Unit 5
30 pages
BDA Unit 5
No ratings yet
BDA Unit 5
33 pages
Unit V Hadoop Related Tools
No ratings yet
Unit V Hadoop Related Tools
54 pages
BDA Unit-4 Part-2 HBase, Hive, Pig
No ratings yet
BDA Unit-4 Part-2 HBase, Hive, Pig
74 pages
Bda Unit 5
No ratings yet
Bda Unit 5
16 pages
Hbase
100% (1)
Hbase
30 pages
Unit 5 Lecture No-3 (Hbase)
No ratings yet
Unit 5 Lecture No-3 (Hbase)
35 pages
Big Data Analytics Unit-5
No ratings yet
Big Data Analytics Unit-5
28 pages
H Base Tutorial
No ratings yet
H Base Tutorial
38 pages
BDM Unit 5
No ratings yet
BDM Unit 5
60 pages
Unit - IV - Notes
No ratings yet
Unit - IV - Notes
23 pages
DSS - U4 - HBASE Rev 1.0
No ratings yet
DSS - U4 - HBASE Rev 1.0
20 pages
Unit 4
No ratings yet
Unit 4
15 pages
Unit 5 Lecture No-3 (Hbase)
No ratings yet
Unit 5 Lecture No-3 (Hbase)
35 pages
Unit 5 Hbase - Hive - Pig
No ratings yet
Unit 5 Hbase - Hive - Pig
93 pages
Unit-5 Notes
No ratings yet
Unit-5 Notes
61 pages
HBASE
No ratings yet
HBASE
18 pages
Hadoop HBASE
No ratings yet
Hadoop HBASE
71 pages
Cs525: Special Topics in DBS: Large-Scale Data Management
No ratings yet
Cs525: Special Topics in DBS: Large-Scale Data Management
35 pages
Unit 5 Big Data
No ratings yet
Unit 5 Big Data
34 pages
Hbase - Quick Guide Hbase - Overview
No ratings yet
Hbase - Quick Guide Hbase - Overview
53 pages
Unit V
No ratings yet
Unit V
6 pages
HBase
No ratings yet
HBase
38 pages
BDT Unit - V
No ratings yet
BDT Unit - V
15 pages
HBase
No ratings yet
HBase
27 pages
Hadoop Week 6
No ratings yet
Hadoop Week 6
38 pages
Unit 5 Notes
100% (3)
Unit 5 Notes
66 pages
HBase (Unit 4)
No ratings yet
HBase (Unit 4)
37 pages
HBase
No ratings yet
HBase
30 pages
Lec 18
No ratings yet
Lec 18
18 pages
10 HBase
No ratings yet
10 HBase
13 pages
HBase
No ratings yet
HBase
6 pages
Big Data Unit 5
No ratings yet
Big Data Unit 5
18 pages
Unit - 5 Part - 1
No ratings yet
Unit - 5 Part - 1
8 pages
Unit 5 Hbase
No ratings yet
Unit 5 Hbase
15 pages
Cse 17CS82 M2 S4 PPT
No ratings yet
Cse 17CS82 M2 S4 PPT
19 pages
Apache HBase
No ratings yet
Apache HBase
12 pages
HBASE
No ratings yet
HBASE
11 pages
HBASE
No ratings yet
HBASE
35 pages
Lec 18
No ratings yet
Lec 18
21 pages
Hbase
No ratings yet
Hbase
3 pages
Big Data UNIT 5 Own
No ratings yet
Big Data UNIT 5 Own
18 pages
Big Data 22MSM40206
No ratings yet
Big Data 22MSM40206
9 pages
Assignment 10
No ratings yet
Assignment 10
9 pages
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
No ratings yet
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
6 pages
Assignment Day 10: Task 1
No ratings yet
Assignment Day 10: Task 1
8 pages
Columnar Database
No ratings yet
Columnar Database
18 pages
Large-Scale Data Management: Hbase
No ratings yet
Large-Scale Data Management: Hbase
36 pages
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
No ratings yet
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
32 pages
HBase
No ratings yet
HBase
31 pages