Hadoop Pig
Hadoop Pig
Hbase
● Hbase is an open source and sorted map data built on Hadoop. It is column
oriented and horizontally scalable.
● It is based on Google's Big Table. It has a set of tables which keep data in key
value format.
● Hbase is well suited for sparse data sets which are very common in big data use
cases.
● Hbase provides APIs enabling development in practically any programming
language.
● It is a part of the Hadoop ecosystem that provides random real-time read/write
access to data in the Hadoop File System.
Features of Hbase
HBase Architecture:
● HMaster –
1)The implementation of Master Server in HBase is HMaster. It is a process in
which regions are assigned to the region server as well as DDL (create, delete
table) operations. It monitors all Region Server instances present in the cluster.
In a distributed environment, Master runs several background threads. HMaster
has many features like controlling load balancing, failover etc.
● Region Server –
HBase Tables are divided horizontally by row key range into Regions. Regions
are the basic building elements of HBase cluster that consists of the distribution
of tables and are composed of Column families. Region Server runs on HDFS
DataNode which is present in the Hadoop cluster. Regions of Region Server are
responsible for several things, like handling, managing, executing as well as
reads and writes of HBase operations on that set of regions. The default size of a
region is 256 MB.
● Zookeeper –
It is like a coordinator in HBase. It provides services like maintaining
configuration information, naming, providing distributed synchronization, server
failure notification etc. Clients communicate with region servers via zookeeper.
HDFS:
● HDFS is a Hadoop distributed File System, as the name implies it provides a
distributed environment for the storage and it is a file system designed in a way
to run on commodity hardware.
● It stores each file in multiple blocks and to maintain fault tolerance, the blocks are
replicated across a Hadoop cluster.
● HDFS provides a high degree of fault –tolerance and runs on cheap commodity
hardware. By adding nodes to the cluster and performing processing & storing by
using the cheap commodity hardware, it will give the client better results as
compared to the existing one.
● Here, the data stored in each block replicates into 3 nodes. In a case when any
node goes down there will be no loss of data, it will have a proper backup
recovery mechanism.
● HDFS gets in contact with the HBase components and stores a large amount of
data in a distributed manner.
● Coming to HBase the following are the key terms representing table schema
1) Table: Collection of rows present.
2) Row: Collection of column families.
3) Column Family: Collection of columns.
4) Column: Collection of key-value pairs.
5) Namespace: Logical grouping of tables.
6) Cell: A {row, column, version} tuple exactly specifies a cell definition in
HBase.
Column and Row-oriented storages differ in their storage mechanism. As we all know
traditional relational models store data in terms of row-based format like in terms of
rows of data. Column-oriented storages store data tables in terms of columns and
column families.
Data is stored and retrieved one row at In this type of data store, data is
a time and hence could read stored and retrieved in columns and
unnecessary data if some of the data in hence it is only able to read only the
a row is required. relevant data if required.
HBase Implementations
HBase Read and Write Data Explained
The Read and Write operations from Client into Hfile can be shown in the diagram
below.
Step 1) Client wants to write data and in turn first communicates with Regions server
and then regions.
Step 2) Regions contacting memstore for storing associated with the column family
Step 3) First data stores into Memstore, where the data is sorted and after that, it
flushes into HFile. The main reason for using Memstore is to store data in a Distributed
file system based on Row Key. Memstore will be placed in Region server main memory
while HFiles are written into HDFS.
Step 4) Client wants to read data from Regions
Step 5) In turn Client can have direct access to Mem store, and it can request for data.
Step 6) Client approaches HFiles to get the data. The data is fetched and retrieved by
the Client. Memstore holds in-memory modifications to the store.
The hierarchy of objects in HBase Regions is as shown from top to bottom in the below
table.
HBASE HDFS
Storage and process both can be perform It’s only for storage areas
Hbase clients:
There are a number of client options for interacting with an HBase cluster.
Java HBase, like Hadoop, is written in Java. Example 20-1 shows the Java version of
how you would do the shell operations listed in the previous section. Example 20-1.
Basic table administration and access
● The class has a main() method and uses the HBaseConfiguration class to create
a Configuration object that reads the HBase configuration from the
hbase-site.xml and hbase-default.xml files.
● The Configuration object is used to create instances of HBaseAdmin and HTable,
which are used for administering the HBase cluster and accessing a specific
table, respectively.
● The code creates a table named "test" with a single column family named "data"
and asserts that the table was created. It then inserts data into the table using
● Put objects and retrieves and prints the first row using a Get object and scans
over the table using a Scan object.
● Finally, the code disables and deletes the table. The code makes use of HBase's
Bytes utility class to convert identifiers and values to the byte arrays that HBase
requires.
Praxis:
● Praxis is a project in HBase that aims to provide a common data ingestion
framework.
● Its main role is to import data from many different data sources (such as
relational databases, file systems, web pages, etc.) into HBase for analysis and
processing.
● With Praxis, users can define the mapping between the data source and the
target HBase table with simple configuration, and specify data transformation and
filtering rules.
● Praxis also provides a set of tools and APIs for data import, making it easy to
import unstructured or structured data into HBase. Best practices for using Praxis
include:
1. Before using Praxis, make sure you have a clear understanding of the
structure and contents of the data source and how to map it to HBase
tables.
2. When writing configuration files, try to keep the configuration concise and
easy to understand. You can manipulate data in a data source by using
regular expressions, filters, and so on.
3. In the process of data import, you can use the components and APIs
provided by Praxis to realize incremental import or full import of data to
meet different needs.
4. Pay attention to the format and encoding of the data source to ensure that
the data can be correctly converted and imported into HBase.
5. When importing large-scale data, consider using Praxis' parallel import
function to speed up the import and reasonably configure the number of
concurrent tasks and thread pool size for import.
Pig
● Pig Hadoop is basically a high-level programming language that is helpful for the
analysis of huge datasets.
● Pig Hadoop was developed by Yahoo! and is generally used with Hadoop to
perform a lot of data administration operations.
The main reason why programmers have started using Hadoop Pig is that it converts
the scripts into a series of MapReduce tasks making their job easy. Below is the
architecture of Pig Hadoop:
1. Parser: When a Pig Latin script is sent to Hadoop Pig, it is first handled by
the parser. The parser is responsible for checking the syntax of the script,
along with other miscellaneous checks. Parser gives an output in the form of
a Directed Acyclic Graph (DAG) that contains Pig Latin statements, together
with other logical operators represented as nodes.
2. Optimizer: After the output from the parser is retrieved, a logical plan for
DAG is passed to a logical optimizer. The optimizer is responsible for
carrying out the logical optimizations.
3. Compiler: The role of the compiler comes in when the output from the
optimizer is received. The compiler compiles the logical plan sent by the
optimizer. The logical plan is then converted into a series of MapReduce
tasks or jobs.
4. Execution Engine: After the logical plan is converted to MapReduce jobs,
these jobs are sent to Hadoop in a properly sorted order, and these jobs are
executed on Hadoop for yielding the desired result.
1. In-built operators: Apache Pig provides a very good set of operators for
performing several data operations like sort, join, filter, etc.
2. Ease of programming: Since Pig Latin has similarities with SQL, it is very
easy to write a Pig script.
3. Automatic optimization: The tasks in Apache Pig are automatically
optimized. This makes the programmers concentrate only on the semantics
of the language.
4. Handles all kinds of data: Apache Pig can analyze both structured and
unstructured data and store the results in HDFS.
Grunt
● Grunt is a JavaScript-based task runner that provides line-editing facilities similar
to those found in GNU Readline, used in the bash shell and many other
command-line applications.
● It offers features such as command history, line recall, and a completion
mechanism.
● For instance, the Ctrl-E key combination moves the cursor to the end of the line,
and Ctrl-P or Ctrl-N (or the up or down cursor keys) can be used to recall lines in
the history buffer.
● The Tab key triggers Grunt's completion mechanism, which attempts to complete
Pig Latin keywords and functions.
● Customizing the completion tokens is also possible by creating a file named
autocomplete.
● When the Grunt session is finished, it can be exited with the quit command or the
equivalent shortcut \q.
Pig Latin:
● Pig Latin is a high-level scripting language used in Apache Pig, which is a
platform for analyzing large datasets that runs on Apache Hadoop.
● It provides a high-level of abstraction for processing over MapReduce and allows
users to express their data analysis programs in a textual language called Pig
Latin.
● This language is designed to make MapReduce programming high level, similar
to that of SQL for relational database management systems.
● It can be extended using user-defined functions (UDFs) written in Java, Python,
JavaScript, Ruby, or Groovy and can execute Hadoop jobs in MapReduce,
Apache Tez, or Apache Spark.
Hive
● Apache Hive is a data warehouse software project that is built on top of the
Hadoop ecosystem.
● It provides an SQL-like interface to query and analyze large datasets stored in
Hadoop’s distributed file system (HDFS) or other compatible storage systems.
● Hive uses a language called HiveQL, which is similar to SQL, to allow users to
express data queries, transformations, and analyses in a familiar syntax.
● HiveQL statements are compiled into MapReduce jobs, which are then executed
on the Hadoop cluster to process the data.
● Hive includes many features that make it a useful tool for big data analysis,
including support for partitioning, indexing, and user-defined functions (UDFs).
● It also provides a number of optimization techniques to improve query
performance, such as predicate pushdown, column pruning, and query
parallelization.
● Hive can be used for a variety of data processing tasks, such as data
warehousing, ETL (extract, transform, load) pipelines, and ad-hoc data analysis.
Architecture of Hive:
● Hive Services: Hive services perform client interactions with Hive. For
example, if a client wants to perform a query, it must talk with Hive services.
● Hive Storage and Computing: Hive services such as file system, job client,
and meta store then communicates with Hive storage and stores things like
metadata table information and query results.
● The Metastore: The metastore is the central repository of Hive metadata. The
metastore is divided into two pieces: a service and the backing store for the
data. By default, the metastore service runs in the same JVM as the Hive
service and contains an embedded Derby database instance backed by the
local disk. This is called the embedded metastore configuration. Using an
embedded metastore is a simple way to get started with Hive; however, only
one embedded Derby database can access the database files on disk at any
one time, which means you can have only one Hive session open at a time
that accesses the same metastore. Trying to start a second session produces
an error when it attempts to open a connection to the metastore.
Features of Hive
Limitations of Hive
Hive Pig
It works on the server-side of the HDFS It works on the client-side of the HDFS
cluster. cluster.
Primitive types:
● BOOLEAN type for storing true and false values.
● There are four signed integral types: TINYINT, SMALLINT, INT, and BIGINT,
which are equivalent to Java’s byte, short, int, and long primitive types,
respectively (they are 1-byte, 2-byte, 4-byte, and 8-byte signed integers).
● Hive’s floating-point types, FLOAT and DOUBLE, correspond to Java’s float and
double, which are 32-bit and 64-bit floating-point numbers.
● The DECIMAL data type is used to represent arbitrary-precision decimals.
DECIMAL values are stored as unscaled integers.
● There are three Hive data types for storing text. STRING is a variable-length
character string with no declared maximum length. (The theoretical maximum
size STRING that may be stored is 2 GB.
Complex types:
● Hive has four complex types: ARRAY, MAP, STRUCT, and UNION. ARRAY and
MAP.
● STRUCT is a record type that encapsulates a set of named fields.
● A UNION specifies a choice of data types; values must match exactly one of
these types.
● Complex types permit an arbitrary level of nesting. Complex type declarations
must specify the type of the fields in the collection, using an angled bracket
notation.
● Text File
● Sequence File
● RC File
● AVRO File
● ORC File
● Parquet File
● Hive Text file format is a default storage format. You can use the text format to
interchange the data with other client applications. The text file format is very
common in most of the applications. Data is stored in lines, with each line being a
record. Each line is terminated by a newline character (\n).
● The text format is a simple plane file format. You can use the compression
(BZIP2) on the text file to reduce the storage spaces.
● Create a TEXT file by adding the storage option as ‘STORED AS TEXTFILE’ at
the end of a Hive CREATE TABLE command.
● Example:
(column_specs)
stored as textfile;
● Sequence files are Hadoop flat files which store values in binary key-value pairs.
The sequence files are in binary format and these files are able to split. The main
advantage of using sequence files is to merge two or more files into one file.
● Create a sequence file by adding the storage option as ‘STORED AS
SEQUENCEFILE’ at the end of a Hive CREATE TABLE command.
● Example:
Create table sequencefile_table
(column_specs)
stored as sequencefile;
● RCFile is row columnar file format. This is another form of Hive file format which
offers high row level compression rates. If you have a requirement to perform
multiple rows at a time then you can use RCFile format.
● The RCFile is very much similar to the sequence file format. This file format also
stores the data as key-value pairs.
● Create RCFile by specifying ‘STORED AS RCFILE’ option at the end of a
CREATE TABLE Command
● Example:
Create table RCfile_table
(column_specs)
stored as rcfile;
● AVRO is an open source project that provides data serialization and data
exchange services for Hadoop. You can exchange data between the Hadoop
ecosystem and programs written in any programming language. Avro is one of
the popular file formats in Big Data Hadoop based applications.
● Create AVRO file by specifying ‘STORED AS AVRO’ option at the end of a
CREATE TABLE Command.
● Example:
Create table avro_table
(column_specs)
stored as avro;
● The ORC file stands for Optimized Row Columnar file format. The ORC file
format provides a highly efficient way to store data in the Hive table. This file
system was actually designed to overcome limitations of the other Hive file
formats. The Use of ORC files improves performance when Hive is reading,
writing, and processing data from large tables.
stored as orc;
● Parquet is a column-oriented binary file format. The parquet is highly efficient for
the types of large-scale queries. Parquet is especially good for queries scanning
particular columns within a particular table. The Parquet table uses compression
Snappy, gzip; currently Snappy by default.
● Create a Parquet file by specifying ‘STORED AS PARQUET’ option at the end of
a CREATE TABLE Command.
● Example:
Create table parquet_table
(column_specs)
stored as parquet;
● HiveQL is the Hive query language. Like all SQL dialects in widespread use, it
doesn’t fully conform to any particular revision of the ANSI SQL standard.
● It is perhaps closest to MySQL’s dialect, but with significant differences. Hive
offers no support for row-level inserts, updates, and deletes.
● Hive doesn’t support transactions. Hive adds extensions to provide better
performance in the context of Hadoop and to integrate with custom extensions
and even external programs.
Databases in Hive:
● Hive DML (Data Manipulation Language) commands are used to insert, update,
retrieve, and delete data from the Hive table once the table and database
schema has been defined using Hive DDL commands.
● The various Hive DML commands are:
1. LOAD
2. SELECT
3. INSERT
4. DELETE
5. UPDATE
6. EXPORT
7. IMPORT