BDP U4
BDP U4
BDP U4
Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is used to process the
large datasets. It provides a high-level of abstraction for processing over the MapReduce. It provides
a high-level scripting language, known as Pig Latin which is used to develop the data analysis codes.
First, to process the data which is stored in the HDFS, the programmers will write the scripts using
the Pig Latin Language. Internally Pig Engine(a component of Apache Pig) converted all these scripts
into a specific map and reduce task. But these are not visible to the programmers in order to provide
a high-level of abstraction. Pig Latin and Pig Engine are the two main components of the Apache Pig
tool. The result of Pig always stored in the HDFS.
Note: Pig Engine has two type of execution environment i.e. a local execution environment in a
single JVM (used when dataset is small in size)and distributed execution environment in a Hadoop
Cluster.
Need of Pig: One limitation of MapReduce is that the development cycle is very long. Writing the
reducer and mapper, compiling packaging the code, submitting the job and retrieving the output is a
time-consuming task. Apache Pig reduces the time of development using the multi-query approach.
Also, Pig is beneficial for programmers who are not from Java background. 200 lines of Java code can
be written in only 10 lines using the Pig Latin language. Programmers who have SQL knowledge
needed less effort to learn Pig Latin.
It uses query approach which results in reducing the length of the code.
Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s researchers. At that time, the
main idea to develop Pig was to execute the MapReduce jobs on extremely large datasets. In the
year 2007, it moved to Apache Software Foundation(ASF) which makes it an open source project.
The first version(0.1) of Pig came in the year 2008. The latest version of Apache Pig is 0.18 which
came in the year 2017.
For performing several operations Apache Pig provides rich sets of operators like the filters,
join, sort, etc.
Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.
Apache Pig is extensible so that you can make your own user-defined functions and process.
Pig can handle the analysis of both structured and unstructured data.
Difference between Pig and MapReduce
For collecting large amounts of datasets in form of search logs and web crawls.
Used where the analytical insights are needed using the sampling.
Types of Data Models in Apache Pig: It consist of the 4 types of data models as follows:
Atom: It is a atomic data value which is used to store as a string. The main use of this model
is that it can be used as a number and as well as a string.
You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local host and local file system. There is no
need of Hadoop or HDFS. This mode is generally used for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS)
using Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a
MapReduce job is invoked in the back-end to perform a particular operation on the data that exists
in the HDFS.
Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and
embedded mode.
Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt
shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump
operator).
Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in
a single file with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions
(User Defined Functions) in programming languages such as Java, and using them in our
script.
You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown
below.
Command − Command −
Output − Output −
Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin
statements in it.
You can write an entire Pig Latin script in a file and execute it using the –x command. Let us suppose
we have a Pig script in a file named sample_script.pig as shown below.
Sample_script.pig
PigStorage(',') as (id:int,name:chararray,city:chararray);
Dump student;
Now, you can execute the script in the above file as shown below.
With the range of technologies in the big data world, there is often confusion to choose from them.
It is required to handle huge databases efficiently with big data, and the options for managing and
querying data are also needed. When it comes to managing database, SQL (Structured Query
Language) is the old friend, well tried and tested by everyone for data analysis. But the complicated
world of Hadoop needs high-level data analysis tools.
Though old SQL still is the favorite of many and is popularly used in numerous organizations, Apache
Hive and Pig have become the buzz terms in the big data world today. These tools provide easy
alternatives to carry out the complex programming of MapReduce helping data developers and
analysts.
The organizations looking for open source querying and programming to tame Big data have
adopted Hive and Pig widely. At the same time, it is vital to pick and choose the right platform and
tool for managing your data well. Hence it is essential to understand the differences between Hive vs
Pig vs SQL and choose the best suitable option for the project.
Apache Pig
Apache Pig is another platform with high-level language to express the analysis programmes to
analyse huge datasets. It is an open source project that provides a simple language Pig Latin that
manipulates and queries the data.
It is quite easy to learn and use Pig if you are aware of SQL. It provides the use of nested data types-
Tuples, Maps, Bags, etc. and supports data operations like Joins, Filters, and Ordering. Tech giants
like Google, Yahoo, and Microsoft use Pig for the analysis of enormous datasets arising out of search
logs, web crawls, and click streams.
SQL
Structured Query Language is the traditional database management tool used by programmers for
decades. It is a declarative language that manages the data stored in relational database systems.
SQL is a much better option than excel as it is a fast tool for data processing and analysis.
All three technologies Hive, Pig, and SQL are quite popular in the industry for data analysis and
management, but the bigger question is to know the appropriate use of these tools. There is a need
to understand that which platform suits your needs better and when to use what. Let us understand
the scenarios when we can use these three tools appropriately in the context of Hive vs Pig vs SQL.
Facebook widely uses Apache Hive for the analytical purposes. Furthermore, they usually promote
Hive language due to its extensive feature list and similarities with SQL. Here are some of the
scenarios when Apache Hive is ideal to use:
For extensibility: Apache Hive contains a range of user APIs that help in building the custom
behavior for the query engine.
For someone familiar with SQL concepts: If you are familiar with SQL, Hive will be very easy
to use as you will see many similarities between the two. Hive uses the clauses like select,
where, order by, group by, etc. similar to SQL.
To work on Structured Data: In case of structured data, Hive is widely adopted everywhere.
To analyze historical data: Apache Hive is a great tool for analysis and querying of the data
which is historical and collected over a period.
Apache Hive is the Big Data technology widely used for Big Data analytics. Let’s understand why
is Big data analytics so important?
Apache Pig, developed by Yahoo Research in the year 2006 is famous for its extensibility and
optimization scope. This language uses a multi-query approach that reduces the time in data
scanning. It usually runs on a client side of clusters of Hadoop. It is also quite easy to use when you
are familiar with the SQL ecosystem. You can use Apache Pig for the following special scenarios:
For fast processing: Apache Pig is faster than Hive because it uses a multi-query approach.
Apache Pig is famous worldwide for its speed.
When you don’t want to work with Schema: In case of Apache Pig, there is no need for
creating a schema for the data loading related work.
For SQL like functions: It has many functions related to SQL along with the cogroup function.
Want to know more about Apache Pig? Please refer our blog on Apache Pig Progression with
Hadoop’s Changing Versions.
SQL is a general purpose database management language used around the globe. It has been
updating itself as per the user expectations for decades. It is declarative and hence focuses explicitly
on ‘what’ is needed. It is popularly used for the transactional as well as analytical queries. When the
requirements are not too demanding, SQL works as an excellent tool. Here are few scenarios –
For better performance: SQL is famous for its ability to pull data quickly and frequently. It
supports OLAP (Online Analytical Processing) applications and performs better for these
applications. Hive is slow in case of online transactional needs.
When the datasets are small: SQL works well with small datasets and performs much better
for smaller amounts of data. It also has many ways for the optimisation of data.
For frequent data manipulation: If your requirement needs frequent modification in records
or you need to update a large number of records frequently, SQL can perform these
activities well. SQL also provides an entirely interactive experience to the user.
To know more about how SQL fits in Hadoop architecture, you can refer our blog on NoSQL vs SQL –
How NoSQL is Better for Big Data Applications?
Does the comparison for Hive vs Pig vs SQL direct the winning of the game?
We have seen that there are significant differences in the three- Hive vs Pig vs SQL. All of these
perform specific functions and meet unique requirements of the business. Also, all three requires
proper infrastructure and skills for their efficient use while working on data sets.
Nature of Uses Procedural language Uses Declarative language SQL itself is Declarative
Language called Pig Latin called HiveQL language
Definition An open source and high- An open source built with General purpose database
level data flow language with an analytical focus used for language for analytical and
a Multi-query approach Analytical queries transactional queries
Suitable for Suitable for Complex as well Ideal for Batch Processing – Ideal for more straightforward
as Nested data structure OLAP (Online Analytical business demands for fast data
Processing) analysis
Operational Semi-structured and Used only for Structured A domain-specific language for
for structured data Data a relational database
management system
Compatibility Pig works on top of Hive works on top of Not compatible with
MapReduce MapReduce MapReduce programming
Use of Schema No concept of a schema to Supports Schema for data Strict use of schemas in case of
store data insertion in tables storing data
On one side, Apache Pig relies on scripts and it requires special knowledge while Apache Hive is the
answer for innate developers working on databases. Furthermore, Apache Hive has better access
choices and features than that in Apache Pig. However, Apache Pig works faster than Apache Hive.
On the other hand, SQL being an old tool with powerful abilities is still an answer to our many needs.
Looking at the differences, we can see that they can meet specific needs of our projects differently.
Both Apache Hive and Apache Pig are used popularly in the management and analysis of big data,
but SQL serves as the traditional database management for smaller datasets. Though SQL is old, the
advance tools can still not replace it. There is a slight tendency of adopting Apache Hive and Apache
Pig over SQL by the big businesses looking for object-oriented programming. However, the smaller
projects will still need SQL.
Bottom Line
Despite of the extensively advanced features, Pig and Hive are still growing and developing
themselves to meet the challenging requirements. Hence when we are comparing Hive vs Pig vs SQL,
there is a distinct direction that Hive and Pig are winning the Big data game, but SQL is still here to
stay.
Note − In some portions of this chapter, the commands like Load and Store are used. Refer the
respective chapters to get in-detail information on them.
Shell Commands
The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts. Prior to that, we can invoke
any shell commands using sh and fs.
sh Command
Using sh command, we can invoke any shell commands from the Grunt shell. Using sh command
from the Grunt shell, we cannot execute the commands that are a part of the shell environment
(ex − cd).
Syntax
Example
We can invoke the ls command of Linux shell from the Grunt shell using the sh option as shown
below. In this example, it lists out the files in the /pig/bin/ directory.
grunt> sh ls
pig
pig_1444799121955.log
pig.cmd
pig.py
fs Command
Using the fs command, we can invoke any FsShell commands from the Grunt shell.
Syntax
Example
We can invoke the ls command of HDFS from the Grunt shell using fs command. In the following
example, it lists the files in the HDFS root directory.
grunt> fs –ls
Found 3 items
In the same way, we can invoke all the other file system shell commands from the Grunt shell using
the fs command.
Utility Commands
The Grunt shell provides a set of utility commands. These include utility commands such as clear,
help, history, quit, and set; and commands such as exec, kill, and run to control Pig from the Grunt
shell. Given below is the description of the utility commands provided by the Grunt shell.
clear Command
Syntax
You can clear the screen of the grunt shell using the clear command as shown below.
grunt> clear
help Command
Usage
You can get a list of Pig commands using the help command as shown below.
grunt> help
Commands: <pig latin statement>; - See the PigLatin manual for details:
http://hadoop.apache.org/pig
http://hadoop.apache.org/common/docs/current/hdfs_shell.html
[-param <param_name>=<pCram_value>]
-out - Store the output into directory rather than print to stdout.
-brief - Don't expand nested plans (presenting a smaller graph for overview).
dump <alias> - Compute the alias and writes the results to stdout.
kill <job_id> - Kill the hadoop job specified by the hadoop job id.
set <key> <value> - Provide execution parameters to Pig. Keys and values are case sensitive.
by default.
job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high.
history Command
This command displays a list of statements executed / used so far since the Grunt sell is invoked.
Usage
Assume we have executed three statements since opening the Grunt shell.
grunt> history
set Command
Usage
Using this command, you can set values to the following keys.
default_parallel You can set the number of reducers for a map job by passing any whole number as a value
to this key.
debug You can turn off or turn on the debugging freature in Pig by passing on/off to this key.
job.name You can set the Job name to the required job by passing a string value to this key.
job.priority You can set the job priority to a job by passing one of the following values to this key −
very_low
low
normal
high
very_high
stream.skippat For streaming, you can set the path from where the data is not to be transferred, by
h passing the desired path in the form of a string to this key.
quit Command
You can quit from the Grunt shell using this command.
Usage
grunt> quit
Let us now take a look at the commands using which you can control Apache Pig from the Grunt
shell.
exec Command
Using the exec command, we can execute Pig scripts from the Grunt shell.
Syntax
Example
Student.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi
Sample_script.pig
as (id:int,name:chararray,city:chararray);
Dump student;
Now, let us execute the above script from the Grunt shell using the exec command as shown below.
Output
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)
kill Command
You can kill a job from the Grunt shell using this command.
Syntax
Example
Suppose there is a running Pig job having id Id_0055, you can kill it from the Grunt shell using
the kill command, as shown below.
run Command
You can run a Pig script from the Grunt shell using the run command
Syntax
Example
Let us assume there is a file named student.txt in the /pig_data/ directory of HDFS with the
following content.
Student.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi
And, assume we have a script file named sample_script.pig in the local filesystem with the following
content.
Sample_script.pig
PigStorage(',') as (id:int,name:chararray,city:chararray);
Now, let us run the above script from the Grunt shell using the run command as shown below.
You can see the output of the script using the Dump operator as shown below.
grunt> Dump;
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)
Pig Latin
The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a textual
language that abstracts the programming from the Java MapReduce idiom into a notation.
The Pig Latin statements are used to process the data. It is an operator that accepts a relation as an
input and generates another relation as an output.
Convention Description
Type Description
Complex Types
Type Description
For writing UDF’s, complete support is provided in Java and limited support is provided in all the
remaining languages. Using Java, you can write UDF’s involving all parts of the processing like data
load/store, column transformation, and aggregation. Since Apache Pig has been written in Java, the
UDF’s written using Java language work efficiently compared to other languages.
In Apache Pig, we also have a Java repository for UDF’s named Piggybank. Using Piggybank, we can
access Java UDF’s written by other users, and contribute our own UDF’s.
While writing UDF’s using Java, we can create and use the following three types of functions −
Filter Functions − The filter functions are used as conditions in filter statements. These
functions accept a Pig value as input and return a Boolean value.
Eval Functions − The Eval functions are used in FOREACH-GENERATE statements. These
functions accept a Pig value as input and return a Pig result.
Algebraic Functions − The Algebraic functions act on inner bags in a FOREACHGENERATE
statement. These functions are used to perform full MapReduce operations on an inner bag.
To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we discuss
how to write a sample UDF using Eclipse. Before proceeding further, make sure you have installed
Eclipse and Maven in your system.
Copy the following content in the pom.xml. This file contains the Maven dependencies for
Apache Pig and Hadoop-core jar files.
xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
<modelVersion>4.0.0</modelVersion>
<groupId>Pig_Udf</groupId>
<artifactId>Pig_Udf</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<sourceDirectory>src</sourceDirectory>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.3</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<version>0.15.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.2</version>
</dependency>
</dependencies>
</project>
Save the file and refresh it. In the Maven Dependencies section, you can find the
downloaded jar files.
Create a new class file with name Sample_Eval and copy the following content in it.
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
return null;
return str.toUpperCase();
While writing UDF’s, it is mandatory to inherit the EvalFunc class and provide implementation
to exec() function. Within this function, the code required for the UDF is written. In the above
example, we have return the code to convert the contents of the given column to uppercase.
After compiling the class without errors, right-click on the Sample_Eval.java file. It gives you
a menu. Select export as shown in the following screenshot.
On clicking export, you will get the following window. Click on JAR file.
Proceed further by clicking Next> button. You will get another window where you need to
enter the path in the local file system, where you need to store the jar file.
Finally click the Finish button. In the specified folder, a Jar file sample_udf.jar is created. This
jar file contains the UDF written in Java.
Using the UDF
After writing the UDF and generating the Jar file, follow the steps given below −
After writing UDF (in Java) we have to register the Jar file that contain the UDF using the Register
operator. By registering the Jar file, users can intimate the location of the UDF to Apache Pig.
Syntax
REGISTER path;
Example
Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below.
$cd PIG_HOME/bin
$./pig –x local
REGISTER '/$PIG_HOME/sample_udf.jar'
Syntax
Example
After defining the alias you can use the UDF same as the built-in functions. Suppose there is a file
named emp_data in the HDFS /Pig_Data/ directory with the following content.
001,Robin,22,newyork
002,BOB,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London
005,David,23,Bhuwaneshwar
006,Maggy,22,Chennai
007,Robert,22,newyork
008,Syam,23,Kolkata
009,Mary,25,Tokyo
010,Saran,25,London
011,Stacy,25,Bhuwaneshwar
012,Kelly,22,Chennai
And assume we have loaded this file into Pig as shown below.
Let us now convert the names of the employees in to upper case using the UDF sample_eval.
(ROBIN)
(BOB)
(MAYA)
(SARA)
(DAVID)
(MAGGY)
(ROBERT)
(SYAM)
(MARY)
(SARAN)
(STACY)
(KELLY)
The Apache Pig Operators is a high-level procedural language for querying large data sets using
Hadoop and the Map-Reduce Platform.
A Pig Latin statement is an operator that takes a relation as input and produces another relation as
output.
These operators are the main tools for Pig Latin provides to operate on the data.
They allow you to transform it by sorting, grouping, joining, projecting, and filtering.
Relational Operators :
Relational operators are the main tools Pig Latin provides to operate on the data.
LOAD: The LOAD operator is used to loading data from the file system or HDFS storage into a Pig
relation.
FOREACH: This operator generates data transformations based on columns of data. It is used to add
or remove fields from a relation.
JOIN: JOIN operator is used to performing an inner, equijoin join of two or more relations based on
common field values
ORDER BY: Order By is used to sort a relation based on one or more fields in either ascending or
descending order using ASC and DESC keywords.
GROUP: The GROUP operator groups together the tuples with the same group key (key field).
COGROUP: COGROUP is the same as the GROUP operator. For readability, programmers usually use
GROUP when only one relation is involved and COGROUP when multiple relations are reinvolved.
Diagnostic Operator :
The load statement will simply load the data into the specified relation in Apache Pig.
To verify the execution of the Load statement, you have to use the Diagnostic Operators.
DUMP: The DUMP operator is used to run Pig Latin statements and display the results on the screen.
DESCRIBE: Use the DESCRIBE operator to review the schema of a particular relation. The DESCRIBE
operator is best used for debugging a script.
ILLUSTRATE: ILLUSTRATE operator is used to review how data is transformed through a sequence of
Pig Latin statements. ILLUSTRATE command is your best friend when it comes to debugging a script.
EXPLAIN: The EXPLAIN operator is used to display the logical, physical, and MapReduce execution
plans of a relation.
What is HIVE
Hive is a data warehouse system which is used to analyze structured data. It is built on the top of
Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets internally
converted to MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing complex MapReduce
programs. Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and
User Defined Functions (UDF).
Features of Hive
o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark
jobs.
o It allows different storage types such as plain text, RCFile, and HBase.
o It supports user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive
Hive Pig
HIVE Shell
$HIVE_HOME/bin/hive is a shell utility which can be used to run Hive queries in either interactive or
batch mode. HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline, which is a JDBC
client based on SQLLine.
Hive Command Line Options
To get help, run “hive -H” or “hive –help”. Usage (as it is in Hive 0.9.0)
usage: hive
Option Explanation
-d,–define <key=value> Variable subsitution to apply to hive commands. e.g. -d A=B or –define A=B
–hivevar <key=value> Variable subsitution to apply to hive commands. e.g. –hivevar A=B
Examples
Example of dumping data out from a query into a file using silent mode
$HIVE_HOME/bin/hive -f /home/my/hive-script.sql
$HIVE_HOME/bin/hive -f hdfs://<namenode>:<port>/hive-script.sql
Hive CLI is a legacy tool which had two main use cases. The first is that it served as a thick client for
SQL on Hadoop and the second is that it served as a command line tool for Hive Server (the original
Hive server, now often referred to as “HiveServer1”). Hive Server has been deprecated and removed
from the Hive code base as of Hive 1.0.0 and replaced with HiveServer2, so the second use case no
longer applies. For the first use case, Beeline provides or is supposed to provide equal functionality,
yet is implemented differently from Hive CLI.
Command Description
reset Resets the configuration to the default values (as of Hive 0.10).
set Prints a list of configuration variables that are overridden by the user or Hive.
add FILE[S] <filepath> Adds one or more files, jars, or archives to the list of resources in the distributed cache.
<filepath>*
add JAR[S] <filepath>
<filepath>*
add ARCHIVE[S]
<filepath> <filepath>*
add FILE[S] <ivyurl> As of Hive 1.2.0, adds one or more files, jars or archives to the list of resources in the
<ivyurl>* distributed cache using an Ivy URL of the form ivy://group:module:version?query_string.
add JAR[S] <ivyurl>
<ivyurl>*
add ARCHIVE[S] <ivyurl>
<ivyurl>*
list FILE[S]list JAR[S]list Lists the resources already added to the distributed cache.
ARCHIVE[S]
list FILE[S] <filepath>* Checks whether the given resources are already added to the distributed cache or not.
list JAR[S] <filepath>*
list ARCHIVE[S]
<filepath>*
delete FILE[S] <ivyurl> As of Hive 1.2.0, removes the resource(s) which were added using the <ivyurl> from the
<ivyurl>* distributed cache.
delete JAR[S] <ivyurl>
<ivyurl>*
delete ARCHIVE[S]
<ivyurl> <ivyurl>*
dfs <dfs command> Executes a dfs command from the Hive shell.
<query string> Executes a Hive query and prints results to standard output.
Ideally, Hive CLI should be deprecated as the Hive community has long recommended using the
Beeline plus HiveServer2 configuration; however, because of the wide use of Hive CLI, we instead
are replacing Hive CLI’s implementation with a new Hive CLI on top of Beeline plus embedded
HiveServer2 so that the Hive community only needs to maintain a single code path. In this way, the
new Hive CLI is just an alias to Beeline at both the shell script level and the high code level. The goal
is that no or minimal changes are required from existing user scripts using Hive CLI.
The CLI when invoked without the -i option will attempt to load $HIVE_HOME/bin/.hiverc and
$HOME/.hiverc as initialization files.
When $HIVE_HOME/bin/hive is run with the -e or -f option, it executes SQL commands in batch
mode.
When $HIVE_HOME/bin/hive is run without either the -e or -f option, it enters interactive shell
mode. Use “;” (semicolon) to terminate commands. Comments in scripts can be specified using the
“–” prefix.
Example
hive> set;
hive> !ls;
HiveServer2 supports a new command shell Beeline that works with HiveServer2. It’s a JDBC client
that is based on the SQLLine CLI. The Beeline shell works in both embedded mode as well as remote
mode. In the embedded mode, it runs an embedded Hive (similar to Hive CLI) whereas remote mode
is for connecting to a separate HiveServer2 process over Thrift. Starting in Hive 0.14, when Beeline is
used with HiveServer2, it also prints the log messages from HiveServer2 for queries it executes to
STDERR.
Option Description
-u <database URL> The JDBC URL to connect to. Usage: beeline -u db_URL
-e <query> Query that should be executed. Double or single quotes enclose the query string. This option
can be specified multiple times. Usage: beeline -e “query_string“
–hiveconf propert Use value for the given configuration property. Properties that are listed in
y=value hive.conf.restricted.list cannot be reset with hiveconf. Usage: beeline –hiveconf prop1=value1
–hivevar name=value Hive variable name and value. This is a Hive-specific setting in which variables can be set at the
session level and referenced in Hive commands or queries. Usage: beeline –
hivevar var1=value1
Hive Services
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive
queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a
web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of various
tables and partitions in the warehouse. It also includes metadata of column and its type
information, the serializers and deserializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-
reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks
in the order of their dependencies.
Beeline, Hue, JDBC, and Impala shell clients make requests through thrift or JDBC to
HiveServer. The HiveServer instance reads/writes data to HMS. By default,
redundant HMS operate in active/active mode. The physical data resides in a
backend RDBMS, one for HMS. You must configure all HMS instances to use the
same backend database. A separate RDBMS supports the security service, Ranger
for example. All connections are routed to a single RDBMS service at any given time.
HMS talks to the NameNode over thrift and functions as a client to HDFS.
HMS connects directly to Ranger and the NameNode (HDFS), and so does
HiveServer, but this is not shown in the diagram for simplicity. One or more HMS
instances on the backend can talk to other services, such as Ranger.
It’s very easily scalable at low cost Not much Scalable, costly scale up.
It’s based on hadoop notation that is Write once and In traditional database we can read and write many
read many times time
Record level updates is not possible in Hive Record level updates, insertions and
deletes, transactions and indexes are possible
Hive Query Language (HiveQL) is a query language in Apache Hive for processing and analyzing
structured data. It separates users from the complexity of Map Reduce programming. It reuses
common concepts from relational databases, such as tables, rows, columns, and schema, to ease
learning. Hive provides a CLI for Hive query writing using Hive Query Language (HiveQL).
Most interactions tend to take place over a command line interface (CLI). Generally, HiveQL syntax is
similar to the SQL syntax that most data analysts are familiar with. Hive supports four file formats
which are: TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File).
Hive uses derby database for single user metadata storage, and for multiple user Metadata or
shared Metadata case, Hive uses MYSQL.
Hive provides Built-in operators for Data operations to be implemented on the tables present inside
Hive warehouse.
These operators are used for mathematical operations on operands, and it will return specific value
as per the logic applied.
Relational Operators
Arithmetic Operators
Logical Operators
Operators such as equals, Not equals, less than, greater than …etc
The following Table will give us details about Relational operators and its usage in HiveQL:
Built-in
Description Operand
Operator
TRUE
It takes all primitive
X=Y if expression X is equivalent to expression Y
types
Otherwise FALSE.
TRUE
It takes all primitive
X != Y if expression X is not equivalent to expression Y
types
Otherwise FALSE.
TRUE
It takes all primitive
X<Y if expression X is less than expression Y
types
Otherwise FALSE.
TRUE
It takes all primitive
X <= Y if expression X is less than or equal to expression Y
types
Otherwise FALSE.
TRUE
It takes all primitive
X>Y if expression X is greater than expression Y
types
Otherwise FALSE.
TRUE
It takes all primitive
X>= Y if expression X is greater than or equal to expression Y
types
Otherwise FALSE.
X IS NULL TRUE if expression X evaluates to NULL otherwise FALSE. It takes all types
FALSE
X IS NOT NULL It takes all types
If expression X evaluates to NULL otherwise TRUE.
TRUE
X LIKE Y Takes only Strings
If string pattern X matches to Y otherwise FALSE.
NULL if X or Y is NULL, TRUE if any substring of X matches the Java regular
X RLIKE Y Takes only Strings
expression Y, otherwise FALSE.
Sample Example:
2 + 3 gives result 5.
In this example, ‘+’ is theoperator and 2 and 3 are operands. The return value is 5
The following Table will give us details about Arithmetic operators in Hive Query Language:
X+Y It will return the output of adding X and Y value. It takes all number types
X–Y It will return the output of subtracting Y from X value. It takes all number types
X*Y It will return the output of multiplying X and Y values. It takes all number types
X/Y It will return the output of dividing Y from X. It takes all number types
X%Y It will return the remainder resulting from dividing X by Y. It takes all number types
X&Y It will return the output of bitwise AND of X and Y. It takes all number types
X|Y It will return the output of bitwise OR of X and Y. It takes all number types
X^Y It will return the output of bitwise XOR of X and Y. It takes all number types
~X It will return the output of bitwise NOT of X. It takes all number types
Logical operations such as AND, OR, NOT between operands we use these Operators.
The following Table will give us details about Logical operators in HiveSQL:
X AND Y TRUE if both X and Y are TRUE, otherwise FALSE. Boolean types only
X && Y Same as X AND Y but here we using && symbol Boolean types only
X OR Y TRUE if either X or Y or both are TRUE, otherwise FALSE. Boolean types only
The following Table will give us details about Complex Type Operators . These are operators which
will provide a different mechanism to access elements in complex types.
Operator
Operands Description
s
A is an Array and n is an integer It will return nth element in the array A. The first element has
A[n]
type index of 0
The following Table will give us details about Complex type Constructors. It will construct instances
on complex data types. These are of complex data types such as Array, Map and Struct types in Hive.
In this section, we are going to see the operations performed on Complex type Constructors.
It will create an array with the given elements as mentioned like val1,
array (val1, val2, …)
val2
It will create a union type with the values that is being mentioned to
Create_ union (tag, val1, val2, …)
by the tag parameter
(key1, value1, key2, It will create a map with the given key/value pairs mentioned in
map
value2, …) operands
Named_struc (name1, val1, name2, It will create a Struct with the given field names and values mentioned
t val2, …) in operands
Creates a Struct with the given field values. Struct field names will be
STRUCT (val1, val2, val3, …)
col1, col2, .
Summary
Hive Query Language (HiveQL) is a query language in Apache Hive for processing and
analyzing structured data.
Hive provides Built-in operators for Data operations to be implemented on the tables
present inside Hive warehouse.
Arithmetic Operators
Logical Operators
Managed Tables
In a managed table, both the table data and the table schema are managed by Hive. The data will be
located in a folder named after the table within the Hive data warehouse, which is essentially just a
file location in HDFS.
The location is user-configurable when Hive is installed. By managed or controlled we mean that if
you drop (delete) a managed table, then Hive will delete both the Schema (the description of the
table) and the data files associated with the table. Default location is /user/hive/warehouse).
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
As for managed tables, you can also copy the schema (but not the data) of an existing table:
LIKE mydb.employees
LOCATION '/path/to/data';
External Tables
An external table is one where only the table schema is controlled by Hive. In most cases, the user
will set up the folder location within HDFS and copy the data file(s) there. This location is included as
part of the table definition statement. When an external table is deleted, Hive will only delete the
schema associated with the table. The data files are not affected.
Syntax to Create External Table
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
LOCATION '/data/stocks';
Hive assumes that it owns the data for For external tables, Hive assumes that it
managed tables. does not manage the data.
If a managed table or partition is dropped, the Dropping the table does not delete the data,
data and metadata associated with that table although the metadata for the table will be
or partition are deleted. deleted.
For Managed tables, Hive stores data into its For External Tables, Hive stores the data in the
warehouse directory LOCATION specified during creation of
the table(generally not in warehouse directory)
You can tell whether or not a table is managed or external using the output of DESCRIBE EXTENDED
table name.
Near the end of the Detailed Table Information output, you will see the following
for managed tables:
Note: If you omit the EXTERNAL keyword and the original table is external, the new table will also be
external. If you omit EXTERNAL and the original table is managed, the new table will also be
managed. However, if you include the EXTERNAL keyword and the original table is managed, the
new table will be external. Even in this scenario, the LOCATION clause will still be optional.
Hive Queries: Order By, Group By, Distribute By, Cluster By Examples
Hive provides SQL type querying language for the ETL purpose on top of Hadoop file system.
Hive Query language (HiveQL) provides SQL type environment in Hive to work with tables,
databases, queries.
We can have a different type of Clauses associated with Hive to perform different type data
manipulations and querying. For better connectivity with different nodes outside the environment.
HIVE provide JDBC connectivity as well.
ETL functionalities such as Extraction, Transformation, and Loading data into tables
Before initiating with our main topic for this tutorial, first we will create a table to use it as
references for the following tutorial.
Here in this tutorial, we are going to create table “employees_guru” with 6 columns.
1. We are creating table “employees_guru” with 6 column values such as Id, Name, Age,
Address, Salary, Department, which belongs to the employees present in organization
“guru.”
2. Here in this step we are loading data into employees_guru table. The data that we are going
to load will be placed under Employees.txt file
Order by query:
The ORDER BY syntax in HiveQL is similar to the syntax of ORDER BY in SQL language.
Order by is the clause we use with “SELECT” statement in Hive queries, which helps sort data. Order
by clause use columns on Hive tables for sorting particular column values mentioned with Order by.
For whatever the column name we are defining the order by clause the query will selects and display
results by ascending or descending order the particular column values.
If the mentioned order by field is a string, then it will display the result in lexicographical order. At
the back end, it has to be passed on to a single reducer.
From the Above screen shot, we can observe the following
1. It is the query that performing on the “employees_guru” table with the ORDER BY clause
with Department as defined ORDER BY column name.
2. This is actual output for the query. If we observe it properly, we can see that it get results
displayed based on Department column such as ADMIN, Finance and so on in orderQuery to
be perform.
Query :
Group by query:
Group by clause use columns on Hive tables for grouping particular column values mentioned with
the group by. For whatever the column name we are defining a “groupby” clause the query will
selects and display results by grouping the particular column values.
For example, in the below screen shot it’s going to display the total count of employees present in
each department. Here we have “Department” as Group by value.
1. It is the query that is performed on the “employees_guru” table with the GROUP BY clause
with Department as defined GROUP BY column name.
2. The output showing here is the department name, and the employees count in different
departments. Here all the employees belong to the specific department is grouped by and
displayed in the results. So the result is department name with the total number of
employees present in each department.
Query:
Sort by:
Sort by clause performs on column names of Hive tables to sort the output. We can mention DESC
for sorting the order in descending order and mention ASC for Ascending order of the sort.
In this sort by it will sort the rows before feeding to the reducer. Always sort by depends on column
types.
For instance, if column types are numeric it will sort in numeric order if the columns types are string
it will sort in lexicographical order.
From the above screen shot we can observe the following:
1. It is the query that performing on the table “employees_guru” with the SORT BY clause with
“id” as define SORT BY column name. We used keyword DESC.
Query:
Cluster By:
Cluster By used as an alternative for both Distribute BY and Sort BY clauses in Hive-QL.
Cluster BY clause used on tables present in Hive. Hive uses the columns in Cluster by to distribute the
rows among reducers. Cluster BY columns will go to the multiple reducers.
It ensures sorting orders of values present in multiple reducers
For example, Cluster By clause mentioned on the Id column name of the table employees_guru
table. The output when executing this query will give results to multiple reducers at the back end.
But as front end it is an alternative clause for both Sort By and Distribute By.
This is actually back end process when we perform a query with sort by, group by, and cluster by in
terms of Map reduce framework. So if we want to store results into multiple reducers, we go with
Cluster By.
From the above screen shot we are getting the following observations:
1. It is the query that performs CLUSTER BY clause on Id field value. Here it’s going to get a sort
on Id values.
Distribute By:
Distribute BY clause used on tables present in Hive. Hive uses the columns in Distribute by to
distribute the rows among reducers. All Distribute BY columns will go to the same reducer.
Query:
In Hive, the users can define own functions to meet certain client requirements. These are known as
UDFs in Hive. User Defined Functions written in Java for specific modules.
Some of UDFs are specifically designed for the reusability of code in application frameworks. The
developer will develop these functions in Java and integrate those UDFs with the Hive.
During the Query execution, the developer can directly use the code, and UDFs will return outputs
according to the user-defined tasks. It will provide high performance in terms of coding and
execution.
For example, for string stemming we don’t have any predefined function in Hive. For this, we can
write stem UDF in Java. Wherever we require Stem functionality, we can directly call this Stem UDF
in Hive.
Here stem functionality means deriving words from its root words. It is like stemming algorithm
reduces the words “wishing”, “wished”, and “wishes” to the root word “wish.” For performing this
type of functionality, we can write UDF in Java and integrate it with Hive.
Depending on the use cases, the UDFs can be written. It will accept and produce different numbers
of input and output values.
The general type of UDF will accept a single input value and produce a single output value. If the UDF
is used in the query, then UDF will be called once for each row in the result data set.
In the other way, it can accept a group of values as input and return a single output value as well.
HBase - Overview
Since 1970, RDBMS is the solution for data storage and maintenance related problems. After the
advent of big data, companies realized the benefit of processing big data and started opting for
solutions like Hadoop.
Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop
excels in storing and processing of huge data of various formats such as arbitrary, semi-, or even
unstructured.
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed only in a sequential manner.
That means one has to search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data set, which should also be processed
sequentially. At this point, a new solution is needed to access any point of data in a single unit of
time (random access).
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random access
to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File
System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the
data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read
and write access.
HDFS HBase
HDFS is a distributed file system HBase is a database built on top of the HDFS.
suitable for storing large files.
HDFS does not support fast individual HBase provides fast lookups for larger tables.
record lookups.
It provides high latency batch It provides low latency access to single rows from billions of
processing; no concept of batch records (Random access).
processing.
It provides only sequential access of HBase internally uses Hash tables and provides random
data. access, and it stores the data in indexed HDFS files for faster
lookups.
HBase is a column-oriented database and the tables in it are sorted by row. The table schema
defines only column families, which are the key value pairs. A table have multiple column families
and each column family can have any number of columns. Subsequent column values are stored
contiguously on the disk. Each cell value of the table has a timestamp. In short, in an HBase:
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
Column-oriented databases are those that store data tables as sections of columns of data, rather
than as rows of data. Shortly, they will have column families.
It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical Processing
(OLAP).
Such databases are designed for small number of Column-oriented databases are designed for
rows and columns. huge tables.
HBase RDBMS
HBase is schema-less, it doesn't have the concept An RDBMS is governed by its schema,
of fixed columns schema; defines only column which describes the whole structure of
families. tables.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to
scalable. scale.
Features of HBase
Apache HBase is used to have random, real-time read/write access to Big Data.
Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up
on Google File System, likewise Apache HBase works on top of Hadoop and HDFS.
Applications of HBase
HBase is used whenever we need to provide fast random access to available data.
Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
HBase History
Year Event
Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.
This chapter describes the java client API for HBase that is used to perform CRUD operations on
HBase tables. HBase is written in Java and has a Java Native API. Therefore it provides programmatic
access to Data Manipulation Language (DML).
Class HTable
HTable is an HBase internal class that represents an HBase table. It is an implementation of table
that is used to communicate with a single HBase table. This class belongs to
the org.apache.hadoop.hbase.client class.
Constructors
1 HTable()
Using this constructor, you can create an object to access an HBase table.
1 void close()
Using this method, you can test the existence of columns in the table, as specified by Get.
5 org.apache.hadoop.conf.Configuration getConfiguration()
6 TableName getName()
7 HTableDescriptor getTableDescriptor()
8 byte[] getTableName()
Using this method, you can insert data into the table.
Class Put
This class is used to perform Put operations for a single row. It belongs to
the org.apache.hadoop.hbase.client package.
Constructors
1 Put(byte[] row)
Using this constructor, you can create a Put operation for the specified row.
Using this constructor, you can make a copy of the passed-in row key to keep local.
Using this constructor, you can make a copy of the passed-in row key to keep local.
Using this constructor, we can create a Put operation for the specified row, using a given
timestamp.
Methods
Adds the specified column and value, with the specified timestamp as its version to this Put
operation.
Adds the specified column and value, with the specified timestamp as its version to this Put
operation.
Adds the specified column and value, with the specified timestamp as its version to this Put
operation.
Class Get
This class is used to perform Get operations on a single row. This class belongs to
the org.apache.hadoop.hbase.client package.
Constructor
1 Get(byte[] row)
Using this constructor, you can create a Get operation for the specified row.
2 Get(Get get)
Methods
Retrieves the column from the specific family with the specified qualifier.
Class Delete
This class is used to perform Delete operations on a single row. To delete an entire row, instantiate a
Delete object with the row to delete. This class belongs to
the org.apache.hadoop.hbase.client package.
Constructor
1 Delete(byte[] row)
Creates a Delete operation for the specified row.
Methods
Deletes all versions of the specified column with a timestamp less than or equal to the specified
timestamp.
Deletes all columns of the specified family with a timestamp less than or equal to the specified
timestamp.
Class Result
This class is used to get a single row result of a Get or a Scan query.
Constructors
S.No Constructors
.
1 Result()
Using this constructor, you can create an empty Result with no KeyValue payload; returns null
if you call raw Cells().
Methods
This method is used to get the latest version of the specified column.
2 byte[] getRow()
This method is used to retrieve the row key that corresponds to the row from which this
Result was created.
HBase Example
Use Case
We have to import data present in the file into an HBase table by creating it through Java API.
1. 1,India,Bihar,Champaran,2009,April,P1,1,5
2. 2,India, Bihar,Patna,2009,May,P1,2,10
3. 3,India, Bihar,Bhagalpur,2010,June,P2,3,15
4. 4,United States,California,Fresno,2009,April,P2,2,5
5. 5,United States,California,Long Beach,2010,July,P2,4,10
6. 6,United States,California,San Francisco,2011,August,P1,6,20
This data has to be inputted into a new HBase table to be created through JAVA API. Following
column families have to be created
1. "sample,region,time.product,sale,profit".
Column family region has three column qualifiers: country, state, city
Jar Files
Make sure that the following jars are present while writing the code as they are required by the
HBase.
a. commons-loging-1.0.4
b. commons-loging-api-1.0.4
c. hadoop-core-0.20.2-cdh3u2
d. hbase-0.90.4-cdh3u2
e. log4j-1.2.15
f. zookeper-3.3.3-cdh3u0
Program Code
1. import java.io.BufferedReader;
2. import java.io.File;
3. import java.io.FileReader;
4. import java.io.IOException;
5. import java.util.StringTokenizer;
6.
7. import org.apache.hadoop.conf.Configuration;
8. import org.apache.hadoop.hbase.HBaseConfiguration;
9. import org.apache.hadoop.hbase.HColumnDescriptor;
10. import org.apache.hadoop.hbase.HTableDescriptor;
11. import org.apache.hadoop.hbase.client.HBaseAdmin;
12. import org.apache.hadoop.hbase.client.HTable;
13. import org.apache.hadoop.hbase.client.Put;
14. import org.apache.hadoop.hbase.util.Bytes;
15.
16.
17. public class readFromFile {
18. public static void main(String[] args) throws IOException{
19. if(args.length==1)
20. {
21. Configuration conf = HBaseConfiguration.create(new Configuration());
22. HBaseAdmin hba = new HBaseAdmin(conf);
23. if(!hba.tableExists(args[0])){
24. HTableDescriptor ht = new HTableDescriptor(args[0]);
25. ht.addFamily(new HColumnDescriptor("sample"));
26. ht.addFamily(new HColumnDescriptor("region"));
27. ht.addFamily(new HColumnDescriptor("time"));
28. ht.addFamily(new HColumnDescriptor("product"));
29. ht.addFamily(new HColumnDescriptor("sale"));
30. ht.addFamily(new HColumnDescriptor("profit"));
31. hba.createTable(ht);
32. System.out.println("New Table Created");
33.
34. HTable table = new HTable(conf,args[0]);
35.
36. File f = new File("/home/training/Desktop/data");
37. BufferedReader br = new BufferedReader(new FileReader(f));
38. String line = br.readLine();
39. int i =1;
40. String rowname="row";
41. while(line!=null && line.length()!=0){
42. System.out.println("Ok till here");
43. StringTokenizer tokens = new StringTokenizer(line,",");
44. rowname = "row"+i;
45. Put p = new Put(Bytes.toBytes(rowname));
46. p.add(Bytes.toBytes("sample"),Bytes.toBytes("sampleNo."),
47. Bytes.toBytes(Integer.parseInt(tokens.nextToken())));
48. p.add(Bytes.toBytes("region"),Bytes.toBytes("country"),Bytes.toBytes(tokens.nex
tToken()));
49. p.add(Bytes.toBytes("region"),Bytes.toBytes("state"),Bytes.toBytes(tokens.nextT
oken()));
50. p.add(Bytes.toBytes("region"),Bytes.toBytes("city"),Bytes.toBytes(tokens.nextTok
en()));
51. p.add(Bytes.toBytes("time"),Bytes.toBytes("year"),Bytes.toBytes(Integer.parseInt
(tokens.nextToken())));
52. p.add(Bytes.toBytes("time"),Bytes.toBytes("month"),Bytes.toBytes(tokens.nextT
oken()));
53. p.add(Bytes.toBytes("product"),Bytes.toBytes("productNo."),Bytes.toBytes(token
s.nextToken()));
54. p.add(Bytes.toBytes("sale"),Bytes.toBytes("quantity"),Bytes.toBytes(Integer.pars
eInt(tokens.nextToken())));
55. p.add(Bytes.toBytes("profit"),Bytes.toBytes("earnings"),Bytes.toBytes(tokens.nex
tToken()));
56. i++;
57. table.put(p);
58. line = br.readLine();
59. }
60. br.close();
61. table.close();
62. }
63. else
64. System.out.println("Table Already exists.Please enter another table name");
65. }
66. else
67. System.out.println("Please Enter the table name through command line");
68. }
69. }
Relational Database Management System (RDBMS): RDBMS is for SQL, and for all modern database
systems like MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft Access. A Relational database
management system (RDBMS) is a database management system (DBMS) that is based on the
relational model as introduced by E. F. Codd. An RDBMS is a type of DBMS with a row-based table
structure that connects related data elements and includes functions that maintain the security,
accuracy, integrity, and consistency of the data. The most basic RDBMS functions are create, read,
update and delete operations. Hbase follows the ACID Properties.
Applications:
Tracking and managing day-to-day activities and transactions such as production, stocking,
income and expenses, and purchases.
HBase: HBase is a column-oriented database management system that runs on top of the Hadoop
Distributed File System (HDFS). It is well suited for sparse data sets, which are common in many big
data use cases. It is an open-source, distributed database developed by Apache software
foundations. Initially, it was named Google Big Table, afterwards, it was re-named HBase and is
primarily written in Java. It can store massive amounts of data from terabytes to petabytes. It is built
for low-latency operations and is used extensively for reading and writing operations. It stores a
large amount of data in the form of tables.
Application:
HBase is used internally by companies including Facebook, Twitter, Yahoo, and Adobe.
4. Scalability RDBMS allows for scaling up. That Scale-out is possible using HBase.
implies, that rather than adding new It means that, while we require
servers, we should upgrade the current extra memory and disc space, we
server to a more capable server must add new servers to the
whenever there is a requirement for cluster rather than upgrade the
more memory, processing power, and existing ones.
disc space.
9. Sparse data It cannot handle sparse data. It can handle sparse data.
10. Volume of The amount of data in RDBMS is In HBase, the .amount of data
data determined by the server’s depends on the number of
configuration. machines deployed rather than
on a single machine.
13. Normalize In RDBMS, you can normalize the data. The data in HBase is not
normalized, which means there is
no logical relationship or
connection between distinct
tables of data.