Bda Unit 5 Notes
Bda Unit 5 Notes
Bda Unit 5 Notes
SYLLABUS:
The following architecture explains the flow of submission of query into Hive.
Hive Client:
Hive allows writing applications in various languages, including Java, Python, and C++.
It supports different types of clients such as:-
Thrift Server: It is a cross-language service provider platform that serves the request from all
those programming languages that supports Thrift.
Hive Services:
The following are the services provided by Hive:-
Hive CLI: The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and
commands.
Hive Web User Interface: The Hive Web UI is just an alternative of Hive CLI. It provides a web-
based GUI for executing Hive queries and commands.
Hive MetaStore: It is a central repository that stores all the structure information of various tables
and partitions in the warehouse. It also includes metadata of column and its type information, the
serializers and deserializers which is used to read and write data and the corresponding HDFS
files where the data is stored.
Hive Server: It is referred to as Apache Thrift Server. It accepts the request from different clients
and provides it to Hive Driver.
Hive Driver: It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC
driver. It transfers the queries to the compiler.
Hive Compiler: The purpose of the compiler is to parse the query and perform semantic analysis on
the different query blocks and expressions. It converts HiveQL statements into MapReduce jobs.
Hive Execution Engine: Optimizer generates the logical plan in the form of DAG of map-reduce
tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the order of
their dependencies.
Installing Hive:
Hive runs on your workstation and converts your SQL query into a series of jobs for execution on
a Hadoop cluster. Hive organizes data into tables, which provide a means for attaching structure to
data stored in HDFS. Metadata, such as table schemas is stored in a database called the metastore.
When starting out with Hive, it is convenient to run the metastore on your local machine. In this
configuration, which is the default, the Hive table definitions that you create will be local to your
machine, so you can’t share them with other users.
Installation of Hive is straightforward. As a prerequisite, Check whether the Java is installed or
not using the command $ java -version, and Check whether the Hadoop is installed or not using
the command $hadoop version.
Download the Apache Hive tar file, and unpack the tarball in a suitable place on your
workstation:
% tar xzf apache-hive-x.y.z-bin.tar.gz
It’s handy to put Hive on your path to make it easy to launch:
% export HIVE_HOME=~/sw/apache-hive-x.y.z-bin
% export PATH=$PATH:$HIVE_HOME/bin
N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 3
Now type hive to launch the Hive shell:
% hive
hive>
The Hive Shell:
The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL.
HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by MySQL, so if you are
familiar with MySQL, you should feel at home using Hive.
When starting Hive for the first time, we can check that it is working by listing its tables. The
command must be terminated with a semicolon to tell Hive to execute it:
hive> SHOW TABLES;
OK
Time taken: 0.473 seconds
Like SQL, HiveQL is generally case insensitive (except for string comparisons).You can also run
the Hive shell in non-interactive mode. The -f option runs the commands in the specified file,
which is script.q in this example:
% hive -f script.q
For short scripts, you can use the -e option to specify the commands inline, in which case the
final semicolon is not required:
% hive -e 'SELECT * FROM dummy'
OK
X
Time taken: 1.22 seconds, Fetched: 1 row(s)
Running Hive:
How to set up Hive to run against a Hadoop cluster and a shared metastore.
Configuring Hive:
Hive is configured using an XML configuration file like Hadoop’s. The file is called hive-site.xml
and is located in Hive’s conf directory. This file is where you can set properties that you want to
set every time you run Hive.
The same directory contains hive-default.xml, which documents the properties that Hive exposes
and their default values.
we can override the configuration directory that Hive looks for in hive-site.xml by passing the --
config option to the hive command:
% hive --config /Users/tom/dev/hive-conf
we can specify the file system and resource manager using the usual Hadoop properties,
fs.defaultFS and yarn.resourcemanager.address, If not set, they default to the local file system and
the local (in process) job runner—just like they do in Hadoop.
If you plan to have more than one Hive user sharing a Hadoop cluster, you need to make the
directories that Hive uses writable by all users.
Spark Architecture
In master node, we have the driver program, which drives our application. The code we are
writing behaves as a driver program or if we are using the interactive shell, the shell acts as
the driver program.
Inside the driver program, the first thing we do is, we create a Spark Context. Assume that
the Spark context is a gateway to all the Spark functionalities. It is similar to your database
connection.
Any command we execute in our database goes through the database connection. Likewise,
anything we do on Spark goes through Spark context.
Now, this Spark context works with the cluster manager to manage various jobs. The driver
program & Spark context takes care of the job execution within the cluster.
N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 10
A job is split into multiple tasks which are distributed over the worker node. Anytime an
RDD is created in Spark context, it can be distributed across various nodes and can be cached
there.
Worker nodes are the slave nodes whose job is to basically execute the tasks. These tasks are
then executed on the partitioned RDDs in the worker node and hence returns back the result
to the Spark Context.
Spark Context takes the job, breaks the job in tasks and distribute them to the worker nodes.
These tasks work on the partitioned RDD, perform operations, collect the results and return to
the main Spark Context.
If we increase the number of workers, then we can divide jobs into more partitions and
execute them parallelly over multiple systems. It will be a lot faster.
With the increase in the number of workers, memory size will also increase & we can cache
the jobs to execute it faster.
To know about the workflow of Spark Architecture, we can have a look at
the infographic below:
STEP1: The client submits spark user application code. When an application code is submitted,
the driver implicitly converts user code that contains transformations and actions into a
logically directed acyclic graph called DAG. At this stage, it also performs optimizations such
as pipelining transformations.
STEP 2: After that, it converts the logical graph called DAG into physical execution plan with
many stages. After converting into a physical execution plan, it creates physical execution
units called tasks under each stage. Then the tasks are bundled and sent to the cluster.
STEP3: Now the driver talks to the cluster manager and negotiates the resources. Cluster
manager launches executors in worker nodes on behalf of the driver. At this point, the driver
will send the tasks to the executors based on data placement. When executors start, they
register themselves with drivers. So, the driver will have a complete view of executors that
are executing the task.
It is a layer of abstracted data over the distributed collection. It is immutable in nature and
follows lazy transformations. The data in an RDD is split into chunks based on a key.
RDDs are highly resilient, i.e., they are able to recover quickly from any issues as the same
data chunks are replicated across multiple executor nodes.
Thus, even if one executor node fails, another will still process the data. This allows you to
perform your functional calculations against your dataset very quickly by harnessing the
power of multiple nodes.
Moreover, once you create an RDD it becomes immutable. Immutable means, an object
whose state cannot be modified after it is created, but they can surely be transformed.
Talking about the distributed environment, each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster.
Due to this, we can perform transformations or actions on the complete data parallelly. Also,
you don’t have to worry about the distribution, because Spark takes care of that.
HBase:
What is HBase?
HBase is an open-source, column-oriented distributed database system that runs on top of
HDFS (Hadoop Distributed File System). Initially, it was Google Big Table, afterward; it was
renamed as HBase and is primarily written in Java.
HBase can store massive amounts of data from terabytes to petabytes. The tables present in
HBase consist of billions of rows having millions of columns. HBase is built for low latency
operations, which is having some specific features compared to traditional relational models.
Why do we need HBase in big data?
Apache HBase is needed for real-time Big Data applications. A table for a popular web
application may consist of billions of rows. If we want to search a particular row from such a
huge amount of data, HBase is the ideal choice as query fetch time is less. Most of the online
analytics applications use HBase.
Traditional relational data models fail to meet the performance requirements of very big
databases. These performance and processing limitations can be overcome by Apache HBase.
HBase Architecture:
$hbase thrift
Building an Online Query Application:
HDFS and MapReduce are powerful tools for processing batch operations over large datasets,
they do not provide ways to read or write individual records efficiently. we can overcome
these drawbacks using HBase tool.
To implement the online query application, we will use the HBase Java API directly. Here it
becomes clear how important your choice of schema and storage format is.
Creating a Table using HBase Shell:
we can create a table using the create command, here we must specify the table name and the
Column Family name. The syntax to create a table in HBase shell is shown below.
Verification: We can verify whether the table is created using the list command as shown
below. Here we can observe the created emp table.
hbase(main):002:0> list
TABLE
emp
2 row(s) in 0.0340 seconds
Inserting Data using HBase Shell:
To create data in an HBase table, the following commands and methods are used:
1.put command,
2.add() method of Put class, and
3.put() method of HTable class.
HBase Table: Using put command, we can insert rows into a table. Its syntax is as follows:
put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’
Inserting the First Row:
Let us insert the first row values into the emp table as shown below.
hbase(main):005:0> put 'emp','1','personal data:name','raju'
0 row(s) in 0.6600 seconds
hbase(main):006:0> put 'emp','1','personal data:city','hyderabad'
0 row(s) in 0.0410 seconds
hbase(main):007:0> put 'emp','1','professional
data:designation','manager'
0 row(s) in 0.0240 seconds
hbase(main):007:0> put 'emp','1','professional data:salary','50000'
0 row(s) in 0.0240 seconds