Bda Unit 5 Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

LECTURE NOTES

BIG DATA ANALYTICS


(BDA)
III B.TECH–CSE-II SEM

SYLLABUS:

Hive: Installing Hive, Running Hive, Comparison with traditional Databases,


HiveQL, Tables, Querying Data.
Spark: Installing Spark, Resilient Distributed Datasets, Shared Variables, Anatomy of
a Spark Job Run.
HBase: HBasics, Installation, clients, Building an Online Query Application.

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 1


Hive:
What is HIVE?
 Hive is a data warehouse system which is used to analyze structured data. It is built on the top of
Hadoop. It was developed by Facebook.
 Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets internally
converted to Map Reduce jobs.
 Using Hive, we can skip the requirement of the traditional approach of writing complex Map
Reduce programs. Hive supports Data Definition Language (DDL), Data Manipulation Language
(DML), and User Defined Functions (UDF).
Why do we need Hive in Big Data?
 Hive allows users to read, write, and manage petabytes of data using SQL. Hive is built on top of
Apache Hadoop, which is an open-source framework is used to efficiently store and process large
datasets.
 As a result, Hive is closely integrated with Hadoop, and is designed to work quickly on petabytes of
data.
 Hive was created to run queries on the huge volumes of data that Facebook stored in HDFS. Today,
Hive is a successful Apache project used by many organizations as a general-purpose, scalable data
processing platform.
 Of course, SQL isn’t ideal for every big data problem—it’s not a good fit for building complex
machine-learning algorithms
Hive Architecture:

 The following architecture explains the flow of submission of query into Hive.

Hive Client:

 Hive allows writing applications in various languages, including Java, Python, and C++.
It supports different types of clients such as:-
 Thrift Server: It is a cross-language service provider platform that serves the request from all
those programming languages that supports Thrift.

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 2


 JDBC Driver: It is used to establish a connection between hive and Java applications. The JDBC
Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
 ODBC Driver: It allows the applications that support the ODBC protocol to connect to Hive.

Hive Services:
 The following are the services provided by Hive:-

 Hive CLI: The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and
commands.
 Hive Web User Interface: The Hive Web UI is just an alternative of Hive CLI. It provides a web-
based GUI for executing Hive queries and commands.
 Hive MetaStore: It is a central repository that stores all the structure information of various tables
and partitions in the warehouse. It also includes metadata of column and its type information, the
serializers and deserializers which is used to read and write data and the corresponding HDFS
files where the data is stored.
 Hive Server: It is referred to as Apache Thrift Server. It accepts the request from different clients
and provides it to Hive Driver.
 Hive Driver: It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC
driver. It transfers the queries to the compiler.
 Hive Compiler: The purpose of the compiler is to parse the query and perform semantic analysis on
the different query blocks and expressions. It converts HiveQL statements into MapReduce jobs.
 Hive Execution Engine: Optimizer generates the logical plan in the form of DAG of map-reduce
tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the order of
their dependencies.
Installing Hive:
 Hive runs on your workstation and converts your SQL query into a series of jobs for execution on
a Hadoop cluster. Hive organizes data into tables, which provide a means for attaching structure to
data stored in HDFS. Metadata, such as table schemas is stored in a database called the metastore.
 When starting out with Hive, it is convenient to run the metastore on your local machine. In this
configuration, which is the default, the Hive table definitions that you create will be local to your
machine, so you can’t share them with other users.
 Installation of Hive is straightforward. As a prerequisite, Check whether the Java is installed or
not using the command $ java -version, and Check whether the Hadoop is installed or not using
the command $hadoop version.

Steps to install Apache Hive:

 Download the Apache Hive tar file, and unpack the tarball in a suitable place on your
workstation:
% tar xzf apache-hive-x.y.z-bin.tar.gz
 It’s handy to put Hive on your path to make it easy to launch:
% export HIVE_HOME=~/sw/apache-hive-x.y.z-bin
% export PATH=$PATH:$HIVE_HOME/bin
N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 3
 Now type hive to launch the Hive shell:
% hive
hive>
The Hive Shell:
 The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL.
HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by MySQL, so if you are
familiar with MySQL, you should feel at home using Hive.
 When starting Hive for the first time, we can check that it is working by listing its tables. The
command must be terminated with a semicolon to tell Hive to execute it:
hive> SHOW TABLES;
OK
Time taken: 0.473 seconds
 Like SQL, HiveQL is generally case insensitive (except for string comparisons).You can also run
the Hive shell in non-interactive mode. The -f option runs the commands in the specified file,
which is script.q in this example:
% hive -f script.q
 For short scripts, you can use the -e option to specify the commands inline, in which case the
final semicolon is not required:
% hive -e 'SELECT * FROM dummy'
OK
X
Time taken: 1.22 seconds, Fetched: 1 row(s)
Running Hive:
 How to set up Hive to run against a Hadoop cluster and a shared metastore.
Configuring Hive:
 Hive is configured using an XML configuration file like Hadoop’s. The file is called hive-site.xml
and is located in Hive’s conf directory. This file is where you can set properties that you want to
set every time you run Hive.
 The same directory contains hive-default.xml, which documents the properties that Hive exposes
and their default values.
 we can override the configuration directory that Hive looks for in hive-site.xml by passing the --
config option to the hive command:
% hive --config /Users/tom/dev/hive-conf
 we can specify the file system and resource manager using the usual Hadoop properties,
fs.defaultFS and yarn.resourcemanager.address, If not set, they default to the local file system and
the local (in process) job runner—just like they do in Hadoop.
 If you plan to have more than one Hive user sharing a Hadoop cluster, you need to make the
directories that Hive uses writable by all users.

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 4


 The following commands will create the directories and set their permissions appropriately:
% hadoop fs -mkdir /tmp
% hadoop fs -chmod a+w /tmp
% hadoop fs -mkdir -p /user/hive/warehouse
% hadoop fs -chmod a+w /user/hive/warehouse
 If all users are in the same group, then permissions g+w are sufficient on the warehouse
directory.
 You can change settings from within a session, too, using the SET command. This is useful for
changing Hive settings for a particular query. By itself, SET will list all the properties set by Hive.
 Use SET -v to list all the properties in the system, including Hadoop defaults. There is a
precedence hierarchy to setting properties. In the following list, lower numbers take precedence
over higher numbers:
1. The Hive SET command
2. The command-line -hiveconf option
3. Hive-site.xml and the Hadoop site files (core-site.xml, hdfs-site.xml, mapred-site.xml,
and yarn-site.xml)
4. The Hive defaults and the Hadoop default files (core-default.xml, hdfs-default.xml,
Mapred-default.xml and yarn-default.xml)
Execution engines:
 Hive was originally written to use MapReduce as its execution engine, and that is still the default.
It is now also possible to run Hive using Apache Tez as its execution engine, and work is
underway to support Spark. Both Tez and Spark are general directed acyclic graph (DAG)
engines that offer more flexibility and higher performance than MapReduce.

Comparison with Traditional Databases:


 The differences between Hive vs RDBMS (traditional relation databases). Below are the key features
of Hive that differ from RDBMS.
 Hive resembles a traditional database by supporting SQL interface but it is not a full database. Hive
can be better called as data warehouse instead of database.
 Hive enforces schema on read time whereas RDBMS enforces schema on write time. In RDBMS, a
table’s schema is enforced at data load time, If the data being loaded doesn’t conform to the schema,
then it is rejected. This design is called schema on write.
 But Hive doesn’t verify the data when it is loaded, but rather when a it is retrieved. This is called
schema on read. Schema on read makes for a very fast initial load, since the data does not have to be
read, parsed, and serialized to disk in the database’s internal format. The load operation is just a file
copy or move.
 Schema on write makes query time performance faster, since the database can index columns and
perform compression on the data but it takes longer to load data into the database.
 Hive is based on the notion of Write once, Read many times but RDBMS is designed for Read and
Write many times.

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 5


 In RDBMS, record level updates, insertions and deletes, transactions and indexes are possible.
Whereas these are not allowed in Hive because Hive was built to operate over HDFS data using
MapReduce, where full-table scans are the norm and a table update is achieved by transforming the
data into a new table.
 In RDBMS, maximum data size allowed will be in 10’s of Terabytes but whereas Hive can 100’s
Petabytes very easily.
 As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction Processing)
but it is closer to OLAP (Online Analytical Processing) but not ideal since there is significant latency
between issuing a query and receiving a reply, due to the overhead of Mapreduce jobs and due to
the size of the data sets Hadoop was designed to serve.
 RDBMS is best suited for dynamic data analysis and where fast responses are expected but Hive is
suited for data warehouse applications, where relatively static data is analyzed, fast response times
are not required, and when the data is not changing rapidly.
 To overcome the limitations of Hive, HBase is being integrated with Hive to support record level
operations and OLAP.
 Hive is very easily scalable at low cost but RDBMS is not that much scalable that too it is very costly
scale up.
HiveQL:
 Hive’s SQL dialect, called HiveQL, is a mixture of SQL-92, MySQL, and Oracle’s SQL dialect.
 A high-level comparison of SQL and HiveQL

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 6


Data Types:
 Hive supports both primitive and complex data types. Primitives include numeric, Boolean, string,
and timestamp types. The complex data types include arrays, maps, and structs.

Operators and Functions:


 The usual set of SQL operators is provided by Hive: relational operators (such as x ='a' for testing
equality, x IS NULL for testing nullity, and x LIKE 'a%' for pattern matching), arithmetic operators
(such as x + 1 for addition), and logical operators (such as x OR y for logical OR).
 Hive comes with a large number of built-in functions,which include mathematical and statistical
functions, string functions, date functions (for operating on string representations of dates),
conditional functions, aggregate functions, and functions for working with XML and JSON.
Tables:
 A Hive table is logically made up of the data being stored and the associated metadata describing the
layout of the data in the table. The data typically resides in HDFS, although it may reside in any
Hadoop filesystem, including the local file system or S3.
 Hive stores the metadata in a relational database and not in HDFS.
Managed Tables and External Tables:
 When you create a table in Hive, by default Hive will manage the data, which means that Hive moves
the data into its warehouse directory. Alternatively, you may create an external table, which tells Hive
to refer to the data that is at an existing location outside the warehouse directory.
 The difference between the two table types is seen in the LOAD and DROP semantics. Let’s consider a
managed table first. When you load data into a managed table, it is moved into Hive’s warehouse
directory.
N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 7
 For example, this:
CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
 Will move the file hdfs://user/tom/data.txt into Hive’s warehouse directory for the managed_table
table, which is hdfs: //user/hive/warehouse/managed_table.
 The load operation is very fast because it is just a move or rename within a file system. However, bear
in mind that Hive does not check that the files in the table directory conform to the schema declared
for the table, even for managed tables.
 If the table is later dropped, using: DROP TABLE managed_table;
Storage Formats:
 There are two dimensions that govern table storage in Hive: the row format and the file format. The
row format dictates how rows, and the fields in a particular row, are stored. In Hive, the row format
is defined by a Serializer-Deserializer.
 When acting as a deserializer, which is the case when querying a table, it will deserialize a row of
data from the bytes in the file to objects used internally by Hive to operate on that row of data.
 When used as a serializer, which is the case when performing an INSERT or CTAS, it will serialize
Hive’s internal representation of a row of data into the bytes that are written to the output file.
 The simplest format is a plain-text file, but there are row-oriented and column-oriented binary
formats available too.
 The default storage format: Delimited text
CREATE TABLE...AS SELECT:
 It’s very convenient to store the output of a Hive query in a new table, perhaps because it is too large
to be dumped to the console or because there are further processing steps to carry out on the result.
 The new table’s column definitions are derived from the columns retrieved by the SELECT clause. In
the following query, the target table has two columns named col1 and col2 whose types are the same
as the ones in the source table:
CREATE TABLE target
AS
SELECT col1, col2
FROM source;
 A CTAS operation is atomic, so if the SELECT query fails for some reason, the table is not created.
Altering Tables:
 Because Hive uses the schema-on-read approach, it’s flexible in permitting a table’s definition to
change after the table has been created. You can rename a table using the ALTER TABLE statement:
ALTER TABLE source RENAME TO target;
 In addition to updating the table metadata, ALTER TABLE moves the underlying table directory so that
it reflects the new name. Hive allows you to change the definition for columns, add new columns, or
even replace all existing columns in a table with a new set.
 For example, consider adding a new column:
ALTER TABLE target ADD COLUMNS (col3 STRING);
 The new column col3 is added after the existing (nonpartition) columns.

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 8


Dropping Tables:
 The DROP TABLE statement deletes the data and metadata for a table. In the case of external tables,
only the metadata is deleted; the data is left untouched. If you want to delete all the data in a table but
keep the table definition, use TRUNCATE TABLE.
 For example: TRUNCATE TABLE my_table;
 This doesn’t work for external tables; instead, use dfs -rmr (from the Hive shell) to remove the
external table directory directly.
 if you want to create a new, empty table with the same schema as another table, then use the LIKE
keyword:
CREATE TABLE new_table LIKE existing_table;
Querying Data:
 How to use various forms of the SELECT statement to retrieve data from Hive.
Sorting and Aggregating:
 Sorting data in Hive can be achieved by using a standard ORDER BY clause. ORDER BY performs a
parallel total sort of the input.
Hive> FROM records2
> SELECT year, temperature
> DISTRIBUTE BY year
> SORT BY year ASC, temperature DESC;
1949 111
1949 78
1950 22
1950 0
1950 -11
Joins:
 The simplest kind of join is the inner join, where each match in the input tables results in a row in
the output. Consider two small demonstration tables, sales (which lists the names of people and the IDs
of the items they bought) and things (which lists the item IDs and their names):
hive> SELECT * FROM sales; hive> SELECT * FROM things;
Joe 2 2 Tie
Hank 4 4 Coat
Ali 0 3 Hat
Eve 3 1 Scarf
Hank 2
 We can perform an inner join on the two tables as follows:
hive> SELECT sales.*, things.*
> FROM sales JOIN things ON (sales.id = things.id);
Joe 2 2 Tie
Hank 4 4 Coat
Eve 3 3 Hat
Hank 2 2 Tie
N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 9
Spark:
What is Spark?
 Apache Spark is a cluster computing framework for large-scale data processing. Spark does not
use MapReduce as an execution engine; instead, it uses its own distributed runtime for
executing work on a cluster.
 Spark is closely integrated with Hadoop: it can run on YARN and works with Hadoop file
formats and storage backends like HDFS.
Why do we need Spark in big data?
 Simply Spark is a fast and general engine for large-scale data processing. The fast means that
it's faster than previous approaches to work with Big Data like classical MapReduce.
 The secret for being faster is that Spark runs on memory (RAM), and that makes the
processing much faster than on disk drives.
Architecture of the Spark:
 Spark architecture is well-layered and integrated with other libraries, making it easier to
use. It is master/slave architecture and has two main daemons: the master daemon and the
worker daemon.

Spark Architecture

 In master node, we have the driver program, which drives our application. The code we are
writing behaves as a driver program or if we are using the interactive shell, the shell acts as
the driver program.
 Inside the driver program, the first thing we do is, we create a Spark Context. Assume that
the Spark context is a gateway to all the Spark functionalities. It is similar to your database
connection.
 Any command we execute in our database goes through the database connection. Likewise,
anything we do on Spark goes through Spark context.
 Now, this Spark context works with the cluster manager to manage various jobs. The driver
program & Spark context takes care of the job execution within the cluster.
N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 10
 A job is split into multiple tasks which are distributed over the worker node. Anytime an
RDD is created in Spark context, it can be distributed across various nodes and can be cached
there.
 Worker nodes are the slave nodes whose job is to basically execute the tasks. These tasks are
then executed on the partitioned RDDs in the worker node and hence returns back the result
to the Spark Context.
 Spark Context takes the job, breaks the job in tasks and distribute them to the worker nodes.
These tasks work on the partitioned RDD, perform operations, collect the results and return to
the main Spark Context.
 If we increase the number of workers, then we can divide jobs into more partitions and
execute them parallelly over multiple systems. It will be a lot faster.
 With the increase in the number of workers, memory size will also increase & we can cache
the jobs to execute it faster.
 To know about the workflow of Spark Architecture, we can have a look at
the infographic below:

 STEP1: The client submits spark user application code. When an application code is submitted,
the driver implicitly converts user code that contains transformations and actions into a
logically directed acyclic graph called DAG. At this stage, it also performs optimizations such
as pipelining transformations.
 STEP 2: After that, it converts the logical graph called DAG into physical execution plan with
many stages. After converting into a physical execution plan, it creates physical execution
units called tasks under each stage. Then the tasks are bundled and sent to the cluster.
 STEP3: Now the driver talks to the cluster manager and negotiates the resources. Cluster
manager launches executors in worker nodes on behalf of the driver. At this point, the driver
will send the tasks to the executors based on data placement. When executors start, they
register themselves with drivers. So, the driver will have a complete view of executors that
are executing the task.

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 11


 STEP 4: During the course of execution of tasks, driver program will monitor the set of
executors that runs. Driver node also schedules future tasks based on data placement.
 This architecture is further integrated with various extensions and libraries. Apache Spark
Architecture is based on two main abstractions:
1.Resilient Distributed Dataset (RDD)
2. Directed Acyclic Graph (DAG)
Resilient Distributed Dataset (RDD):
 RDDs are the building blocks of any Spark application. RDDs Stands for:
Resilient: Fault tolerant and is capable of rebuilding data on failure
Distributed: Distributed data among the multiple nodes in a cluster
Dataset: Collection of partitioned data with values

 It is a layer of abstracted data over the distributed collection. It is immutable in nature and
follows lazy transformations. The data in an RDD is split into chunks based on a key.
 RDDs are highly resilient, i.e., they are able to recover quickly from any issues as the same
data chunks are replicated across multiple executor nodes.
 Thus, even if one executor node fails, another will still process the data. This allows you to
perform your functional calculations against your dataset very quickly by harnessing the
power of multiple nodes.
 Moreover, once you create an RDD it becomes immutable. Immutable means, an object
whose state cannot be modified after it is created, but they can surely be transformed.
 Talking about the distributed environment, each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster.
 Due to this, we can perform transformations or actions on the complete data parallelly. Also,
you don’t have to worry about the distribution, because Spark takes care of that.

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 12


Workflow of RDD
 There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or by referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, etc.With RDDs, you can perform two types of operations:
1. Transformations: They are the operations that are applied to create a new RDD.
2. Actions: They are applied on an RDD to instruct Apache Spark to apply computation and
pass the result back to the driver.
Features of Apache Spark:
 1.Speed: Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data
processing. It is also able to achieve this speed through controlled partitioning.
 2.Powerful Caching: Simple programming layer provides powerful caching and disk
persistence capabilities.
 3. Deployment: It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster
manager.
 4.Real-Time: It offers Real-time computation & low latency because of in-memory
computation.
 5.Polyglot: Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be
written in any of these four languages. It also provides a shell in Scala and Python.
Installing Spark:
 Download a stable release of the Spark binary distribution from the downloads page and
unpack the tarball in a suitable location:
% tar xzf spark-x.y.z-bin-distro.tgz
 It’s convenient to put the Spark binaries on your path as follows:
% export SPARK_HOME=~/sw/spark-x.y.z-bin-distro
% export PATH=$PATH:$SPARK_HOME/bin
 We’re now ready to run an example in Spark.

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 13


Shared Variables:
 In Spark, when any function passed to a transformation operation, then it is executed on a
remote cluster node. It works on different copies of all the variables used in the function.
 These variables are copied to each machine, and no updates to the variables on the remote
machine are revert to the driver program.
Broadcast variable:
 The broadcast variables support a read-only variable cached on each machine rather than
providing a copy of it with tasks. Spark uses broadcast algorithms to distribute broadcast
variables for reducing communication cost.
 The execution of spark actions passes through several stages, separated by distributed
"shuffle" operations. Spark automatically broadcasts the common data required by tasks
within each stage. The data broadcasted this way is cached in serialized form and deserialized
before running each task.
 To create a broadcast variable (let say, v), call SparkContext.broadcast(v). Let's understand
with an example.
scala> val v = sc.broadcast(Array(1, 2, 3))
scala> v.value
Accumulator:
 The Accumulator is a variable that is used to perform associative and commutative operations
such as counters or sums. The Spark provides support for accumulators of numeric types.
However, we can add support for new types.
 To create a numeric accumulator, call SparkContext.longAccumulator () or
SparkContext.doubleAccumulator () to accumulate the values of Long or Double type .
scala> val a=sc.longAccumulator("Accumulator")
scala> sc.parallelize(Array(2,5)).foreach(x=>a.add(x))
scala> a.value
Anatomy of a Spark Job Run:
 What happens when we run a Spark job? At the highest level, there are two independent
entities: the driver, which hosts the application (SparkContext) and schedules tasks for a job;
and the executors, which are exclusive to the application, run for the duration of the
application, and execute the application’s tasks.
 Usually the driver runs as a client that is not managed by the cluster manager and the
executors run on machines in the cluster.
1.Job Submission:
 A Spark job is submitted automatically when an action (such as count()) is performed on an
RDD. Internally, this causes runJob() to be called on the SparkContext (step 1), which passes
the call on to the scheduler that runs as a part of the driver (step 2).
N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 14
 The scheduler is made up of two parts: a DAG scheduler that breaks down the job into a DAG
of stages, and a task scheduler that is responsible for submitting the tasks from each stage to
the cluster.

How Spark runs a job


2. DAG Construction:
 To understand how a job is broken up into stages, we need to look at the type of tasks that can
run in a stage. There are two types: shuffle map tasks and result tasks. The name of the task
type indicates what Spark does with the task’s output:
Shuffle map tasks:
 As the name suggests, shuffle map tasks are like the map-side part of the shuffle in
MapReduce.
 Each shuffle map task runs a computation on one RDD partition and, based on a partitioning
function, writes its output to a new set of partitions, which are then fetched in a later stage.
Shuffle map tasks run in all stages except the final stage.
Result tasks:
 Result tasks run in the final stage that returns the result to the user’s program. Each result task
runs a computation on its RDD partition,then sends the result back to the driver, and the
driver assembles the results from each partition into a final result.
 Once the DAG scheduler has constructed the complete DAG of stages, it submits each stage’s
set of tasks to the task scheduler (step 3).
3. Task Scheduling:
 When the task scheduler is sent a set of tasks, it uses its list of executors that are running for
the application and constructs a mapping of tasks to executors that takes placement
preferences into account.
 Next, the task scheduler assigns tasks to executors that have free cores, and it continues to
assign more tasks as executors finish running tasks, until the task set is complete.

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 15


 Each task is allocated one core by default, although this can be changed by setting
spark.task.cpus.
 Assigned tasks are launched through a scheduler backend (step 4), which sends a remote
launch task message (step 5) to the executor backend to tell the executor to run the task (step
6).
4. Task Execution:
 An executor runs a task as follows (step 7). First, it makes sure that the JAR and file
dependencies for the task are up to date.
 The executor keeps a local cache of all the dependencies that previous tasks have used, so that
it only downloads them when they have changed.
 Second, it deserializes the task code from the serialized bytes that were sent as a part of the
launch task message.
 Third, the task code is executed. Note that tasks are run in the same JVM as the executor, so
there is no process overhead for task launch.
 Tasks can return a result to the driver. The result is serialized and sent to the executor
backend, and then back to the driver as a status update message.
 A Shuffle map task returns information that allows the next stage to retrieve the output
partitions, while a result task returns the value of the result for the partition it ran on, which
the driver assembles into a final result to return to the user’s program.

HBase:
What is HBase?
 HBase is an open-source, column-oriented distributed database system that runs on top of
HDFS (Hadoop Distributed File System). Initially, it was Google Big Table, afterward; it was
renamed as HBase and is primarily written in Java.
 HBase can store massive amounts of data from terabytes to petabytes. The tables present in
HBase consist of billions of rows having millions of columns. HBase is built for low latency
operations, which is having some specific features compared to traditional relational models.
Why do we need HBase in big data?
 Apache HBase is needed for real-time Big Data applications. A table for a popular web
application may consist of billions of rows. If we want to search a particular row from such a
huge amount of data, HBase is the ideal choice as query fetch time is less. Most of the online
analytics applications use HBase.
 Traditional relational data models fail to meet the performance requirements of very big
databases. These performance and processing limitations can be overcome by Apache HBase.

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 16


Apache HBase Features:
1. HBase is built for low latency operations.
2. HBase is used extensively for random read and write operations.
3. HBase stores a large amount of data in terms of tables.
4. Provides linear and modular scalability over cluster environment.
5. Strictly consistent to read and write operations.
6. Automatic and configurable sharding of tables.
7. Easy to use Java API for client access.

HBase Architecture:

 HBase architecture consists of following components.


1. HMaster
2. HRegionserver
3. HRegions
4. Zookeeper
5. HDFS
HMaster:
 HMaster in HBase is the implementation of a Master server in HBase architecture. It acts as a
monitoring agent to monitor all Region Server instances present in the cluster and acts as an
interface for all the metadata changes. In a distributed cluster environment, Master runs on
NameNode. Master runs several background threads.
 The following are important roles performed by HMaster in HBase.
1. Plays a vital role in terms of performance and maintaining nodes in the cluster.
2. HMaster provides admin performance and distributes services to different region servers.
3. HMaster assigns regions to region servers.
4. HMaster has the features like controlling load balancing and failover to handle the load
over nodes present in the cluster.
5. When a client wants to change any schema and to change any Metadata operations,
HMaster takes responsibility for these operations.

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 17


HBase Region Servers:
 When HBase Region Server receives writes and read requests from the client, it assigns the
request to a specific region, where the actual column family resides. However, the client can
directly contact with HRegion servers, there is no need of HMaster mandatory permission to
the client regarding communication with HRegion servers. The client requires HMaster help
when operations related to metadata and schema changes are required.
 HMaster can get into contact with multiple HRegion servers and performs the following
functions.
1. Hosting and managing regions
2. Splitting regions automatically
3. Handling read and writes requests
4. Communicating with the client directly
HBase Regions:
 HRegions are the basic building elements of HBase cluster that consists of the distribution of
tables and are comprised of Column families. It contains multiple stores, one for each column
family. It consists of mainly two components, which are Memstore and Hfile.
ZooKeeper:
 HBase Zookeeper is a centralized monitoring server which maintains configuration
information and provides distributed synchronization. Distributed synchronization is to
access the distributed applications running across the cluster with the responsibility of
providing coordination services between nodes. If the client wants to communicate with
regions, the server’s client has to approach ZooKeeper first.
 Services provided by ZooKeeper
1. Maintains Configuration information
2. Provides distributed synchronization
3. Client Communication establishment with region servers
4. Provides ephemeral nodes for which represent different region servers
5. To track server failure and network partitions
HDFS:
 HDFS is a Hadoop distributed File System, as the name implies it provides a distributed
environment for the storage and it is a file system designed in a way to run on commodity
hardware. It stores each file in multiple blocks and to maintain fault tolerance, the blocks are
replicated across a Hadoop cluster.

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 18


Installation:
 Download HBase Software from an Apache Download Mirror and unpack it on your local file
system.
% tar xzf hbase-x.y.z.tar.gz
 As with Hadoop, first we need to tell HBase where Java is located on our system. If we have
the JAVA_HOME environment variable set to point to a suitable Java installation, then that
will be used, and we don’t have to configure anything further.
 Otherwise, we can set the Java installation that HBase uses by editing HBase’s conf/hbase-
env.sh file and specifying the JAVA_HOME variable.
 For convenience, add the HBase binary directory to your command-line path.
% export HBASE_HOME=~/sw/hbase-x.y.z
% export PATH=$PATH:$HBASE_HOME/bin
 To start a standalone instance of HBase that uses a temporary directory on the local file system
for persistence, use this:
% start-hbase.sh
 In standalone mode, the HBase master, the regionserver, and a ZooKeeper instance are all run
in the same JVM.
 To administer your HBase instance, launch the HBase shell as follows:
% hbase shell
Client:
 There are a number of client options for interacting with an HBase cluster. The following are
the clients; we can use to access the HBase Cluster.
1. The HBase shell
2. Kundera – the object mapper
3. The REST client
4. The Thrift client
 1. HBase Shell: The easiest way to access HBase is using the command-line interface called the
HBase shell.
 The HBase shell is based on the Java Virtual Machine-based implementation of Ruby (JRuby)
and can be used to connect to local or remote servers for interaction.
 It also provides both client and administrative operations. The HBase shell, the default HBase
tool that comes with the HBase installation, can be launched as follows:
$HBASE_HOME/bin/hbase shell
 2. Kundera – the object mapper: In order to start using HBase in your Java application with
minimal learning, we can use one of the popular open source API named Kundera, which is a
JPA 2.1 compliant object mapper.

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 19


 Kundera is a polyglot object mapper for NoSQL, as well as RDBMS data stores. It is a single
high-level Java API that supports eight NoSQL data stores.
 The idea behind Kundera is to make working with NoSQL databases drop-dead simple and
fun. Kundera supports cross-datastore persistence, that is, it supports polyglot persistence
between supported NoSQL datastores and RDBMS.
 This means you can store and fetch related entities in different datastores using a single
method call. It manages transactions beautifully and supports both Entity Transaction and
Java Transaction API (JTA).
 The following are the three possible ways to start using Kundera:
 Using Kundera Binaries: Kundera binaries are available at
https://oss.sonatype.org/content/repositories/releases/com/impetus/kundera/.
 Using as a Maven dependency: If you have a Maven project, then simply add the following
repository and dependency into your project's pom.xml file to include Kundera in your
project:
mvn clean install -Dfile src/pom.xml
 3,4.REST and Thrift: HBase ships with REST and Thrift interfaces. These are useful when the
interacting application is written in a language other than Java.
 In both cases, a Java server hosts an instance of the HBase client brokering REST and Thrift
application requests into and out of the HBase cluster.
 For a REST-based client to make a connection to an HBase cluster, first start the REST server on
the HBase Master as follows:
$hbase rest
 For a Thrift-based client to make a connection to a HBase cluster, first start the Thrift server
on the HBase Master as follows:

$hbase thrift
Building an Online Query Application:
 HDFS and MapReduce are powerful tools for processing batch operations over large datasets,
they do not provide ways to read or write individual records efficiently. we can overcome
these drawbacks using HBase tool.
 To implement the online query application, we will use the HBase Java API directly. Here it
becomes clear how important your choice of schema and storage format is.
Creating a Table using HBase Shell:
 we can create a table using the create command, here we must specify the table name and the
Column Family name. The syntax to create a table in HBase shell is shown below.

create ‘<table name>’,’<column family>’

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 20


 Example:Given below is a sample schema of a table named emp. It has two column families:
“personal data” and “professional data”.

Row key personal data professional data

 You can create this table in HBase shell as shown below.

hbase(main):002:0> create 'emp', 'personal data', 'professional data'

 And it will give you the following output.

0 row(s) in 1.1300 seconds


=> Hbase::Table - emp

 Verification: We can verify whether the table is created using the list command as shown
below. Here we can observe the created emp table.

hbase(main):002:0> list
TABLE
emp
2 row(s) in 0.0340 seconds
Inserting Data using HBase Shell:
 To create data in an HBase table, the following commands and methods are used:
1.put command,
2.add() method of Put class, and
3.put() method of HTable class.
HBase Table: Using put command, we can insert rows into a table. Its syntax is as follows:
put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’
Inserting the First Row:
 Let us insert the first row values into the emp table as shown below.
hbase(main):005:0> put 'emp','1','personal data:name','raju'
0 row(s) in 0.6600 seconds
hbase(main):006:0> put 'emp','1','personal data:city','hyderabad'
0 row(s) in 0.0410 seconds
hbase(main):007:0> put 'emp','1','professional
data:designation','manager'
0 row(s) in 0.0240 seconds
hbase(main):007:0> put 'emp','1','professional data:salary','50000'
0 row(s) in 0.0240 seconds

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 21


Reading Data using HBase Shell:
 The get command and the get() method of HTable class are used to read data from a table in
HBase. Using get command, you can get a single row of data at a time. Its syntax is as follows:
get ’<table name>’,’row1’
 Example: The following example shows how to use the get command. Let us scan the first row
of the emp table.
hbase(main):012:0> get 'emp', '1'
 Output: COLUMN CELL
personal : city timestamp = 1417521848375, value = hyderabad
personal : name timestamp = 1417521785385, value = ramu
professional: designation timestamp = 1417521885277, value = manager
professional: salary timestamp = 1417521903862, value = 50000
4 row(s) in 0.0270 seconds
Reading a Specific Column:
 Given below is the syntax to read a specific column using the get method.

hbase> get 'table name', ‘rowid’, {COLUMN ⇒ ‘column family:column name ’}


 Example: Given below is the example to read a specific column in HBase table.
hbase(main):015:0> get 'emp', 'row1', {COLUMN ⇒ 'personal:name'}
 Output: COLUMN CELL
personal:name timestamp = 1418035791555, value = raju
1 row(s) in 0.0080 seconds
Count:
 You can count the number of rows of a table using the count command. Its syntax is as
follows:
count ‘<table name>’
 After deleting the first row, emp table will have two rows. Verify it as shown below.
hbase(main):023:0> count 'emp'
2 row(s) in 0.090 seconds
⇒2
Truncate:
 This command disables drops and recreates a table. The syntax of truncate is as follows:
hbase> truncate 'table name'
 Example: Given below is the example of truncate command. Here we have truncated the emp
table.
hbase(main):011:0> truncate 'emp'

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 22


 Output: Truncating 'one' table (it may take a while):
- Disabling table...
- Truncating table...
0 row(s) in 1.5950 seconds
 After truncating the table, use the scan command to verify. we will get a table with zero rows.
hbase(main):017:0> scan ‘emp’
ROW COLUMN + CELL
0 row(s) in 0.3110 seconds
Updating Data using HBase Shell:
 You can update an existing cell value using the put command. To do so, just follow the same
syntax and mention your new value as shown below.
put ‘table name’,’row ’,'Column family:column name',’new value’
 The newly given value replaces the existing value, updating the row.
 Example: Suppose there is a table in HBase called emp with the following data.
hbase(main):003:0> scan 'emp'
ROW COLUMN + CELL
row1 column = personal:name, timestamp = 1418051555, value = raju
row1 column = personal:city, timestamp = 1418275907, value = Hyderabad
row1 column = professional:designation, timestamp = 14180555,value = manager
row1 column = professional:salary, timestamp = 1418035791555,value = 50000
1 row(s) in 0.0100 seconds
 The following command will update the city value of the employee named ‘Raju’ to Delhi.
hbase(main):002:0> put 'emp','row1','personal:city','Delhi'
0 row(s) in 0.0400 seconds
 The updated table looks as follows where you can observe the city of Raju has been
changed to ‘Delhi’.
hbase(main):003:0> scan 'emp'
 Output: ROW COLUMN + CELL
row1 column = personal:name, timestamp = 1418035791555, value = raju
row1 column = personal:city, timestamp = 1418274645907, value = Delhi
row1 column = professional:designation, timestamp = 141857555,value = manager
row1 column = professional:salary, timestamp = 1418039555, value = 50000
1 row(s) in 0.0100 seconds

N.KOTESWARA RAO, ASSOC.PROFF BDA-UNIT-5 23

You might also like