0% found this document useful (0 votes)
18 views44 pages

BDA Unit-5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views44 pages

BDA Unit-5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

UNIT-5

Hive: Installing Hive, Running Hive, Comparison with traditional Databases, HiveQL, Tables, Querying Data.

Spark: Installing Spark, Resilient Distributed Datasets, Shared Variables, Anatomy of a Spark Job Run.

HBase: HBasics, Installation, clients, Building an Online Query Application.

What is Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop
to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it
further as an open source under the name Apache Hive. It is used by different companies. For example,
Amazon uses it in Amazon Elastic MapReduce.

Hive is not
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates

Features of Hive

• It stores schema in a database and processed data into HDFS.


• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.

Installing Hive and Running Hive

To configure Apache Hive, first you need to download and unzip Hive. Then you need to customize the
following files and settings:

• Edit .bashrc file


• Edit hive-config.sh file
• Create Hive directories in HDFS
• Configure hive-site.xml file
• Initiate Derby database

Step 1: Download and Untar Hive


Visit the Apache Hive official download page and determine which Hive version is best suited for your
Hadoop edition. Once you establish which version you need, select the Download a Release Now! option.

The mirror link on the subsequent page leads to the directories containing available Hive tar packages. This
page also provides useful instructions on how to validate the integrity of files retrieved from mirror sites.

The Ubuntu system presented in this guide already has Hadoop 3.2.1 installed. This Hadoop version is
compatible with the Hive 3.1.2 release.
Select the apache-hive-3.1.2-bin.tar.gz file to begin the download process.
Alternatively, access your Ubuntu command line and download the compressed Hive files using and
the wget command followed by the download path:

$wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

Once the download process is complete, untar the compressed Hive package:

$tar xzf apache-hive-3.1.2-bin.tar.gz

The Hive binary files are now located in the apache-hive-3.1.2-bin directory.

Step 2: Configure Hive Environment Variables (bashrc)

The $HIVE_HOME environment variable needs to direct the client shell to the apache-hive-3.1.2-
bin directory. Edit the .bashrc shell configuration file using a text editor of your choice (we will be using
nano):

$sudo nano .bashrc

Append the following Hive environment variables to the .bashrc file:

export HIVE_HOME= "home/hdoop/apache-hive-3.1.2-bin"


export PATH=$PATH:$HIVE_HOME/bin

The Hadoop environment variables are located within the same file.

Save and exit the .bashrc file once you add the Hive variables. Apply the changes to the current environment
with the following command:

$source ~/.bashrc

Step 3: Edit hive-config.sh file

Apache Hive needs to be able to interact with the Hadoop Distributed File System. Access the hive-
config.sh file using the previously created $HIVE_HOME variable:

$sudo nano $HIVE_HOME/bin/hive-config.sh

Add the HADOOP_HOME variable and the full path to your Hadoop directory:

export HADOOP_HOME=/home/hdoop/hadoop-3.2.1
Save the edits and exit the hive-config.sh file.

Step 4: Create Hive Directories in HDFS

Create two separate directories to store data in the HDFS layer:

• The temporary, tmp directory is going to store the intermediate results of Hive processes.
• The warehouse directory is going to store the Hive related tables.

Create tmp Directory

Create a tmp directory within the HDFS storage layer. This directory is going to store the intermediary data
Hive sends to the HDFS:

$hdfs dfs -mkdir /tmp

Add write and execute permissions to tmp group members:

$hdfs dfs -chmod g+w /tmp

Check if the permissions were added correctly:

$hdfs dfs -ls /

The output confirms that users now have write and execute permissions.

Create warehouse Directory

Create the warehouse directory within the /user/hive/ parent directory:


$hdfs dfs -mkdir -p /user/hive/warehouse

Add write and execute permissions to warehouse group members:

hdfs dfs -chmod g+w /user/hive/warehouse

Check if the permissions were added correctly:

hdfs dfs -ls /user/hive

The output confirms that users now have write and execute permissions.

Step 5: Configure hive-site.xml File (Optional)

Apache Hive distributions contain template configuration files by default. The template files are located within
the Hive conf directory and outline default Hive settings.

Use the following command to locate the correct file:

$cd $HIVE_HOME/conf

List the files contained in the folder using the ls command.

Use the hive-default.xml.template to create the hive-site.xml file:


$cp hive-default.xml.template hive-site.xml

Access the hive-site.xml file using the nano text editor:

sudo nano hive-site.xml

Using Hive in a stand-alone mode rather than in a real-life Apache Hadoop cluster is a safe option for
newcomers. You can configure the system to use your local storage rather than the HDFS layer by setting
the hive.metastore.warehouse.dir parameter value to the location of your Hive warehouse directory.

Step 6: Initiate Derby Database

Apache Hive uses the Derby database to store metadata. Initiate the Derby database, from the
Hive bin directory using the schematool command:

$HIVE_HOME/bin/schematool -dbType derby -initSchema

The process can take a few moments to complete.


Derby is the default metadata store for Hive. If you plan to use a different database solution, such as MySQL
or PostgreSQL, you can specify a database type in the hive-site.xml file.

How to Fix guava Incompatibility Error in Hive

If the Derby database does not successfully initiate, you might receive an error with the following content:

“Exception in thread “main” java.lang.NoSuchMethodError:


com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V”

This error indicates that there is most likely an incompatibility issue between Hadoop and
Hive guava versions.

Locate the guava jar file in the Hive lib directory:

$ls $HIVE_HOME/lib

Locate the guava jar file in the Hadoop lib directory as well:

$ls $HADOOP_HOME/share/hadoop/hdfs/lib
The two listed versions are not compatible and are causing the error. Remove the existing guava file from the
Hive lib directory:

$rm $HIVE_HOME/lib/guava-19.0.jar

Copy the guava file from the Hadoop lib directory to the Hive lib directory:

$cp $HADOOP_HOME/share/hadoop/hdfs/lib/guava-27.0-jre.jar $HIVE_HOME/lib/

Use the schematool command once again to initiate the Derby database:

$HIVE_HOME/bin/schematool -dbType derby -initSchema

Launch Hive Client Shell on Ubuntu

Start the Hive command-line interface using the following commands:

cd $HIVE_HOME/bin
hive

You are now able to issue SQL-like commands and directly interact with HDFS.

Conclusion
You have successfully installed and configured Hive on your Ubuntu system. Use HiveQL to query and
manage your Hadoop distributed storage and perform SQL-like tasks. Your Hadoop cluster now has an easy-
to-use gateway to previously inaccessible RDBMS.

The Hive Shell:

The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL.

When starting Hive for the first time, we can check that it is working by listing its tables —there should be
none. The command must be terminated with a semicolon to tell Hive to execute it:

hive> SHOW TABLES;

OK

Time taken: 0.473 seconds

You can also run the Hive shell in noninteractive mode. The -f option runs the commands in the specified file,
which is script.q in this example:

$ hive -f script.q

For short scripts, you can use the -e option to specify the commands inline, in which case the final semicolon is
not required:

$hive -e 'SELECT * FROM dummy'

OK

Time taken: 1.22 seconds, Fetched: 1 row(s)

In both interactive and noninteractive mode, Hive will print information to standard error—such as the time
taken to run a query—during the course of operation. You can suppress these messages using the -S option at
launch time, which has the effect of showing only the output result for queries:

% hive -S -e 'SELECT * FROM dummy'

Hive Architecture:
Hive Consists of Mainly 3 core parts

1. Hive Clients
2. Hive Services
3. Hive Storage and Computing

Hive Clients:

Hive provides different drivers for communication with a different type of applications. For Thrift based
applications, it will provide Thrift client for communication.

For Java related applications, it provides JDBC Drivers. Other than any type of applications provided ODBC
drivers. These Clients and drivers in turn again communicate with Hive server in the Hive services.

Hive Services:

Client interactions with Hive can be performed through Hive Services. If the client wants to perform any query
related operations in Hive, it has to communicate through Hive Services.

CLI is the command line interface acts as Hive service for DDL (Data definition Language) operations. All
drivers communicate with Hive server and to the main driver in Hive services as shown in above architecture
diagram.

Driver present in the Hive services represents the main driver, and it communicates all type of JDBC, ODBC,
and other client specific applications. Driver will process those requests from different applications to
metastore and field systems for further processing.

Hive Storage and Computing:

Hive services such as Meta store, File system, and Job Client in turn communicates with Hive storage and
performs the following actions
• Metadata information of tables created in Hive is stored in Hive “Meta storage database”.
• Query results and data loaded in the tables are going to be stored in Hadoop cluster on HDFS.

Job execution flow:

From the above screenshot we can understand the Job execution flow in Hive with Hadoop

The data flow in Hive behaves in the following pattern;

1. Executing Query from the UI( User Interface)


2. The driver is interacting with Compiler for getting the plan. (Here plan refers to query execution)
process and its related metadata information gathering
3. The compiler creates the plan for a job to be executed. Compiler communicating with Meta store for
getting metadata request
4. Meta store sends metadata information back to compiler
5. Compiler communicating with Driver with the proposed plan to execute the query
6. Driver Sending execution plans to Execution engine
7. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query. For DFS
operations.
• EE should first contacts Name Node and then to Data nodes to get the values stored in tables.
• EE is going to fetch desired records from Data Nodes. The actual data of tables resides in data node
only. While from Name Node it only fetches the metadata information for the query.
• It collects actual data from data nodes related to mentioned query
• Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to perform DDL
(Data Definition Language) operations. Here DDL operations like CREATE, DROP and ALTERING
tables and databases are done. Meta store will store information about database name, table names and
column names only. It will fetch data related to query mentioned.
• Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data nodes,
and job tracker to execute the query on top of Hadoop file system

8. Fetching results from driver


9. Sending results to Execution engine. Once the results fetched from data nodes to the EE, it will send
results back to driver and to UI ( front end)

Hive Continuously in contact with Hadoop file system and its daemons via Execution engine. The dotted arrow
in the Job flow diagram shows the Execution engine communication with Hadoop daemons.

The Metastore:

What is Hive Metastore?

Metastore is the central repository of Apache Hive metadata. It stores metadata for Hive tables (like their
schema and location) and partitions in a relational database. It provides client access to this information by
using metastore service API.
Hive metastore consists of two fundamental units:

1. A service that provides metastore access to other Apache Hive services.


2. Disk storage for the Hive metadata which is separate from HDFS storage.
Hive Metastore Modes

There are three modes for Hive Metastore deployment:

• Embedded Metastore
• Local Metastore
• Remote Metastore
Let’s now discuss the above three Hive Metastore deployment modes one by one-

i. Embedded Metastore
In Hive by default, metastore service runs in the same JVM as the Hive service. It uses
embedded derby database stored on the local file system in this mode. Thus both metastore service and hive
service runs in the same JVM by using embedded Derby Database.
But, this mode also has limitation that, as only one embedded Derby database can access the database files on
disk at any one time, so only one Hive session could be open at a time.

ii. Local Metastore


Hive is the data-warehousing framework, so hive does not prefer single session. To overcome this limitation of
Embedded Metastore, for Local Metastore was introduced. This mode allows us to have many Hive sessions
i.e. many users can use the metastore at the same time.
We can achieve by using any JDBC compliant like MySQL which runs in a separate JVM or different
machines than that of the Hive service and metastore service which are running in the same JVM.

This configuration is called as local metastore because metastore service still runs in the same process as the
Hive. But it connects to a database running in a separate process, either on the same machine or on a remote
machine.

Before starting Apache Hive client, add the JDBC / ODBC driver libraries to the Hive lib folder.
MySQL is a popular choice for the standalone metastore. In this case,
the javax.jdo.option.ConnectionURL property is set
to jdbc:mysql://host/dbname? createDatabaseIfNotExist=true,
and javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. The JDBC driver JAR file for
MySQL (Connector/J) must be on Hive’s classpath, which is achieved by placing it in Hive’s lib directory.
iii. Remote Metastore
Moving further, another metastore configuration called Remote Metastore. In this mode, metastore runs on its
own separate JVM, not in the Hive service JVM. If other processes want to communicate with the metastore
server they can communicate using Thrift Network APIs.
We can also have one more metastore servers in this case to provide more availability. This also brings better
manageability/security because the database tier can be completely firewalled off. And the clients no longer
need share database credentials with each Hiver user to access the metastore database.

To use this remote metastore, you should configure Hive service by setting hive.metastore.uris to the
metastore server URI(s). Metastore server URIs are of the form thrift://host:port, where the port corresponds to
the one set by METASTORE_PORT when starting the metastore server.
Databases Supported by Hive

Hive supports 5 backend databases which are as follows:

• Derby
• MySQL
• MS SQL Server
• Oracle
• Postgres

Comparison with Traditional Databases

➢ The differences between Hive vs RDBMS (traditional relation databases). Below are the key features of
Hive that differ from RDBMS.
➢ Hive resembles a traditional database by supporting SQL interface but it is not a full database. Hive can be
better called as data warehouse instead of database.
➢ Hive enforces schema on read time whereas RDBMS enforces schema on write time. In RDBMS, a
table’s schema is enforced at data load time, If the data being loaded doesn’t conform to the schema, then it
is rejected. This design is called schema on write.
➢ But Hive doesn’t verify the data when it is loaded, but rather when a it is retrieved. This is called schema
on read. Schema on read makes for a very fast initial load, since the data does not have to be read, parsed,
and serialized to disk in the database’s internal format. The load operation is just a file copy or move.
➢ Schema on write makes query time performance faster, since the database can index columns and
perform compression on the data but it takes longer to load data into the database.
➢ Hive is based on the notion of Write once, Read many times but RDBMS is designed for Read and
Write many times.
➢ In RDBMS, record level updates, insertions and deletes, transactions and indexes are possible.
Whereas these are not allowed in Hive because Hive was built to operate over HDFS data using
MapReduce, where full-table scans are the norm and a table update is achieved by transforming the data
into a new table.
➢ In RDBMS, maximum data size allowed will be in 10’s of Terabytes but whereas Hive can 100’s
Petabytes very easily.
➢ As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction Processing) but it
is closer to OLAP (Online Analytical Processing) but not ideal since there is significant latency between
issuing a query and receiving a reply, due to the overhead of Mapreduce jobs and due to the size of the
data sets Hadoop was designed to serve.
➢ RDBMS is best suited for dynamic data analysis and where fast responses are expected but Hive is suited
for data warehouse applications, where relatively static data is analyzed, fast response times are not
required, and when the data is not changing rapidly.
➢ To overcome the limitations of Hive, HBase is being integrated with Hive to support record level
operations and OLAP.
➢ Hive is very easily scalable at low cost but RDBMS is not that much scalable that too it is very costly
scale up.

HiveQL

The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a
Metastore.
Data Types

Hive supports both primitive and complex data types.

Primitives include numeric, Boolean, string, and timestamp types.

The complex data types include arrays, maps, and structs.

Hive’s data types are listed in Table 17-3.

a.The literal forms for arrays, maps, structs, and unions are provided as functions. That is, array, map, struct,
and create_union are built-in Hive functions.

b The columns are named col1, col2, col3, etc.


Complex types

Hive has four complex types: ARRAY, MAP, STRUCT, and UNION. ARRAY and MAP are like their
namesakes in Java, whereas a STRUCT is a record type that encapsulates a set of named fields. A UNION
specifies a choice of data types; values must match exactly one of these types.

Array in Hive is an ordered sequence of similar type elements that are indexable using the zero-based integers.

Arrays in Hive are similar to the arrays in JAVA.

array<datatype>

Example: array(‘Data’,’Flair’). The second element is accessed as array[1].


Map in Hive is a collection of key-value pairs, where the fields are accessed using array notations of keys (e.g.,
[‘key’]).

map<primitive_type, data_type>

Example: ‘first’ -> ‘John’, ‘last’ -> ‘Deo’, represented as map(‘first’, ‘John’, ‘last’, ‘Deo’). Now ‘John’ can be
accessed with map[‘first’].
STRUCT in Hive is similar to the STRUCT in C language. It is a record type that encapsulates a set of named
fields, which can be any primitive data type.

We can access the elements in STRUCT type using DOT (.) notation.

STRUCT <col_name : data_type [ COMMENT col_comment], ...>

Example: For a column c3 of type STRUCT {c1 INTEGER; c2 INTEGER}, the c1 field is accessed by the
expression c3.c1.

UNION type in Hive is similar to the UNION in C. UNION types at any point of time can hold exactly one
data type from its specified data types.

The full support for UNIONTYPE data type in Hive is still incomplete.

UNIONTYPE<data_type, data_type, ...>

Complex types permit an arbitrary level of nesting. Complex type declarations must specify the type of the
fields in the collection, using an angled bracket notation, as illustrated in this table definition with three
columns (one for each complex type):

CREATE TABLE complex (


c1 ARRAY<INT>,

c2 MAP<STRING, INT>,

c3 STRUCT<a:STRING, b:INT, c:DOUBLE>,

c4 UNIONTYPE<STRING, INT>

);

If we load the table with one row of data for ARRAY, MAP, STRUCT, and UNION, as shown in the “Literal
examples” column in Table 17, the following query demonstrates the field accessor operators for each type:

hive> SELECT c1[0], c2['b'], c3.c, c4 FROM complex;

1 2 1.0 {1:63}

Operators and Functions

The usual set of SQL operators is provided by Hive: relational operators (such as x = 'a' for testing equality, x
IS NULL for testing nullity, and x LIKE 'a%' for pattern matching), arithmetic operators (such as x + 1 for
addition), and logical operators (such as x OR y for logical OR). The operators match those in MySQL, which
deviates from SQL-92 because || is logical OR, not string concatenation. Use the concat function for the latter
in both MySQL and Hive.

Hive comes with a large number of built-in functions—too many to list here—divided into categories that
include mathematical and statistical functions, string functions, date functions (for operating on string
representations of dates), conditional functions, aggregate functions, and functions for working with XML
(using the xpath function) and JSON.

You can retrieve a list of functions from the Hive shell by typing SHOW FUNCTIONS.To get brief usage
instructions for a particular function, use the DESCRIBE command:

hive> DESCRIBE FUNCTION length;

length(str | binary) - Returns the length of str or number of bytes in binary data

Conversions

Primitive types form a hierarchy that dictates the implicit type conversions Hive will perform in function and
operator expressions.
For example, a TINYINT will be converted to an INT if an expression expects an INT; however, the reverse
conversion will not occur, and Hive will return an error unless the CAST operator is used.

The implicit conversion rules can be summarized as follows.

Any numeric type can be implicitly converted to a wider type, or to a text type (STRING, VARCHAR,
CHAR). All the text types can be implicitly converted to another text type. Perhaps surprisingly, they can also
be converted to DOUBLE or DECIMAL.

BOOLEAN types cannot be converted to any other type, and they cannot be implicitly converted to any other
type in expressions.

TIMESTAMP and DATE can be implicitly converted to a text type.

You can perform explicit type conversion using CAST. For example, CAST('1' AS INT) will convert the string
'1' to the integer value 1. If the cast fails—as it does in CAST('X' AS INT), for example—the expression
returns NULL.

Tables:

➢ A Hive table is logically made up of the data being stored and the associated metadata describing the layout
of the data in the table. The data typically resides in HDFS, although it may reside in any Hadoop
filesystem, including the local file system or S3.
➢ Hive stores the metadata in a relational database and not in HDFS.
Managed Tables and External Tables:
➢ When you create a table in Hive, by default Hive will manage the data, which means that Hive moves the
data into its warehouse directory. Alternatively, you may create an external table, which tells Hive to refer to
the data that is at an existing location outside the warehouse directory.
➢ The difference between the two table types is seen in the LOAD and DROP semantics. Let’s consider a
managed table first. When you load data into a managed table, it is moved into Hive’s warehouse
directory.
➢ For example, this:
CREATE TABLE managed_table (dummy STRING);

LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;

➢ Will move the file hdfs://user/tom/data.txt into Hive’s warehouse directory for the managed_table table,
which is hdfs: //user/hive/warehouse/managed_table.
➢ The load operation is very fast because it is just a move or rename within a file system. However, bear in
mind that Hive does not check that the files in the table directory conform to the schema declared for the
table, even for managed tables.
➢ If the table is later dropped, using: DROP TABLE managed_table;
Storage Formats:
➢ There are two dimensions that govern table storage in Hive: the row format and the file format. The
row format dictates how rows, and the fields in a particular row, are stored. In Hive, the row format is
defined by a Serializer-Deserializer.

➢ When acting as a deserializer, which is the case when querying a table, it will deserialize a row of data
from the bytes in the file to objects used internally by Hive to operate on that row of data.
➢ When used as a serializer, which is the case when performing an INSERT or CTAS, it will serialize
Hive’s internal representation of a row of data into the bytes that are written to the output file.
➢ The simplest format is a plain-text file, but there are row-oriented and column-oriented binary formats
available too.
➢ The default storage format: Delimited text
CREATE TABLE...AS SELECT:

➢ It’s very convenient to store the output of a Hive query in a new table, perhaps because it is too large to be
dumped to the console or because there are further processing steps to carry out on the result.
➢ The new table’s column definitions are derived from the columns retrieved by the SELECT clause. In the
following query, the target table has two columns named col1 and col2 whose types are the same as the
ones in the source table:
CREATE TABLE target
AS
SELECT col1, col2FROM
source;

➢ A CTAS operation is atomic, so if the SELECT query fails for some reason, the table is not created.
Altering Tables:
➢ Because Hive uses the schema-on-read approach, it’s flexible in permitting a table’s definition to
change after the table has been created. You can rename a table using the ALTER TABLE statement:
ALTER TABLE source RENAME TO target;

➢ In addition to updating the table metadata, ALTER TABLE moves the underlying table directory so that it
reflects the new name. Hive allows you to change the definition for columns, add new columns, or even
replace all existing columns in a table with a new set.
➢ For example, consider adding a new column:
ALTER TABLE target ADD COLUMNS (col3 STRING);

➢ The new column col3 is added after the existing (nonpartition) columns.
Dropping Tables:

➢ The DROP TABLE statement deletes the data and metadata for a table. In the case of external tables, only the
metadata is deleted; the data is left untouched. If you want to delete all the data in a table but keep the table
definition, use TRUNCATE TABLE.
➢ For example: TRUNCATE TABLE my_table;
➢ This doesn’t work for external tables; instead, use dfs -rmr (from the Hive shell) to remove the external
table directory directly.
➢ if you want to create a new, empty table with the same schema as another table, then use the LIKE
keyword:
CREATE TABLE new_table LIKE existing_table;

Querying Data

➢ How to use various forms of the SELECT statement to retrieve data from Hive.
Sorting and Aggregating:
➢ Sorting data in Hive can be achieved by using a standard ORDER BY clause. ORDER BY performs a
parallel total sort of the input.
Hive> FROM records2

> SELECT year, temperature


> DISTRIBUTE BY year
> SORT BY year ASC, temperature
DESC; 1949 111
1949 78
1950 22

1950 0

1950 -11

Joins:

➢ The simplest kind of join is the inner join, where each match in the input tables results in a row in the
output. Consider two small demonstration tables, sales (which lists the names of people and the IDs of the
items they bought) and things (which lists the item IDs and their names):
hive> SELECT * FROM sales; hive> SELECT * FROM things;

Joe 2 2 Tie

Hank 4 4 Coat

Ali 0 3 Hat

Eve 3 1 Scarf

Hank 2

➢ We can perform an inner join on the two tables as follows:


hive> SELECT sales.*, things.*
> FROM sales JOIN things ON (sales.id = things.id);
Joe 2 2 Tie
Hank 4 4 Coat

Eve 3 3 Hat

Hank 2 2 Tie

Hank 2 2 Tie
Spark
What is Spark?
➢ Apache Spark is a cluster computing framework for large-scale data processing. Spark does not
use MapReduce as an execution engine; instead, it uses its own distributed runtime for
executing work on a cluster.
➢ Spark is closely integrated with Hadoop: it can run on YARN and works with Hadoop file
formats and storage backends like HDFS.
Why do we need Spark in big data?
➢ Simply Spark is a fast and general engine for large-scale data processing. The fast means that
it's faster than previous approaches to work with Big Data like classical MapReduce.
➢ The secret for being faster is that Spark runs on memory (RAM), and that makes the
processing much faster than on disk drives.
Spark architecture is well-layered

Architecture of the Spark:


➢ and integrated with other libraries, making it easier to use. It is master/slave architecture
and has two main daemons: the master daemon and the worker daemon.

Fig: Spark Architecture

➢ In master node, we have the driver program, which drives our application. The code we are
writing behaves as a driver program or if we are using the interactive shell, the shell acts as
the driver program.
➢ Inside the driver program, the first thing we do is, we create a Spark Context. Assume that
the Spark context is a gateway to all the Spark functionalities. It is similar to your database
connection.
➢ Any command we execute in our database goes through the database connection. Likewise,
anything we do on Spark goes through Spark context.
➢ Now, this Spark context works with the cluster manager to manage various jobs. The driver
program & Spark context takes care of the job execution within the cluster.
➢ A job is split into multiple tasks which are distributed over the worker node. Anytime an
RDD is created in Spark context, it can be distributed across various nodes and can be cached
there.
➢ Worker nodes are the slave nodes whose job is to basically execute the tasks. These tasks are
then executed on the partitioned RDDs in the worker node and hence returns back the result
to the Spark Context.
➢ Spark Context takes the job, breaks the job in tasks and distribute them to the worker nodes.
These tasks work on the partitioned RDD, perform operations, collect the results and return to
the main Spark Context.
➢ If we increase the number of workers, then we can divide jobs into more partitions and
execute them parallelly over multiple systems. It will be a lot faster.
➢ With the increase in the number of workers, memory size will also increase & we can cache
the jobs to execute it faster.
➢ To know about the workflow of Spark Architecture, we can have a look at
the infographic below:

➢ STEP1: The client submits spark user application code. When an application code is submitted,
the driver implicitly converts user code that contains transformations and actions into a
logically directed acyclic graph called DAG. At this stage, it also performs optimizations such
as pipelining transformations.
➢ STEP 2: After that, it converts the logical graph called DAG into physical execution plan with
many stages. After converting into a physical execution plan, it creates physical execution
units called tasks under each stage. Then the tasks are bundled and sent to the cluster.
➢ STEP3: Now the driver talks to the cluster manager and negotiates the resources. Cluster
manager launches executors in worker nodes on behalf of the driver. At this point, the driver
will send the tasks to the executors based on data placement. When executors start, they
register themselves with drivers. So, the driver will have a complete view of executors that
are executing the task.
➢ STEP 4: During the course of execution of tasks, driver program will monitor the set of
executors that runs. Driver node also schedules future tasks based on data placement.
➢ This architecture is further integrated with various extensions and libraries. Apache Spark
Architecture is based on two main abstractions:
1. Resilient Distributed Dataset (RDD)
2. Directed Acyclic Graph (DAG)
Resilient Distributed Dataset (RDD):

➢ RDDs are the building blocks of any Spark application. RDDs Stands for:
Resilient: Fault tolerant and is capable of rebuilding data on failure
Distributed: Distributed data among the multiple nodes in a cluster
Dataset: Collection of partitioned data with values

➢ It is a layer of abstracted data over the distributed collection. It is immutable in nature and
follows lazy transformations. The data in an RDD is split into chunks based on a key.
➢ RDDs are highly resilient, i.e., they are able to recover quickly from any issues as the same
data chunks are replicated across multiple executor nodes.
➢ Thus, even if one executor node fails, another will still process the data. This allows you to
perform your functional calculations against your dataset very quickly by harnessing the
power of multiple nodes.
➢ Moreover, once you create an RDD it becomes immutable. Immutable means, an object
whose state cannot be modified after it is created, but they can surely be transformed.
➢ Talking about the distributed environment, each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster.
➢ Due to this, we can perform transformations or actions on the complete data parallelly. Also,
you don’t have to worry about the distribution, because Spark takes care of that.
Workflow of RDD
➢ There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or by referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, etc.With RDDs, you can perform two types of operations:
1. Transformations: They are the operations that are applied to create a new RDD.
2. Actions: They are applied on an RDD to instruct Apache Spark to apply computation and
pass the result back to the driver.

Features of Apache Spark:

➢ 1.Speed: Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data
processing. It is also able to achieve this speed through controlled partitioning.
➢ 2.Powerful Caching: Simple programming layer provides powerful caching and disk
persistence capabilities.
➢ 3. Deployment: It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster
manager.
➢ 4.Real-Time: It offers Real-time computation & low latency because of in-memory
computation.
➢ 5.Polyglot: Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be
written in any of these four languages. It also provides a shell in Scala and Python.

Installing Spark

➢ Download a stable release of the Spark binary distribution from the downloads page and
unpack the tarball in a suitable location:
% tar xzf spark-x.y.z-bin-distro.tgz
➢ It’s convenient to put the Spark binaries on your path as follows:
% export SPARK_HOME=~/sw/spark-x.y.z-bin-distro
% export PATH=$PATH:$SPARK_HOME/bin
➢ We’re now ready to run an example in Spark.
Resilient Distributed Datasets
RDDs are at the heart of every Spark program.

Creation

There are three ways of creating RDDs: from an in-memory collection of objects (known as parallelizing a
collection), using a dataset from external storage (such as HDFS), or transforming an existing RDD. The first
way is useful for doing CPU-intensive com‐ putations on small amounts of input data in parallel. For example,
the following runs separate computations on the numbers from 1 to 103

val params = sc.parallelize(1 to 10)

val result = params.map(performExpensiveComputation)

The second way to create an RDD is by creating a reference to an external dataset. We have already seen how
to create an RDD of String objects for a text file:

val text: RDD[String] = sc.textFile(inputPath)

The third way of creating an RDD is by transforming an existing RDD. We look at transformations next.

Transformations and Actions

Spark provides two categories of operations on RDDs: transformations and actions. A transformation
generates a new RDD from an existing one, while an action triggers a computation on an RDD and does
something with the results—either returning them to the user, or saving them to external storage. Actions have
an immediate effect, but transformations do not—they are lazy, in the sense that they don’t perform any work
until an action is performed on the transformed RDD. For example, the following lowercases lines in a text
file:

val text = sc.textFile(inputPath)

val lower: RDD[String] = text.map(_.toLowerCase())

lower.foreach(println(_))

Persistence

we can cache the intermediate dataset of year-temperature pairs in memory with the following:

scala> tuples.cache()

res1: tuples.type = MappedRDD[4] at map at :18

Calling cache() does not cache the RDD in memory straightaway. Instead, it marks the RDD with a flag
indicating it should be cached when the Spark job is run. So let’s first force a job run:
scala> tuples.reduceByKey((a, b) => Math.max(a, b)).foreach(println(_))

INFO BlockManagerInfo: Added rdd_4_0 in memory on 192.168.1.90:64640

INFO BlockManagerInfo: Added rdd_4_1 in memory on 192.168.1.90:64640

(1950,22)

(1949,111)

Serialization

There are two aspects of serialization to consider in Spark: serialization of data and serialization of functions
(or closures).

Data:

Let’s look at data serialization first. By default, Spark will use Java serialization to send data over the network
from one executor to another, or when caching (persisting) data in serialized form as described in “Persistence
levels”. Java serialization is well understood by programmers (you make sure the class you are using
implements java.io.Serializable or java.io.Externalizable), but it is not particularly efficient from a performance
or size perspective.

Functions:

Generally, serialization of functions will “just work”: in Scala, functions are serializable using the standard
Java serialization mechanism, which is what Spark uses to send functions to remote executor nodes. Spark will
serialize functions even when running in local mode, so if you inadvertently introduce a function that is not
serializable (such as one converted from a method on a nonserializable class), you will catch it early on in the
development process.

Shared Variables

➢ In Spark, when any function passed to a transformation operation, then it is executed on a


remote cluster node. It works on different copies of all the variables used in the function.
➢ These variables are copied to each machine, and no updates to the variables on the remote
machine are revert to the driver program.
Broadcast variable:
➢ The broadcast variables support a read-only variable cached on each machine rather than
providing a copy of it with tasks. Spark uses broadcast algorithms to distribute broadcast
variables for reducing communication cost.
➢ The execution of spark actions passes through several stages, separated by distributed
"shuffle" operations. Spark automatically broadcasts the common data required by tasks
within each stage. The data broadcasted this way is cached in serialized form and deserialized
before running each task.
➢ To create a broadcast variable (let say, v), call SparkContext.broadcast(v). Let's understand
with an example.
scala> val v = sc.broadcast(Array(1, 2, 3))
scala> v.value
Accumulator:
➢ The Accumulator is a variable that is used to perform associative and commutative operations

such as counters or sums. The Spark provides support for accumulators of numeric types.
However, we can add support for new types.
➢ To create a numeric accumulator, call SparkContext.longAccumulator () or
SparkContext.doubleAccumulator () to accumulate the values of Long or Double type.
scala> val a=sc.longAccumulator("Accumulator")
scala> sc.parallelize(Array(2,5)).foreach(x=>a.add(x))
scala> a.value
Anatomy of a Spark Job Run:

➢ What happens when we run a Spark job? At the highest level, there are two independent
entities: the driver, which hosts the application (SparkContext) and schedules tasks for a job;
and the executors, which are exclusive to the application, run for the duration of the
application, and execute the application’s tasks.
➢ Usually the driver runs as a client that is not managed by the cluster manager and the
executors run on machines in the cluster.
1. Job Submission:
➢ A Spark job is submitted automatically when an action (such as count()) is performed on an
RDD. Internally, this causes runJob() to be called on the SparkContext (step 1), which passes
the call on to the scheduler that runs as a part of the driver (step 2).
➢ The scheduler is made up of two parts: a DAG scheduler that breaks down the job into a DAG
of stages, and a task scheduler that is responsible for submitting the tasks from each stage to
the cluster.

How Spark runs a job


2. DAG Construction:
➢ To understand how a job is broken up into stages, we need to look at the type of tasks that can
run in a stage. There are two types: shuffle map tasks and result tasks. The name of the task
type indicates what Spark does with the task’s output:
Shuffle map tasks:
➢ As the name suggests, shuffle map tasks are like the map-side part of the shuffle in
MapReduce.
➢ Each shuffle map task runs a computation on one RDD partition and, based on a partitioning
function, writes its output to a new set of partitions, which are then fetched in a later stage.
Shuffle map tasks run in all stages except the final stage.
Result tasks:
➢ Result tasks run in the final stage that returns the result to the user’s program. Each result task
runs a computation on its RDD partition,then sends the result back to the driver, and the
driver assembles the results from each partition into a final result.
➢ Once the DAG scheduler has constructed the complete DAG of stages, it submits each stage’s
set of tasks to the task scheduler (step 3).
3. Task Scheduling:
➢ When the task scheduler is sent a set of tasks, it uses its list of executors that are running for
the application and constructs a mapping of tasks to executors that takes placement
preferences into account.
➢ Next, the task scheduler assigns tasks to executors that have free cores, and it continues to
assign more tasks as executors finish running tasks, until the task set is complete.
➢ Each task is allocated one core by default, although this can be
changed by settingspark.task.cpus.
➢ Assigned tasks are launched through a scheduler backend (step 4), which
sends a remote launch task message (step 5) to the executor backend to tell
the executor to run the task (step6).
4. Task Execution:
➢ An executor runs a task as follows (step 7). First, it makes sure that
the JAR and filedependencies for the task are up to date.
➢ The executor keeps a local cache of all the dependencies that previous tasks
have used, so thatit only downloads them when they have changed.
➢ Second, it deserializes the task code from the serialized bytes that were
sent as a part of the launch task message.
➢ Third, the task code is executed. Note that tasks are run in the same JVM as
the executor, sothere is no process overhead for task launch.
➢ Tasks can return a result to the driver. The result is serialized and
sent to the executorbackend, and then back to the driver as a status
update message.
➢ A Shuffle map task returns information that allows the next stage to retrieve
the output partitions, while a result task returns the value of the result for
the partition it ran on, whichthe driver assembles into a final result to
return to the user’s program.

HBasics
What is HBase?
➢ HBase is an open-source, column-oriented distributed database system that
runs on top of HDFS (Hadoop Distributed File System). Initially, it was Google
Big Table, afterward; it was renamed as HBase and is primarily written in
Java.
➢ HBase can store massive amounts of data from terabytes to petabytes. The
tables present in HBase consist of billions of rows having millions of
columns. HBase is built for low latency operations, which is having some
specific features compared to traditional relational models.
Why do we need HBase in big data?
➢ Apache HBase is needed for real-time Big Data applications. A table for a
popular web application may consist of billions of rows. If we want to search
a particular row from such a huge amount of data, HBase is the ideal choice
as query fetch time is less. Most of the online analytics applications use
HBase.
➢ Traditional relational data models fail to meet the performance
requirements of very big databases. These performance and processing
limitations can be overcome by Apache HBase.
Apache HBase Features:
1. HBase is built for low latency operations.
2. HBase is used extensively for random read and write operations.
3. HBase stores a large amount of data in terms of tables.
4. Provides linear and modular scalability over cluster environment.
5. Strictly consistent to read and write operations.
6. Automatic and configurable sharding of tables.
7. Easy to use Java API for client access.

HBase Architecture:

➢ HBase architecture consists of following components.


1. HMaster
2. HRegionserver
3. HRegions
4. Zookeeper
5. HDFS
HMaster:
➢ HMaster in HBase is the implementation of a Master server in HBase
architecture. It acts as a monitoring agent to monitor all Region Server
instances present in the cluster and acts as an interface for all the metadata
changes. In a distributed cluster environment, Master runs on NameNode.
Master runs several background threads.
➢ The following are important roles performed by HMaster in HBase.
1. Plays a vital role in terms of performance and maintaining nodes in the
cluster.
2. HMaster provides admin performance and distributes services to different
region servers.
3. HMaster assigns regions to region servers.
4. HMaster has the features like controlling load balancing and failover to
handle the loadover nodes present in the cluster.
5. When a client wants to change any schema and to change any
Metadata operations,HMaster takes responsibility for these
operations.
HBase Region Servers:
➢ When HBase Region Server receives writes and read requests from the client, it
assigns the request to a specific region, where the actual column family resides.
However, the client can directly contact with HRegion servers, there is no
need of HMaster mandatory permission tothe client regarding communication
with HRegion servers. The client requires HMaster help when operations
related to metadata and schema changes are required.
➢ HMaster can get into contact with multiple HRegion servers and performs
the following functions.
1. Hosting and managing regions
2. Splitting regions automatically
3. Handling read and writes requests
4. Communicating with
the client directlyHBase Regions:
➢ HRegions are the basic building elements of HBase cluster that consists of the
distribution of tables and are comprised of Column families. It contains
multiple stores, one for each column family. It consists of mainly two
components, which are Memstore and Hfile.
ZooKeeper:
➢ HBase Zookeeper is a centralized monitoring server which maintains
configuration information and provides distributed synchronization.
Distributed synchronization is to access the distributed applications
running across the cluster with the responsibility of providing coordination
services between nodes. If the client wants to communicate with regions,
the server’s client has to approach ZooKeeper first.
➢ Services provided by ZooKeeper
1. Maintains Configuration information
2. Provides distributed synchronization
3. Client Communication establishment with region servers
4. Provides ephemeral nodes for which represent different region servers
5. To track server failure and
network partitionsHDFS:
➢ HDFS is a Hadoop distributed File System, as the name implies it provides a
distributed environment for the storage and it is a file system designed in a
way to run on commodity hardware. It stores each file in multiple blocks and
to maintain fault tolerance, the blocks are replicated across a Hadoop cluster.
Installation

The prerequisite for HBase installation are Java and Hadoop installed on your Linux
machine.

Hbase can be installed in three modes: standalone, Pseudo Distributed mode and Fully
Distributed mode.

Download a stable release from an Apache Download Mirror and unpack it on your local
filesystem. For example:

% tar xzf hbase-x.y.z.tar.gz

As with Hadoop, you first need to tell HBase where Java is located on your system. If you
have the JAVA_HOME environment variable set to point to a suitable Java installation, then
that will be used, and you don’t have to configure anything further.
Otherwise, you can set the Java installation that HBase uses by editing HBase’s conf/hbase-
env.sh file and specifying the JAVA_HOME variable.

For convenience, add the HBase binary directory to your command-line path.

For example:

% export HBASE_HOME=~/sw/hbase-x.y.z

% export PATH=$PATH:$HBASE_HOME/bin

To get the list of HBase options, use the following:

%hbase

Test Drive To start a standalone instance of HBase that uses a temporary directory on the
local filesystem for persistence, use this:

% start-hbase.sh

To administer your HBase instance, launch the HBase shell as follows:

% hbase shell

HBase Shell; enter 'help' for list of supported commands.

Type "exit" to leave the HBase Shell

Version 0.98.7-hadoop2, r800c23e2207aa3f9bddb7e9514d8340bcfb89277, Wed Oct 8

15:58:11 PDT 2014

hbase(main):001:0>

HBase Commands

A list of HBase commands are given below.

o Create: Creates a new table identified by 'table1' and Column Family identified by
'colf'.
o Put: Inserts a new record into the table with row identified by 'row..'
o Scan: returns the data stored in table
o Get: Returns the records matching the row identifier provided in the table
o Help: Get a list of commands

Syntax:
create 'table1', 'colf'
list 'table1'
put 'table1', 'row1', 'colf:a', 'value1'
put 'table1', 'row1', 'colf:b', 'value2'
put 'table1', 'row2', 'colf:a', 'value3'
scan 'table1'
get 'table1', 'row1'

To create a table named test with a single column family named data using defaults for table
and column family attributes, enter:

hbase(main):001:0> create 'test', 'data'

0 row(s) in 0.9810 seconds

To prove the new table was created successfully, run the list command. This will output all
tables in user space:

hbase(main):002:0> list

TABLE

test

1 row(s) in 0.0260 seconds

To insert data into three different rows and columns in the data column family, get the first
row, and then list the table content, do the following:
Shut down your HBase instance by running:

% stop-hbase.sh

Clients

There are a number of client options for interacting with an HBase cluster. Java HBase, like
Hadoop, is written in Java. Example:
• Most of the HBase classes are found in the org.apache.hadoop.hbase and
org.apache.hadoop.hbase.client packages.
• In this class, we first ask the HBaseConfiguration class to create a Configuration
object. It will return a Configuration that has read the HBase configuration from the
hbase-site.xml and hbase-default.xml files found on the program’s classpath.
• This Configuration is subsequently used to create instances of HBaseAdmin and
HTable.
• HBaseAdmin is used for administering your HBase cluster, specifically for adding
and dropping tables. HTable is used to access a specific table.
• To create a table, we need to create an instance of HBaseAdmin and then ask it to
create the table named test with a single column family named data.
• To operate on a table, we will need an instance of HTable, which we construct by
passing it our Configuration instance and the name of the table. We then create Put
objects in a loop to insert data into the table.
• Next, we create a Get object to retrieve and print the first row that we added. Then we
use a Scan object to scan over the table, printing out what we find. At the end of the
program, we clean up by first disabling the table and then deleting it (recall that a
table must be disabled before it can be dropped).

Here’s a sample run:

% hbaseExampleClient

Get: keyvalues={row1/data:1/1414932826551/Put/vlen=6/mvcc=0}

Scan: keyvalues={row1/data:1/1414932826551/Put/vlen=6/mvcc=0}

Scan: keyvalues={row2/data:2/1414932826564/Put/vlen=6/mvcc=0}

Scan: keyvalues={row3/data:3/1414932826566/Put/vlen=6/mvcc=0}

Each line of output shows an HBase row, rendered using the toString() method from
Result. The fields are separated by a slash character, and are as follows: the row name,
the column name, the cell timestamp, the cell type, the length of the value’s byte array
(vlen), and an internal HBase field (mvcc). To get the value from a Result object using its
getValue() method.

Building an Online Query Application

➢ HDFS and MapReduce are powerful tools for processing batch operations over
large datasets, they do not provide ways to read or write individual records
efficiently. we can overcome these drawbacks using HBase tool.
➢ To implement the online query application, we will use the HBase Java API
directly. Here it becomes clear how important your choice of schema and
storage format is.
Creating a Table using HBase Shell:
➢ we can create a table using the create command, here we must specify the table
name and the Column Family name. The syntax to create a table in HBase
shell is shown below.

create ‘<table name>’,’<column family>’


➢ Example:Given below is a sample schema of a table named emp. It has two
column families:“personal data” and “professional data”.
Row key personal data professional data

➢ You can create this table in HBase shell as shown

below. hbase(main):002:0> create 'emp', 'personal

data', 'professional data'

➢ And it will give you the following output.

0 row(s) in 1.1300 seconds


=> Hbase::Table - emp

➢ Verification: We can verify whether the table is created using the list
command as shownbelow. Here we can observe the created emp table.

hbase(
main):0
02:0>
list
TABLE
emp
2 row(s) in
0.0340 seconds
Inserting Data
using HBase Shell:
➢ To create data in an HBase table, the following commands and
methods are used:1.put command,
2.add() method
of Put class, and
3.put() method
of HTable class.
HBase Table: Using put command, we can insert rows into a table. Its
syntax is as follows:put ’<table
name>’,’row1’,’<colfamily:colname>’,’<value>’
Inserting the First Row:
➢ Let us insert the first row values into the emp table
as shown below.hbase(main):005:0> put
'emp','1','personal data:name','raju'
0 row(s) in 0.6600 seconds
hbase(main):006:0> put 'emp','1','personal
data:city','hyderabad'0 row(s) in 0.0410
seconds
hbase(main):007:0> put
'emp','1','professional
data:designation','manager'
0 row(s) in 0.0240 seconds
hbase(main):007:0> put 'emp','1','professional
data:salary','50000'0 row(s) in 0.0240 seconds
Reading Data using HBase Shell:
➢ The get command and the get() method of HTable class are used to read data
from a table inHBase. Using get command, you can get a single row of data at
a time. Its syntax is as follows:
get ’<table name>’,’row1’
➢ Example: The following example shows how to use the get command. Let us
scan the first rowof the emp table.
hbase(main):012:0> get 'emp', '1'
➢ Output: COLUMN CELL
personal : city timestamp = 1417521848375, value =
hyderabad personal : name timestamp =
1417521785385, value = ramu professional: designation
timestamp = 1417521885277, value = manager
professional: salary timestamp = 1417521903862,
value = 50000
4 row(s) in 0.0270 seconds
Reading a Specific Column:
➢ Given below is the syntax to read a specific column using the get method.

hbase> get 'table name', ‘rowid’, {COLUMN ⇒ ‘column family:column


name ’}
➢ Example: Given below is the example to read a specific
column in HBase table.hbase(main):015:0> get 'emp',
'row1', {COLUMN ⇒ 'personal:name'}
➢ Output: COLUMN CELL
personal:name timestamp =
1418035791555, value = raju1 row(s) in
0.0080 seconds

Count:

➢ You can count the number of rows of a table using the count
command. Its syntax is asfollows:
count ‘<table name>’
➢ After deleting the first row, emp table will have two rows. Verify it
as shown below.hbase(main):023:0> count 'emp'
2 row(s) in 0.090 seconds
⇒2
Truncate:
➢ This command disables drops and recreates a table. The syntax of
truncate is as follows:hbase> truncate 'table name'
➢ Example: Given below is the example of truncate command. Here we have
truncated the emptable.
hbase(main):011:0> truncate 'emp'
➢ Output: Truncating 'one' table (it may take a while):
- Disabling table...
- Truncating table...
0 row(s) in 1.5950 seconds
➢ After truncating the table, use the scan command to verify. we will get a table
with zero rows.hbase(main):017:0> scan ‘emp’
ROW COLUMN + CELL
0 row(s) in
0.3110 secondsUpdating
Data using HBase Shell:
➢ You can update an existing cell value using the put command. To do so, just
follow the samesyntax and mention your new value as shown below.
put ‘table name’,’row ’,'Column family:column name',’new value’
➢ The newly given value replaces the existing value, updating the row.
➢ Example: Suppose there is a table in HBase called emp with
the following data.hbase(main):003:0> scan 'emp'
ROW COLUMN + CELL
row1 column = personal:name, timestamp = 1418051555,
value = raju row1 column = personal:city, timestamp =
1418275907, value = Hyderabad
row1 column = professional:designation, timestamp =
14180555,value = managerrow1 column = professional:salary,
timestamp = 1418035791555,value = 50000 1 row(s) in 0.0100
seconds
➢ The following command will update the city value of the employee
named ‘Raju’ to Delhi.hbase(main):002:0> put
'emp','row1','personal:city','Delhi'
0 row(s) in 0.0400 seconds
➢ The updated table looks as follows where you can observe the city
of Raju has beenchanged to ‘Delhi’.
hbase(main):003:0> scan 'emp'
➢ Output: ROW COLUMN + CELL
row1 column = personal:name, timestamp =
1418035791555, value = rajurow1 column = personal:city,
timestamp = 1418274645907, value = Delhi
row1 column = professional:designation, timestamp = 141857555,value
= managerrow1 column = professional:salary, timestamp =
1418039555, value = 50000
1 row(s) in 0.0100 seconds

You might also like