BDA Unit-5
BDA Unit-5
Hive: Installing Hive, Running Hive, Comparison with traditional Databases, HiveQL, Tables, Querying Data.
Spark: Installing Spark, Resilient Distributed Datasets, Shared Variables, Anatomy of a Spark Job Run.
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop
to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it
further as an open source under the name Apache Hive. It is used by different companies. For example,
Amazon uses it in Amazon Elastic MapReduce.
Hive is not
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Features of Hive
To configure Apache Hive, first you need to download and unzip Hive. Then you need to customize the
following files and settings:
The mirror link on the subsequent page leads to the directories containing available Hive tar packages. This
page also provides useful instructions on how to validate the integrity of files retrieved from mirror sites.
The Ubuntu system presented in this guide already has Hadoop 3.2.1 installed. This Hadoop version is
compatible with the Hive 3.1.2 release.
Select the apache-hive-3.1.2-bin.tar.gz file to begin the download process.
Alternatively, access your Ubuntu command line and download the compressed Hive files using and
the wget command followed by the download path:
$wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
Once the download process is complete, untar the compressed Hive package:
The Hive binary files are now located in the apache-hive-3.1.2-bin directory.
The $HIVE_HOME environment variable needs to direct the client shell to the apache-hive-3.1.2-
bin directory. Edit the .bashrc shell configuration file using a text editor of your choice (we will be using
nano):
The Hadoop environment variables are located within the same file.
Save and exit the .bashrc file once you add the Hive variables. Apply the changes to the current environment
with the following command:
$source ~/.bashrc
Apache Hive needs to be able to interact with the Hadoop Distributed File System. Access the hive-
config.sh file using the previously created $HIVE_HOME variable:
Add the HADOOP_HOME variable and the full path to your Hadoop directory:
export HADOOP_HOME=/home/hdoop/hadoop-3.2.1
Save the edits and exit the hive-config.sh file.
• The temporary, tmp directory is going to store the intermediate results of Hive processes.
• The warehouse directory is going to store the Hive related tables.
Create a tmp directory within the HDFS storage layer. This directory is going to store the intermediary data
Hive sends to the HDFS:
The output confirms that users now have write and execute permissions.
The output confirms that users now have write and execute permissions.
Apache Hive distributions contain template configuration files by default. The template files are located within
the Hive conf directory and outline default Hive settings.
$cd $HIVE_HOME/conf
Using Hive in a stand-alone mode rather than in a real-life Apache Hadoop cluster is a safe option for
newcomers. You can configure the system to use your local storage rather than the HDFS layer by setting
the hive.metastore.warehouse.dir parameter value to the location of your Hive warehouse directory.
Apache Hive uses the Derby database to store metadata. Initiate the Derby database, from the
Hive bin directory using the schematool command:
If the Derby database does not successfully initiate, you might receive an error with the following content:
This error indicates that there is most likely an incompatibility issue between Hadoop and
Hive guava versions.
$ls $HIVE_HOME/lib
Locate the guava jar file in the Hadoop lib directory as well:
$ls $HADOOP_HOME/share/hadoop/hdfs/lib
The two listed versions are not compatible and are causing the error. Remove the existing guava file from the
Hive lib directory:
$rm $HIVE_HOME/lib/guava-19.0.jar
Copy the guava file from the Hadoop lib directory to the Hive lib directory:
Use the schematool command once again to initiate the Derby database:
cd $HIVE_HOME/bin
hive
You are now able to issue SQL-like commands and directly interact with HDFS.
Conclusion
You have successfully installed and configured Hive on your Ubuntu system. Use HiveQL to query and
manage your Hadoop distributed storage and perform SQL-like tasks. Your Hadoop cluster now has an easy-
to-use gateway to previously inaccessible RDBMS.
The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL.
When starting Hive for the first time, we can check that it is working by listing its tables —there should be
none. The command must be terminated with a semicolon to tell Hive to execute it:
OK
You can also run the Hive shell in noninteractive mode. The -f option runs the commands in the specified file,
which is script.q in this example:
$ hive -f script.q
For short scripts, you can use the -e option to specify the commands inline, in which case the final semicolon is
not required:
OK
In both interactive and noninteractive mode, Hive will print information to standard error—such as the time
taken to run a query—during the course of operation. You can suppress these messages using the -S option at
launch time, which has the effect of showing only the output result for queries:
Hive Architecture:
Hive Consists of Mainly 3 core parts
1. Hive Clients
2. Hive Services
3. Hive Storage and Computing
Hive Clients:
Hive provides different drivers for communication with a different type of applications. For Thrift based
applications, it will provide Thrift client for communication.
For Java related applications, it provides JDBC Drivers. Other than any type of applications provided ODBC
drivers. These Clients and drivers in turn again communicate with Hive server in the Hive services.
Hive Services:
Client interactions with Hive can be performed through Hive Services. If the client wants to perform any query
related operations in Hive, it has to communicate through Hive Services.
CLI is the command line interface acts as Hive service for DDL (Data definition Language) operations. All
drivers communicate with Hive server and to the main driver in Hive services as shown in above architecture
diagram.
Driver present in the Hive services represents the main driver, and it communicates all type of JDBC, ODBC,
and other client specific applications. Driver will process those requests from different applications to
metastore and field systems for further processing.
Hive services such as Meta store, File system, and Job Client in turn communicates with Hive storage and
performs the following actions
• Metadata information of tables created in Hive is stored in Hive “Meta storage database”.
• Query results and data loaded in the tables are going to be stored in Hadoop cluster on HDFS.
From the above screenshot we can understand the Job execution flow in Hive with Hadoop
Hive Continuously in contact with Hadoop file system and its daemons via Execution engine. The dotted arrow
in the Job flow diagram shows the Execution engine communication with Hadoop daemons.
The Metastore:
Metastore is the central repository of Apache Hive metadata. It stores metadata for Hive tables (like their
schema and location) and partitions in a relational database. It provides client access to this information by
using metastore service API.
Hive metastore consists of two fundamental units:
• Embedded Metastore
• Local Metastore
• Remote Metastore
Let’s now discuss the above three Hive Metastore deployment modes one by one-
i. Embedded Metastore
In Hive by default, metastore service runs in the same JVM as the Hive service. It uses
embedded derby database stored on the local file system in this mode. Thus both metastore service and hive
service runs in the same JVM by using embedded Derby Database.
But, this mode also has limitation that, as only one embedded Derby database can access the database files on
disk at any one time, so only one Hive session could be open at a time.
This configuration is called as local metastore because metastore service still runs in the same process as the
Hive. But it connects to a database running in a separate process, either on the same machine or on a remote
machine.
Before starting Apache Hive client, add the JDBC / ODBC driver libraries to the Hive lib folder.
MySQL is a popular choice for the standalone metastore. In this case,
the javax.jdo.option.ConnectionURL property is set
to jdbc:mysql://host/dbname? createDatabaseIfNotExist=true,
and javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. The JDBC driver JAR file for
MySQL (Connector/J) must be on Hive’s classpath, which is achieved by placing it in Hive’s lib directory.
iii. Remote Metastore
Moving further, another metastore configuration called Remote Metastore. In this mode, metastore runs on its
own separate JVM, not in the Hive service JVM. If other processes want to communicate with the metastore
server they can communicate using Thrift Network APIs.
We can also have one more metastore servers in this case to provide more availability. This also brings better
manageability/security because the database tier can be completely firewalled off. And the clients no longer
need share database credentials with each Hiver user to access the metastore database.
To use this remote metastore, you should configure Hive service by setting hive.metastore.uris to the
metastore server URI(s). Metastore server URIs are of the form thrift://host:port, where the port corresponds to
the one set by METASTORE_PORT when starting the metastore server.
Databases Supported by Hive
• Derby
• MySQL
• MS SQL Server
• Oracle
• Postgres
➢ The differences between Hive vs RDBMS (traditional relation databases). Below are the key features of
Hive that differ from RDBMS.
➢ Hive resembles a traditional database by supporting SQL interface but it is not a full database. Hive can be
better called as data warehouse instead of database.
➢ Hive enforces schema on read time whereas RDBMS enforces schema on write time. In RDBMS, a
table’s schema is enforced at data load time, If the data being loaded doesn’t conform to the schema, then it
is rejected. This design is called schema on write.
➢ But Hive doesn’t verify the data when it is loaded, but rather when a it is retrieved. This is called schema
on read. Schema on read makes for a very fast initial load, since the data does not have to be read, parsed,
and serialized to disk in the database’s internal format. The load operation is just a file copy or move.
➢ Schema on write makes query time performance faster, since the database can index columns and
perform compression on the data but it takes longer to load data into the database.
➢ Hive is based on the notion of Write once, Read many times but RDBMS is designed for Read and
Write many times.
➢ In RDBMS, record level updates, insertions and deletes, transactions and indexes are possible.
Whereas these are not allowed in Hive because Hive was built to operate over HDFS data using
MapReduce, where full-table scans are the norm and a table update is achieved by transforming the data
into a new table.
➢ In RDBMS, maximum data size allowed will be in 10’s of Terabytes but whereas Hive can 100’s
Petabytes very easily.
➢ As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction Processing) but it
is closer to OLAP (Online Analytical Processing) but not ideal since there is significant latency between
issuing a query and receiving a reply, due to the overhead of Mapreduce jobs and due to the size of the
data sets Hadoop was designed to serve.
➢ RDBMS is best suited for dynamic data analysis and where fast responses are expected but Hive is suited
for data warehouse applications, where relatively static data is analyzed, fast response times are not
required, and when the data is not changing rapidly.
➢ To overcome the limitations of Hive, HBase is being integrated with Hive to support record level
operations and OLAP.
➢ Hive is very easily scalable at low cost but RDBMS is not that much scalable that too it is very costly
scale up.
HiveQL
The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a
Metastore.
Data Types
a.The literal forms for arrays, maps, structs, and unions are provided as functions. That is, array, map, struct,
and create_union are built-in Hive functions.
Hive has four complex types: ARRAY, MAP, STRUCT, and UNION. ARRAY and MAP are like their
namesakes in Java, whereas a STRUCT is a record type that encapsulates a set of named fields. A UNION
specifies a choice of data types; values must match exactly one of these types.
Array in Hive is an ordered sequence of similar type elements that are indexable using the zero-based integers.
array<datatype>
map<primitive_type, data_type>
Example: ‘first’ -> ‘John’, ‘last’ -> ‘Deo’, represented as map(‘first’, ‘John’, ‘last’, ‘Deo’). Now ‘John’ can be
accessed with map[‘first’].
STRUCT in Hive is similar to the STRUCT in C language. It is a record type that encapsulates a set of named
fields, which can be any primitive data type.
We can access the elements in STRUCT type using DOT (.) notation.
Example: For a column c3 of type STRUCT {c1 INTEGER; c2 INTEGER}, the c1 field is accessed by the
expression c3.c1.
UNION type in Hive is similar to the UNION in C. UNION types at any point of time can hold exactly one
data type from its specified data types.
The full support for UNIONTYPE data type in Hive is still incomplete.
Complex types permit an arbitrary level of nesting. Complex type declarations must specify the type of the
fields in the collection, using an angled bracket notation, as illustrated in this table definition with three
columns (one for each complex type):
c2 MAP<STRING, INT>,
c4 UNIONTYPE<STRING, INT>
);
If we load the table with one row of data for ARRAY, MAP, STRUCT, and UNION, as shown in the “Literal
examples” column in Table 17, the following query demonstrates the field accessor operators for each type:
1 2 1.0 {1:63}
The usual set of SQL operators is provided by Hive: relational operators (such as x = 'a' for testing equality, x
IS NULL for testing nullity, and x LIKE 'a%' for pattern matching), arithmetic operators (such as x + 1 for
addition), and logical operators (such as x OR y for logical OR). The operators match those in MySQL, which
deviates from SQL-92 because || is logical OR, not string concatenation. Use the concat function for the latter
in both MySQL and Hive.
Hive comes with a large number of built-in functions—too many to list here—divided into categories that
include mathematical and statistical functions, string functions, date functions (for operating on string
representations of dates), conditional functions, aggregate functions, and functions for working with XML
(using the xpath function) and JSON.
You can retrieve a list of functions from the Hive shell by typing SHOW FUNCTIONS.To get brief usage
instructions for a particular function, use the DESCRIBE command:
length(str | binary) - Returns the length of str or number of bytes in binary data
Conversions
Primitive types form a hierarchy that dictates the implicit type conversions Hive will perform in function and
operator expressions.
For example, a TINYINT will be converted to an INT if an expression expects an INT; however, the reverse
conversion will not occur, and Hive will return an error unless the CAST operator is used.
Any numeric type can be implicitly converted to a wider type, or to a text type (STRING, VARCHAR,
CHAR). All the text types can be implicitly converted to another text type. Perhaps surprisingly, they can also
be converted to DOUBLE or DECIMAL.
BOOLEAN types cannot be converted to any other type, and they cannot be implicitly converted to any other
type in expressions.
You can perform explicit type conversion using CAST. For example, CAST('1' AS INT) will convert the string
'1' to the integer value 1. If the cast fails—as it does in CAST('X' AS INT), for example—the expression
returns NULL.
Tables:
➢ A Hive table is logically made up of the data being stored and the associated metadata describing the layout
of the data in the table. The data typically resides in HDFS, although it may reside in any Hadoop
filesystem, including the local file system or S3.
➢ Hive stores the metadata in a relational database and not in HDFS.
Managed Tables and External Tables:
➢ When you create a table in Hive, by default Hive will manage the data, which means that Hive moves the
data into its warehouse directory. Alternatively, you may create an external table, which tells Hive to refer to
the data that is at an existing location outside the warehouse directory.
➢ The difference between the two table types is seen in the LOAD and DROP semantics. Let’s consider a
managed table first. When you load data into a managed table, it is moved into Hive’s warehouse
directory.
➢ For example, this:
CREATE TABLE managed_table (dummy STRING);
➢ Will move the file hdfs://user/tom/data.txt into Hive’s warehouse directory for the managed_table table,
which is hdfs: //user/hive/warehouse/managed_table.
➢ The load operation is very fast because it is just a move or rename within a file system. However, bear in
mind that Hive does not check that the files in the table directory conform to the schema declared for the
table, even for managed tables.
➢ If the table is later dropped, using: DROP TABLE managed_table;
Storage Formats:
➢ There are two dimensions that govern table storage in Hive: the row format and the file format. The
row format dictates how rows, and the fields in a particular row, are stored. In Hive, the row format is
defined by a Serializer-Deserializer.
➢ When acting as a deserializer, which is the case when querying a table, it will deserialize a row of data
from the bytes in the file to objects used internally by Hive to operate on that row of data.
➢ When used as a serializer, which is the case when performing an INSERT or CTAS, it will serialize
Hive’s internal representation of a row of data into the bytes that are written to the output file.
➢ The simplest format is a plain-text file, but there are row-oriented and column-oriented binary formats
available too.
➢ The default storage format: Delimited text
CREATE TABLE...AS SELECT:
➢ It’s very convenient to store the output of a Hive query in a new table, perhaps because it is too large to be
dumped to the console or because there are further processing steps to carry out on the result.
➢ The new table’s column definitions are derived from the columns retrieved by the SELECT clause. In the
following query, the target table has two columns named col1 and col2 whose types are the same as the
ones in the source table:
CREATE TABLE target
AS
SELECT col1, col2FROM
source;
➢ A CTAS operation is atomic, so if the SELECT query fails for some reason, the table is not created.
Altering Tables:
➢ Because Hive uses the schema-on-read approach, it’s flexible in permitting a table’s definition to
change after the table has been created. You can rename a table using the ALTER TABLE statement:
ALTER TABLE source RENAME TO target;
➢ In addition to updating the table metadata, ALTER TABLE moves the underlying table directory so that it
reflects the new name. Hive allows you to change the definition for columns, add new columns, or even
replace all existing columns in a table with a new set.
➢ For example, consider adding a new column:
ALTER TABLE target ADD COLUMNS (col3 STRING);
➢ The new column col3 is added after the existing (nonpartition) columns.
Dropping Tables:
➢ The DROP TABLE statement deletes the data and metadata for a table. In the case of external tables, only the
metadata is deleted; the data is left untouched. If you want to delete all the data in a table but keep the table
definition, use TRUNCATE TABLE.
➢ For example: TRUNCATE TABLE my_table;
➢ This doesn’t work for external tables; instead, use dfs -rmr (from the Hive shell) to remove the external
table directory directly.
➢ if you want to create a new, empty table with the same schema as another table, then use the LIKE
keyword:
CREATE TABLE new_table LIKE existing_table;
Querying Data
➢ How to use various forms of the SELECT statement to retrieve data from Hive.
Sorting and Aggregating:
➢ Sorting data in Hive can be achieved by using a standard ORDER BY clause. ORDER BY performs a
parallel total sort of the input.
Hive> FROM records2
1950 0
1950 -11
Joins:
➢ The simplest kind of join is the inner join, where each match in the input tables results in a row in the
output. Consider two small demonstration tables, sales (which lists the names of people and the IDs of the
items they bought) and things (which lists the item IDs and their names):
hive> SELECT * FROM sales; hive> SELECT * FROM things;
Joe 2 2 Tie
Hank 4 4 Coat
Ali 0 3 Hat
Eve 3 1 Scarf
Hank 2
Eve 3 3 Hat
Hank 2 2 Tie
Hank 2 2 Tie
Spark
What is Spark?
➢ Apache Spark is a cluster computing framework for large-scale data processing. Spark does not
use MapReduce as an execution engine; instead, it uses its own distributed runtime for
executing work on a cluster.
➢ Spark is closely integrated with Hadoop: it can run on YARN and works with Hadoop file
formats and storage backends like HDFS.
Why do we need Spark in big data?
➢ Simply Spark is a fast and general engine for large-scale data processing. The fast means that
it's faster than previous approaches to work with Big Data like classical MapReduce.
➢ The secret for being faster is that Spark runs on memory (RAM), and that makes the
processing much faster than on disk drives.
Spark architecture is well-layered
➢ In master node, we have the driver program, which drives our application. The code we are
writing behaves as a driver program or if we are using the interactive shell, the shell acts as
the driver program.
➢ Inside the driver program, the first thing we do is, we create a Spark Context. Assume that
the Spark context is a gateway to all the Spark functionalities. It is similar to your database
connection.
➢ Any command we execute in our database goes through the database connection. Likewise,
anything we do on Spark goes through Spark context.
➢ Now, this Spark context works with the cluster manager to manage various jobs. The driver
program & Spark context takes care of the job execution within the cluster.
➢ A job is split into multiple tasks which are distributed over the worker node. Anytime an
RDD is created in Spark context, it can be distributed across various nodes and can be cached
there.
➢ Worker nodes are the slave nodes whose job is to basically execute the tasks. These tasks are
then executed on the partitioned RDDs in the worker node and hence returns back the result
to the Spark Context.
➢ Spark Context takes the job, breaks the job in tasks and distribute them to the worker nodes.
These tasks work on the partitioned RDD, perform operations, collect the results and return to
the main Spark Context.
➢ If we increase the number of workers, then we can divide jobs into more partitions and
execute them parallelly over multiple systems. It will be a lot faster.
➢ With the increase in the number of workers, memory size will also increase & we can cache
the jobs to execute it faster.
➢ To know about the workflow of Spark Architecture, we can have a look at
the infographic below:
➢ STEP1: The client submits spark user application code. When an application code is submitted,
the driver implicitly converts user code that contains transformations and actions into a
logically directed acyclic graph called DAG. At this stage, it also performs optimizations such
as pipelining transformations.
➢ STEP 2: After that, it converts the logical graph called DAG into physical execution plan with
many stages. After converting into a physical execution plan, it creates physical execution
units called tasks under each stage. Then the tasks are bundled and sent to the cluster.
➢ STEP3: Now the driver talks to the cluster manager and negotiates the resources. Cluster
manager launches executors in worker nodes on behalf of the driver. At this point, the driver
will send the tasks to the executors based on data placement. When executors start, they
register themselves with drivers. So, the driver will have a complete view of executors that
are executing the task.
➢ STEP 4: During the course of execution of tasks, driver program will monitor the set of
executors that runs. Driver node also schedules future tasks based on data placement.
➢ This architecture is further integrated with various extensions and libraries. Apache Spark
Architecture is based on two main abstractions:
1. Resilient Distributed Dataset (RDD)
2. Directed Acyclic Graph (DAG)
Resilient Distributed Dataset (RDD):
➢ RDDs are the building blocks of any Spark application. RDDs Stands for:
Resilient: Fault tolerant and is capable of rebuilding data on failure
Distributed: Distributed data among the multiple nodes in a cluster
Dataset: Collection of partitioned data with values
➢ It is a layer of abstracted data over the distributed collection. It is immutable in nature and
follows lazy transformations. The data in an RDD is split into chunks based on a key.
➢ RDDs are highly resilient, i.e., they are able to recover quickly from any issues as the same
data chunks are replicated across multiple executor nodes.
➢ Thus, even if one executor node fails, another will still process the data. This allows you to
perform your functional calculations against your dataset very quickly by harnessing the
power of multiple nodes.
➢ Moreover, once you create an RDD it becomes immutable. Immutable means, an object
whose state cannot be modified after it is created, but they can surely be transformed.
➢ Talking about the distributed environment, each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster.
➢ Due to this, we can perform transformations or actions on the complete data parallelly. Also,
you don’t have to worry about the distribution, because Spark takes care of that.
Workflow of RDD
➢ There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or by referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, etc.With RDDs, you can perform two types of operations:
1. Transformations: They are the operations that are applied to create a new RDD.
2. Actions: They are applied on an RDD to instruct Apache Spark to apply computation and
pass the result back to the driver.
➢ 1.Speed: Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data
processing. It is also able to achieve this speed through controlled partitioning.
➢ 2.Powerful Caching: Simple programming layer provides powerful caching and disk
persistence capabilities.
➢ 3. Deployment: It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster
manager.
➢ 4.Real-Time: It offers Real-time computation & low latency because of in-memory
computation.
➢ 5.Polyglot: Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be
written in any of these four languages. It also provides a shell in Scala and Python.
Installing Spark
➢ Download a stable release of the Spark binary distribution from the downloads page and
unpack the tarball in a suitable location:
% tar xzf spark-x.y.z-bin-distro.tgz
➢ It’s convenient to put the Spark binaries on your path as follows:
% export SPARK_HOME=~/sw/spark-x.y.z-bin-distro
% export PATH=$PATH:$SPARK_HOME/bin
➢ We’re now ready to run an example in Spark.
Resilient Distributed Datasets
RDDs are at the heart of every Spark program.
Creation
There are three ways of creating RDDs: from an in-memory collection of objects (known as parallelizing a
collection), using a dataset from external storage (such as HDFS), or transforming an existing RDD. The first
way is useful for doing CPU-intensive com‐ putations on small amounts of input data in parallel. For example,
the following runs separate computations on the numbers from 1 to 103
The second way to create an RDD is by creating a reference to an external dataset. We have already seen how
to create an RDD of String objects for a text file:
The third way of creating an RDD is by transforming an existing RDD. We look at transformations next.
Spark provides two categories of operations on RDDs: transformations and actions. A transformation
generates a new RDD from an existing one, while an action triggers a computation on an RDD and does
something with the results—either returning them to the user, or saving them to external storage. Actions have
an immediate effect, but transformations do not—they are lazy, in the sense that they don’t perform any work
until an action is performed on the transformed RDD. For example, the following lowercases lines in a text
file:
lower.foreach(println(_))
Persistence
we can cache the intermediate dataset of year-temperature pairs in memory with the following:
scala> tuples.cache()
Calling cache() does not cache the RDD in memory straightaway. Instead, it marks the RDD with a flag
indicating it should be cached when the Spark job is run. So let’s first force a job run:
scala> tuples.reduceByKey((a, b) => Math.max(a, b)).foreach(println(_))
(1950,22)
(1949,111)
Serialization
There are two aspects of serialization to consider in Spark: serialization of data and serialization of functions
(or closures).
Data:
Let’s look at data serialization first. By default, Spark will use Java serialization to send data over the network
from one executor to another, or when caching (persisting) data in serialized form as described in “Persistence
levels”. Java serialization is well understood by programmers (you make sure the class you are using
implements java.io.Serializable or java.io.Externalizable), but it is not particularly efficient from a performance
or size perspective.
Functions:
Generally, serialization of functions will “just work”: in Scala, functions are serializable using the standard
Java serialization mechanism, which is what Spark uses to send functions to remote executor nodes. Spark will
serialize functions even when running in local mode, so if you inadvertently introduce a function that is not
serializable (such as one converted from a method on a nonserializable class), you will catch it early on in the
development process.
Shared Variables
such as counters or sums. The Spark provides support for accumulators of numeric types.
However, we can add support for new types.
➢ To create a numeric accumulator, call SparkContext.longAccumulator () or
SparkContext.doubleAccumulator () to accumulate the values of Long or Double type.
scala> val a=sc.longAccumulator("Accumulator")
scala> sc.parallelize(Array(2,5)).foreach(x=>a.add(x))
scala> a.value
Anatomy of a Spark Job Run:
➢ What happens when we run a Spark job? At the highest level, there are two independent
entities: the driver, which hosts the application (SparkContext) and schedules tasks for a job;
and the executors, which are exclusive to the application, run for the duration of the
application, and execute the application’s tasks.
➢ Usually the driver runs as a client that is not managed by the cluster manager and the
executors run on machines in the cluster.
1. Job Submission:
➢ A Spark job is submitted automatically when an action (such as count()) is performed on an
RDD. Internally, this causes runJob() to be called on the SparkContext (step 1), which passes
the call on to the scheduler that runs as a part of the driver (step 2).
➢ The scheduler is made up of two parts: a DAG scheduler that breaks down the job into a DAG
of stages, and a task scheduler that is responsible for submitting the tasks from each stage to
the cluster.
HBasics
What is HBase?
➢ HBase is an open-source, column-oriented distributed database system that
runs on top of HDFS (Hadoop Distributed File System). Initially, it was Google
Big Table, afterward; it was renamed as HBase and is primarily written in
Java.
➢ HBase can store massive amounts of data from terabytes to petabytes. The
tables present in HBase consist of billions of rows having millions of
columns. HBase is built for low latency operations, which is having some
specific features compared to traditional relational models.
Why do we need HBase in big data?
➢ Apache HBase is needed for real-time Big Data applications. A table for a
popular web application may consist of billions of rows. If we want to search
a particular row from such a huge amount of data, HBase is the ideal choice
as query fetch time is less. Most of the online analytics applications use
HBase.
➢ Traditional relational data models fail to meet the performance
requirements of very big databases. These performance and processing
limitations can be overcome by Apache HBase.
Apache HBase Features:
1. HBase is built for low latency operations.
2. HBase is used extensively for random read and write operations.
3. HBase stores a large amount of data in terms of tables.
4. Provides linear and modular scalability over cluster environment.
5. Strictly consistent to read and write operations.
6. Automatic and configurable sharding of tables.
7. Easy to use Java API for client access.
HBase Architecture:
The prerequisite for HBase installation are Java and Hadoop installed on your Linux
machine.
Hbase can be installed in three modes: standalone, Pseudo Distributed mode and Fully
Distributed mode.
Download a stable release from an Apache Download Mirror and unpack it on your local
filesystem. For example:
As with Hadoop, you first need to tell HBase where Java is located on your system. If you
have the JAVA_HOME environment variable set to point to a suitable Java installation, then
that will be used, and you don’t have to configure anything further.
Otherwise, you can set the Java installation that HBase uses by editing HBase’s conf/hbase-
env.sh file and specifying the JAVA_HOME variable.
For convenience, add the HBase binary directory to your command-line path.
For example:
% export HBASE_HOME=~/sw/hbase-x.y.z
% export PATH=$PATH:$HBASE_HOME/bin
%hbase
Test Drive To start a standalone instance of HBase that uses a temporary directory on the
local filesystem for persistence, use this:
% start-hbase.sh
% hbase shell
hbase(main):001:0>
HBase Commands
o Create: Creates a new table identified by 'table1' and Column Family identified by
'colf'.
o Put: Inserts a new record into the table with row identified by 'row..'
o Scan: returns the data stored in table
o Get: Returns the records matching the row identifier provided in the table
o Help: Get a list of commands
Syntax:
create 'table1', 'colf'
list 'table1'
put 'table1', 'row1', 'colf:a', 'value1'
put 'table1', 'row1', 'colf:b', 'value2'
put 'table1', 'row2', 'colf:a', 'value3'
scan 'table1'
get 'table1', 'row1'
To create a table named test with a single column family named data using defaults for table
and column family attributes, enter:
To prove the new table was created successfully, run the list command. This will output all
tables in user space:
hbase(main):002:0> list
TABLE
test
To insert data into three different rows and columns in the data column family, get the first
row, and then list the table content, do the following:
Shut down your HBase instance by running:
% stop-hbase.sh
Clients
There are a number of client options for interacting with an HBase cluster. Java HBase, like
Hadoop, is written in Java. Example:
• Most of the HBase classes are found in the org.apache.hadoop.hbase and
org.apache.hadoop.hbase.client packages.
• In this class, we first ask the HBaseConfiguration class to create a Configuration
object. It will return a Configuration that has read the HBase configuration from the
hbase-site.xml and hbase-default.xml files found on the program’s classpath.
• This Configuration is subsequently used to create instances of HBaseAdmin and
HTable.
• HBaseAdmin is used for administering your HBase cluster, specifically for adding
and dropping tables. HTable is used to access a specific table.
• To create a table, we need to create an instance of HBaseAdmin and then ask it to
create the table named test with a single column family named data.
• To operate on a table, we will need an instance of HTable, which we construct by
passing it our Configuration instance and the name of the table. We then create Put
objects in a loop to insert data into the table.
• Next, we create a Get object to retrieve and print the first row that we added. Then we
use a Scan object to scan over the table, printing out what we find. At the end of the
program, we clean up by first disabling the table and then deleting it (recall that a
table must be disabled before it can be dropped).
% hbaseExampleClient
Get: keyvalues={row1/data:1/1414932826551/Put/vlen=6/mvcc=0}
Scan: keyvalues={row1/data:1/1414932826551/Put/vlen=6/mvcc=0}
Scan: keyvalues={row2/data:2/1414932826564/Put/vlen=6/mvcc=0}
Scan: keyvalues={row3/data:3/1414932826566/Put/vlen=6/mvcc=0}
Each line of output shows an HBase row, rendered using the toString() method from
Result. The fields are separated by a slash character, and are as follows: the row name,
the column name, the cell timestamp, the cell type, the length of the value’s byte array
(vlen), and an internal HBase field (mvcc). To get the value from a Result object using its
getValue() method.
➢ HDFS and MapReduce are powerful tools for processing batch operations over
large datasets, they do not provide ways to read or write individual records
efficiently. we can overcome these drawbacks using HBase tool.
➢ To implement the online query application, we will use the HBase Java API
directly. Here it becomes clear how important your choice of schema and
storage format is.
Creating a Table using HBase Shell:
➢ we can create a table using the create command, here we must specify the table
name and the Column Family name. The syntax to create a table in HBase
shell is shown below.
➢ Verification: We can verify whether the table is created using the list
command as shownbelow. Here we can observe the created emp table.
hbase(
main):0
02:0>
list
TABLE
emp
2 row(s) in
0.0340 seconds
Inserting Data
using HBase Shell:
➢ To create data in an HBase table, the following commands and
methods are used:1.put command,
2.add() method
of Put class, and
3.put() method
of HTable class.
HBase Table: Using put command, we can insert rows into a table. Its
syntax is as follows:put ’<table
name>’,’row1’,’<colfamily:colname>’,’<value>’
Inserting the First Row:
➢ Let us insert the first row values into the emp table
as shown below.hbase(main):005:0> put
'emp','1','personal data:name','raju'
0 row(s) in 0.6600 seconds
hbase(main):006:0> put 'emp','1','personal
data:city','hyderabad'0 row(s) in 0.0410
seconds
hbase(main):007:0> put
'emp','1','professional
data:designation','manager'
0 row(s) in 0.0240 seconds
hbase(main):007:0> put 'emp','1','professional
data:salary','50000'0 row(s) in 0.0240 seconds
Reading Data using HBase Shell:
➢ The get command and the get() method of HTable class are used to read data
from a table inHBase. Using get command, you can get a single row of data at
a time. Its syntax is as follows:
get ’<table name>’,’row1’
➢ Example: The following example shows how to use the get command. Let us
scan the first rowof the emp table.
hbase(main):012:0> get 'emp', '1'
➢ Output: COLUMN CELL
personal : city timestamp = 1417521848375, value =
hyderabad personal : name timestamp =
1417521785385, value = ramu professional: designation
timestamp = 1417521885277, value = manager
professional: salary timestamp = 1417521903862,
value = 50000
4 row(s) in 0.0270 seconds
Reading a Specific Column:
➢ Given below is the syntax to read a specific column using the get method.
Count:
➢ You can count the number of rows of a table using the count
command. Its syntax is asfollows:
count ‘<table name>’
➢ After deleting the first row, emp table will have two rows. Verify it
as shown below.hbase(main):023:0> count 'emp'
2 row(s) in 0.090 seconds
⇒2
Truncate:
➢ This command disables drops and recreates a table. The syntax of
truncate is as follows:hbase> truncate 'table name'
➢ Example: Given below is the example of truncate command. Here we have
truncated the emptable.
hbase(main):011:0> truncate 'emp'
➢ Output: Truncating 'one' table (it may take a while):
- Disabling table...
- Truncating table...
0 row(s) in 1.5950 seconds
➢ After truncating the table, use the scan command to verify. we will get a table
with zero rows.hbase(main):017:0> scan ‘emp’
ROW COLUMN + CELL
0 row(s) in
0.3110 secondsUpdating
Data using HBase Shell:
➢ You can update an existing cell value using the put command. To do so, just
follow the samesyntax and mention your new value as shown below.
put ‘table name’,’row ’,'Column family:column name',’new value’
➢ The newly given value replaces the existing value, updating the row.
➢ Example: Suppose there is a table in HBase called emp with
the following data.hbase(main):003:0> scan 'emp'
ROW COLUMN + CELL
row1 column = personal:name, timestamp = 1418051555,
value = raju row1 column = personal:city, timestamp =
1418275907, value = Hyderabad
row1 column = professional:designation, timestamp =
14180555,value = managerrow1 column = professional:salary,
timestamp = 1418035791555,value = 50000 1 row(s) in 0.0100
seconds
➢ The following command will update the city value of the employee
named ‘Raju’ to Delhi.hbase(main):002:0> put
'emp','row1','personal:city','Delhi'
0 row(s) in 0.0400 seconds
➢ The updated table looks as follows where you can observe the city
of Raju has beenchanged to ‘Delhi’.
hbase(main):003:0> scan 'emp'
➢ Output: ROW COLUMN + CELL
row1 column = personal:name, timestamp =
1418035791555, value = rajurow1 column = personal:city,
timestamp = 1418274645907, value = Delhi
row1 column = professional:designation, timestamp = 141857555,value
= managerrow1 column = professional:salary, timestamp =
1418039555, value = 50000
1 row(s) in 0.0100 seconds