226 Unit-7
226 Unit-7
226 Unit-7
AND TOOLS
Structure
7.0 Introduction
7.1 Objectives
7.2 Apache SPARK Framework
7.3 HIVE
7.3.1 Working of HIVE Queries
7.3.2 Installation of HIVE
7.3.3 Writing Queries in HIVE
7.4 HBase
7.4.1 HBase Installation
7.4.2 Working with HBase
7.5 Other Tools
7.6 Summary
7.7 Answers
7.8 References and further readings
7.0 INTRODUCTION
In the Unit 5 and Unit 6 of this Block, you have gone through the concepts of Hadoop
and MapReduce programming. In addition, you have gone through various phases in
the MapReduce program. This unit introduces you to other popular big data
architecture tools and architectures. These architectures are beneficial to both
the ETL developers and analytics professionals.
This Unit introduces you to the basic software stack of SPARK architecture, one
of the most popular architecture of handling large data. In addition, this Unit
introduces two important tools – HIVE, which is a data warehouse system and
HBase, which is an important database system. An open-source NoSQL
database called HBase uses HDFS and Apache Hadoop to function. Unlimited
data can be stored on this extendable storage. Built on HDFS, HIVE is a SQL
engine that uses MapReduce.
7.1 OBJECTIVES
With SPARK, the scalability and fault tolerance of Hadoop MapReduce were
preserved while being designed for quick iterative processing, such as machine
learning, and interactive data analysis. As we have already discussed in Unit 6,
the MapReduce programming model, which is the foundation of the Hadoop
framework, allows for scalable, adaptable, fault-resilient, and cost friendly
solutions. In this case, it becomes imperative to reduce the turnaround time
between queries and execution. The Apache Software Foundation released
SPARK to quicken the Hadoop computing processes. SPARK has its own
cluster management, hence it is not dependent on Hadoop and is not a revised
form of Hadoop. Hadoop is merely one method of implementing SPARK.
SPARK's in-memory cluster computing, which accelerates application
processing, is its key feature. Numerous tasks, including batch processing,
iterative algorithms, interactive queries, and streaming, can be handled by
SRARK. Along with accommodating each job in its own system, it lessens the
administrative strain of managing various tools.
Figure 1 depicts the Spark data lake and shows how the Apache Spark can work
in conjunction with Hadoop components and information flow happens with
Apache Spark. Hadoop Distributed File Storage (HDFS) allows us to form
cluster of computing machines and utilize the combined capacity to store the
data, thus allowing to store huge data volumes. Further, MapReduce allows to
use combined power of the clusters and process it to store enormous data stored
in HDFS. The advent of HDFS and MapReduce allows horizontal scalability and
low capital cost as compared to data warehouses. Gradually, cloud infrastructure
became more economical and got wider adoption. Large amounts of organised,
semi-structured, and unstructured data can be stored, processed, and secured
using a data lake, a centralised repository. Without regard to size restrictions, it
can process any type of data and store it in its native format. A data lake has four
key capabilities: i) Ingest: allows data collection and ingestion, ii) Store:
responsible for data storage and management, iii) Process: leads to
transformation and data processing, and iv) Consume: ensures data access and
retrieval. In store capability of data lake, it could be an HDFS or cloud store such
as in Amazon S3, Azure Blob, Google cloud storage etc. The cloud storage
allows scalability and high availability access at an extremely low cost in almost
no time to procure. The notion of the data lake recommends to bring data into
data lake in raw format, i.e. ingest data into data lake and preserve an unmodified
immutable copy of data. The ingest block of data lake is about identifying,
implementing and managing the right tools to bring data from the source system
to the data lake. There is no single ingestion tools thus there could be multiple
tools such as HVR, Informatica, Talend etc. The next layer is the process layer
where all computation takes place such as initial data quality check,
transforming and preparing data, correlating, aggregating, and analysing and
applying machine learning models. Processing layer is further broken into two
parts which helps to manage better: i) data processing and ii) orchestration. The
processing is the core development framework allows to design and develop
distributed computing frameworks. Apache Spark is part of data processing. The
orchestration framework is responsible for the formation of the clusters,
managing resources, scaling up or down etc. There are three main competing
orchestration tools such as Hadoop Yarn, Kubernetes and Apache Mesos. The
last and most critical capability of data lake is to consume the data from data
lake for real life usage. The data lake is a repository of raw and processed data.
The consumption requirements could be from data analysts/scientists, or from
some applications or dashboards to take insights from data, or from
JDBC/ODBC connectors, others might be from Rest interface etc.
Step 2: Check whether SCALA is installed or not since Spark is written in Scala,
although elementary knowledge in Scala is enough to run Spark. Other
supported languages in Spark are Java, R, Python.
7.3 HIVE
A Hadoop utility for processing structured data is called Hive. It sits on top of
Hadoop to summarize big data and simplifies querying and analysis. Initially
created by Facebook, Hive was later taken up and further developed as an open
source project under the name Apache Hive by the Apache Software Foundation.
It is utilised by various businesses. Hive is not a relational database like SQL.
HIVE does not support for “Row-level” modifications of data, as the case in
SQL based database management systems. It also allows real time queries on
data. It puts processed data into HDFS and stores the schema in a database. It
offers a querying language called HiveQL or HQL that is similar to SQL. It is
dependable, quick, expandable, and scalable.
Figure 3: HIVE Architecture
ii) Hive services: Hive provides a range of services for various purposes.
The following are some of the most useful services:
a. Beeline: It is a command shell that HiveServer2 supports,
allowing users to send commands and queries to the system. It is
JDBC Client based on SQLLINE Command Line Interface
(CLI). SQLLINE CLI is a Java Console-based utility for running
SQL queries and connecting to relational databases.
b. Hive Server 2: After the popularity of HiveServer1, next
HiveServer2 was launched which helps the clients to run the
queries. It enables numerous clients to ask Hive questions and get
query responses.
c. Hive Driver: The user submits a Hive query using HiveQL
statements via the command shell, which are then received by the
Hive driver. The query is then sent to the compiler where session
handles for the query that are created.
d. Hive Compiler: The query is parsed by Hive compiler. The
metadata of the database, which is stored in the metastore can be
used to perform semantic analysis and data type validation on the
various parsed queries, and then it provides an execution plan.
The DAG (Directed Acyclic Graph) is the execution plan the
compiler generates, with each stage referred as a map or a reduce
job, HDFS action, and metadata operation.
e. Optimizer: To increase productivity and scalability, the optimizer
separates the work and performs transformation actions on the
execution plan so as to optimise the query execution time.
f. Execution engine: Following the optimization and compilation
processes, the execution engine uses Hadoop to carry out the
execution plan generated by the compiler as per the dependency
order amongst the execution plan.
g. Metastore: The metadata information on the columns and column
types in tables and partitions is kept in a central location called
the metastore. Additionally, it stores data storage information for
HDFS files as well as serializer and deserializer information
needed for read or write operations. Typically, this metastore is a
relational database. A Thrift interface is made available by
Metastore for querying and modifying Hive metadata.
h. HCatalog: It refers to the storage and table management layer of
Hadoop. It is constructed above metastore and makes Hive
metastore's tabular data accessible to other data processing tools.
And
i. WebHCat: HCatalog’s REST API is referred to as WebHCat. A
Hadoop table storage management tool called HCatalog allows
other Hadoop applications to access the tabular data of the Hive
metastore. A web service architecture known as REST API is
used to create online services that communicate via the HTTP
protocol. WebHCat is an HTTP interface for working with Hive
metadata.
iii) Processing and Resource Management: Internally, the de facto
engine for Hive's query execution is the MapReduce framework,
which is a software framework. MapReduce is used to create map
and reduce functions that process enormous amounts of data
concurrently on vast clusters of commodity hardware. Data is divided
into pieces and processed by map-reduce tasks as part of a map and
reduce jobs.
iv) Distributed Storage: Since Hadoop is the foundation of Hive, the
distributed storage is handled by the Hadoop Distributed File System.
There are various steps involved in the execution of Hive queries as follows:
Step 1: executeQuery Command: Hive UI either command line or web UI sends
the query which is to be executed to the JDBC/ODBC driver.
Step 2: getPlan Command: After accepting query expression and creating a
handle for the session to execute, the driver instructs the compiler to produce an
execution plan for the query.
Step 3: getMetadata Command: Compiler contacts the metastore with a
metadata request. The hive metadata contains the table’s information such as
schema and location and also the information of partitions.
Step 4: sendMetadata Command: The metastore transmits the metadata to the
compiler. These metadata are used by the compiler to type-check and analyse
the semantics of the query expressions. The execution plan (in the form of a
Directed Acyclic graph) is then produced by the compiler. This plan can use
MapReduce programming. Therefore, the map and reduce jobs would include
the map and the reduce operator trees.
Step 5: sendPlan Command: Next compiler communicates with the driver by
sending the created execution plan.
Step 6: executePlan Command: The driver transmits the plan to the execution
engine for execution after obtaining it from the compiler.
Step 7: submit job to MapReduce: The necessary map and reduce job worker
nodes get these Directed Acyclic Graphs (DAG) stages after being sent by the
execution engine. Each task, whether mapper or reducer, reads the rows from
HDFS files using the deserializer. These are then handed over through the
associated operator tree. As soon as the output is ready, the serializer writes it to
the HDFS temporary file. The last temporary file is then transferred to the
location of the table for Data Manipulation Language (DML) operations.
Step 8-10: sendResult Command: The execution engine now receives the
temporary files contents directly from HDFS as a retrieve request from the
driver. Results are then sent to the Hive UI by the driver.
The step-by-step installation of Hive (on Ubuntu or any other Linux platform)
is as follows:
Step 1: Check whether JAVA is installed or not, if not you need to install it.
Step 2: Check whether HADOOP is installed or not, if not you need to install it.
Step 6: Set up environment for Hive by adding the following to ~/.bashrc file:
In order to write HIVE queries, you must install the HIVE software on your
system. HIVE queries can be written using the query language that can be
supported by Hive. THE different data types supported by Hive are given below.
Maps
Struct
Hive Query Commands: Following are some of the basics commands to create
and drop database, creating, dropping and altering tables; creating partitions
etc. This section also lists some of the commands to query these database
including join command
1. Create Database
2. Drop database:
3. Creating Table:
4. Altering Table:
5. Dropping Table:
6. Add Partition:
7. Operators Used:
Relational Operators: =, !=, <, <=, >, >=, Is NULL, Is Not NULL,
LIKE (to compare strings)
Arithmetic Operators: +, -, *, /, %, & (Bitwise AND), | (Bitwise OR), ^
(Bitwise XOR), ~(Bitwise NOT)
Logical Operators: &&, || , !
Complex Operators: A[n] (nth element of Array A), M[key] (returns
value of the key in map M), S.x (return x field of struct S)
8. Functions: round(), floor(), ceil(), rand(), concat(string M, string N),
upper(), lower(), to_date(), cast()
Aggregate functions: count(), sum(), avg(), min(), max()
9. Views:
11. Select Order By Clause: To get information from a single column and
sort the result set in either ascending or descending order, use the
ORDER BY clause.
12. Select Group By Clause: A result set can be grouped using a specific
collection column by utilising the GROUP BY clause. It is used to
search a collection of records.
13. Join: The JOIN clause is used to combine specified fields from two
tables using values that are shared by both. It is employed to merge data
from two or more database tables.
Thus, most of the commands, as can be seen are very close to SQL like syntax.
In case, you know SQL, you would be able to write the Hive queries too.
7.4 HBASE
RDBMS has been the answer to issues with data maintenance and storage since
1970s. After big data became prevalent, businesses began to realise the
advantages of processing large data and began choosing solutions like Hadoop.
Hadoop processes huge data using MapReduce and stores it in distributed file
systems. Hadoop excels at processing and storing vast amounts of data in
arbitrary, semi-structured, and even unstructured formats. Hadoop is only
capable of batch processing, and only sequential access is possible to data. That
implies that even for the most straightforward tasks, one must explore the entire
dataset. When a large dataset is handled, it produces another equally large data
collection, which should also be processed in a timely manner. At this stage, a
fresh approach is needed to handle any data in a one go i.e. access at random.
On top of the Hadoop, the distributed column based data store HBase was
created. It is a horizontally scalable open-source project. Similar to Google's
Bigtable, HBase is a data model created to offer speedy random access to
enormous amounts of structured data. It makes use of the Hadoop File System's
fault tolerance (HDFS). It offers real-time read or write operations to access data
randomly from the Hadoop File System. Data can be directly stored in HDFS or
indirectly using HBase. Using HBase, data consumers randomly read from and
access the data stored in HDFS. HBase is built on top of Hadoop file system,
which offers read and write access. In nutshell, as compared to HDFS, HBase
enables faster lookup, random access with low latency due to internal storage in
hash tables. The tables in the column-oriented database HBase are sorted by row
(as depicted in Figure 4). The column families constituted of key-value pairs are
defined by the table structure. A table is a grouping of rows. A row is made up
of different column families. A collection of columns is called a column family.
Each column has a set of key-value pairs.
Figure 5: HBase components having HBase master and several region servers.
Tables in HBase are divided into regions and are handled by region servers.
Regions are vertically organised into "Stores" by column families. In HDFS,
stores are saved as files (as shown in Figure 5). The master server will assign the
regions to region servers with Apache ZooKeeper's assistance. It manages the
servers in the regions' load balancing. It transfers the regions to less busy servers
and unloads the busy servers and negotiates load balancing to maintain the
cluster's state. Tables that have been divided and distributed throughout the
region servers make up regions. The region server deal with client interactions
and data-related tasks. All read and write requests should be handled by the
respective regions. It uses the region size thresholds to determine the region's
size. The store includes HFiles and memory store. Similar to a cache memory
there is memstore. Everything that is entered into the HBase is initially saved
here. The data is then transported, stored as blocks in Hfiles, and the memstore
is then cleared. All modifications to the data in HBase's file-based storage are
tracked by the Write Ahead Log (WAL). The WAL makes sure that the data
changes can be replayed in the event that a Region Server fails or becomes
unavailable before the MemStore is cleared. An open-source project called
Zookeeper offers functions including naming, maintaining configuration data
and provides distributed synchronisation, etc. Different region servers are
represented by ephemeral nodes in Zookeeper. These nodes are used by master
servers to find servers which are unassigned jobs. Server outages and network
splinters are tracked by the nodes. Clients use zookeeper to communicate with
the region servers. HBase will handle zookeeper in pseudo mode and standalone
mode. HBase operates in standalone mode by default. For the purpose of small-
scale testing, standalone mode and pseudo-distributed mode are also offered.
Distributed mode is suited for a production setting. HBase daemon instances
execute in distributed mode across a number of server machines in the cluster.
The HBase architecture in shown in Figure 6.
Step 1: Check whether JAVA is installed or not, if not you need to install it.
Step 2: Check whether HADOOP is installed or not, if not you need to install it.
Edit hbase-site.xml
Distributed Mode:
Step 5: Edit hbase-site.xml
After you have completed the installation of HBase, you can use commands to
run HBase. Next section discusses these commands.
In order to interact with HBase, first you should use the shell commands as given
below:
HBase shell Commands: One can communicate with HBase by utilising the shell
that is included with it. The Hadoop File System is used by HBase to store its
data. It contains a master server and region servers. Regions will be used for the
data storage (tables). These regions are splited and stored in respective region
servers. These region servers are managed by the master server, and HDFS is
used for all of these functions. The following commands are supported in HBase
shell.
a) Generic Commands:
status: It gives information about HBase's status, such as how many
servers are there.
version: It gives the HBase current version that is in use.
table_help: It contains instructions for table related commands.
whoami: It will provide details of the user.
b) Data definition language:
create: To create table
list: Lists each table in the HBase database.
Disable: Turns a table into disable mode.
is_disabled: to check if table is disabled
enable: to enable the table
is_enabled: to check if table is enabled
describe: Gives a table's description.
Alter: To alter table
exists : To check if table exists
drop: To drop table
drop_all: Drops all table
Java Admin API: Java offers an Admin API for programmers to
implement DDL functionality.
c) Data manipulation language:
Put: Puts a cell value in a specific table at a specific column in a specific
row using the put command.
Get: Retrieves a row's or cell's contents.
Delete: Removes a table's cell value.
deleteall: Removes every cell in a specified row.
Scan: Scans the table and outputs the data.
Count: counts the rows in a table and returns that number.
Truncate: A table is disabled, dropped, and then recreated.
Java client API: Prior to all of the aforementioned commands, Java has a
client API under the org.apache.hadoop.hbase.client package that
enables programmers to perform DML functionality
2. List table:
This command when used in HBase prompt, gives us the list of all the
tables in HBase.
3. Disable table
4. Enable table
6. Exists table:
7. Drop table:
8. Exit shell:
9. Insert data:
A wide variety of Big Data tools and technologies are available on the market
today. They improve time management and cost effectiveness for tasks involving
data analysis. Some of these include Atlas.ti, HPCC, Apache Storm, Cassandra,
StatsIQ, CouchDB, Pentaho, FLink, Cloudera, OpenRefine, RapidMiner etc.
Atlas.ti allows to access all available platforms from one place. It can be utilised
for mixed techniques and qualitative data analysis varied research. High
Performance Computing Cluster (HPCC) Systems provides services using a
single platform, architecture, and data processing programming language. Storm
is a free large data open source distributed and fault tolerant computing system.
Today, a lot of people utilise the Apache Cassandra database to manage massive
amounts of data effectively. The statistical tool Stats iQ by Qualtrics is simple to
use. JSON documents that can be browsed online or queried using JavaScript
are used by CouchDB to store data. Big data technologies are available from
Pentaho to extract, prepare, and combine data. It provides analytics and
visualisations that transform how any business is run. Apache FLink is open
source data analytics tools for massive data stream processing. The most
efficient, user-friendly, and highly secure big data platform today is Cloudera.
Anyone may access any data from any setting using a single, scalable platform.
OpenRefine is a large data analytics tool that aids in working with unclean data,
cleaning it up, and converting it between different formats. RapidMiner is
utilised for model deployment, machine learning, and data preparation. It
provides a range of products to set up predictive analysis and create new data
mining methods.
Check Your Progress 3
7.8 SUMMARY
In this unit, we have learnt various tools and cutting-edge big data technologies
being used by the data analytics worldwide. Particularly, we studied in detail the
usage, installation, components and working of three main big data tools Spark,
Hive and HBase. Furthermore, we also discussed how to do query processing in
Hive and HBase.
7.9 SOLUTIONS/ANSWERS
iii) Advance Analytics support: Spark offers more than just "Map" and
"Reduce." Additionally, it supports graph methods, streaming data, machine
learning (ML), and SQL queries.
2. The main components of Apache Spark framework are Spark core, Spark SQL
for interactive queries, Spark Streaming for real time streaming analytics,
Machine Learning Library, and GraphX for graph processing.
ii) Hive services: Hive offers a number of services, like the Hive
server2, Beeline, etc., to handle all queries. Hive provides a range of
services, including: a) Beeline, b) Hive Server 2, c) Hive Driver, d)
Hive Compiler, e) Optimizer, f) Execution engine, g) Metastore, h)
HCatalog i) WebHCat.
iii) Processing and Resource Management: Internally, the de facto
engine for Hive's query execution is the MapReduce framework. A
software framework called MapReduce is used to create programmes
that process enormous amounts of data concurrently on vast clusters
of commodity hardware. Data is divided into pieces and processed
by map-reduce tasks as part of a map-reduce job.
iv) Distributed Storage: Since Hadoop is the foundation of Hive, the
distributed storage is handled by the Hadoop Distributed File System.
2. Hive has 4 main data types:
1. Column Types: Integer, String, Union, Timestamp
2. Literals: Floating, Decimal
3. Null Values: All missing values as NULL
4. Complex Types: Arrays, Maps, Struct
3. There are 4 main types:
1. Join: Join is extremely similar to SQL's Outer Join.
2. FULL OUTER JOIN: The records from the left and right outer tables are
combined in a FULL OUTER JOIN.
3. LEFT OUTER JOIN: All rows from the left table are retrieved using the
LEFT OUTER JOIN even if there are no matches in the right table.
4. RIGHT OUTER JOIN: In this case as well, even if there are no matches in
the left table, all the rows from the right table are retrieved.
[1] https://www.infoworld.com/article/3236869/what-is-apache-spark-the-big-data-
platform-that-crushed-hadoop.html
[2] https://aws.amazon.com/big-data/what-is-spark/
[3] https://spark.apache.org/
[4] https://www.tutorialspoint.com/hive/hive_views_and_indexes.htm
[5] https://hive.apache.org/downloads.html
[6] https://data-flair.training/blogs/apache-hive-architecture/
[7] https://halvadeforspark.readthedocs.io/en/latest/
[8] https://www.tutorialspoint.com/hbase/index.htm
[9] Capriolo, Edward, Dean Wampler, and Jason Rutherglen. Programming Hive:
Data warehouse and query language for Hadoop. " O'Reilly Media, Inc.", 2012.
[10] Du, Dayong. Apache Hive Essentials. Packt Publishing Ltd, 2015.
[11] Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark:
lightning-fast big data analysis. " O'Reilly Media, Inc.".
[12] George, Lars. HBase: the definitive guide: random access to your planet-size
data. " O'Reilly Media, Inc.", 2011.
26