Bda Unit 4 060115 Big Data Analytics Unit 4
Bda Unit 4 060115 Big Data Analytics Unit 4
UNIT IV
Frameworks and Applications: Frameworks: Applications on Big Data Using Pig and Hive, Data
processing operators in Pig, Hive services, HiveQL, Querying Data in Hive, fundamentals of
HBase and ZooKeeper.
3. HBase :
HBase is a column-oriented non-relational database management system that runs on top of the
Hadoop Distributed File System (HDFS).
HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big
data use cases
HBase does support writing applications in Apache Avro, REST and Thrift.
Application :
1. Medical
2. Sports
3. Web
4. Oil and petroleum
5. e-commerce
The data model in Apache Pig is nested The data model used in SQL is flat
relational. relational.
Apache Pig provides limited opportunity for There is more opportunity for query
Query optimization. optimization in SQL.
Grunt :
Grunt shell is a shell command.
The Grunt shell of the Apace pig is mainly used to write pig Latin scripts.
Pig script can be executed with grunt shell which is a native shell provided by Apache pig to
execute pig queries.
We can invoke shell commands using sh and fs.
Syntax of sh command :
grunt> sh
ls
Syntax of fs command :
grunt>fs -ls
Pig Latin :
The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop.
It is a textual language that abstracts the programming from the Java MapReduce idiom into a
notation.
The Pig Latin statements are used to process the data.
It is an operator that accepts a relation as an input and generates another relation as an output.
· It can span multiple lines.
· Each statement must end with a semi-colon.
· It may include expression and schemas.
· By default, these statements are processed using multi-query execution
User-Defined Functions :
Apache Pig provides extensive support for User Defined Functions(UDF’s).
Using these UDF’s, we can define our own functions and use them.
The UDF support is provided in six programming languages:
· Java
· Jython
· Python
· JavaScript
· Ruby
· Groovy
For writing UDF’s, complete support is provided in Java and limited support is provided in all the
remaining languages.
Using Java, you can write UDF’s involving all parts of the processing like data load/store, column
transformation, and aggregation.
Since Apache Pig has been written in Java, the UDF’s written using Java language work efficiently
compared to other languages.
Types of UDF’s in Java :
Filter Functions :
• The filter functions are used as conditions in filter statements.
• These functions accept a Pig value as input and return a Boolean value.
Eval Functions :
• The Eval functions are used in FOREACH-GENERATE statements.
• These functions accept a Pig value as input and return a Pig result.
Algebraic Functions :
• The Algebraic functions act on inner bags in a FOREACHGENERATE statement.
• These functions are used to perform full MapReduce operations on an inner bag.
Relational Operators :
Relational operators are the main tools Pig Latin provides to operate on the data.
Some of the Relational Operators are :
LOAD: The LOAD operator is used to loading data from the file system or HDFS storage into a
Pig relation.
FOREACH: This operator generates data transformations based on columns of data. It is used to
add or remove fields from a relation.
FILTER: This operator selects tuples from a relation based on a condition.
JOIN: JOIN operator is used to performing an inner, equijoin join of two or more relations
based on common field values
ORDER BY: Order By is used to sort a relation based on one or more fields in either ascending
or descending order using ASC and DESC keywords.
GROUP: The GROUP operator groups together the tuples with the same group key (key field).
COGROUP: COGROUP is the same as the GROUP operator. For readability, programmers
usually use GROUP when only one relation is involved and COGROUP when multiple relations
are reinvolved.
Diagnostic Operator :
The load statement will simply load the data into the specified relation in Apache Pig.
To verify the execution of the Load statement, you have to use the Diagnostic Operators.
Some Diagnostic Operators are :
DUMP: The DUMP operator is used to run Pig Latin statements and display the results on the
screen.
DESCRIBE: Use the DESCRIBE operator to review the schema of a particular relation. The
DESCRIBE operator is best used for debugging a script.
ILLUSTRATE: ILLUSTRATE operator is used to review how data is transformed through a
sequence of Pig Latin statements. ILLUSTRATE command is your best friend when it comes to
debugging a script.
EXPLAIN: The EXPLAIN operator is used to display the logical, physical, and MapReduce
execution plans of a relation.
The above figure shows the architecture of Apache Hive and its major components.
The major components of Apache Hive are :
1. Hive Client
2. Hive Services
3. Processing and Resource Management
4. Distributed Storage
HIVE CLIENT :
Hive supports applications written in any language like Python, Java, C++, Ruby, etc using JDBC,
ODBC, and Thrift drivers, for performing queries on the Hive.
Hence, one can easily write a hive client application in any language of its own choice.
Hive clients are categorized into three types :
1. Thrift Clients : The Hive server is based on Apache Thrift so that it can serve the request
from a thrift client.
2. JDBC client : Hive allows for the Java applications to connect to it using the JDBC driver.
JDBC driver uses Thrift to communicate with the Hive Server.
3. ODBC client : Hive ODBC driver allows applications based on the ODBC protocol to
connect to Hive. Similar to the JDBC driver, the ODBC driver uses Thrift to communicate with
the Hive Server.
HIVE SERVICE :
To perform all queries, Hive provides various services like the Hive server2, Beeline, etc.
The various services offered by Hive are :
1. Beeline
2. Hive Server 2
3. Hive Driver
4. Hive Compiler
5. Optimizer
6. Execution Engine
7. Metastore
8. HCatalog
9. WebHCat
PROCESSING AND RESOURCE MANAGEMENT :
Hive internally uses a MapReduce framework as a defacto engine for executing the queries.
MapReduce is a software framework for writing those applications that process a massive amount
of data in parallel on the large clusters of commodity hardware.
MapReduce job works by splitting data into chunks, which are processed by map-reduce tasks.
DISTRIBUTED STORAGE :
Hive is built on top of Hadoop, so it uses the underlying Hadoop Distributed File System for the
distributed storage.
4.5 Hive Shell :
Hive shell is a primary way to interact with hive.
It is a default service in the hive.
· Hive Driver: It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
· Hive Compiler: The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
· Hive Execution Engine: Optimizer generates the logical plan in the form of DAG of map-
reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the
order of their dependencies.
MetaStore :
Hive metastore (HMS) is a service that stores Apache Hive and other metadata in a backend
RDBMS, such as MySQL or PostgreSQL.
Impala, Spark, Hive, and other services share the metastore.
The connections to and from HMS include HiveServer, Ranger, and the NameNode, which
represents HDFS.
Beeline, Hue, JDBC, and Impala shell clients make requests through thrift or JDBC to HiveServer.
The HiveServer instance reads/writes data to HMS.
By default, redundant HMS operate in active/active mode.
The physical data resides in a backend RDBMS, one for HMS.
All connections are routed to a single RDBMS service at any given time.
HMS talks to the NameNode over thrift and functions as a client to HDFS.
HMS connects directly to Ranger and the NameNode (HDFS), and so does HiveServer.
One or more HMS instances on the backend can talk to other services, such as Ranger.
Comparison with Traditional Database :
RDBMS HIVE
It uses SQL (Structured Query Language). It uses HQL (Hive Query Language).
4.7 HiveQL :
Even though based on SQL, HiveQL does not strictly follow the full SQL-92 standard.
HiveQL offers extensions not in SQL, including multitable inserts and create table as select.
HiveQL lacked support for transactions and materialized views and only limited subquery support.
Support for insert, update, and delete with full ACID functionality was made available with release
0.14.
Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce
Tez, or Spark jobs, which are submitted to Hadoop for execution.
Example :
Checks if table docs exist and drop it if it does. Creates a new table called docs with a single
column of type STRING called line.
Loads the specified file or directory (In this case “input_file”) into the table.
OVERWRITE specifies that the target table to which the data is being loaded is to be re-written;
Otherwise, the data would be appended.
The query CREATE TABLE word_counts AS SELECT word, count(1) AS count creates a table
called word_counts with two columns: word and count.
This query draws its input from the inner
query (SELECT explode(split(line, '\s')) AS word FROM docs) temp".
This query serves to split the input words into different rows of a temporary table aliased as temp.
External Tables :
An external table is one where only the table schema is controlled by Hive.
In most cases, the user will set up the folder location within HDFS and copy the data file(s) there.
This location is included as part of the table definition statement.
When an external table is deleted, Hive will only delete the schema associated with the table.
The data files are not affected.
Syntax for External Tables :
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';
Querying Data :
A query is a request for data or information from a database table or a combination of tables.
This data may be generated as results returned by Structured Query Language (SQL) or as
pictorials, graphs or complex results, e.g., trend analyses from data-mining tools.
One of several different query languages may be used to perform a range of simple to complex
database queries.
SQL, the most well-known and widely-used query language, is familiar to most database
administrators (DBAs)
User-Defined Functions :
In Hive, the users can define their own functions to meet certain client requirements.
These are known as UDFs in Hive.
User-Defined Functions written in Java for specific modules.
Some of UDFs are specifically designed for the reusability of code in application frameworks.
The developer will develop these functions in Java and integrate those UDFs with the Hive.
During the Query execution, the developer can directly use the code, and UDFs will return outputs
according to the user-defined tasks.
It will provide high performance in terms of coding and execution.
The general type of UDF will accept a single input value and produce a single output value.
We can use two different interfaces for writing Apache Hive User-Defined Functions :
1. Simple API
2. Complex API
· Output :
1949 111
1949 78
1950 22
1950 0
1950 -11
4.8 MapReduce Scripts in Hive / Hive Scripts :
Similar to any other scripting language, Hive scripts are used to execute a set of Hive commands
collectively.
Hive scripting helps us to reduce the time and effort invested in writing and executing the
individual commands manually.
Hive scripting is supported in Hive 0.10.0 or higher versions of Hive.
Joins and SubQueries :
JOINS :
Join queries can perform on two tables present in Hive.
Joins are of 4 types, these are :
· Inner join: The Records common to both tables will be retrieved by this Inner Join.
· Left outer Join: Returns all the rows from the left table even though there are no matches in
the right table.
· Right Outer Join: Returns all the rows from the Right table even though there are no matches
in the left table.
· Full Outer Join: It combines records of both the tables based on the JOIN Condition given in
the query. It returns all the records from both tables and fills in NULL Values for the columns
missing values matched on either side.
SUBQUERIES :
A Query present within a Query is known as a subquery.
The main query will depend on the values returned by the subqueries.
Subqueries can be classified into two types :
· Subqueries in FROM clause
· Subqueries in WHERE clause
When to use :
· To get a particular value combined from two column values from different tables.
· Dependency of one table values on other tables.
· Comparative checking of one column values from other tables.
Syntax :
4.9 HBASE
HBase Concepts :
HBase is a distributed column-oriented database built on top of the Hadoop file system.
It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data.
It leverages the fault tolerance provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in
the Hadoop File System.
One can store the data in HDFS either directly or through HBase.
Data consumer reads/accesses the data in HDFS randomly using HBase.
HBase sits on top of the Hadoop File System and provides read and write access.
HBase Vs RDBMS :
RDBMS HBase
It is row-oriented It is column-oriented
Schema Design :
HBase table can scale to billions of rows and any number of columns based on your requirements.
This table allows you to store terabytes of data in it.
The HBase table supports the high read and writes throughput at low latency.
A single value in each row is indexed; this value is known as the row key.
The HBase schema design is very different compared to the relational database schema design.
Some of the general concepts that should be followed while designing schema in Hbase:
· Row key: Each table in the HBase table is indexed on the row key. There are no secondary
indices available on the HBase table.
· Automaticity: Avoid designing a table that requires atomicity across all rows. All operations
on HBase rows are atomic at row level.
· Even distribution: Read and write should be uniformly distributed across all nodes available
in the cluster. Design row key in such a way that, related entities should be stored in adjacent rows
to increase read efficacy.
4.10 Zookeeper :
ZooKeeper is a distributed coordination service that also helps to manage a large set of hosts.
Managing and coordinating a service especially in a distributed environment is a complicated
process, so ZooKeeper solves this problem due to its simple architecture as well as API.
ZooKeeper allows developers to focus on core application logic.
For instance, to track the status of distributed data, Apache HBase uses ZooKeeper.
They can also support a large Hadoop cluster easily.
To retrieve information, each client machine communicates with one of the servers.
It keeps an eye on the synchronization as well as coordination across the cluster
There is some best Apache ZooKeeper feature :
· Simplicity: With the help of a shared hierarchical namespace, it coordinates.
· Reliability: The system keeps performing, even if more than one node fails.
· Speed: In the cases where ‘Reads’ are more common, it runs with the ratio of 10:1.
· Scalability: By deploying more machines, the performance can be enhanced.
These deep insights help you to filter and manipulate data from sheets even further.
Intro to Big SQL :
IBM Big SQL is a high performance massively parallel processing (MPP) SQL engine for Hadoop
that makes querying enterprise data from across the organization an easy and secure experience.
A Big SQL query can quickly access a variety of data sources including HDFS, RDBMS, NoSQL
databases, object stores, and WebHDFS by using a single database connection or single query for
best-in-class analytic capabilities.
Big SQL provides tools to help you manage your system and your databases, and you can use
popular analytic tools to visualize your data.
Big SQL's robust engine executes complex queries for relational data and Hadoop data.
Big SQL provides an advanced SQL compiler and a cost-based optimizer for efficient query
execution.
Combining these with massive parallel processing (MPP) engine helps distribute query execution
across nodes in a cluster.