0% found this document useful (0 votes)
9 views30 pages

Unit V

Uploaded by

massdangerman701
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views30 pages

Unit V

Uploaded by

massdangerman701
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

DR.R.K.

Shanmugam College of Arts and Science, Indili, Kallakurichi


PG and Research Department of Computer Science
III B.Sc., Computer Science
UNIT V

SYLLABUS:

Frameworks: Applications on Big Data Using Pig and Hive – Data processing operators in Pig –
Hive services – HiveQL – Querying Data in Hive – fundamentals of HBase and ZooKeeper– IBM
InfoSphere BigInsights and Streams.

Application of Big Data using Pig and Hive :

Pig :
 Pig is a high-level platform or tool which is used to process large datasets.
 It provides a high level of abstraction for processing over MapReduce.
 It provides a high-level scripting language, known as Pig Latin which is used to develop the data
analysis codes.
Applications :
1. For exploring large datasets Pig Scripting is used.
2. Provides supports across large data sets for Ad-hoc queries.
3. In the prototyping of large data-sets processing algorithms.
4. Required to process the time-sensitive data loads.
5. For collecting large amounts of datasets in form of search logs and web crawls.
6. Used where the analytical insights are needed using the sampling.

Hive :
 Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
 It resides on top of Hadoop to summarize Big Data and makes querying and analyzing
easy.
 It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Benefits :
1. Ease of use
2. Accelerated initial insertion of data
3. Superior scalability, flexibility, and cost-efficiency
4. Streamlined security
5. Low overhead
6. Exceptional working capacity
Data processing operators in Pig:
We have a huge set of Apache Pig Operators, for performing several types of Operations.
Let’s discuss types of Apache Pig Operators:
1. Diagnostic Operators
2. Grouping & Joining
3. Combining & Splitting
4. Filtering
5. Sorting

So, let’s discuss each type of Apache Pig Operators in detail.


Types of Pig Operators

1. Diagnostic Operators: Apache Pig Operators


Basically, we use Diagnostic Operators to verify the execution of the Load statement.
There are four different types of diagnostic operators −

a. Dump operator
b. Describe operator
c. Explanation operator
d. Illustration operator

Further, we will discuss each operator of Pig Latin in depth.

a. Dump Operator
In order to run the Pig Latin statements and display the results on the screen, we use Dump Operator.
Generally, we use it for debugging Purpose.
 Syntax
So the syntax of the Dump operator is:
grunt> Dump Relation_Name
 Example
Here, is the example, in which a dump is performed after each statement.

A = LOAD 'Employee' AS (name:chararray, age:int, gpa:float);


DUMP A;
(Shubham,18,4.0F)
(Pulkit,19,3.7F)
(Shreyash,20,3.9F)
(Mehul,22,3.8F)
(Rishabh,20,4.0F)
B = FILTER A BY name matches 'J.+';
DUMP B;
(Shubham,18,4.0F)
(Mehul,22,3.8F)
b. Describe operator
To view the schema of a relation, we use the describe operator.
 Syntax
So, the syntax of the describe operator is −

grunt> Describe Relation_name

 Example
Let’s suppose we have a file Employee_data.txt in HDFS. Its content is.

001,mehul,chourey,9848022337,Hyderabad
002,Ankur,Dutta,9848022338,Kolkata
003,Shubham,Sengar,9848022339,Delhi
004,Prerna,Tripathi,9848022330,Pune
005,Sagar,Joshi,9848022336,Bhubaneswar
006,Monika,sharma,9848022335,Chennai

Also, using the LOAD operator, we have read it into a relation Employee.
Further, let’s describe the relation named Employee. Then verify the schema.

grunt> describe Employee

grunt> Employee = LOAD


'hdfs://localhost:9000/pig_data/Employee_data.txt'
USINGPigStorage(',')

as ( id:int, firstname:chararray, lastname:chararray,


phone:chararray, city:chararray );

 Output

It will produce the following output, after execution of the above Pig Latin statement.

grunt> Employee: { id: int,firstname: chararray,lastname:


chararray,phone: chararray,city: chararray

c. Explanation operator
To display the logical, physical, and MapReduce execution plans of a relation, we usethe explain
operator.
 Syntax
So, the syntax of the explain operator is-
grunt> explain Relation_name;

001,mehul,chourey,9848022337,Hyderabad
002,Ankur,Dutta,9848022338,Kolkata
003,Shubham,Sengar,9848022339,Delhi
004,Prerna,Tripathi,9848022330,Pune
005,Sagar,Joshi,9848022336,Bhubaneswar
006,Monika,sharma,9848022335,Chennai
Example
Let’s suppose we have a file Employee_data.txt in HDFS. Its content
Also, using the LOAD operator, we have read it into a relation Employee

grunt> Employee = LOAD


'hdfs://localhost:9000/pig_data/Employee_data.txt' USING
PigStorage(',')

as ( id:int, firstname:chararray, lastname:chararray,


phone:chararray, city:chararray );

Further, using the explain operator let ‘s explain the relation named Employee

grunt> explain Employee;

Illustration operator
This operator gives you the step-by-step execution of a sequence of statements.
 Syntax
So, the syntax of the illustrate operator is-grunt> illustrate Relation_name;
 Example
Let’s suppose we have a file Employee_data.txt in HDFS. Its content is:

001,mehul,chourey,9848022337,Hyderabad
002,Ankur,Dutta,9848022338,Kolkata
003,Shubham,Sengar,9848022339,Delhi
004,Prerna,Tripathi,9848022330,Pune
005,Sagar,Joshi,9848022336,Bhubaneswar
006,Monika,sharma,9848022335,Chennai

Also, using the LOAD operator, we have read it into a relation Employee
grunt> Employee = LOAD
'hdfs://localhost:9000/pig_data/Employee_data.txt' USING
PigStorage(',')

as ( id:int, firstname:chararray, lastname:chararray,


phone:chararray, city:chararray );

Further, we illustrate the relation named Employee as.

grunt> illustrate Employee;

Output
We will get the following output, on executing the above statement.

grunt> illustrate Employee;

INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly
$M ap - Aliases

being processed per job phase (AliasName[line,offset]): M:


Emloyee[1,10] C: R:

ii. Grouping & Joining: Apache Pig Operators


There are 4 types of Grouping and Joining Operators. Such as:
a. Group Operator
b. Cogroup Operator
c. Join Operator
d. Cross operator Let’s discuss them in depth:
a. Group Operator
To group the data in one or more relations, we use the GROUP operator.
 Syntax
So, the syntax of the group operator is:

grunt> Group_data = GROUP Relation_name BY age;

 Group All
We can group a relation by all the columns.

grunt> group_all = GROUP Employee_details All;

b. Co-group Operator
It works more or less in the same way as the GROUP operator. At one point they differentiate that we
normally use the group operator with one relation, whereas,we use the cogroup operator in statements
involving two or more relations.

 Grouping Two Relations using Cogroup


Let’s suppose we have two files namely Employee_details.txt and Clients_details.txt in the HDFS
directory /pig_data/.

grunt> cogroup_data = COGROUP Employee_details by age,


Clients_detailsby age;
 Verification
Using the DUMP operator, Verify the relation cogroup_data.
c. Join Operator
Basically, to combine records from two or more relations, we use the JOIN operator. Moreover, we
declare one (or a group of) tuple(s) from each relation, as keys, while performinga join operation.
However, make sure, the two particular tuples are matched, when these keys match, else the records are
dropped.
There are several types of Joins. Such as −
Self-join
Inner-join
Outer-join − left join, right join, and full join
d. Cross Operator
It computes the cross-product of two or more relations.
 Syntax
 grunt> Dump cogroup_data;
grunt> Relation3_name = CROSS Relation1_name, Relation2_name;

So, the syntax of the CROSS operator.

grunt> cross_data = CROSS Users, orders;

Using the cross operator on these two relations, let’s get the cross-product of these tworelations.
 Verification
Now, using the DUMP operator, verify the relation cross_data.
iii. Combining & Splitting: Apache Pig Operators
These are of two types-

a. Union
b. Split

a. Union Operator
To merge the content of two relations, we use the UNION operator of Pig Latin. Also, make sure, to
perform UNION operation on two relations, their columns and domains must be identical.
 Syntax
So, the syntax of the UNION operator.

grunt> Relation_name3 = UNION Relation_name1, Relation_name2;

Using the UNION operator, let’s now merge the contents of these two relations.

grunt> Employee = UNION Employee1, Employee2;

 Verification

grunt> Dump Employee;

Now, using the DUMP operator, verify the relation Employee.

grunt> Dump cross_data;

b. Split Operator
To split a relation into two or more relations, we use the SPLIT operator is used.
 Syntax
grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1),
Relation2_name (condition2),

So, the syntax of the SPLIT operator is-


Now, Let’s split the relation into two,
First listing the employees of age less than 23,
Second listing the employees having the age between 22 and 25.

SPLIT Employee_details into Employee_details1 if age<23,


Employee_details2 if (22<age and age>25);

 Verification
Using the DUMP operator, Verify the relations Employee_details1 and Employee_details2.

grunt> Dump Employee


grunt> Dump Employee_details2;

Filtering: Apache Pig Operators

These are of 3 types;


c. Filter
d. Distinct
e. For Each
Now, let’s discuss, each in detail:

a. Filter Operator
To select the required tuples from a relation based on a condition, we use the FILTER operator.
 Syntax
So the syntax of the FILTER operator is

grunt> Relation2_name = FILTER Relation1_name BY (condition);


Now, to get the details of the Employee who belong to the city Chennai, let ’s use the Filter operator.

filter_data = FILTER Employee_details BY city == 'Chennai';

 Verification
Using the DUMP operator, verify the relation filter_data.

grunt> Dump filter_data;


 Output
By, displaying the contents of the relation filter_data, it will produce the following

(6,Monika,Sharma,23,9848022335,Chennai) (8,Roshan,Shaikh,24,9848022333,Chennai)
c. The DISTINCT Operator

To remove redundant (duplicate) tuples from a relation, we use the DISTINCT


operator.
 Syntax

grunt> Relation_name2 = DISTINCT Relatin_name1;

So, the syntax of the DISTINCT operator is:

grunt> distinct_data = DISTINCT Employee_details;

Now, now using the DISTINCT operator remove the redundant (duplicate) tuples from
the relation named Employee_details. Also, store it in another relation named
distinct_data.
 Verification

grunt> Dump distinct_data;

Using the DUMP operator, verify the relation distinct_data.

To generate specified data transformations based on the column data, we use the FOR
EACH operator.
 Syntax
So, the syntax of FOREACH operator.

grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);

Now, using the foreach operator, let us now get the id, age, and city values of each
Employee from the relation Employee_details and store it into another relation
named for each_data.
 Verification
Also, using the DUMP operator, verify the relation for each_data.

grunt> for each_data = FOREACH Employee_details GENERATE id,age,city;

grunt> Dump for each_data;


 Output
By displaying the contents of the relation foreach_data, it will produce the following output.
(1,21,Hyderabad) (2,22,Kolkata) (3,22,Delhi)
(4,21,Pune)
(5,23,Bhubaneswar) (6,23,Chennai) (7,24,trivandrum) (8,24,Chennai)

iv. Sorting: Apache Pig Operators


These are of two types,

a. Order By
b. Limit

Let’s discuss both in detail:


a. ORDER BY operator
To display the contents of a relation in a sorted order based on one or more fields, we use the ORDER
BY operator.
 Syntax
So, the syntax of the ORDER BY operator is-

grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);

Now, on the basis of the age of the Employee let’s sort the relation in a descendingorder. Then using
the ORDER BY operator store it into another relation named order_by_data.

grunt> order_by_data = ORDER Employee_details BY age DESC;

 Verification

grunt> Dump order_by_data;

Further, using the DUMP operator verify the relation order_by_data.


 Output
By displaying the contents of the relation order_by_data, it will produce the following output.
(8,Roshan,Shaikh,24,9848022333,Chennai) (7,pulkit,pawar,24,9848022334,trivandrum)
(6,Monika,sharma,23,9848022335,Chennai) (5,Sagar,Joshi,23,9848022336,Bhubaneswar)
(3,Shubham,Sengar,22,9848022339,Delhi) (2,Ankur,Dutta,22,9848022338,Kolkata)
b. LIMIT operator
In order to get a limited number of tuples from a relation, we use the LIMIT operator.
 Syntax
So, the syntax of the LIMIT operator is-

grunt> Result = LIMIT Relation_name required number of tuples;

Now, on the basis of age of the Employee let’s sort the relation in descending order.
Then using the ORDER BY operator store it into another relation named limit_data.

grunt> limit_data = LIMIT Employee_details 4;

 Verification
Further, using the DUMP operator, verify the relation limit_data.

grunt> Dump limit_data;

 Output
By displaying the contents of the relation limit_data, it will produce the following output.
(1,mehul,chourey,21,9848022337,Hyderabad)
(2,Ankur,Dutta,22,9848022338,Kolkata) (3,Shubham,Sengar,22,9848022339,Delhi)
(4,Prerna,Tripathi,21,9848022330,Pune)
HIVE SERVICES
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation tookit up and developed
it further as an open source under the name Apache Hive. It is used by different companies. For example,
Amazon uses it in Amazon Elastic MapReduce.
Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updatesFeatures of Hive

These are the following features of Hive:

o Hive is fast and scalable.


o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduceor
Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem.
o It supports user-defined functions (UDFs) where user can provide its functionality.

Architecture of Hive
The following component diagram depicts the architecture of Hive:
Hive is commonly used by Data Pig is commonly used by
Analysts. programmers.
It follows SQL-like queries. It follows the data-flow
language.
It can handle structured data. It can handle semi-structured
data.
It works on server-side of HDFS It works on client-side of
cluster. HDFS cluster.
Hive is slower than Pig. Pig is comparatively faster
than Hive.

This component diagram contains different units. The following table describes each unit:

Unit Operation
Name
User Hive is a data warehouse infrastructure software that can
Interface create interaction between user and HDFS. The user
interfaces that Hive supports are Hive Web UI, Hive
command line, and Hive HD Insight (In Windows server).

Meta Hive chooses respective database servers to store the


Store schema or Metadata of tables, databases, columns in a
table, their data types, and HDFS mapping.

HiveQL HiveQL is similar to SQL for querying on schema info


Process on the Metastore. It is one of the replacements of
Engine traditional approach for MapReduce program. Instead of
writing MapReduce program in Java, we can write a
query for MapReduce job and process it.
Execution The conjunction part of HiveQL process Engine and
Engine MapReduce is Hive Execution Engine. Execution engine
processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.

HDFS or Hadoop distributed file system or HBASE are the data


HBASE storage techniques to store data into file system.
Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:
Step No. Operation
1 Execute Query
The Hive interface such as Command Line or Web UI sends query
to Driver (any database driver such as JDBC, ODBC, etc.) to
execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to
check thesyntax and query plan or the requirement of query.

3 Get Metadata
The compiler sends metadata request to Metastore (any database)
4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the
driver. Upto here, the parsing and compiling of a query is
complete.

6 Execute Plan
The driver sends the execute plan to the execution engine.

7 Execute Job
Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name node
and it assigns this job to TaskTracker, which is in Data node. Here,
the query executes MapReduce job.
7. Metadata Ops
Meanwhile in execution, the execution engine can execute
metadataoperations with Metastore.
8 Fetch Result
The execution engine receives the results from Data nodes.

9 Send Results

The execution engine sends those resultant values to the


driver.
Send Results
10
The driver sends the results to Hive Interfaces.
HIVE SERVICES :
The following are the services provided by Hive :
· Hive CLI: The Hive CLI (Command Line Interface) is a shell where we can execute Hivequeries
and commands.
· Hive Web User Interface: The Hive Web UI is just an alternative of Hive CLI. It providesa web-
based GUI for executing Hive queries and commands.
· Hive metastore: It is a central repository that stores all the structure information of various
tables and partitions in the warehouse. It also includes metadata of column and its type information, the
serializers and deserializers which is used to read and write data and the corresponding HDFS files where
the data is stored.
· Hive Server: It is referred to as Apache Thrift Server. It accepts the request from different clients
and provides it to Hive Driver.
· Hive Driver: It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
· Hive Compiler: The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements into MapReduce
jobs.
· Hive Execution Engine: Optimizer generates the logical plan in the form of DAG of map-
reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the order
of their dependencies.

MetaStore :
Hive metastore (HMS) is a service that stores Apache Hive and other metadata in abackend
RDBMS, such as MySQL or PostgreSQL.
Impala, Spark, Hive, and other services share the metastore.
The connections to and from HMS include HiveServer, Ranger, and the NameNode, which represents
HDFS.
Beeline, Hue, JDBC, and Impala shell clients make requests through thrift or JDBC to HiveServer.
The HiveServer instance reads/writes data to HMS.
By default, redundant HMS operate in active/active mode.
The physical data resides in a backend RDBMS, one for HMS.
All connections are routed to a single RDBMS service at any given time. HMS talks to the NameNode
over thrift and functions as a client to HDFS. HMS connects directly to Ranger and the NameNode
(HDFS), and so does HiveServer.
One or more HMS instances on the backend can talk to other services, such as Ranger.
Comparison with Traditional Database :

RDBMS HIVE
It is used to maintain the
It is used to maintain a data
database.
warehouse.
It uses SQL (Structured
QueryLanguage). It uses HQL (Hive Query Language).

Schema is fixed in Schema varies in it.


RDBMS
Normalized and de-normalized both
Normalized data is typeof data is stored.
stored.
Tables in rdms are The table in hive is dense.
sparse.
It doesn’t support It supports automation partition.
partitioning.
No partition method is The sharding method is used for
used partition

Hive QL
What is Hive Query Language (HiveQL)?
Hive Query Language (HiveQL) is a query language in Apache Hive for processing and analyzing
structured data. It separates users from the complexity of Map Reduce programming. It reuses common
concepts from relational databases, such as tables, rows, columns, and schema, to ease learning. Hive
provides a CLI for Hive query writing using HiveQuery Language (HiveQL).
Most interactions tend to take place over a command line interface (CLI). Generally, HiveQL syntax is
similar to the SQL syntax that most data analysts are familiar with. Hive supports four file formats which
are: TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File).

Hive uses derby database for single user metadata storage, and for multiple userMetadata or shared
Metadata case, Hive uses MYSQL.
HiveQL Built-in Operators
Hive provides Built-in operators for Data operations to be implemented on the tables present inside
Hive warehouse.

These operators are used for mathematical operations on operands, and it will return specific value as
per the logic applied.

Below are the main types of Built-in Operators in HiveQL:

Relational Operators
Arithmetic Operators
Logical Operators
Operators on Complex types
Complex type Constructors

Relational Operators in Hive SQL


We use Relational operators for relationship comparisons between two operands.
Operators such as equals, Not equals, less than, greater than …etc
The operand types are all number types in these Operators.

The following Table will give us details about Relational operators and its usage in HiveQL:

Built-in
Description Operand
Operator
TRUE
X=Y It takes
if expression X is equivalent to all
primitive
expression YOtherwise FALSE.
types
TRUE
X != Y It takes
if expression X is not equivalent to all
primitive
expression YOtherwise FALSE.
types

TRUE
X<Y It takes all
if expression X is less than primitive
types
expression YOtherwise FALSE.
TRUE
X <= Y It takes all
if expression X is less than or equal primitive
types
to expression YOtherwise FALSE.
TRUE
X>Y It takes all
if expression X is greater than primitive
types
expression YOtherwise FALSE.
TRUE
X>= Y It takes all
if expression X is greater than or primitive
types
equal to expression YOtherwise

FALSE.
X IS NULL TRUE if expression X evaluates to It takes all
NULL otherwise FALSE. types
X IS NOT FALSE
It takes all
NULL
If expression X evaluates to NULL types
otherwise TRUE.
TRUE Takes only
X LIKE Y
Strings
If string pattern X matches to Y
otherwise FALSE.
NULL if X or Y is NULL, TRUE if Takes only
X RLIKE Y
any substring of X matchesthe Java Strings
regular expression Y, otherwise
FALSE.
Takes only
X REGEXP Same as RLIKE.
Strings
Y

HiveQL Arithmetic Operators


We use Arithmetic operators for performing arithmetic operations on operands

Arithmetic operations such as addition, subtraction, multiplication and divisionbetween


operands we use these Operators.
The operand types all are number types in these Operators

Sample Example: 2 + 3 gives result 5.

In this example, ‘+’ is theoperator and 2 and 3 are operands. The return value is 5

The following Table will give us details about Arithmetic operators in Hive QueryLanguage:
Built-in Description Operand
Operator
X+Y It will return the output of adding X and Y value. It takes all
number types
X–Y It will return the output of subtracting Y from X It takes all
value. number types
X*Y It will return the output of multiplying X and Y It takes all
values. number types
X/Y It will return the output of dividing Y from X. It takes all
number types
X%Y It will return the remainder resulting from It takes all
dividing X by Y. number types
X&Y It will return the output of bitwise AND of X and It takes all
Y. number types
X|Y It will return the output of bitwise OR of X and It takes all
Y. number types
X^Y It will return the output of bitwise XOR of X and It takes all
Y. number types
~X It will return the output of bitwise NOT of X. It takes all
number types

Hive QL Logical Operators


We use Logical operators for performing Logical operations on operands

Logical operations such as AND, OR, NOT between operands we use these Operators.
The operand types all are BOOLEAN type in these Operators

The following Table will give us details about Logical operators in HiveSQL:

Operators Description Operands

X AND Y TRUE if both X and Y are TRUE, Boolean types


otherwise FALSE. only
X && Y Same as X AND Y but here we using Boolean types
&& symbol only
X OR Y TRUE if either X or Y or both are Boolean types
TRUE, otherwise FALSE. only
X || Y Same as X OR Y but here we using || Boolean types
symbol only
NOT X TRUE if X is FALSE, otherwise Boolean types
FALSE. only
!X Same as NOT X but here we using! Boolean types
symbol only
Operators on Complex Types
The following Table will give us details about Complex Type Operators . These are operators which will
provide a different mechanism to access elements in complex types.
Operators Operands Description

A is an Array and n is an integertype It will return nth element in the array A. The first elehas index of 0
A[n]

M is a Map<K, V> and key hastype K


M[key] It will return the values belongs to the key in the m

Complex Type Constructors


The following Table will give us details about Complex type Constructors. It will construct instances on
complex data types. These are of complex data types such as Array, Mapand Struct types in Hive.

In this section, we are going to see the operations performed on Complex typeConstructors.

Operators Operands Description

It will create an array with the given elements as mentioned likeval1, val2
Array (val1, val2,
…)

Create_union It will create a union type with the values that is beingmentioned to by the
(tag, val1,
tag parameter
val2, …)

(key1, It will create a map with the given key/value pairs mentioned inoperands
Map value1,
key2,
value2, …)
(name1, It will create a Struct with the given field names and valuesmentioned in
Named_struct val1, operands
name2,
val2, …)
Creates a Struct with the given field values. Struct field names
STRUCT (val1, val2,
val3, …) will be col1, col2, .
QUERYING DATA IN HIVE:

HiveQL has a SQL like a dialect used for summarizing and querying large chunks of data through
Hadoop environment. Hive is widely used to alter, create and drop tables, databases, views or user-
defined functions by big data professionals. Some of the datadefinition language (DDL) used to load data
and modify it in the database are Create, Alter, Show, describe, describe formatted, drop, truncate.
Data Types in Hive
As in relational databases, Hive supports most of the primitive data types and alsothree collection
data types.
Primitive data types are Integer, Boolean, float, double, String, Timestamp andBinary.
Within Integer, Hive supports varying sizes like tinyint, smallint, int and bigint.
The collection data types are structs, maps and array.
A struct is analogous to a C struct.
“Dot” notation is used to access the fields with this data type.
A map is a collection of key-value tuples.
Array notation(e.g. [‘key’ ]) is used to access the fields with this data type.
An array is a collection of elements with the same data type which can be accessed using zero-
based integer indexes.
Types of HiveQL Queries
Given below are the types of HiveQL queries that are widely used:

1. HiveQL query for information_schema database


Hive queries can be written to get information about Hive privileges, tables, views or columns.
Information_schema data is a read-only and user-friendly way to know the state of the system similar to
sys database data.
Example:

Code:

Select * from information_schema.columns where table_schema =


‘database_name’

This will retrieve all the columns in the database table specified.

2. Creation and loading of data into a table


The bulk load operation is used to insert data into managed tables as Hive does not support row-level
insert, delete or update.

Code:

LOAD DATA LOCAL INPATH ‘$Home/students_address’ OVERWRITE INTO


TABLE students
PARTITION (class = “12”, section = “science”);

With the above command, a directory is first created for the partition, and then all the files are copied in
the directory. The keyword “local” is used to specify that the data is present in the local file system.
. “Partition” keyword can be omitted if the table does not have a partitionkey. Hive query will not check for
the data being loaded to match the schema of the table.
The “INSERT” command is used to load data from a query into a table. “OVERWRITE” keyword is
used to replace the data in a table. In Hive v0.8.0 or later, data will get appended into a table if overwrite
keyword is omitted.
Code:

INSERT OVERWRITE TABLE students

PARTITION ( class = “12”, section = “science”)

Select * from students_data where class = “12” and section =“science”

All the partitions of the table students_data can be dynamically inserted by setting below
properties
Set hive.exec.dynamic.partition = True;
Set hive.exec.dynamic.partition.mode = unstrict
Set hive.exec.max.dynamic.partition.pernode = 1000;

CREATE TABLE clause will also create a table, and schema will be taken from the select clause

3. Merge data in tables


Data can be merged from tables using classic SQL joins like inner, full outer, left, right join.
Code:
Select a.roll_number, class, section from students as ainner join pass_table as b
on a.roll_number = b.roll_number
This will return class and section of all the roll numbers who have passed. Using a left join to this will
return the “grade” for only pass students and “NULL” for the failed ones.

Code:
Select a.roll_number, class, section, b.grade from students asa

Left join pass_table as b

on a.roll_number = b.roll_number

UNION ALL and UNION are also used to append data present in two tables. However, few things need to
be taken care of on doing so like, Schema of both the tables should be same. UNION is used to append the
table and return unique records while UNION ALL returns all the records, including duplicates.
4. Ordering a table
ORDER BY clause enables total ordering of the data set by passing all data through one reducer. This
may take a long time for large data tables, so SORT BY clause can be usedto achieve partial sorting, by
sorting each reducer.

Code:
Select customer_id, spends from customer as a order by spendsDESC limit 100

this will return the top 100 customers with highest spends.

5. Aggregation of data in a table


Aggregation is done using aggregate functions that returns a single value after doing computation on
many rows. These are count(col), sum(col), avg(col), min(col), max(col), stddev_pop(col),
percentile_approx(int_expr, P, NB), where NB is number of histogram bins for estimation),
collect_set(col), this returns duplicate elements after removing collection column.
The set property which helps in improving the performance of aggregation is hive.map.aggr = true.

“GROUP BY” clause is used with an aggregate function.

Example:

Code:

Select year(date_yy), avg(spends) from customer_spends where


merchant = “Retail” group by year(date_yy)

HAVING clause is used to restrict the output from GROUP BY, which is done usinga subquery.
6. Conditional statements
CASE…WHEN…THEN clause is similar to if-else statements to perform a conditionaloperation on

any column in a query.

For example:
Code:
Select customer,

Case when percentage <40 then “Fail”

When percentage >=40 and percentage <80 then “Average” Else“Excellent”


End as rank From students;

7. Filtering of data
WHERE clause is used to filter data in HiveQL. LIKE is used along with WHERE clause as a

predicate operator to match a regular expression in a record.

8. Way to escape an illegal identifier


There is a way to use special characters or keywords or space in columns or partitionnames by
enclosing it in backticks ( ` ).

Comments in Hive Scripts:

There is a way to add comment lines to the Hive script by starting it with the string.
Below is the code to display students data.

Code:

Select * from student_table;

This only works in scripts, if pasted in CLI error messages will get displayed.

HiveQL supports a member of file formats like Textfile, Parquet, etc. The ORC

(Optimized Row Columnar) format can support tables upto 300 PB as Hive supports

ANSI SQL and ACID (Atomic, Consistent, Isolated and Durable) transactions.

FUNDAMENTALS OF HBASE AND ZOOKEEPER


What is HBase

Hbase is an open source and sorted map data built on Hadoop. It is column oriented and horizontally
scalable.

It is based on Google's Big Table.It has set of tables which keep data in key value format. Hbase is well
suited for sparse data sets which are very common in big data use cases. Hbase provides APIs enabling
development in practically any programming language. It is a part of the Hadoop ecosystem that
provides random real-time read/write access to data in the Hadoop File System.

Why HBase

 RDBMS get exponentially slow as the data becomes large


 Expects data to be highly structured, i.e. ability to fit in a well-defined schema
 Any change in schema might require a downtime
 For sparse datasets, too much of overhead of maintaining NULL values

Features of Hbase

 Horizontally scalable: You can add any number of columns anytime.


 Automatic Failover: Automatic failover is a resource that allows a system administrator to automatically
switch data handling to a standby system in the event of system compromise
 Integrations with Map/Reduce framework: Al the commands and java codes internally implement Map/
Reduce to do the task and it is built over Hadoop Distributed File System.
 sparse, distributed, persistent, multidimensional sorted map, which is indexed by rowkey, column key,and
timestamp.
 Often referred as a key value store or column family-oriented database, or storing versioned maps of
maps.
 fundamentally, it's a platform for storing and retrieving data with random access.
 It doesn't care about datatypes(storing an integer in one row and a string in another for the same column).
 It doesn't enforce relationships within your data.
 It is designed to run on a cluster of computers, built using commodity hardware.
HBase Data Model

HBase Read

A read against HBase must be reconciled between the HFiles, MemStore & BLOCKCACHE.The
BlockCache is designed to keep frequently accessed data from the HFiles in memory so as to avoid disk
reads.Each column family has its own BlockCache.BlockCache contains data in form of 'block', as unit
of data that HBase reads from disk in a single pass.

ZooKeeper is a distributed co-ordination service to manage large set of hosts. Co-ordinating


and managing a service in a distributed environment is a complicated process. ZooKeeper solves this
issue with its simple architecture and API. ZooKeeper allows developers to focus on core application
logic without worrying about the distributed nature of the application.

The ZooKeeper framework was originally built at “Yahoo!” for accessing their applications in an
easy and robust manner. Later, Apache ZooKeeper became a standard for organized service used by
Hadoop, HBase, and other distributed frameworks. For example, Apache HBase uses ZooKeeper to
track the status of distributed data. This tutorial explains the basics of ZooKeeper, how to install and
deploy a ZooKeeper cluster in a distributed environment, and finally concludes with a few examples
using Java programming and sample applications.

What is Apache ZooKeeper Meant For?

Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate between


themselves and maintain shared data with robust synchronization techniques. ZooKeeper is itself a
distributed application providing services for writing a distributed application.

The common services provided by ZooKeeper are as follows −


 Naming service − Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes.
 Configuration management − Latest and up-to-date configuration information of the system for
a joining node.
 Cluster management − Joining / leaving of a node in a cluster and node status at real time.
 Leader election − Electing a node as leader for coordination purpose.
 Locking and synchronization service − Locking the data while modifying it. This mechanism
helps you in automatic fail recovery while connecting other distributed applications like Apache
HBase.
 Highly reliable data registry − Availability of data even when one or a few nodes are down.
Distributed applications offer a lot of benefits, but they throw a few complex and hard-to-crack
challenges as well. ZooKeeper framework provides a complete mechanism to overcome all the challenges.
Race condition and deadlock are handled using fail-safe synchronization approach. Another main
drawback is inconsistency of data, which ZooKeeper resolves with atomicity.

Benefits of ZooKeeper

Here are the benefits of using ZooKeeper −

 Simple distributed coordination process


 Synchronization − Mutual exclusion and co-operation between server processes. This process
helps in Apache HBase for configuration management.
 Ordered Messages
 Serialization − Encode the data according to specific rules. Ensure your application runs
consistently. This approach can be used in MapReduce to coordinate queue to execute running
threads.
 Reliability
 Atomicity − Data transfer either succeed or fail completely, but no transaction is partial.

Before going deep into the working of ZooKeeper, let us take a look at the fundamental concepts of
ZooKeeper. We will discuss the following topics in this chapter −

 Architecture
 Hierarchical namespace
 Session
 Watches

Architecture of ZooKeeper

Take a look at the following diagram. It depicts the “Client-Server Architecture” of ZooKeeper.

Each one of the components that is a part of the ZooKeeper architecture has been explained in the
following table.
Part Description
Clients, one of the nodes in our distributed application cluster, access information from the
server. For a particular time interval, every client sends a message to the server to let the sever
know that the client is alive.
Client
Similarly, the server sends an acknowledgement when a client connects. If there is no response
from the connected server, the client automatically redirects the message to another server.
Server, one of the nodes in our ZooKeeper ensemble, provides all the services to clients. Gives
Server
acknowledgement to client to inform that the server is alive.
Group of ZooKeeper servers. The minimum number of nodes that is required to form an ensemble is
Ensemble
3.
Server node which performs automatic recovery if any of the connected node failed. Leaders are elected
Leader
on service startup.
Follower Server node which follows leader instruction.

Hierarchical Namespace

The following diagram depicts the tree structure of ZooKeeper file system used for memory
representation. ZooKeeper node is referred as znode. Every znode is identified by a name and separated
by a sequence of path (/).

 In the diagram, first you have a root znode separated by “/”. Under root, you have two logical
namespaces config and workers.
 The config namespace is used for centralized configuration management and the workers
namespace is used for naming.
 Under config namespace, each znode can store upto 1MB of data. This is similar to UNIX file
system except that the parent znode can store data as well. The main purpose of this structure is to
store synchronized data and describe the metadata of the znode. This structure is called as
ZooKeeper Data Model.

Every znode in the ZooKeeper data model maintains a stat structure. A stat simply provides the metadata
of a znode. It consists of Version number, Action control list (ACL), Timestamp, and Data length.

 Version number − Every znode has a version number, which means every time the data associated
with the znode changes, its corresponding version number would also increased. The use of
version number is important when multiple zookeeper clients are trying to perform operations over
the same znode.
 Action Control List (ACL) − ACL is basically an authentication mechanism for accessing the
znode. It governs all the znode read and write operations.
 Timestamp − Timestamp represents time elapsed from znode creation and modification. It is
usually represented in milliseconds. ZooKeeper identifies every change to the znodes from
“Transaction ID” (zxid). Zxid is unique and maintains time for each transaction so that you can
easily identify the time elapsed from one request to another request.
 Data length − Total amount of the data stored in a znode is the data length. You can store a
maximum of 1MB of data.

IBM INFOSPHERE BIGINSIGHTS AND STREAMS.

The name of the base system (platform): Apache Hadoop


Developers: IBM
Technology: BI, Big Data, Data Mining

IBM released at the end of 2011 the software of InfoSphere BigInsights and InfoSphere Streams which
allows clients to gain a fast impression about information streams in a zone of interests of their business.

BigInsights in approach

BigInsights is the platform for data analysis allowing the companies to turn difficult data sets of scale
of the Internet into knowledge. Easily set Apache Hadoop distribution kit and also a set of the connected
tools necessary for application development, data transfer and management of a cluster are a part of this
platform. Thanks to the simplicity and scalability of Hadoop, the representing Open Source-реализацию
of infrastructure MapReduce, uses deserved recognition in different industries and sciences. In addition
to Hadoop, the following Open Source-технологии are a part of BigInsights (all of them, except for Jaql,
are the Apache Software Foundation projects):

 Pig is the platform including a high-level language of the description of the programs analyzing big data
sets. The compiler transforming the Pig applications to the sequences of the MapReduce tasks performed
in the environment of Hadoop is a part of Pig.
 Hive is the solution for data warehousing developed on the basis of the Hadoop environment. In it the
familiar principles of relational databases - tables, columns, sections are implemented. Also set of SQL
statements (HiveQL) for work in the unstructured Hadoop environment is its part. Requests of Hive are
compiled in the MapReduce tasks performed in the environment of Hadoop.
 Jaql is the language of requests with the SQL-like interface developed by IBM and intended for JavaScript
Object Notation (JSON). Jaql perfectly maintains enclosure, is highly function-oriented and extremely
flexible. This language well is suitable for work with poorly structured data; also it serves as the interface
of storage of the HBase columns and is used for the analysis of the text.
 HBase - the data storage environment focused on columns by a не-SQL intended for support of big tables
with small degree of fullness in Hadoop.
 Flume is the distributed, reliable and available service intended for effective movement of large volumes
of the generated data. Flume well is suitable for obtaining event logs from several systems and their moving
to the Hadoop file system (Hadoop Distributed File System, HDFS) in process of their generation.
 Lucene is the library of the search system providing the high performance and full text search.
 Avro is the technology of consecutive ordering of data using JSON for determination of data types and
protocols. Arranges data in a compact binary format.
 ZooKeeper is the centralized service intended for support of the configuration information and naming;
provides the distributed synchronization and group service.
 Oozie is the schedule system of line processing of tasks intended for the organization and management of
Apache Hadoop task performance.

In addition to above-mentioned products the BigInsights distribution kit includes the following
technologies of IBM:

 BigSheets is the browser interface in the form of the spreadsheet intended for search and data analysis and
using all power of Hadoop; allows users to collect and analyze data easily. Contains the wired programs of
viewing data able to work with several widespread formats including JSON, CSV (the value separated by
commas) and TSV (the value separated by signs of tabulation).
 Text analytics is previously brought together library of text annotator for widespread business objects.
Contains a rich language and tools for creation of the user annotator of locations.
 Adaptive MapReduce is the solution developed by IBM Research and intended for acceleration of
accomplishment of the small MapReduce tasks by change of a method of their processing.

InfoSphere platform

InfoSphere is the comprehensive platform on integration of information including means of storage and
data analysis, an integration tool of information, a management tool master data, management tools
lifecycle and also means of protecting and ensuring confidentiality of data. InfoSphere does development
process of applications by more effective, allowing the organizations to save time, to reduce costs for
integration and to increase quality of information.

The product BigInsights, being a part of the platform IBM Big Data, contains integration points with other
its components, including storage systems and data integration, mechanisms of management and third-
party tools for data analysis. It is possible to BigInsights to integrate with the InfoSphere Streams platform.

New paradigm of calculations

Stream calculations - a new paradigm, requirement for which is caused by new scenarios of generation of
data - universal use of mobile devices, services on position fix and a wide use of various sensors. All this
generated the sharp need for the scalable computing platforms and parallel architecture capable to process
huge volumes of the generated stream data.

BigInsights technologies are not suitable for processing of stream data in real time as are focused generally
on batch processing of static data. When processing static data reply to the request "to Select all users
connected to network" one resulting set of values will be. When processing stream data in real time it is
possible to execute a continuous request, for example "to Select all users connected to network in the last
10 minutes". This request will continuously update results. In the world of static data the user will look
for a notorious needle in a haystack whereas in the world of stream data he looks for this needle as wind
blows off hay from a stack.

The IBM InfoSphere Streams platform supports processing of stream data in real time, providing periodic
updating of results of continuous requests. The necessary knowledge can be taken from data streams
which still are in the movement.

You might also like