Big Data Unit-5
Big Data Unit-5
2. Hive :
Hive is often used to store and manage structured data in data warehouses built on Hadoop..
It makes querying and analyzing easy.
It allows querying and managing large datasets using a SQL-like language called HiveQL.
It translates HiveQL into MapReduce, Tez, or Spark jobs under the hood.
It supports structured and semi-structured data
It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Applications :
1. Ease of use
2. Streamlined security
3. Low overhead
4. Ideal for batch processing and aggregated data analysis.
5. Perform tasks like:Count, max, min, avg over large datasets and generate summary
statistics for decision making
6. BI tools like Tableau, Power BI, or QlikView can connect to Hive for visualization
and reporting.
3. HBase :
HBase is a column-oriented non-relational database management system that runs on
top of the Hadoop Distributed File System (HDFS).
HBase provides a fault-tolerant way of storing sparse data sets, which are common in
many big data use cases
1|Page
HBase does support writing applications in Apache Avro, REST and Thrift.
Application :
1 Used when you need real-time updates, unlike Hive which is batch-oriented.
2 Perfect for storing sensor data, logs, or metrics with timestamps.
3 Stores user profiles, posts, likes, shares, and comments.
4 Handles fast reads/writes for high-traffic platforms like Facebook or Twitter.
5 chat history storage, delivery receipts, notification logs.
PIG
Introduction to PIG :
o Pig is a high-level platform or tool which is used to process large datasets.
It provides a high level of abstraction for processing over MapReduce.
(High abstraction in Pig means you don’t write the logic for low-level
execution (like in MapReduce). Instead, you write simple, SQL-like
commands and Pig does the rest for you — translating them into efficient
parallel jobs.)
o It provides a high-level scripting language, known as Pig Latin which is
used to develop the data analysis codes.
o Pig Latin and Pig Engine are the two main components of the Apache Pig
tool. The result of Pig is always stored in the HDFS.
One limitation of MapReduce is that the development cycle is very
long. Writing the reducer and mapper, compiling packaging the
code, submitting the job and retrieving the output is a time-
consuming task.
o Apache Pig reduces the time of development using the multi-query
approach.
o Pig is beneficial for programmers who are not from Java backgrounds. 200
lines of Java code can be written in only 10 lines using the Pig Latin
language.
o Programmers who have SQL knowledge needed less effort to learn Pig
Latin.
Grunt :
This is where you can type commands like LOAD, DUMP, DESCRIBE,
ILLUSTRATE, etc.
Syntax of sh command :
grunt> sh ls
Syntax of fs command :
grunt>fs -ls
Pig Latin :
The Pig Latin is a data flow language used by Apache Pig to analyze the data in
Hadoop.
It is a textual language that abstracts the programming from the Java MapReduce
idiom into a notation.
The Pig Latin statements are used to process the data.
It is an operator that accepts a relation as an input and generates another relation as an
output.
· It can span multiple lines.
· Each statement must end with a semi-colon.
· It may include expression and schemas.
· By default, these statements are processed using multi-query execution
User-Defined Functions :
Apache Pig provides extensive support for User Defined
Functions(UDF’s).
Using these UDF’s, we can define our own functions and use them. The
UDF support is provided in six programming languages:
· Java
· Jython
· Python
· JavaScript
· Ruby
· Groovy
For writing UDF’s, complete support is provided in Java and limited
support is provided in all the remaining languages.
Using Java, you can write UDF’s involving all parts of the processing like
data load/store, column transformation, and aggregation.
Since Apache Pig has been written in Java, the UDF’s written using Java
language work efficiently compared to other languages.
Types of UDF’s in Java :
Filter Functions :
• The filter functions are used as conditions in filter statements. • These functions
accept a Pig value as input and return a Boolean value.
Eval Functions :
Hive
Apache Hive Architecture :
The above figure shows the architecture of Apache Hive and its major components.
The major components of Apache Hive are :
1. Hive Client
2. Hive Services
3. Processing and Resource Management
4. Distributed Storage
HIVE CLIENT :
Hive supports applications written in any language like Python, Java, C++, Ruby, etc
using JDBC, ODBC, and Thrift drivers, for performing queries on the Hive. Hence, one
can easily write a hive client application in any language of its own choice.
Hive clients are categorized into three types :
1. Thrift Clients : The Hive server is based on Apache Thrift so that it can serve the
request from a thrift client.
2. JDBC client : Hive allows for the Java applications to connect to it using the JDBC
driver. JDBC driver uses Thrift to communicate with the Hive Server. 3. ODBC client :
Hive ODBC driver allows applications based on the ODBC protocol to connect to Hive.
Similar to the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive
Server.
HIVE SERVICE :
To perform all queries, Hive provides various services like the Hive server2, Beeline,
etc.
The various services offered by Hive are :
1. Beeline
2. Hive Server 2
3. Hive Driver
4. Hive Compiler
5. Optimizer
6. Metastore
DISTRIBUTED STORAGE :
Hive is built on top of Hadoop, so it uses the underlying Hadoop Distributed File
System for the distributed storage.
Hive Shell :
Hive shell is a primary way to interact with hive.
It is a default service in the hive.
It is also called CLI (command line interference).
Hive shell is similar to MySQL Shell.
Hive users can run HQL queries in the hive shell.
In hive shell up and down arrow keys are used to scroll previous
commands. HiveQL is case-insensitive (except for string comparisons).
The tab key will autocomplete (provides suggestions while you type into the
field) Hive keywords and functions.
Hive Shell can run in two modes :
Non-Interactive mode :
Non-interactive mode means run shell scripts in administer zone.
Hive Shell can run in the non-interactive mode, with the -f option.
Example:
$hive -f script.q, Where script. q is a file.
Interactive mode :
The hive can work in interactive mode by directly typing the command “hive” in the
terminal.
Example:
$hive
Hive> show databases;
Hive Services :
The following are the services provided by Hive :
Hive CLI (Beeline ): The Hive CLI (Command Line Interface) is a shell where we
can execute Hive queries and commands.
• Hive Web User Interface: The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands. • Hive metastore: It
is a central repository that stores all the structure information of various tables and
partitions in the warehouse. It also includes metadata of column and its type information,
the serializers and deserializers which is used to read and write data and the corresponding
HDFS files where the data is stored. • Hive Server: It is referred to as Apache Thrift
Server. It accepts the request from different clients and provides it to Hive Driver.
• Hive Driver: It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
• Hive Compiler: The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts HiveQL
statements into MapReduce jobs.