Bda 06
Bda 06
Bda 06
Theory
Apache Hive
Apache Hive is a data warehouse and an ETL tool which provides an SQL-like interface between
the user and the Hadoop distributed file system (HDFS) which integrates Hadoop. It is built on top
of Hadoop. It is a software project that provides data query and analysis. It facilitates reading,
writing and handling wide datasets stored in distributed storage and queried by Structure Query
Language (SQL) syntax. It is not built for Online Transactional Processing (OLTP) workloads. It is
frequently used for data warehousing tasks like data encapsulation, Ad-hoc Queries, and analysis
of huge datasets. It is designed to enhance scalability, extensibility, performance, fault-tolerance
and loose-coupling with its input formats.
Initially Hive is developed by Facebook and Amazon, Netflix and It delivers standard SQL
functionality for analytics. Traditional SQL queries are written in the MapReduce Java API to
execute SQL Application and SQL queries over distributed data. Hive provides portability as most
data warehousing applications function with SQL-based query languages like NoSQL.
Apache Hive is a data warehouse software project that is built on top of the Hadoop ecosystem.
It provides an SQL-like interface to query and analyze large datasets stored in Hadoop’s
distributed file system (HDFS) or other compatible storage systems.
Hive uses a language called HiveQL, which is similar to SQL, to allow users to express data
queries, transformations, and analyses in a familiar syntax. HiveQL statements are compiled into
MapReduce jobs, which are then executed on the Hadoop cluster to process the data.
Hive includes many features that make it a useful tool for big data analysis, including support for
partitioning, indexing, and user-defined functions (UDFs). It also provides a number of
optimization techniques to improve query performance, such as predicate pushdown, column
pruning, and query parallelization.
Hive can be used for a variety of data processing tasks, such as data warehousing, ETL (extract,
transform, load) pipelines, and ad-hoc data analysis. It is widely used in the big data industry,
1
especially in companies that have adopted the Hadoop ecosystem as their primary data
processing platform.
Components of Hive -
1. HCatalog –
It is a Hive component and is a table as well as a store management layer for Hadoop. It
enables users along with various data processing tools like Pig and MapReduce which
enables users to read and write on the grid easily.
2. WebHCat –
It provides a service which can be utilized by the user to run Hadoop MapReduce (or
YARN), Pig, Hive tasks or function Hive metadata operations with an HTTP interface.
Modes of Hive -
1. Local Mode –
It is used when the Hadoop is built under pseudo mode which has only one data node,
when the data size is smaller in terms of restricted to a single local machine, and when
processing will be faster on smaller datasets existing in the local machine.
It is used, when Hadoop is built with multiple data nodes and data is divided across
various nodes, it will function on huge datasets and query is executed parallelly, and to
achieve enhanced performance in processing large datasets.
Characteristics of Hive -
2
Features of Hive -
1. It provides indexes, including bitmap indexes to accelerate the queries. Index type
containing compaction and bitmap index as of 0.10.
2. Metadata storage in a RDBMS, reduces the time to function semantic checks during query
execution.
3. Built in user-defined functions (UDFs) to manipulation of strings, dates, and other
data-mining tools. Hive is reinforced to extend the UDF set to deal with the use-cases not
reinforced by predefined functions.
4. DEFLATE, BWT, snappy, etc are the algorithms to operate on compressed data which is
stored in the Hadoop Ecosystem.
5. It stores schemas in a database and processes the data into the Hadoop File Distributed
File System (HDFS).
6. It is built for Online Analytical Processing (OLAP).
7. It delivers various types of querying languages which are frequently known as Hive Query
Language (HVL or HiveQL).
Advantages -
1. Scalability -
Apache Hive is designed to handle large volumes of data, making it a scalable solution for
big data processing.
Hive uses a SQL-like language called HiveQL, which makes it easy for SQL users to learn
and use.
Hive integrates well with the Hadoop ecosystem, enabling users to process data using
other Hadoop tools like Pig, MapReduce, and Spark.
Hive supports partitioning and bucketing, which can improve query performance by
limiting the amount of data scanned.
5. User-defined functions -
Hive allows users to define their own functions, which can be used in HiveQL queries.
3
Disadvantages -
Hive is designed for batch processing, which means it may not be the best tool for
real-time data processing.
2. Slow performance -
Hive can be slower than traditional relational databases because it is built on top of
Hadoop, which is optimized for batch processing rather than interactive querying.
While Hive uses a SQL-like language, it still requires users to have knowledge of Hadoop
and distributed computing, which can make it difficult for beginners to use.
Hive does not support transactions, which can make it difficult to maintain data
consistency.
5. Limited flexibility -
Hive is not as flexible as other data warehousing tools because it is designed to work
specifically with Hadoop, which can limit its usability in other environments.
Pig
Pig is a high-level platform or tool which is used to process the large datasets. It provides a
high-level of abstraction for processing over the MapReduce. It provides a high-level scripting
language, known as Pig Latin which is used to develop the data analysis codes. First, to process
the data which is stored in the HDFS, the programmers will write the scripts using the Pig Latin
Language. Internally Pig Engine(a component of Apache Pig) converted all these scripts into a
specific map and reduced tasks. But these are not visible to the programmers in order to provide
a high-level of abstraction. Pig Latin and Pig Engine are the two main components of the Apache
Pig tool. The result of Pig is always stored in the HDFS.
● For performing several operations Apache Pig provides rich sets of operators like the
filtering, joining, sorting, aggregation etc.
4
● Easy to learn, read and write. Especially for SQL-programmers, Apache Pig is a boon.
● Apache Pig is extensible so that you can make your own process and user-defined
functions(UDFs) written in python, java or other programming languages.
● Join operation is easy in Apache Pig.
● Fewer lines of code.
● Apache Pig allows splits in the pipeline.
● By integrating with other components of the Apache Hadoop ecosystem, such as Apache
Hive, Apache Spark, and Apache ZooKeeper, Apache Pig enables users to take
advantage of these components’ capabilities while transforming data
5
Implementation
Hive setup
Download and extract hive
In hive-env.sh
Hive-site.xml
6
MySQL Connection URL
7
tmpdir Relative path in absolute URI
Hive config.sh
Change JAVA_Home path in hadoop-env.sh (Hive is based on JDK8 not JDK11 yet)
8
Create user and password as mentioned in hive-site.xml
CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'Hive111!!!';
9
GRANT ALL PRIVILEGES ON *.* TO 'hiveuser'@'localhost';
-----------------------------------------------------------------------------------------------------------
Download MySQL Connector and Copy Jar to Hive folder
https://downloads.mysql.com/archives/c-j/
Extract deb package for Ubuntu 20.04 and it will be installed under /usr/share/java
10
Run Hiveserver
hive --service metastore
hive --service hiveserver2
Pig Setup
Verify Hadoop and Java Installation.
Download Pig zip
11
Extract the tar file by using the below command and rename the folder to pig to make it
meaningful.
$tar -xzf pig-0.17.0.tar.gz
$mv pig-0.17.0 pig
Edit the .bashrc file to update the environment variable of Apache Pig so that it can be accessed
from any directory.
$nano .bashrc
Add below lines.
export PIG_HOME=/home/jigsaw/pig
export PATH=$PATH:/home/jigsaw/pig/bin
export PIG_CLASSPATH=$HADOOP_HOME/etc/hadoop
Run source command to update changes in the same terminal. source .bashrc
Now run the pig version command to make sure that Pig is installed properly
Run pig help command to see all pig command options.
12
start Pig grunt shell (grunt shell is used to execute Pig Latin scripts).
13
Conclusion
In this experiment we have successfully set up Hive and Pig in our existing Hadoop cluster.
14