Bda 06

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Experiment 06

Big Data Analysis

Harsh Suryanath Nag


10th March, 2024
Aim
Setup and Install Apache Hive and Pig. Write your observations.

Theory
Apache Hive

Apache Hive is a data warehouse and an ETL tool which provides an SQL-like interface between
the user and the Hadoop distributed file system (HDFS) which integrates Hadoop. It is built on top
of Hadoop. It is a software project that provides data query and analysis. It facilitates reading,
writing and handling wide datasets stored in distributed storage and queried by Structure Query
Language (SQL) syntax. It is not built for Online Transactional Processing (OLTP) workloads. It is
frequently used for data warehousing tasks like data encapsulation, Ad-hoc Queries, and analysis
of huge datasets. It is designed to enhance scalability, extensibility, performance, fault-tolerance
and loose-coupling with its input formats.

Initially Hive is developed by Facebook and Amazon, Netflix and It delivers standard SQL
functionality for analytics. Traditional SQL queries are written in the MapReduce Java API to
execute SQL Application and SQL queries over distributed data. Hive provides portability as most
data warehousing applications function with SQL-based query languages like NoSQL.

Apache Hive is a data warehouse software project that is built on top of the Hadoop ecosystem.
It provides an SQL-like interface to query and analyze large datasets stored in Hadoop’s
distributed file system (HDFS) or other compatible storage systems.

Hive uses a language called HiveQL, which is similar to SQL, to allow users to express data
queries, transformations, and analyses in a familiar syntax. HiveQL statements are compiled into
MapReduce jobs, which are then executed on the Hadoop cluster to process the data.

Hive includes many features that make it a useful tool for big data analysis, including support for
partitioning, indexing, and user-defined functions (UDFs). It also provides a number of
optimization techniques to improve query performance, such as predicate pushdown, column
pruning, and query parallelization.

Hive can be used for a variety of data processing tasks, such as data warehousing, ETL (extract,
transform, load) pipelines, and ad-hoc data analysis. It is widely used in the big data industry,

1
especially in companies that have adopted the Hadoop ecosystem as their primary data
processing platform.

Components of Hive -

1. HCatalog –

It is a Hive component and is a table as well as a store management layer for Hadoop. It
enables users along with various data processing tools like Pig and MapReduce which
enables users to read and write on the grid easily.

2. WebHCat –

It provides a service which can be utilized by the user to run Hadoop MapReduce (or
YARN), Pig, Hive tasks or function Hive metadata operations with an HTTP interface.

Modes of Hive -

1. Local Mode –

It is used when the Hadoop is built under pseudo mode which has only one data node,
when the data size is smaller in terms of restricted to a single local machine, and when
processing will be faster on smaller datasets existing in the local machine.

2. Map Reduce Mode –

It is used, when Hadoop is built with multiple data nodes and data is divided across
various nodes, it will function on huge datasets and query is executed parallelly, and to
achieve enhanced performance in processing large datasets.

Characteristics of Hive -

1. Databases and tables are built before loading the data.


2. Hive as a data warehouse is built to manage and query only structured data which is
residing under tables.
3. At the time of handling structured data, MapReduce lacks optimization and usability
functions such as UDFs whereas Hive framework has optimization and usability.
4. Programming in Hadoop deals directly with the files. So, Hive can partition the data with
directory structures to improve performance on certain queries.
5. Hive is compatible for the various file formats which are TEXTFILE, SEQUENCEFILE, ORC,
RCFILE, etc.
6. Hive uses derby database in single user metadata storage and it uses MYSQL for multiple
user Metadata or shared Metadata.

2
Features of Hive -

1. It provides indexes, including bitmap indexes to accelerate the queries. Index type
containing compaction and bitmap index as of 0.10.
2. Metadata storage in a RDBMS, reduces the time to function semantic checks during query
execution.
3. Built in user-defined functions (UDFs) to manipulation of strings, dates, and other
data-mining tools. Hive is reinforced to extend the UDF set to deal with the use-cases not
reinforced by predefined functions.
4. DEFLATE, BWT, snappy, etc are the algorithms to operate on compressed data which is
stored in the Hadoop Ecosystem.
5. It stores schemas in a database and processes the data into the Hadoop File Distributed
File System (HDFS).
6. It is built for Online Analytical Processing (OLAP).
7. It delivers various types of querying languages which are frequently known as Hive Query
Language (HVL or HiveQL).

Advantages -

1. Scalability -

Apache Hive is designed to handle large volumes of data, making it a scalable solution for
big data processing.

2. Familiar SQL-like interface -

Hive uses a SQL-like language called HiveQL, which makes it easy for SQL users to learn
and use.

3. Integration with Hadoop ecosystem -

Hive integrates well with the Hadoop ecosystem, enabling users to process data using
other Hadoop tools like Pig, MapReduce, and Spark.

4. Supports partitioning and bucketing -

Hive supports partitioning and bucketing, which can improve query performance by
limiting the amount of data scanned.

5. User-defined functions -

Hive allows users to define their own functions, which can be used in HiveQL queries.

3
Disadvantages -

1. Limited real-time processing -

Hive is designed for batch processing, which means it may not be the best tool for
real-time data processing.

2. Slow performance -

Hive can be slower than traditional relational databases because it is built on top of
Hadoop, which is optimized for batch processing rather than interactive querying.

3. Steep learning curve -

While Hive uses a SQL-like language, it still requires users to have knowledge of Hadoop
and distributed computing, which can make it difficult for beginners to use.

4. Lack of support for transaction -

Hive does not support transactions, which can make it difficult to maintain data
consistency.

5. Limited flexibility -

Hive is not as flexible as other data warehousing tools because it is designed to work
specifically with Hadoop, which can limit its usability in other environments.

Pig

Pig is a high-level platform or tool which is used to process the large datasets. It provides a
high-level of abstraction for processing over the MapReduce. It provides a high-level scripting
language, known as Pig Latin which is used to develop the data analysis codes. First, to process
the data which is stored in the HDFS, the programmers will write the scripts using the Pig Latin
Language. Internally Pig Engine(a component of Apache Pig) converted all these scripts into a
specific map and reduced tasks. But these are not visible to the programmers in order to provide
a high-level of abstraction. Pig Latin and Pig Engine are the two main components of the Apache
Pig tool. The result of Pig is always stored in the HDFS.

Features of Apache Pig -

● For performing several operations Apache Pig provides rich sets of operators like the
filtering, joining, sorting, aggregation etc.

4
● Easy to learn, read and write. Especially for SQL-programmers, Apache Pig is a boon.
● Apache Pig is extensible so that you can make your own process and user-defined
functions(UDFs) written in python, java or other programming languages.
● Join operation is easy in Apache Pig.
● Fewer lines of code.
● Apache Pig allows splits in the pipeline.
● By integrating with other components of the Apache Hadoop ecosystem, such as Apache
Hive, Apache Spark, and Apache ZooKeeper, Apache Pig enables users to take
advantage of these components’ capabilities while transforming data

5
Implementation

Hive setup
Download and extract hive

Edit Hive conf files.

In hive-env.sh

Hive-site.xml

6
MySQL Connection URL

Set username and password of MySQL user.

Scratch dir location

MySQL JDBC Connector

Remove illegal character &#8

7
tmpdir Relative path in absolute URI

Hive config.sh

Change JAVA_Home path in hadoop-env.sh (Hive is based on JDK8 not JDK11 yet)

Change and source bashrc

Setup MySQL Server

8
Create user and password as mentioned in hive-site.xml
CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'Hive111!!!';

9
GRANT ALL PRIVILEGES ON *.* TO 'hiveuser'@'localhost';
-----------------------------------------------------------------------------------------------------------
Download MySQL Connector and Copy Jar to Hive folder
https://downloads.mysql.com/archives/c-j/
Extract deb package for Ubuntu 20.04 and it will be installed under /usr/share/java

Hive Schema Initial Tool


In /opt/hive/bin , run command to create metastore

Work on HDFS - Create hive default location

Access Hive CLI

10
Run Hiveserver
hive --service metastore
hive --service hiveserver2

Pig Setup
Verify Hadoop and Java Installation.
Download Pig zip

11
Extract the tar file by using the below command and rename the folder to pig to make it
meaningful.
$tar -xzf pig-0.17.0.tar.gz
$mv pig-0.17.0 pig
Edit the .bashrc file to update the environment variable of Apache Pig so that it can be accessed
from any directory.
$nano .bashrc
Add below lines.
export PIG_HOME=/home/jigsaw/pig
export PATH=$PATH:/home/jigsaw/pig/bin
export PIG_CLASSPATH=$HADOOP_HOME/etc/hadoop
Run source command to update changes in the same terminal. source .bashrc
Now run the pig version command to make sure that Pig is installed properly
Run pig help command to see all pig command options.

12
start Pig grunt shell (grunt shell is used to execute Pig Latin scripts).

13
Conclusion
In this experiment we have successfully set up Hive and Pig in our existing Hadoop cluster.

14

You might also like