Bda Exp-6

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Vivekanand Education Society’s Institute of Technology

Department of Computer Engineering

Subject: Big Data Analytics

Class: D17A
ROLL NO: 16
39 NAME: Divya Makhija
Mohit Gangwani

EXPERIMENT TITLE: Create HIVE Database and Descriptive analytics-based statistics,


NO: 06 visualization using Hive

DOP: DOS:

GRADES: LOs MAPPED: SIGNATURE:


Aim: Create HIVE Database and Descriptive analytics-based statistics, visualization using

Hive THEORY:

Introduction to HIVE:

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing easy. Initially Hive was developed by Facebook, later
the Apache Software Foundation took it up and developed it further as an open source under the name Apache
Hive.

Hive is not

● A relational database

● A design for OnLine Transaction Processing (OLTP)

● A language for real-time queries and row-level updates

Features of Hive

● It stores schema in a database and processes data into HDFS.

● It is designed for OLAP.

● It provides SQL type language for querying called HiveQL or HQL.

Hive is an open-source data warehousing and SQL-like query language system built on top of the Hadoop
ecosystem. It provides a way to query, analyze, and manage large datasets stored in Hadoop's distributed storage
systems, such as HDFS, by using a SQL-like language called Hive Query Language (HQL).

Key features and concepts of Hive include:

1. Data Warehousing: Hive is designed for data warehousing and analytics. It allows users to structure and
manage their data using tables and columns similar to a relational database, making it accessible for analysis
using SQL-like queries.

2. Schema on Read: Hive follows a "schema on read" approach, which means that data is stored in its raw form,
and the schema is applied during query processing. This flexibility enables handling semi-structured and
unstructured data.

3. Hive Query Language (HQL): HQL is a SQL-like language that allows users to write queries to extract,
transform, and analyze data stored in Hadoop clusters. It supports various SQL operations, including filtering,
aggregation, and joins.
4. Tables and Metastore: Hive defines tables, columns, and partitions for data organization. It stores metadata
about these tables and their schemas in a metastore, which can be backed by a traditional relational database or
a compatible storage solution.
5. Optimization and Execution: Hive optimizes queries by generating a plan that is executed by Hadoop
MapReduce or other execution engines (like Tez or Spark). Hive also provides mechanisms for optimizing
query performance through techniques like predicate pushdown and map-side joins.

6. User-Defined Functions (UDFs): Hive supports custom user-defined functions (UDFs) that can be written in
various programming languages like Java, Python, or Scala. UDFs allow users to extend Hive's capabilities and
perform custom transformations.

7. Integration with Hadoop Ecosystem: Hive seamlessly integrates with other Hadoop ecosystem components
like HBase, HDFS, and YARN. It can interact with data stored in these systems and leverage their
capabilities.

8. Partitioning and Bucketing: Hive supports partitioning, where data is divided into partitions based on certain
criteria, such as date or location. This improves query performance by reducing the amount of data scanned.
Bucketing is another mechanism that helps organize data within partitions for further optimization.

9. Data Storage Formats: Hive supports various storage formats, such as ORC (Optimized Row Columnar) and
Parquet, which are designed to improve data compression and query performance.

10. Data Transformation: While not as powerful as traditional ETL tools, Hive supports basic data
transformation and cleansing through its HQL queries.

Hive is particularly useful for analysts and data engineers who are familiar with SQL and want to leverage their
existing skills for querying and analyzing large datasets stored in Hadoop. It abstracts the complexity of writing
MapReduce jobs while providing a familiar SQL-like interface, making big data processing more accessible to
a wider range of users.
Architecture of HIVE:

Hive architecture consists of various components that work together to provide a data warehousing and
querying solution for large datasets stored in the Hadoop ecosystem. The architecture of Hive can be divided
into the following key components:

1. Client Interface:
- Hive CLI (Command-Line Interface): The Hive CLI is a command-line tool that allows users to interact with
Hive by submitting Hive Query Language (HQL) commands and queries.
- HiveServer2: HiveServer2 is a more advanced version of the Hive server that provides a Thrift-based
interface for clients to submit queries and retrieve results. It supports multiple concurrent connections and offers
improved performance and security features.
- Hive Web UI: Hive provides a web-based user interface (UI) that allows users to submit and monitor
queries using a web browser.

2. Hive Metastore:
- The Hive Metastore is a critical component that stores metadata about Hive tables, columns, partitions, and
storage formats.
- It can be backed by a traditional relational database like MySQL, Derby, or PostgreSQL, or other compatible
solutions like Apache Derby.
3. Hive Execution Engine:
- The execution engine is responsible for processing Hive queries and transforming them into a series of
MapReduce, Tez, or Spark jobs that run on the Hadoop cluster.
- The default execution engine is MapReduce, but Hive also supports Tez and Spark for faster query
execution.

4. Hive Compiler and Optimizer:


- The Hive Query Compiler parses and compiles HQL queries into an execution plan. - The Optimizer
analyzes the execution plan and applies various optimization techniques to improve query performance.
This includes predicate pushdown, join optimization, and map-side aggregation.

5. Hive Query Language (HQL) Parser:


- The HQL Parser parses HQL queries and converts them into an abstract syntax tree (AST) representation
that the execution engine can understand.

6. Hive Metastore Service:


- The Hive Metastore Service provides APIs for interacting with the Hive Metastore. It manages metadata
storage, retrieval, and updates for Hive tables, schemas, and partitions.

7. Driver and Session Management:


- The Hive Driver is responsible for executing Hive queries. It coordinates query compilation, optimization,
and execution through interactions with the compiler, optimizer, and execution engine. - Session
management handles the lifecycle of user sessions, allowing multiple users to interact with Hive
simultaneously.

8. Hive Storage Handlers:


- Hive Storage Handlers allow Hive to interact with various data storage systems beyond HDFS. They enable
Hive to query data stored in other data stores like HBase, Cassandra, and more.
CODE SNAPSHOTS WITH COMMENTS:

1. csv dataset is dragged and dropped from windows to local machines (Cloudera). Csv dataset is added
from local machine to hdfs.

2. Create an external table trees with attributes as present in the dataset:


3. Load data in Hive from HDFS:

4. Select statement to check contents of table:


5. Query execution and visualization:
6. Query execution and visualization (using aggregate functions)

7. Query execution and visualization (using aggregrate functions & pie chart):
8. Use of where clause:

CONCLUSION:

The Hive on the Cloudera platform shows immense capabilities in managing data and creating visualizations.
By structuring the dataset into organized tables, we can efficiently explore the different attributes using SQL like
queries. Hive's integration with visualization tools, like Hue, allowed us to translate query results into easily
understandable visuals, like bar charts. This experiment highlighted Hive's role in simplifying data analysis and
visualization for better insights.

You might also like