0% found this document useful (0 votes)

38 views10 pages

Bda Exp-6

The document discusses setting up a Hive database to perform descriptive analytics on a dataset. Key steps include: 1. Importing a CSV dataset into HDFS 2. Creating an external Hive table matching the dataset schema 3. Loading data into the Hive table from HDFS 4. Executing SQL queries on the Hive table to analyze and visualize the data using functions, filters, and third-party visualization tools. The experiment demonstrated Hive's capabilities for managing large datasets and creating insights through SQL queries and integration with visualization tools.

Uploaded by

Mohit Gangwani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views10 pages

Bda Exp-6

Uploaded by

Mohit Gangwani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Vivekanand Education Society’s Institute of Technology

Department of Computer Engineering

Subject: Big Data Analytics

Class: D17A
ROLL NO: 16
39 NAME: Divya Makhija
Mohit Gangwani

EXPERIMENT TITLE: Create HIVE Database and Descriptive analytics-based statistics,

NO: 06 visualization using Hive

DOP: DOS:

GRADES: LOs MAPPED: SIGNATURE:

Aim: Create HIVE Database and Descriptive analytics-based statistics, visualization using

Hive THEORY:

Introduction to HIVE:

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing easy. Initially Hive was developed by Facebook, later
the Apache Software Foundation took it up and developed it further as an open source under the name Apache
Hive.

Hive is not

● A relational database

● A design for OnLine Transaction Processing (OLTP)

● A language for real-time queries and row-level updates

Features of Hive

● It stores schema in a database and processes data into HDFS.

● It is designed for OLAP.

● It provides SQL type language for querying called HiveQL or HQL.

Hive is an open-source data warehousing and SQL-like query language system built on top of the Hadoop
ecosystem. It provides a way to query, analyze, and manage large datasets stored in Hadoop's distributed storage
systems, such as HDFS, by using a SQL-like language called Hive Query Language (HQL).

Key features and concepts of Hive include:

1. Data Warehousing: Hive is designed for data warehousing and analytics. It allows users to structure and
manage their data using tables and columns similar to a relational database, making it accessible for analysis
using SQL-like queries.

2. Schema on Read: Hive follows a "schema on read" approach, which means that data is stored in its raw form,
and the schema is applied during query processing. This flexibility enables handling semi-structured and
unstructured data.

3. Hive Query Language (HQL): HQL is a SQL-like language that allows users to write queries to extract,
transform, and analyze data stored in Hadoop clusters. It supports various SQL operations, including filtering,
aggregation, and joins.
4. Tables and Metastore: Hive defines tables, columns, and partitions for data organization. It stores metadata
about these tables and their schemas in a metastore, which can be backed by a traditional relational database or
a compatible storage solution.
5. Optimization and Execution: Hive optimizes queries by generating a plan that is executed by Hadoop
MapReduce or other execution engines (like Tez or Spark). Hive also provides mechanisms for optimizing
query performance through techniques like predicate pushdown and map-side joins.

6. User-Defined Functions (UDFs): Hive supports custom user-defined functions (UDFs) that can be written in
various programming languages like Java, Python, or Scala. UDFs allow users to extend Hive's capabilities and
perform custom transformations.

7. Integration with Hadoop Ecosystem: Hive seamlessly integrates with other Hadoop ecosystem components
like HBase, HDFS, and YARN. It can interact with data stored in these systems and leverage their
capabilities.

8. Partitioning and Bucketing: Hive supports partitioning, where data is divided into partitions based on certain
criteria, such as date or location. This improves query performance by reducing the amount of data scanned.
Bucketing is another mechanism that helps organize data within partitions for further optimization.

9. Data Storage Formats: Hive supports various storage formats, such as ORC (Optimized Row Columnar) and
Parquet, which are designed to improve data compression and query performance.

10. Data Transformation: While not as powerful as traditional ETL tools, Hive supports basic data
transformation and cleansing through its HQL queries.

Hive is particularly useful for analysts and data engineers who are familiar with SQL and want to leverage their
existing skills for querying and analyzing large datasets stored in Hadoop. It abstracts the complexity of writing
MapReduce jobs while providing a familiar SQL-like interface, making big data processing more accessible to
a wider range of users.
Architecture of HIVE:

Hive architecture consists of various components that work together to provide a data warehousing and
querying solution for large datasets stored in the Hadoop ecosystem. The architecture of Hive can be divided
into the following key components:

1. Client Interface:
- Hive CLI (Command-Line Interface): The Hive CLI is a command-line tool that allows users to interact with
Hive by submitting Hive Query Language (HQL) commands and queries.
- HiveServer2: HiveServer2 is a more advanced version of the Hive server that provides a Thrift-based
interface for clients to submit queries and retrieve results. It supports multiple concurrent connections and offers
improved performance and security features.
- Hive Web UI: Hive provides a web-based user interface (UI) that allows users to submit and monitor
queries using a web browser.

2. Hive Metastore:
- The Hive Metastore is a critical component that stores metadata about Hive tables, columns, partitions, and
storage formats.
- It can be backed by a traditional relational database like MySQL, Derby, or PostgreSQL, or other compatible
solutions like Apache Derby.
3. Hive Execution Engine:
- The execution engine is responsible for processing Hive queries and transforming them into a series of
MapReduce, Tez, or Spark jobs that run on the Hadoop cluster.
- The default execution engine is MapReduce, but Hive also supports Tez and Spark for faster query
execution.

4. Hive Compiler and Optimizer:

- The Hive Query Compiler parses and compiles HQL queries into an execution plan. - The Optimizer
analyzes the execution plan and applies various optimization techniques to improve query performance.
This includes predicate pushdown, join optimization, and map-side aggregation.

5. Hive Query Language (HQL) Parser:

- The HQL Parser parses HQL queries and converts them into an abstract syntax tree (AST) representation
that the execution engine can understand.

6. Hive Metastore Service:

- The Hive Metastore Service provides APIs for interacting with the Hive Metastore. It manages metadata
storage, retrieval, and updates for Hive tables, schemas, and partitions.

7. Driver and Session Management:

- The Hive Driver is responsible for executing Hive queries. It coordinates query compilation, optimization,
and execution through interactions with the compiler, optimizer, and execution engine. - Session
management handles the lifecycle of user sessions, allowing multiple users to interact with Hive
simultaneously.

8. Hive Storage Handlers:

- Hive Storage Handlers allow Hive to interact with various data storage systems beyond HDFS. They enable
Hive to query data stored in other data stores like HBase, Cassandra, and more.
CODE SNAPSHOTS WITH COMMENTS:

1. csv dataset is dragged and dropped from windows to local machines (Cloudera). Csv dataset is added
from local machine to hdfs.

2. Create an external table trees with attributes as present in the dataset:

3. Load data in Hive from HDFS:

4. Select statement to check contents of table:

5. Query execution and visualization:
6. Query execution and visualization (using aggregate functions)

7. Query execution and visualization (using aggregrate functions & pie chart):
8. Use of where clause:

CONCLUSION:

The Hive on the Cloudera platform shows immense capabilities in managing data and creating visualizations.
By structuring the dataset into organized tables, we can efficiently explore the different attributes using SQL like
queries. Hive's integration with visualization tools, like Hue, allowed us to translate query results into easily
understandable visuals, like bar charts. This experiment highlighted Hive's role in simplifying data analysis and
visualization for better insights.

Apache Hive
No ratings yet
Apache Hive
17 pages
Architecture and Working of Hive
No ratings yet
Architecture and Working of Hive
7 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Hive
No ratings yet
Hive
52 pages
Hive Introduction
No ratings yet
Hive Introduction
47 pages
Unit 3-1
No ratings yet
Unit 3-1
41 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Chapter - 4 - Data Access - Hive
No ratings yet
Chapter - 4 - Data Access - Hive
35 pages
Unit-IV - BDA
No ratings yet
Unit-IV - BDA
42 pages
7 Hive
No ratings yet
7 Hive
30 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Unit-4 Hive
No ratings yet
Unit-4 Hive
10 pages
Lecture Notes - Hive and Querying
No ratings yet
Lecture Notes - Hive and Querying
20 pages
Final Doc Presentation Hive
No ratings yet
Final Doc Presentation Hive
20 pages
Hive Tutorial
No ratings yet
Hive Tutorial
19 pages
Hive
No ratings yet
Hive
30 pages
BDA Answers
No ratings yet
BDA Answers
10 pages
NoteGPT - Apache Hive Tutorial For Beginners - Big Data Training - Edureka - Big Data Rewind
No ratings yet
NoteGPT - Apache Hive Tutorial For Beginners - Big Data Training - Edureka - Big Data Rewind
15 pages
Hive Slides-2
No ratings yet
Hive Slides-2
25 pages
Bda Report
No ratings yet
Bda Report
16 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
Unit 3 Hive Overview and Architecture
No ratings yet
Unit 3 Hive Overview and Architecture
5 pages
Hive
No ratings yet
Hive
49 pages
Bigdata Lecture 5
No ratings yet
Bigdata Lecture 5
19 pages
Hive Architecture
No ratings yet
Hive Architecture
7 pages
Report On Hive of Apache
No ratings yet
Report On Hive of Apache
3 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Hive Full Lecture
No ratings yet
Hive Full Lecture
17 pages
HIVE
No ratings yet
HIVE
18 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
01 Introduction To Hive
No ratings yet
01 Introduction To Hive
17 pages
Unit 3 BDA
No ratings yet
Unit 3 BDA
44 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
Unit 3
No ratings yet
Unit 3
23 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
Bda 06
No ratings yet
Bda 06
15 pages
Hive
No ratings yet
Hive
7 pages
Course3 Module2 Intro To Hive Slides
No ratings yet
Course3 Module2 Intro To Hive Slides
76 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
Web Based Data Management of Apache Hive
No ratings yet
Web Based Data Management of Apache Hive
22 pages
BDA Unit-5
No ratings yet
BDA Unit-5
25 pages
Unit 3 Hive
No ratings yet
Unit 3 Hive
3 pages
Hive
No ratings yet
Hive
65 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
24 pages
Hive
No ratings yet
Hive
5 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Hive
No ratings yet
Hive
12 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Hive
No ratings yet
Hive
23 pages
Hive Is A Data Warehouse Infrastructure Tool To Process Structured Data in Hadoop
No ratings yet
Hive Is A Data Warehouse Infrastructure Tool To Process Structured Data in Hadoop
30 pages
Bda Unit 5 Notes
No ratings yet
Bda Unit 5 Notes
23 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages

Bda Exp-6

Uploaded by

Bda Exp-6

Uploaded by

Vivekanand Education Society’s Institute of Technology

Department of Computer Engineering

Subject: Big Data Analytics

EXPERIMENT TITLE: Create HIVE Database and Descriptive analytics-based statistics,

GRADES: LOs MAPPED: SIGNATURE:

● A design for OnLine Transaction Processing (OLTP)

● A language for real-time queries and row-level updates

● It stores schema in a database and processes data into HDFS.

● It is designed for OLAP.

● It provides SQL type language for querying called HiveQL or HQL.

Key features and concepts of Hive include:

4. Hive Compiler and Optimizer:

5. Hive Query Language (HQL) Parser:

6. Hive Metastore Service:

7. Driver and Session Management:

8. Hive Storage Handlers:

2. Create an external table trees with attributes as present in the dataset:

4. Select statement to check contents of table:

You might also like