Bda Exp-6
Bda Exp-6
Bda Exp-6
Class: D17A
ROLL NO: 16
39 NAME: Divya Makhija
Mohit Gangwani
DOP: DOS:
Hive THEORY:
Introduction to HIVE:
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing easy. Initially Hive was developed by Facebook, later
the Apache Software Foundation took it up and developed it further as an open source under the name Apache
Hive.
Hive is not
● A relational database
Features of Hive
Hive is an open-source data warehousing and SQL-like query language system built on top of the Hadoop
ecosystem. It provides a way to query, analyze, and manage large datasets stored in Hadoop's distributed storage
systems, such as HDFS, by using a SQL-like language called Hive Query Language (HQL).
1. Data Warehousing: Hive is designed for data warehousing and analytics. It allows users to structure and
manage their data using tables and columns similar to a relational database, making it accessible for analysis
using SQL-like queries.
2. Schema on Read: Hive follows a "schema on read" approach, which means that data is stored in its raw form,
and the schema is applied during query processing. This flexibility enables handling semi-structured and
unstructured data.
3. Hive Query Language (HQL): HQL is a SQL-like language that allows users to write queries to extract,
transform, and analyze data stored in Hadoop clusters. It supports various SQL operations, including filtering,
aggregation, and joins.
4. Tables and Metastore: Hive defines tables, columns, and partitions for data organization. It stores metadata
about these tables and their schemas in a metastore, which can be backed by a traditional relational database or
a compatible storage solution.
5. Optimization and Execution: Hive optimizes queries by generating a plan that is executed by Hadoop
MapReduce or other execution engines (like Tez or Spark). Hive also provides mechanisms for optimizing
query performance through techniques like predicate pushdown and map-side joins.
6. User-Defined Functions (UDFs): Hive supports custom user-defined functions (UDFs) that can be written in
various programming languages like Java, Python, or Scala. UDFs allow users to extend Hive's capabilities and
perform custom transformations.
7. Integration with Hadoop Ecosystem: Hive seamlessly integrates with other Hadoop ecosystem components
like HBase, HDFS, and YARN. It can interact with data stored in these systems and leverage their
capabilities.
8. Partitioning and Bucketing: Hive supports partitioning, where data is divided into partitions based on certain
criteria, such as date or location. This improves query performance by reducing the amount of data scanned.
Bucketing is another mechanism that helps organize data within partitions for further optimization.
9. Data Storage Formats: Hive supports various storage formats, such as ORC (Optimized Row Columnar) and
Parquet, which are designed to improve data compression and query performance.
10. Data Transformation: While not as powerful as traditional ETL tools, Hive supports basic data
transformation and cleansing through its HQL queries.
Hive is particularly useful for analysts and data engineers who are familiar with SQL and want to leverage their
existing skills for querying and analyzing large datasets stored in Hadoop. It abstracts the complexity of writing
MapReduce jobs while providing a familiar SQL-like interface, making big data processing more accessible to
a wider range of users.
Architecture of HIVE:
Hive architecture consists of various components that work together to provide a data warehousing and
querying solution for large datasets stored in the Hadoop ecosystem. The architecture of Hive can be divided
into the following key components:
1. Client Interface:
- Hive CLI (Command-Line Interface): The Hive CLI is a command-line tool that allows users to interact with
Hive by submitting Hive Query Language (HQL) commands and queries.
- HiveServer2: HiveServer2 is a more advanced version of the Hive server that provides a Thrift-based
interface for clients to submit queries and retrieve results. It supports multiple concurrent connections and offers
improved performance and security features.
- Hive Web UI: Hive provides a web-based user interface (UI) that allows users to submit and monitor
queries using a web browser.
2. Hive Metastore:
- The Hive Metastore is a critical component that stores metadata about Hive tables, columns, partitions, and
storage formats.
- It can be backed by a traditional relational database like MySQL, Derby, or PostgreSQL, or other compatible
solutions like Apache Derby.
3. Hive Execution Engine:
- The execution engine is responsible for processing Hive queries and transforming them into a series of
MapReduce, Tez, or Spark jobs that run on the Hadoop cluster.
- The default execution engine is MapReduce, but Hive also supports Tez and Spark for faster query
execution.
1. csv dataset is dragged and dropped from windows to local machines (Cloudera). Csv dataset is added
from local machine to hdfs.
7. Query execution and visualization (using aggregrate functions & pie chart):
8. Use of where clause:
CONCLUSION:
The Hive on the Cloudera platform shows immense capabilities in managing data and creating visualizations.
By structuring the dataset into organized tables, we can efficiently explore the different attributes using SQL like
queries. Hive's integration with visualization tools, like Hue, allowed us to translate query results into easily
understandable visuals, like bar charts. This experiment highlighted Hive's role in simplifying data analysis and
visualization for better insights.