Introduction To HIVE
Introduction To HIVE
Introduction To HIVE
HIVE
Hive has a fascinating history related to the world's largest social networking site: Facebook. Facebook adopted the Hadoop framework to manage
their big data. big data is nothing but massive amounts of data that cannot be stored, processed, and analyzed by traditional systems.
As we know, Hadoop uses MapReduce to process data. With MapReduce, users were required to write long and extensive Java code. Not all users
were well-versed with Java and other coding languages. Users were comfortable with writing queries in SQL (Structured Query Language), and they
wanted a language similar to SQL. Enter the HiveQL language. The idea was to incorporate the concepts of tables and columns, just like SQL.
Hive is a data warehouse system that is used to query and analyze large datasets stored in the HDFS. Hive uses a query language called HiveQL, which
is similar to SQL.
As seen from the image below, the user first sends out the Hive queries. These queries are converted into MapReduce tasks, and that accesses the
Hadoop MapReduce system.
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It supports different types of clients such as:-
o Thrift Server - It is a cross-language service provider platform that serves the request from all those programming languages that supports
Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications. The JDBC Driver is present in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to Hive.
Hive Services
The following are the services provided by Hive:-
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-based GUI for executing Hive queries and
commands.
o Hive MetaStore - It is a central repository that stores all the structure information of various tables and partitions in the warehouse. It also
includes metadata of column and its type information, the serializers and deserializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis on the different query blocks and
expressions. It converts HiveQL statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce tasks and HDFS tasks. In the end, the
execution engine executes the incoming tasks in the order of their dependencies.
Working of Apache Hive
Now, let’s have a look at the working of the Hive over the Hadoop framework.
Here is a small description of a few of them.
Q. What is Hive Metastore?
Hive metastore is a database that stores metadata about your Hive tables (eg. Table name, column names and types,table location, storage handler
being used, number of buckets in the table, sorting columns if any, partition columns if any, etc.).
When you create a table,this metastore gets updated with the information related to the new table which gets queried when you issue queries on that
table.
Hive is a central repository of hive metadata. it has 2 parts services and data. by default it uses derby db in local disk. it is referred as embedded
metastore configuration. It tends to the limitation that only one session can be served at any given point of time.
Q.Which classes are used by the Hive to Read and Write HDFS Files?
Following classes are used by Hive to read and write HDFS files