Introduction to Hive
Introduction to Hive
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive
that are used to help Hadoop modules.
Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
The traditional approach using Java MapReduce program for structured, semi-structured,
and unstructured data.
The scripting approach for MapReduce to process structured and semi structured data
using Pig.
The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data using
Hive.
Hive is not:
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive:
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
It is capable of analyzing large datasets stored in HDFS.
It allows different storage types such as plain text, RCFile, and HBase.
It uses indexing to accelerate queries.
It can operate on compressed data stored in the Hadoop ecosystem.
It supports user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive:
Hive Architecture:
Working of Hive:
Hive Physical Architecture:
1. Cloud Infrastructure
32 GB memory (RAM)
4-core processor (CPU)
200–320 GB hard disk (storage)
2. Virtualization in Hadoop
3. Rack Awareness
Hadoop knows where (which rack or switch) each machine (node) is located. This is called
location awareness.
Why important?
3. So, even if one rack fails (power issue), data is not lost.
All the data types in Hive are classified into four types, given as follows:
Column Types
Literals
Null Values
Complex Types
Column Types:
Column type are used as column data types of Hive. They are as follows:
1. Integral Types
Integer type data can be specified using integral data types, INT. When the data range exceeds
the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.
2. String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
3. Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format YYYY-MM-DD HH:MM:SS.fffffffff and format yyyy-mm-dd
hh:mm:ss.ffffffffff.
4. Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
5. Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for representing
immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
6. Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union. The syntax and example is as follows:
Literals
The following literals are used in Hive:
2. Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data
type. The range of decimal type is approximately -10-308 to 10308.
Null Value
Missing values are represented by the special value NULL.
Complex Types
In Hive, a table is a collection of data that is sorted according to a specific set of identifiers
using a schema.
If a database with the name "mydatabase" doesn't already exist, this statement creates one.
The database is only created if it doesn't already exist, thanks to the IF NOT EXISTS condition.
USE mydatabase;
By switching to the "mydatabase" database using this line, further activities can be carried out
in that database.
CREATE TABLE IF NOT EXISTS employees ( id INT, name STRING, age INT);
The "employees" table is created with this statement. It has three columns: "id" (integer),
"name" (string), and "age" (integer). The table is only generated if it doesn't already exist
thanks to the IF NOT EXISTS condition.
This hadoop hiveql command creates a new external table called "ext_employees." External
tables point to data that is kept in a location independent of Hive, preserving the original
location of the data. The HDFS path where the data is located is specified by the LOCATION
clause.
The LOAD DATA statement inserts data into the designated table from an HDFS path. The
"employees" table receives a specific row of data when the INSERT INTO TABLE query is
executed.
One of the core functions of using Apache Hive is data querying with HiveQL. You may obtain,
filter, transform, and analyse data stored in Hive tables using HiveQL, which is a language
comparable to SQL.
This hadoop hiveql command retrieves all records from the "employees" table.
Only those records from the "employees" table that have a "age" greater than 25 are chosen
by this.
These hadoop hiveql queries count the number of employees and determine the average age
using aggregation operations on the "employees" table.
In order to extract employee names and their related departments, this query connects the
"employees" and "departments" databases based on the "department_id" field.
The "department_id" column is used to link the "employees" and "departments" databases in
order to access employee names and their related departments.
This query counts the number of employees in each department and organises employees by
department.
HiveQL offers the means to carry out these actions on your data contained in Hive tables. Data
filtering and sorting are crucial data analysis activities. To filter and sort data using HiveQL,
follow these steps:
1. Data Filtering: You can use the WHERE clause to filter rows based on specific conditions.
The "marks" table's field must be greater than 60 in order for this query to return all items
with that value.
2. Sorting Data: You can use the ORDER BY clause to order the result set according to one or
more columns.
3.Combining Filtering and Sorting: To obtain particular subsets of data in a specified order,
you can combine filtering and sorting.
Some examples of data aggregations and transformations you can make with HiveQL:
1. Data Transformations: HiveQL provides a number of built-in functions for changing the data
in your query.
This hadoop hiveql query pulls the "name" column from the "employees" table and uses the
UPPER function to change the names to uppercase.
2.Aggregations: Using functions like COUNT, SUM, AVG, and others, aggregates let you
condense data.
Using the AVG function, this query determines the average age of every employee in the
"employees" table.
3. Grouping and Aggregating: To group data into categories, the GROUP BY clause is used with
aggregate functions.
The COUNT function is used in this query to count the number of employees in each
department and group the employees by the "department" column.
4.Filtering Before Aggregating: Before doing aggregations, data transformations and filtering
might be used.
Example: Calculate the typical age of your staff members that are over 35.
This hadoop hiveql query determines the average age of the filtered subset of employees by
first excluding those over the age of 35.
HiveQL's advanced features, including as joins and subqueries, let you aggregate data from
various tables and run sophisticated searches.
1. Joins: With the use of joins, you can merge rows from various tables based on a shared
column. The INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN are examples of
common join types.
Example: As an illustration, retrieve the employees and the corresponding departments from
an inner join.
Based on the "department_id" column, this query combines information from the
"employees" and "departments" tables to retrieve employee names and their related
departments.
2. Subqueries: A subquery is a query that is nested inside another query. The SELECT, WHERE,
and FROM clauses can all use them.
Example: Determine the typical age of employees in each department using a subquery in the
SELECT statement.
The average age of employees for each department in the "departments" dataset is
determined by this query using a subquery.
3. Correlated Subqueries: An inner query that depends on results from the outer query is
referred to as a correlated subquery.
Example: Find employees whose ages are higher than the department's average, for instance.
SELECT id, nameFROM employees eWHERE age > ( SELECT AVG(age) FROM employees
WHERE department_id = e.department_id);
To locate employees whose ages are higher than the mean ages of employees in the same
department, this query uses a correlated subquery.