0% found this document useful (0 votes)
4 views

Introduction to Hive

The document provides an overview of Hive, a data warehouse infrastructure tool in the Hadoop ecosystem that simplifies the processing of structured data using SQL-like scripts (HiveQL). It outlines Hive's features, limitations, architecture, data types, and querying capabilities, including data manipulation and aggregation. Additionally, it explains how to create databases and tables, load data, and perform various HiveQL operations such as filtering, sorting, and joining tables.

Uploaded by

sakinabohra0909
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Introduction to Hive

The document provides an overview of Hive, a data warehouse infrastructure tool in the Hadoop ecosystem that simplifies the processing of structured data using SQL-like scripts (HiveQL). It outlines Hive's features, limitations, architecture, data types, and querying capabilities, including data manipulation and aggregation. Additionally, it explains how to create databases and tables, load data, and perform various HiveQL operations such as filtering, sorting, and joining tables.

Uploaded by

sakinabohra0909
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Introduction to Hive:

The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive
that are used to help Hadoop modules.

 Sqoop: It is used to import and export data to and from between HDFS and RDBMS.

 Pig: It is a procedural language platform used to develop a script for MapReduce


operations.

 Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

There are various ways to execute MapReduce operations:

 The traditional approach using Java MapReduce program for structured, semi-structured,
and unstructured data.
 The scripting approach for MapReduce to process structured and semi structured data
using Pig.
 The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data using
Hive.

Hive is a data warehouse infrastructure tool to process structured data in Hadoop.


It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage.
Using Hive, we can skip the requirement of the traditional approach of writing complex
MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).

Hive is not:
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive:
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
 It is capable of analyzing large datasets stored in HDFS.
 It allows different storage types such as plain text, RCFile, and HBase.
 It uses indexing to accelerate queries.
 It can operate on compressed data stored in the Hadoop ecosystem.
 It supports user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive:

 Hive is not capable of handling real-time data.

 It is not designed for online transaction processing.

 Hive queries contain high latency.

Hive Architecture:
Working of Hive:
Hive Physical Architecture:
1. Cloud Infrastructure

In the cloud, there are servers (machines) that handle:

 Storage (where your data is stored)


 Computing (where your data is processed)

A typical machine has:

 32 GB memory (RAM)
 4-core processor (CPU)
 200–320 GB hard disk (storage)
2. Virtualization in Hadoop

 Hadoop can run on virtual machines (VMs).


 Virtualization helps to use resources better, and it’s improving over time with open-source
software.

3. Rack Awareness

 Hadoop knows where (which rack or switch) each machine (node) is located. This is called
location awareness.
 Why important?

1. To run tasks near the data (for faster processing)

2. To keep data safe by storing copies in different racks

3. So, even if one rack fails (power issue), data is not lost.

4. Hadoop Cluster (Where Hive Works)

Small Cluster: Has 1 master node and multiple worker nodes.

Master Node runs:

 JobTracker: Assigns and manages jobs (work)


 TaskTracker: Tracks progress of tasks
 NameNode: Keeps the index of where data is stored
 DataNode: Stores the actual data

Worker Node (slave): Also acts as TaskTracker and DataNode

Large Cluster: Has separate dedicated servers for better performance:

 Dedicated NameNode: Manages the file system index


 Secondary NameNode: Takes backup (snapshots) of NameNode memory
 Dedicated JobTracker: Handles job scheduling only
Hive Datatype:

All the data types in Hive are classified into four types, given as follows:
 Column Types
 Literals
 Null Values
 Complex Types

Column Types:
Column type are used as column data types of Hive. They are as follows:

1. Integral Types
Integer type data can be specified using integral data types, INT. When the data range exceeds
the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.

The following table depicts various INT data types:

2. String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.

The following table depicts various CHAR data types:

3. Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format YYYY-MM-DD HH:MM:SS.fffffffff and format yyyy-mm-dd
hh:mm:ss.ffffffffff.

4. Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
5. Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for representing
immutable arbitrary precision. The syntax and example is as follows:

DECIMAL(precision, scale)
decimal(10,0)

6. Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union. The syntax and example is as follows:

Literals
The following literals are used in Hive:

1. Floating Point Types


Floating point types are nothing but numbers with decimal points. Generally, this type of data
is composed of DOUBLE data type.

2. Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data
type. The range of decimal type is approximately -10-308 to 10308.

Null Value
Missing values are represented by the special value NULL.

Complex Types

Hive Query Language:


To interact with tables, databases, and queries, Hive provides a SQL like environment through
hadoop hiveql. To execute various types of data processing and querying, we can have
different types of Clauses for improved communication with various nodes outside the
ecosystem. HIVE also has JDBC connectivity.
Following features provide by Hive:

 Creating databases, tables, and other forms of data modelling, etc.


 ETL features such data extraction, transformation, and table loading.
 Joins to combine various data tables.
 User-specific customised scripts for coding simplicity.
 A quicker querying tool built upon Hadoop.

Creating Databases and Tables

In Hive, a table is a collection of data that is sorted according to a specific set of identifiers
using a schema.

Step 1: Create a Database

CREATE DATABASE IF NOT EXISTS mydatabase;

If a database with the name "mydatabase" doesn't already exist, this statement creates one.
The database is only created if it doesn't already exist, thanks to the IF NOT EXISTS condition.

Step 2: Switching to a Database:

USE mydatabase;

By switching to the "mydatabase" database using this line, further activities can be carried out
in that database.

Step 3: Creating a Table::

CREATE TABLE IF NOT EXISTS employees ( id INT, name STRING, age INT);

The "employees" table is created with this statement. It has three columns: "id" (integer),
"name" (string), and "age" (integer). The table is only generated if it doesn't already exist
thanks to the IF NOT EXISTS condition.

Step 4: Creating an External Table::


CREATE EXTERNAL TABLE IF NOT EXISTS ext_employees ( id INT, name STRING, age INT)
LOCATION '/path/to/data';

This hadoop hiveql command creates a new external table called "ext_employees." External
tables point to data that is kept in a location independent of Hive, preserving the original
location of the data. The HDFS path where the data is located is specified by the LOCATION
clause.

Loading Data into Tables

 Load data from HDFS

LOAD DATA INPATH '/path/to/data' INTO TABLE employees;

 Insert data into the table

INSERT INTO TABLE employees VALUES (1, 'John Doe', 30);

The LOAD DATA statement inserts data into the designated table from an HDFS path. The
"employees" table receives a specific row of data when the INSERT INTO TABLE query is
executed.

Querying Data with HiveQL

One of the core functions of using Apache Hive is data querying with HiveQL. You may obtain,
filter, transform, and analyse data stored in Hive tables using HiveQL, which is a language
comparable to SQL.

Following are a few typical HiveQL querying operations:

1. Select All Records:

SELECT * FROM employees;

This hadoop hiveql command retrieves all records from the "employees" table.

2. Filtering: Example: Select employees older than 25

SELECT * FROM employees WHERE age > 25;

Only those records from the "employees" table that have a "age" greater than 25 are chosen
by this.

3. Aggregation: Example: Count the number of employees

SELECT COUNT(*) FROM employees;

Example: Calculate the average age


SELECT AVG(age) FROM employees;

These hadoop hiveql queries count the number of employees and determine the average age
using aggregation operations on the "employees" table.

4.Sorting: Example: Sort by age in descending order

SELECT * FROM employees ORDER BY age DESC;

In order to extract employee names and their related departments, this query connects the
"employees" and "departments" databases based on the "department_id" field.

5. Joining Tables: Example: Join employees and departments based on department_id

SELECT e.id, e.name, d.departmentFROM employees eJOIN departments d ON


e.department_id = d.id;

The "department_id" column is used to link the "employees" and "departments" databases in
order to access employee names and their related departments.

6. Grouping and Aggregation: Example: Count employees in each department

SELECT department, COUNT(*) as employee_countFROM employeesGROUP BY department;

This query counts the number of employees in each department and organises employees by
department.

7. Limiting Results: Example: Get the top 10 oldest employees

SELECT * FROM employees ORDER BY age DESC LIMIT 10;

This search returns the ten oldest employees in order of age.

Data Filtering and Sorting

HiveQL offers the means to carry out these actions on your data contained in Hive tables. Data
filtering and sorting are crucial data analysis activities. To filter and sort data using HiveQL,
follow these steps:

1. Data Filtering: You can use the WHERE clause to filter rows based on specific conditions.

Example: Select marks which are more than 60.

SELECT * FROM employees WHERE marks > 60;

The "marks" table's field must be greater than 60 in order for this query to return all items
with that value.
2. Sorting Data: You can use the ORDER BY clause to order the result set according to one or
more columns.

Example: Consider ranking the by marks in increasing order.

SELECT * FROM marks ORDER BY INCR;

3.Combining Filtering and Sorting: To obtain particular subsets of data in a specified order,
you can combine filtering and sorting.

Example: Select and sort marks more than 60.

SELECT * FROM marks WHERE marks > 60 ORDER BY INCR;

Data Transformations and Aggregations

Some examples of data aggregations and transformations you can make with HiveQL:

1. Data Transformations: HiveQL provides a number of built-in functions for changing the data
in your query.

Example: Change the case of names

SELECT UPPER(name) as upper_case_name FROM employees;

This hadoop hiveql query pulls the "name" column from the "employees" table and uses the
UPPER function to change the names to uppercase.

2.Aggregations: Using functions like COUNT, SUM, AVG, and others, aggregates let you
condense data.

Example: Calculate the average age of the workforce, for instance.

SELECT AVG(age) as average_age FROM employees;

Using the AVG function, this query determines the average age of every employee in the
"employees" table.

3. Grouping and Aggregating: To group data into categories, the GROUP BY clause is used with
aggregate functions.

Example: For instance, total the personnel in each department.

SELECT department, COUNT(*) as employee_countFROM employeesGROUP BY department;

The COUNT function is used in this query to count the number of employees in each
department and group the employees by the "department" column.
4.Filtering Before Aggregating: Before doing aggregations, data transformations and filtering
might be used.

Example: Calculate the typical age of your staff members that are over 35.

SELECT AVG(age) as average_ageFROM employeesWHERE age > 35;

This hadoop hiveql query determines the average age of the filtered subset of employees by
first excluding those over the age of 35.

Joins and Subqueries

HiveQL's advanced features, including as joins and subqueries, let you aggregate data from
various tables and run sophisticated searches.

Using HiveQL, let's examine how to use joins and subqueries:

1. Joins: With the use of joins, you can merge rows from various tables based on a shared
column. The INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN are examples of
common join types.

Example: As an illustration, retrieve the employees and the corresponding departments from
an inner join.

SELECT e.id, e.name, d.departmentFROM employees eJOIN departments d ON


e.department_id = d.id;

Based on the "department_id" column, this query combines information from the
"employees" and "departments" tables to retrieve employee names and their related
departments.

2. Subqueries: A subquery is a query that is nested inside another query. The SELECT, WHERE,
and FROM clauses can all use them.

Example: Determine the typical age of employees in each department using a subquery in the
SELECT statement.

SELECT department, ( SELECT AVG(age) FROM employees e WHERE e.department_id =


d.id) as avg_ageFROM departments d;

The average age of employees for each department in the "departments" dataset is
determined by this query using a subquery.

3. Correlated Subqueries: An inner query that depends on results from the outer query is
referred to as a correlated subquery.

Example: Find employees whose ages are higher than the department's average, for instance.
SELECT id, nameFROM employees eWHERE age > ( SELECT AVG(age) FROM employees
WHERE department_id = e.department_id);

To locate employees whose ages are higher than the mean ages of employees in the same
department, this query uses a correlated subquery.

You might also like