0% found this document useful (0 votes)

11 views14 pages

Introduction To Hive

The document provides an overview of Hive, a data warehouse infrastructure tool in the Hadoop ecosystem that simplifies the processing of structured data using SQL-like scripts (HiveQL). It outlines Hive's features, limitations, architecture, data types, and querying capabilities, including data manipulation and aggregation. Additionally, it explains how to create databases and tables, load data, and perform various HiveQL operations such as filtering, sorting, and joining tables.

Uploaded by

sakinabohra0909

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views14 pages

Introduction To Hive

Uploaded by

sakinabohra0909

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Introduction to Hive:

The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive
that are used to help Hadoop modules.

 Sqoop: It is used to import and export data to and from between HDFS and RDBMS.

 Pig: It is a procedural language platform used to develop a script for MapReduce

operations.

 Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

There are various ways to execute MapReduce operations:

 The traditional approach using Java MapReduce program for structured, semi-structured,
and unstructured data.
 The scripting approach for MapReduce to process structured and semi structured data
using Pig.
 The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data using
Hive.

Hive is a data warehouse infrastructure tool to process structured data in Hadoop.

It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage.
Using Hive, we can skip the requirement of the traditional approach of writing complex
MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).

Hive is not:
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive:
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
 It is capable of analyzing large datasets stored in HDFS.
 It allows different storage types such as plain text, RCFile, and HBase.
 It uses indexing to accelerate queries.
 It can operate on compressed data stored in the Hadoop ecosystem.
 It supports user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive:

 Hive is not capable of handling real-time data.

 It is not designed for online transaction processing.

 Hive queries contain high latency.

Hive Architecture:
Working of Hive:
Hive Physical Architecture:
1. Cloud Infrastructure

In the cloud, there are servers (machines) that handle:

 Storage (where your data is stored)

 Computing (where your data is processed)

A typical machine has:

 32 GB memory (RAM)
 4-core processor (CPU)
 200–320 GB hard disk (storage)
2. Virtualization in Hadoop

 Hadoop can run on virtual machines (VMs).

 Virtualization helps to use resources better, and it’s improving over time with open-source
software.

3. Rack Awareness

 Hadoop knows where (which rack or switch) each machine (node) is located. This is called
location awareness.
 Why important?

1. To run tasks near the data (for faster processing)

2. To keep data safe by storing copies in different racks

3. So, even if one rack fails (power issue), data is not lost.

4. Hadoop Cluster (Where Hive Works)

Small Cluster: Has 1 master node and multiple worker nodes.

Master Node runs:

 JobTracker: Assigns and manages jobs (work)

 TaskTracker: Tracks progress of tasks
 NameNode: Keeps the index of where data is stored
 DataNode: Stores the actual data

Worker Node (slave): Also acts as TaskTracker and DataNode

Large Cluster: Has separate dedicated servers for better performance:

 Dedicated NameNode: Manages the file system index

 Secondary NameNode: Takes backup (snapshots) of NameNode memory
 Dedicated JobTracker: Handles job scheduling only
Hive Datatype:

All the data types in Hive are classified into four types, given as follows:
 Column Types
 Literals
 Null Values
 Complex Types

Column Types:
Column type are used as column data types of Hive. They are as follows:

1. Integral Types
Integer type data can be specified using integral data types, INT. When the data range exceeds
the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.

The following table depicts various INT data types:

2. String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.

The following table depicts various CHAR data types:

3. Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format YYYY-MM-DD HH:MM:SS.fffffffff and format yyyy-mm-dd
hh:mm:ss.ffffffffff.

4. Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
5. Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for representing
immutable arbitrary precision. The syntax and example is as follows:

DECIMAL(precision, scale)
decimal(10,0)

6. Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union. The syntax and example is as follows:

Literals
The following literals are used in Hive:

1. Floating Point Types

Floating point types are nothing but numbers with decimal points. Generally, this type of data
is composed of DOUBLE data type.

2. Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data
type. The range of decimal type is approximately -10-308 to 10308.

Null Value
Missing values are represented by the special value NULL.

Complex Types

Hive Query Language:

To interact with tables, databases, and queries, Hive provides a SQL like environment through
hadoop hiveql. To execute various types of data processing and querying, we can have
different types of Clauses for improved communication with various nodes outside the
ecosystem. HIVE also has JDBC connectivity.
Following features provide by Hive:

 Creating databases, tables, and other forms of data modelling, etc.

 ETL features such data extraction, transformation, and table loading.
 Joins to combine various data tables.
 User-specific customised scripts for coding simplicity.
 A quicker querying tool built upon Hadoop.

Creating Databases and Tables

In Hive, a table is a collection of data that is sorted according to a specific set of identifiers
using a schema.

Step 1: Create a Database

CREATE DATABASE IF NOT EXISTS mydatabase;

If a database with the name "mydatabase" doesn't already exist, this statement creates one.
The database is only created if it doesn't already exist, thanks to the IF NOT EXISTS condition.

Step 2: Switching to a Database:

USE mydatabase;

By switching to the "mydatabase" database using this line, further activities can be carried out
in that database.

Step 3: Creating a Table::

CREATE TABLE IF NOT EXISTS employees ( id INT, name STRING, age INT);

The "employees" table is created with this statement. It has three columns: "id" (integer),
"name" (string), and "age" (integer). The table is only generated if it doesn't already exist
thanks to the IF NOT EXISTS condition.

Step 4: Creating an External Table::

CREATE EXTERNAL TABLE IF NOT EXISTS ext_employees ( id INT, name STRING, age INT)
LOCATION '/path/to/data';

This hadoop hiveql command creates a new external table called "ext_employees." External
tables point to data that is kept in a location independent of Hive, preserving the original
location of the data. The HDFS path where the data is located is specified by the LOCATION
clause.

Loading Data into Tables

 Load data from HDFS

LOAD DATA INPATH '/path/to/data' INTO TABLE employees;

 Insert data into the table

INSERT INTO TABLE employees VALUES (1, 'John Doe', 30);

The LOAD DATA statement inserts data into the designated table from an HDFS path. The
"employees" table receives a specific row of data when the INSERT INTO TABLE query is
executed.

Querying Data with HiveQL

One of the core functions of using Apache Hive is data querying with HiveQL. You may obtain,
filter, transform, and analyse data stored in Hive tables using HiveQL, which is a language
comparable to SQL.

Following are a few typical HiveQL querying operations:

1. Select All Records:

SELECT * FROM employees;

This hadoop hiveql command retrieves all records from the "employees" table.

2. Filtering: Example: Select employees older than 25

SELECT * FROM employees WHERE age > 25;

Only those records from the "employees" table that have a "age" greater than 25 are chosen
by this.

3. Aggregation: Example: Count the number of employees

SELECT COUNT(*) FROM employees;

Example: Calculate the average age

SELECT AVG(age) FROM employees;

These hadoop hiveql queries count the number of employees and determine the average age
using aggregation operations on the "employees" table.

4.Sorting: Example: Sort by age in descending order

SELECT * FROM employees ORDER BY age DESC;

In order to extract employee names and their related departments, this query connects the
"employees" and "departments" databases based on the "department_id" field.

5. Joining Tables: Example: Join employees and departments based on department_id

SELECT e.id, e.name, d.departmentFROM employees eJOIN departments d ON

e.department_id = d.id;

The "department_id" column is used to link the "employees" and "departments" databases in
order to access employee names and their related departments.

6. Grouping and Aggregation: Example: Count employees in each department

SELECT department, COUNT(*) as employee_countFROM employeesGROUP BY department;

This query counts the number of employees in each department and organises employees by
department.

7. Limiting Results: Example: Get the top 10 oldest employees

SELECT * FROM employees ORDER BY age DESC LIMIT 10;

This search returns the ten oldest employees in order of age.

Data Filtering and Sorting

HiveQL offers the means to carry out these actions on your data contained in Hive tables. Data
filtering and sorting are crucial data analysis activities. To filter and sort data using HiveQL,
follow these steps:

1. Data Filtering: You can use the WHERE clause to filter rows based on specific conditions.

Example: Select marks which are more than 60.

SELECT * FROM employees WHERE marks > 60;

The "marks" table's field must be greater than 60 in order for this query to return all items
with that value.
2. Sorting Data: You can use the ORDER BY clause to order the result set according to one or
more columns.

Example: Consider ranking the by marks in increasing order.

SELECT * FROM marks ORDER BY INCR;

3.Combining Filtering and Sorting: To obtain particular subsets of data in a specified order,
you can combine filtering and sorting.

Example: Select and sort marks more than 60.

SELECT * FROM marks WHERE marks > 60 ORDER BY INCR;

Data Transformations and Aggregations

Some examples of data aggregations and transformations you can make with HiveQL:

1. Data Transformations: HiveQL provides a number of built-in functions for changing the data
in your query.

Example: Change the case of names

SELECT UPPER(name) as upper_case_name FROM employees;

This hadoop hiveql query pulls the "name" column from the "employees" table and uses the
UPPER function to change the names to uppercase.

2.Aggregations: Using functions like COUNT, SUM, AVG, and others, aggregates let you
condense data.

Example: Calculate the average age of the workforce, for instance.

SELECT AVG(age) as average_age FROM employees;

Using the AVG function, this query determines the average age of every employee in the
"employees" table.

3. Grouping and Aggregating: To group data into categories, the GROUP BY clause is used with
aggregate functions.

Example: For instance, total the personnel in each department.

SELECT department, COUNT(*) as employee_countFROM employeesGROUP BY department;

The COUNT function is used in this query to count the number of employees in each
department and group the employees by the "department" column.
4.Filtering Before Aggregating: Before doing aggregations, data transformations and filtering
might be used.

Example: Calculate the typical age of your staff members that are over 35.

SELECT AVG(age) as average_ageFROM employeesWHERE age > 35;

This hadoop hiveql query determines the average age of the filtered subset of employees by
first excluding those over the age of 35.

Joins and Subqueries

HiveQL's advanced features, including as joins and subqueries, let you aggregate data from
various tables and run sophisticated searches.

Using HiveQL, let's examine how to use joins and subqueries:

1. Joins: With the use of joins, you can merge rows from various tables based on a shared
column. The INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN are examples of
common join types.

Example: As an illustration, retrieve the employees and the corresponding departments from
an inner join.

SELECT e.id, e.name, d.departmentFROM employees eJOIN departments d ON

e.department_id = d.id;

Based on the "department_id" column, this query combines information from the
"employees" and "departments" tables to retrieve employee names and their related
departments.

2. Subqueries: A subquery is a query that is nested inside another query. The SELECT, WHERE,
and FROM clauses can all use them.

Example: Determine the typical age of employees in each department using a subquery in the
SELECT statement.

SELECT department, ( SELECT AVG(age) FROM employees e WHERE e.department_id =

d.id) as avg_ageFROM departments d;

The average age of employees for each department in the "departments" dataset is
determined by this query using a subquery.

3. Correlated Subqueries: An inner query that depends on results from the outer query is
referred to as a correlated subquery.

Example: Find employees whose ages are higher than the department's average, for instance.
SELECT id, nameFROM employees eWHERE age > ( SELECT AVG(age) FROM employees
WHERE department_id = e.department_id);

To locate employees whose ages are higher than the mean ages of employees in the same
department, this query uses a correlated subquery.

Unit-4 Pig Hive
No ratings yet
Unit-4 Pig Hive
40 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
1Z0 1127 25 Hrd57y
No ratings yet
1Z0 1127 25 Hrd57y
49 pages
Thingworx Training Brochure (12-04-2021)
No ratings yet
Thingworx Training Brochure (12-04-2021)
2 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
BDA Unit-5
No ratings yet
BDA Unit-5
39 pages
Hive and Pig
No ratings yet
Hive and Pig
57 pages
Distributed DBMS: Announcements
100% (1)
Distributed DBMS: Announcements
11 pages
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
7.2 Netezza Database Users Guide
No ratings yet
7.2 Netezza Database Users Guide
326 pages
What Are Database Recovery Techniques
No ratings yet
What Are Database Recovery Techniques
11 pages
Order Fulfillment - 24.1 Implementation Guide
No ratings yet
Order Fulfillment - 24.1 Implementation Guide
40 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Unit 5-Hive
No ratings yet
Unit 5-Hive
18 pages
Cleaning With Eiva S-SCAN
No ratings yet
Cleaning With Eiva S-SCAN
13 pages
Blood Bank Mini Project Batch-12 Final (1) .1 (3.1)
No ratings yet
Blood Bank Mini Project Batch-12 Final (1) .1 (3.1)
51 pages
Bda-Unit-Iv - 2020-21
100% (1)
Bda-Unit-Iv - 2020-21
30 pages
Neo4j Cheat Sheet
No ratings yet
Neo4j Cheat Sheet
14 pages
Hive
No ratings yet
Hive
50 pages
Module - 3 - Reference Course Content
No ratings yet
Module - 3 - Reference Course Content
16 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
NV9 Range
No ratings yet
NV9 Range
1 page
Application Based, Advantageous K-Means Clustering Algorithm in Data Mining - A Review
No ratings yet
Application Based, Advantageous K-Means Clustering Algorithm in Data Mining - A Review
6 pages
Adms MCQ Pmscs-653
No ratings yet
Adms MCQ Pmscs-653
5 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
SQL Cheat Sheet Createtable Alter Drop Truncate
No ratings yet
SQL Cheat Sheet Createtable Alter Drop Truncate
1 page
Unit 2.2 Hive
No ratings yet
Unit 2.2 Hive
80 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Databases Assignment
No ratings yet
Databases Assignment
2 pages
User Access Review (UAR) Workflow Configuration and Description
No ratings yet
User Access Review (UAR) Workflow Configuration and Description
11 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Curriculum Vitae: Tanay Lakshman
No ratings yet
Curriculum Vitae: Tanay Lakshman
4 pages
Hive
No ratings yet
Hive
49 pages
Cse3002 Big Data m2
No ratings yet
Cse3002 Big Data m2
76 pages
Big Data
No ratings yet
Big Data
120 pages
Nidhiojha (4 0)
No ratings yet
Nidhiojha (4 0)
3 pages
Campus Recruitment System
No ratings yet
Campus Recruitment System
5 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Unit IV
No ratings yet
Unit IV
64 pages
Hive Part 2
No ratings yet
Hive Part 2
53 pages
Course On: Big Data Analytics
No ratings yet
Course On: Big Data Analytics
59 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Unit 3
No ratings yet
Unit 3
23 pages
Hive
No ratings yet
Hive
42 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Hive
No ratings yet
Hive
65 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
Implementation: 4.1 Component Modules
No ratings yet
Implementation: 4.1 Component Modules
10 pages
Hive Final
No ratings yet
Hive Final
75 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
M4 Q&a
No ratings yet
M4 Q&a
22 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
HIVE
No ratings yet
HIVE
28 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
13 pages
Multidimensional Analysis
No ratings yet
Multidimensional Analysis
27 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
HIVE
No ratings yet
HIVE
80 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Unit 3
No ratings yet
Unit 3
8 pages
Hive Overview
No ratings yet
Hive Overview
28 pages
Circor Pump Selector Program Help: Index
No ratings yet
Circor Pump Selector Program Help: Index
47 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
Hive Main
No ratings yet
Hive Main
33 pages
Hive
No ratings yet
Hive
29 pages
Unit IV
No ratings yet
Unit IV
22 pages
Hiveppt
No ratings yet
Hiveppt
29 pages
HIVE Data Types
No ratings yet
HIVE Data Types
6 pages
Bda Report
No ratings yet
Bda Report
16 pages
Dbms Unit-1 - Important Points
No ratings yet
Dbms Unit-1 - Important Points
58 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Hive
No ratings yet
Hive
23 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Bda Bi Jit Chapter-5
No ratings yet
Bda Bi Jit Chapter-5
27 pages
Module 3-1
No ratings yet
Module 3-1
32 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Utl File Package
No ratings yet
Utl File Package
5 pages
SQL (4,5) Input Output
No ratings yet
SQL (4,5) Input Output
6 pages
Unit-Vi Hive Hadoop & Big Data
100% (1)
Unit-Vi Hive Hadoop & Big Data
24 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
IT Sem 5 Syllabus
No ratings yet
IT Sem 5 Syllabus
6 pages
DBMS Solutions For EndSem
No ratings yet
DBMS Solutions For EndSem
54 pages
Resume AnkitaGhoshal
No ratings yet
Resume AnkitaGhoshal
6 pages
Release Schedule of Current Database Releases (Doc ID 742060.1)
No ratings yet
Release Schedule of Current Database Releases (Doc ID 742060.1)
8 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
Design and Implementation Student Fees Management System (Using Canadian College As A Case Study)
No ratings yet
Design and Implementation Student Fees Management System (Using Canadian College As A Case Study)
4 pages

Introduction To Hive

Uploaded by

Introduction To Hive

Uploaded by

Introduction to Hive:

 Pig: It is a procedural language platform used to develop a script for MapReduce

 Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

There are various ways to execute MapReduce operations:

Hive is a data warehouse infrastructure tool to process structured data in Hadoop.

 Hive is not capable of handling real-time data.

 It is not designed for online transaction processing.

 Hive queries contain high latency.

In the cloud, there are servers (machines) that handle:

 Storage (where your data is stored)

A typical machine has:

 Hadoop can run on virtual machines (VMs).

1. To run tasks near the data (for faster processing)

2. To keep data safe by storing copies in different racks

4. Hadoop Cluster (Where Hive Works)

Small Cluster: Has 1 master node and multiple worker nodes.

Master Node runs:

 JobTracker: Assigns and manages jobs (work)

Worker Node (slave): Also acts as TaskTracker and DataNode

Large Cluster: Has separate dedicated servers for better performance:

 Dedicated NameNode: Manages the file system index

The following table depicts various INT data types:

The following table depicts various CHAR data types:

1. Floating Point Types

Hive Query Language:

 Creating databases, tables, and other forms of data modelling, etc.

Creating Databases and Tables

Step 1: Create a Database

CREATE DATABASE IF NOT EXISTS mydatabase;

Step 2: Switching to a Database:

Step 3: Creating a Table::

Step 4: Creating an External Table::

Loading Data into Tables

 Load data from HDFS

LOAD DATA INPATH '/path/to/data' INTO TABLE employees;

 Insert data into the table

INSERT INTO TABLE employees VALUES (1, 'John Doe', 30);

Querying Data with HiveQL

Following are a few typical HiveQL querying operations:

1. Select All Records:

SELECT * FROM employees;

2. Filtering: Example: Select employees older than 25

SELECT * FROM employees WHERE age > 25;

3. Aggregation: Example: Count the number of employees

SELECT COUNT(*) FROM employees;

Example: Calculate the average age

4.Sorting: Example: Sort by age in descending order

SELECT * FROM employees ORDER BY age DESC;

5. Joining Tables: Example: Join employees and departments based on department_id

SELECT e.id, e.name, d.departmentFROM employees eJOIN departments d ON

6. Grouping and Aggregation: Example: Count employees in each department

SELECT department, COUNT(*) as employee_countFROM employeesGROUP BY department;

7. Limiting Results: Example: Get the top 10 oldest employees

SELECT * FROM employees ORDER BY age DESC LIMIT 10;

This search returns the ten oldest employees in order of age.

Data Filtering and Sorting

Example: Select marks which are more than 60.

SELECT * FROM employees WHERE marks > 60;

Example: Consider ranking the by marks in increasing order.

SELECT * FROM marks ORDER BY INCR;

Example: Select and sort marks more than 60.

SELECT * FROM marks WHERE marks > 60 ORDER BY INCR;

Data Transformations and Aggregations

Example: Change the case of names

SELECT UPPER(name) as upper_case_name FROM employees;

Example: Calculate the average age of the workforce, for instance.

SELECT AVG(age) as average_age FROM employees;

Example: For instance, total the personnel in each department.

SELECT department, COUNT(*) as employee_countFROM employeesGROUP BY department;

SELECT AVG(age) as average_ageFROM employeesWHERE age > 35;

Joins and Subqueries

Using HiveQL, let's examine how to use joins and subqueries:

SELECT e.id, e.name, d.departmentFROM employees eJOIN departments d ON

SELECT department, ( SELECT AVG(age) FROM employees e WHERE e.department_id =

You might also like