0% found this document useful (0 votes)
35 views38 pages

Hive File Format

The document provides an overview of Hive file formats, including Text File, Sequential File, and RCFile, along with their pros and cons. It also covers Hive Query Language (HQL), including DDL and DML statements, and details about internal and external tables, partitioning, bucketing, views, subqueries, joins, and aggregation functions. Additionally, it explains how to manage data loading and querying efficiently in Hive.

Uploaded by

mohammed.twaha08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views38 pages

Hive File Format

The document provides an overview of Hive file formats, including Text File, Sequential File, and RCFile, along with their pros and cons. It also covers Hive Query Language (HQL), including DDL and DML statements, and details about internal and external tables, partitioning, bucketing, views, subqueries, joins, and aggregation functions. Additionally, it explains how to manage data loading and querying efficiently in Hive.

Uploaded by

mohammed.twaha08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

HIVE FILE FORMAT

HIVE FILE FORMAT


• The file formats in Hive specify how records are encoded in a file.
• The main file formats in Hive are:
1. Text File
2. Sequential File
3. RCFile(Record Columnar File)
TEXT FILE
• The default file format is text file.
• In this format, each record is a line in the file.
• Different control characters are used as delimiters.
• The delimiters are ^A(octal 001, separates all fields), ^B(octal 002,
separates the elements in the array or struct), ^C(octal 003, separates
key-value pair), and \n
• The supported text files are CSV and TSV.JSON or XML documents
can be specified as text file.
• Pros
Simple and easy to use
Human-readable
• Cons
Inefficient storage (no compression)
Slow performance due to lack of indexing
Example:
• CREATE TABLE students (
id INT,
name STRING,
marks FLOAT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',’
STORED AS TEXTFILE;
Sequential File
• Sequential files are flat files that store binary key-value pairs.
• It includes compression support, which reduces the CPU, I/O
requirement
• Pros:
Supports compression (Snappy, Gzip, etc.)
Faster read/write compared to text files.
• Cons:
Not human-readable
Example:
• CREATE TABLE students_seq (
id INT,
name STRING,
marks FLOAT
) STORED AS SEQUENCEFILE;
RCFile(Record Columnar File)
• RCFile stores the data in Column Oriented Manner.
HIVE QUERY
LANGUAGE(HQL)
HIVE QUERY LANGUAGE(HQL)
• Hive query language provides basic SQL-like operations.
• Few tasks:-
1. Create and manage tables and partitions.
2. Support various Relational, Arithmetic, and Logical Operators.
3. Evaluate functions.
4. Download the contents of a table to a local directory or the result of
queries to the HDFS directory.
DDL(Data Definition Language)
Statements
• Used to build and modify tables and other objects in the database.
• Commands:
1. Create/Drop/Alter Database
2. Create/Drop/Truncate Table
3. Alter Table/Partition/Column
4. Create/Drop/Alter View
5. Create/Drop/Alter Index
6. Show
7. Describe
DML(Data Manipulation Language)
Statements
• Used to retrieve, store, modify, delete, and update data in the database.
• Commands:
1. Loading files into table.
2. Inserting data into Hive Tables from queries.
• Starting Hive Shell:
Database
• A database is like a container for data. It has a collection of tables that
house the data.
Tables
Two kinds of table
1. Internal or Manged Table
2. External Table

Internal Tables (Managed Tables)


1. Hive stores managed tables under the warehouse folder.
2. Hive manages the entire lifecycle of these tables.
3. When an internal table is dropped, both data and metadata are removed.
To check if a table is managed or external:
DESCRIBE FORMATTED STUDENT;
It shows metadata, including the table type (MANAGED_TABLE or EXTERNAL_TABLE).
External (Self-Managed) Tables
1. Dropping an external table does NOT delete its data.
2. Use the EXTERNAL keyword to create one.
3. Specify a location for data storage.
Loading Data into a Table

Difference Between INTO TABLE and OVERWRITE TABLE


INTO TABLE: Appends new data to the existing table.
OVERWRITE TABLE: Replaces the existing data with new data.
Example Scenario:
EXT_STUDENT table has 100 records.
student.tsv contains 10 records.
LOAD DATA ... INTO TABLE EXT_STUDENT; → Table now has 110 records.
LOAD DATA ... OVERWRITE TABLE EXT_STUDENT; → Table now has 10 records.
Collection Data Types
Partitions
By default, queries scan the entire dataset, even with a WHERE filter. Partitioning reduces the I/O operations
and improves query performance.
Example Scenario:
• XYZ Enterprise has customer data across multiple US states. The IT team needs state-wide sales reports.
• Two options:
1. Without Partitioning
Run queries with WHERE StateName = "A", scanning all records every time. Inefficient for large
datasets.
2. With Partitioning
Create separate folders for each state. Querying a state scans only its folder.

Types of Partitioning
1. Static Partitioning
User manually assigns data to partitions.
Example: Placing State A’s data in a folder named State_A.
2. Dynamic Partitioning
Hive automatically creates partitions based on unique values in a column.
Static Partitioning

Adding Another Partition:


ALTER TABLE STATIC_PART_STUDENT ADD PARTITION (gpa = 3.5);
Dynamic Partitioning
Objective: Create a dynamic partition based on the gpa column.
Table Creation:
CREATE TABLE IF NOT EXISTS DYNAMIC_PART_STUDENT (
rollno INT,
name STRING
)
PARTITIONED BY (gpa FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
Enabling Dynamic Partitioning:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Loading Data into Dynamic Partition:
INSERT OVERWRITE TABLE DYNAMIC_PART_STUDENT
PARTITION (gpa)
SELECT rollno, name, gpa FROM EXT_STUDENT;
Bucketing
• Partitioning creates directories for each unique column value.
• Bucketing groups data into a fixed number of buckets (files) instead of creating thousands of partitions.
Objective: Implement Bucketing.
Table Creation:
CREATE TABLE IF NOT EXISTS STUDENT (
rollno INT,
name STRING,
grade FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
Loading Data:
LOAD DATA LOCAL INPATH '/root/hivedemos/student.tsv' INTO TABLE STUDENT;
Enabling Bucketing:
SET hive.enforce.bucketing = true;
Objective: Create a bucketed table with 3 buckets.
Table Creation:
CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (
rollno INT,
name STRING,
grade FLOAT
)
CLUSTERED BY (grade) INTO 3 BUCKETS;
Loading Data:
FROM STUDENT
INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno, name, grade;
Querying Buckets:
SELECT DISTINCT GRADE FROM STUDENT_BUCKET
TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);
Views
Views are logical objects in Hive (available from version 0.6). Views do not store data, they act as saved
queries.

Objective: Create a view named STUDENT_VIEW.


CREATE VIEW STUDENT_VIEW AS
SELECT rollno, name FROM EXT_STUDENT;
Querying a View:
SELECT * FROM STUDENT_VIEW LIMIT 4;
Dropping a View:
DROP VIEW STUDENT_VIEW;
Subqueries
Supported only in the FROM clause (Hive 0.12). Subquery columns must have unique names.
Objective: Count occurrences of words in a file.
CREATE TABLE docs (line STRING);

LOAD DATA LOCAL INPATH '/root/hivedemos/lines.txt'


OVERWRITE INTO TABLE docs;

CREATE TABLE word_count AS


SELECT word, COUNT(1) AS count FROM
(SELECT explode(split(line, ' ')) AS word FROM docs) w
GROUP BY word
ORDER BY word;

SELECT * FROM word_count;


explode() splits text into words and outputs them as separate rows.
Joins
Joins work similarly to SQL.
Objective: Join STUDENT and DEPARTMENT tables using rollno.
Table Creation:
CREATE TABLE IF NOT EXISTS STUDENT (
rollno INT,
name STRING,
gpa FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

CREATE TABLE IF NOT EXISTS DEPARTMENT (


rollno INT,
deptno INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
Loading Data:
LOAD DATA LOCAL INPATH '/root/hivedemos/student.tsv'
OVERWRITE INTO TABLE STUDENT;

LOAD DATA LOCAL INPATH '/root/hivedemos/department.tsv'


OVERWRITE INTO TABLE DEPARTMENT;
Performing a Join:
SELECT a.rollno, a.name, a.gpa, b.deptno
FROM STUDENT a
JOIN DEPARTMENT b
ON a.rollno = b.rollno;

Aggregation
Hive supports functions like AVG, COUNT, etc.
Objective: Perform aggregation functions.
SELECT AVG(gpa) FROM STUDENT;
SELECT COUNT(*) FROM STUDENT;
GROUP BY and HAVING

GROUP BY groups data based on column values. HAVING filters groups that meet a condition.
Objective: Group by rollno, name, and gpa, and filter gpa > 4.0.
SELECT rollno, name, gpa
FROM STUDENT
GROUP BY rollno, name, gpa
HAVING gpa > 4.0;

You might also like