0% found this document useful (0 votes)
562 views1 page

Cheat Sheet: Hive Basics

Hive functions allow performing operations on HQL queries. User defined functions (UDFs) take columns as arguments and return values, while user defined table generating functions (UDTFs) split columns into multiple rows. Partitioning divides tables into parts based on values like date or department. HCatalog is a metadata and table management system that allows storing data in any format on Hadoop. The Hive SELECT command retrieves data specified by columns, tables, and optional clauses for filtering, sorting, grouping, aggregation, and counting.

Uploaded by

Travis Scott
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
562 views1 page

Cheat Sheet: Hive Basics

Hive functions allow performing operations on HQL queries. User defined functions (UDFs) take columns as arguments and return values, while user defined table generating functions (UDTFs) split columns into multiple rows. Partitioning divides tables into parts based on values like date or department. HCatalog is a metadata and table management system that allows storing data in any format on Hadoop. The Hive SELECT command retrieves data specified by columns, tables, and optional clauses for filtering, sorting, grouping, aggregation, and counting.

Uploaded by

Travis Scott
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

HIVE Hive Functions Partitioner

O p e r a t i o n s
Function
- P e r f o r m e d o n
HQL Query
H i v e

C H E AT S H E E T
• UDF(User defined Functions): It is a function that fetches one or more
Partitioner controls the partitioning of keys of the intermediate map SELECT from_columns FROM table WHERE
columns from a row as arguments and returns a single value To retrieve information
conditions;
outputs, typically by a hash function which is same as the number of reduce
• UDTF( User defined Tabular Functions): This function is used to produce To select all values SELECT * FROM table;
tasks for a job

Hive Basics
multiple columns or rows of output by taking zero or more inputs To select a particular category
• Partitioning: It is used for distributing load horizontally. It is a way of SELECT * FROM table WHERE rec_name = "value";
• Macros: It is a function that uses other Hive functions values
dividing the tables into related parts based on values such as date, city, SELECT * FROM TABLE WHERE rec1 = "value1“ AND
• User defined aggregate functions: A user defined function that takes To select for multiple criteria
rec2 = "value2";
departments etc.
multiple rows or columns and returns the aggregation of the data For selecting specific columns SELECT column_name FROM table;
Apache Hive • User defined table generating functions: A function which takes a column
To retrieve unique output records SELECT DISTINCT column_name FROM table;
from single record and splitting it into multiple rows Hcatalog
It is a data warehouse infrastructure based on Hadoop framework which is For sorting SELECT col1, col2 FROM table ORDER BY col2;
perfectly suitable for data summarization, analysis and querying. It uses an It is a metadata and table management system for Hadoop platform which For sorting backwards SELECT col1, col2 FROM table ORDER BY col2 DESC;
SQL like language called HQL (Hive query Language) Hive SELECT Command enables storage of data in any format.
For counting rows from the table SELECT COUNT(*) FROM table;
HQL: It is a query language used to write the custom map reduce SELECT [ALL | DISTINCT] select_expr, select_expr, ... SELECT owner, COUNT(*) FROM table GROUP BY
framework in Hive to perform more sophisticated analysis of the data For grouping along with counting

Table: Table in hive is a table which contains logically stored data


FROM table_reference Hive commands in HQL owner;
SELECT owner, COUNT(*) FROM table GROUP BY
[WHERE where_condition] For selecting maximum values
owner;
Hive Interfaces: [GROUP BY col_list] Data Definition Language(DDL): It is used to build or modify tables and Selecting from multiple tables and SELECT pet.name, comment FROM pet JOIN event
• Hive interfaces includes WEB UI [HAVING having_condition] joining ON (pet.name = event.name);
objects stored in a database. Some of the DDL commands are as follows:
• Hive command line [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] • To create database in Hive: create database<data base name> C o m m a n d L i n e S t a t e m e n t s
• HD insight (windows server) [LIMIT number] • To list out the databases created in a Hive warehouse: show databases Function Hive Commands
; • To use the database created: USE <data base name> To run the query hive -e 'select a.col from tab1 a'
Components of Hive • Select: Select is a projection operator in HiveQL, which scans the table • To describe the associated database in metadata: describe<data base To run a query in a silent mode hive -S -e 'select a.col from tab1 a'
To select hive configuration hive -e 'select a.col from tab1 a’ –hiveconf
Meta store: Meta store is where the schemas of the Hive tables are stored, specified by the FROM clause name> variables hive.root.logger=DEBUG,console
it stores the information about the tables and partitions that are in the • Where: Where is a condition which specifies what to filter • To alter the database created: alter<data base name> To use the initialization script hive -i initialize.sql

warehouse. • Group by: It uses the list of columns, which specifies how to aggregate the Data Manipulation Language(DML): These statements are used to retrieve, To run the non-interactive script hive -f script.sql
To run script inside the shell source file_name
SerDe: Serializer, Deserializer which gives instructions to hive on how to records store, modify, delete, insert and update data in a database
To run the list command dfs –ls /user
process records • Cluster by, Distribute by, Sort by: Specifies the algorithm to sort, distribute • Inserting data in a database: The Load function is used to move the data To run ls (bash command) from
!ls
and create cluster, and the order for sorting into a particular Hive table. the shell
• Limit: This specifies how many records to be retrieved LOAD data <LOCAL> inpath <file path> into table [tablename] To set configuration variables set mapred.reduce.tasks=32
Thrift • Drop table: The drop table statements deletes the data and metadata
Tab auto completion set hive.<TAB>
To display all variables starting
A thrift service is used to provide remote access from other processors set
Hive Data Types from the table: drop table<table name> with hive
• Aggregation: It is used to count different categories from the table : To revert all variables reset
Integral data types: Timestamp: It supports the traditional To add jar files to distributed
Meta Store • Tinyint
Select count (DISTINCT category) from tablename;
cache
add jar jar_path
Unix timestamp with optional • Grouping: Group command is used to group the result set, where the To display all the jars in the
This is a service which stores the metadata information such as table • Smallint list jars
nanosecond precision result of one table is stored in the other: Select <category>, sum( distributed cache
schemas • Int • Dates amount) from <txt records> group by <category> To delete jars from the
delete jar jar_name
• distributed cache
Bigint • Decimals • To exit from the Hive shell: Use the command quit
Indexes String types: Complex types: M e t a d a t a F u n c t i o n s a n d Q u e r y
Indexes are created to the speedy access to columns in the database • VARCHAR-Length(1 to 65355) • Arrays: Syntax-ARRAY<data_type> Function Hive Commands
• CHAR-Length(255) • Maps: Syntax- MAP<primitive_type, User Selecting a database USE database;
Syntax: Create index <INDEX_NAME> on table <TABLE_NAME> WEB UI HIVE COMMAND LINE HD Insight
Union type: It is a collection of Interface Listing databases SHOW DATABASES;
data_type>
heterogenous data types. • Structs: STRUCT<col_name : listing table in a database SHOW TABLES;
Hive Function Meta • Syntax: UNIONTYPE<int, double, data_type [COMMENT Describing format of a table DESCRIBE (FORMATTED|EXTENDED) table;

array<string>, Hive QL Process Engine Creating a database CREATE DATABASE db_name;


Commands col_comment], ...>
Meta Store Execution Engine Dropping a database DROP DATABASE db_name (CASCADE);
struct<a:int,b:string>>
Show functions: Lists Hive functions and operators Map Reduce
Describe function [function name]: Displays short description of the
particular function Bucketing
Describe function extended [function name]: Displays extended description HDFS or HBASE Data Storage
of the particular function It is a technique to decompose the datasets into more manageable parts FURTHERMORE:
Hadoop Certification Training Course

You might also like