Cheat Sheet: Hive Basics
Cheat Sheet: Hive Basics
O p e r a t i o n s
Function
- P e r f o r m e d o n
HQL Query
H i v e
C H E AT S H E E T
• UDF(User defined Functions): It is a function that fetches one or more
Partitioner controls the partitioning of keys of the intermediate map SELECT from_columns FROM table WHERE
columns from a row as arguments and returns a single value To retrieve information
conditions;
outputs, typically by a hash function which is same as the number of reduce
• UDTF( User defined Tabular Functions): This function is used to produce To select all values SELECT * FROM table;
tasks for a job
Hive Basics
multiple columns or rows of output by taking zero or more inputs To select a particular category
• Partitioning: It is used for distributing load horizontally. It is a way of SELECT * FROM table WHERE rec_name = "value";
• Macros: It is a function that uses other Hive functions values
dividing the tables into related parts based on values such as date, city, SELECT * FROM TABLE WHERE rec1 = "value1“ AND
• User defined aggregate functions: A user defined function that takes To select for multiple criteria
rec2 = "value2";
departments etc.
multiple rows or columns and returns the aggregation of the data For selecting specific columns SELECT column_name FROM table;
Apache Hive • User defined table generating functions: A function which takes a column
To retrieve unique output records SELECT DISTINCT column_name FROM table;
from single record and splitting it into multiple rows Hcatalog
It is a data warehouse infrastructure based on Hadoop framework which is For sorting SELECT col1, col2 FROM table ORDER BY col2;
perfectly suitable for data summarization, analysis and querying. It uses an It is a metadata and table management system for Hadoop platform which For sorting backwards SELECT col1, col2 FROM table ORDER BY col2 DESC;
SQL like language called HQL (Hive query Language) Hive SELECT Command enables storage of data in any format.
For counting rows from the table SELECT COUNT(*) FROM table;
HQL: It is a query language used to write the custom map reduce SELECT [ALL | DISTINCT] select_expr, select_expr, ... SELECT owner, COUNT(*) FROM table GROUP BY
framework in Hive to perform more sophisticated analysis of the data For grouping along with counting
warehouse. • Group by: It uses the list of columns, which specifies how to aggregate the Data Manipulation Language(DML): These statements are used to retrieve, To run the non-interactive script hive -f script.sql
To run script inside the shell source file_name
SerDe: Serializer, Deserializer which gives instructions to hive on how to records store, modify, delete, insert and update data in a database
To run the list command dfs –ls /user
process records • Cluster by, Distribute by, Sort by: Specifies the algorithm to sort, distribute • Inserting data in a database: The Load function is used to move the data To run ls (bash command) from
!ls
and create cluster, and the order for sorting into a particular Hive table. the shell
• Limit: This specifies how many records to be retrieved LOAD data <LOCAL> inpath <file path> into table [tablename] To set configuration variables set mapred.reduce.tasks=32
Thrift • Drop table: The drop table statements deletes the data and metadata
Tab auto completion set hive.<TAB>
To display all variables starting
A thrift service is used to provide remote access from other processors set
Hive Data Types from the table: drop table<table name> with hive
• Aggregation: It is used to count different categories from the table : To revert all variables reset
Integral data types: Timestamp: It supports the traditional To add jar files to distributed
Meta Store • Tinyint
Select count (DISTINCT category) from tablename;
cache
add jar jar_path
Unix timestamp with optional • Grouping: Group command is used to group the result set, where the To display all the jars in the
This is a service which stores the metadata information such as table • Smallint list jars
nanosecond precision result of one table is stored in the other: Select <category>, sum( distributed cache
schemas • Int • Dates amount) from <txt records> group by <category> To delete jars from the
delete jar jar_name
• distributed cache
Bigint • Decimals • To exit from the Hive shell: Use the command quit
Indexes String types: Complex types: M e t a d a t a F u n c t i o n s a n d Q u e r y
Indexes are created to the speedy access to columns in the database • VARCHAR-Length(1 to 65355) • Arrays: Syntax-ARRAY<data_type> Function Hive Commands
• CHAR-Length(255) • Maps: Syntax- MAP<primitive_type, User Selecting a database USE database;
Syntax: Create index <INDEX_NAME> on table <TABLE_NAME> WEB UI HIVE COMMAND LINE HD Insight
Union type: It is a collection of Interface Listing databases SHOW DATABASES;
data_type>
heterogenous data types. • Structs: STRUCT<col_name : listing table in a database SHOW TABLES;
Hive Function Meta • Syntax: UNIONTYPE<int, double, data_type [COMMENT Describing format of a table DESCRIBE (FORMATTED|EXTENDED) table;