Session 3.1
Session 3.1
Session Topic
3.1 Hadoop Eco systems: Hive – Architecture – data type – File format
3.3 Pig: Features – Anatomy – Pig on Hadoop - Pig Latin overview – Data
types – Running pig – Execution modes of Pig
Data Analyst
rowcount, N
rowcount,1 rowcount,1
/hivebase/Sales
/country=US
/country=CANADA
/year=2012 /year=2012
/year=2015
/year=2014
/month=12
/month=11 /month=11
File File File
Hive Data Model Contd.
• Buckets
- Data in each partition divided into buckets
- Based on a hash function of the column
- H(column) mod NumBuckets = bucket number
- Each bucket is stored as a file in partition directory
Hive Data Units
• Databases : The namespace for tables.
• Tables: Set of records that have similar schema.
• Partitions: Logical separations of data based on classification
of given information as per specific attributes. Once hive has
partitioned the data based on a specified key, it starts to
assemble the records into specific folders as and when the
records are inserted.
• Buckets (Clusters) : Similar to partitions but uses hash
functions to segregate data and determines the cluster or
bucket into which the record should be placed.
• In Hive, tables are stored as a folder, partitions are stored as a
sub-directory and buckets are stored as a file.
Architecture
Externel Interfaces- CLI, WebUI, JDBC,
ODBC programming interfaces