0% found this document useful (0 votes)

73 views190 pages

Hadoop - Hive

Uploaded by

Jhumri Talaiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views190 pages

Hadoop - Hive

Uploaded by

Jhumri Talaiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 190

Hadoop Hive

(A Framework for data warehousing

on top of Hadoop)

Disclaimer: This material is protected under copyright act AnalytixLabs ©, 2011-2016. Unauthorized use and/ or duplication of this material or any part of this material
including data, in any form without explicit and written permission from AnalytixLabs is strictly prohibited. Any violation of this copyright will attract legal actions
(A Framework for data warehousing on top of
Hadoop)
Recall: Hadoop Eco-System – Analytics mapping
APACHE OOZIE (WORK FLOW)

HIVE PIG LATIN MAHOUT  These Tools

Tools DW SYSTEM DATA ANALYSIS MACHINE LEARNING Other YARN can be
Frame replaced
works (MPI,
Map Reduce Framework by
Giraph) HBASE
Process R-Hadoop/
YARN (Cluster Resource Management)
PyHadoop
Storage HDFS(Hadoop Distributed File System) Storage

FLUME Import & Export SQOOP

Import
Export 

Processing
Unstructured & Semi structured Structured
 Zookeeper is software project, providing an open source distributed configuration service and synchronization service and naming registry for large distributed system
Hive - Overview
Hive-SQL Analytics For Any Data Size

• Scalable SQL processing over data in Hadoop

• Scales to 100PB+
• Structured and Unstructured data
Hive-Background
• Started at Facebook in 2007
• Data was collected by nightly cron jobs into Oracle DB “ETL” via hand-coded
python
• Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that.

Scribe server tier MySQL server tier

Data collection server Oracle Database

Amazon uses it in Amazon Elastic MapReduce

Use case

7
What is Hive?
Apache Hive is a data warehouse software facilitates querying as well as managing large datasets
residing in distributed storage. Hive is one of the easiest to use of the high-level MapReduce (MR)
frameworks.
Features of Hive
• Its Open Source(Very Important!) so free
• Data-warehousing tool on top of Hadoop
• Suitable for structured & semi structured data
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
• Ability to bring structure to various data formats
• Simple interface for ad hoc querying, analysing and summarizing large amounts of data
• Access to files on various data stores such as HDFS and Hbase
What is Hive
• Creates table schema before loading data into tables.
• Hive is batch-oriented and has high latency for query execution
• Database / table / partition / bucket – DDL Operations
• SQL Types + Complex Types (ARRAY, MAP, etc)
• No need to learn java and Hadoop API’s
• Abstracts complexity of Hadoop
• Indexing to provide acceleration, index type including compaction and Bitmap index as of 0.10,
more index types are planned.
• Different storage types such as plain text, RCFile, HBase, ORC, and others.
• Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during
query execution.
• Operating on compressed data stored into the Hadoop ecosystem using algorithms
including DEFLATE, BWT, snappy, etc.
• Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools.
Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
• SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.
What Hive is not?
Hive is not
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• Does not use complex indexes so do not response in a seconds!

• But it scales very well and, It works with data of Peta Byte order

• It is not independent and it’s performance is tied Hadoop

• A language for real-time queries and row-level updates
• For small amounts of data and it may take minutes
Need for high level language

• Hadoop is great for large-data processing!

• But writing Java programs for everything is verbose and slow
• Not everyone wants to (or can) write Java code
• Solution: develop higher-level data processing languages
• Hive: HQL is like SQL
• Pig: Pig Latin is a bit like Perl
Hive Releases
• 1.2.1 27 June 2015
• 1.2 May 2015
• 1.1 Mar 2015
• 1.0 Feb 2015
• 0.14 Nov 2014
• 0.13 April 2014
• 0.12 Oct 2013
• 0.11 May 2013
• 0.10 Jan 2013

https://git-wip-us.apache.org/repos/asf?p=hive.git
What is Hive?
What is cool about Hive?
Translates HiveQL statements into set of MapReduce jobs which then executed on Hadoop cluster
Why use Hive
Where to use Hive
• Log processing
• Daily Report
• User Activity Measurement
• Data/Text mining
• Machine learning (Training Data)
• Business intelligence
• Advertising Delivery
• Spam Detection
Hive Installation

• Use yum install or apt-install if possible

• Download latest Hive tar or RPM file
• Extract the archive in the appropriate folder
• Set environment variables like HIVE_HOME, PATH and CLASSPATH
Hive Installation
• Download the latest Hive tar file
• Unpack the tarball
• $ tar -xzvf hive-x.y.z.tar.gz

• Set the environment variable HIVE_HOME to point to the installation directory

$ cd hive-x.y.z
$ export HIVE_HOME={{pwd}}

• Add $HIVE_HOME/bin to your PATH:

$ export PATH=$HIVE_HOME/bin:$PATH
Using the Hive shell
Accessing Hive from command line
Hive Properties
Interacting with Operating system and HDFS
Accessing Hive with HUE
Accessing Hive with HUE
Interacting with Hive Server-2 – Hive as service
Interacting with Hive Server-2 – Hive as service
Hive, Map-Reduce and Local-Mode
Hive also supports a mode to run map-reduce jobs in local-mode automatically. The
relevant options are hive.exec.mode.local.auto, hive.exec.mode.local.auto.inputbytes.max,
and hive.exec.mode.local.auto.tasks.max:
hive> SET mapreduce.framework.name=local;
hive> SET hive.exec.mode.local.auto=false;
Note that this feature is disabled by default. If enabled, Hive analyzes the size of each map-
reduce job in a query and may run it locally if the following thresholds are satisfied:
The total input size of the job is lower
than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)
The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by
default). The total number of reduce tasks required is 1 or 0.
So for queries over small data sets, or for queries with multiple map-reduce jobs where the
input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior
job), jobs may be run locally.
27 of 31
Hive-logs
Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The
default logging level is WARN for Hive releases prior to 0.13.0. Starting with Hive 0.13.0,
the default logging level is INFO.
The logs are stored in the directory /tmp/<user.name>:
/tmp/<user.name>/hive.log
Note: In local mode, prior to Hive 0.13.0 the log file name was ".log" instead of "hive.log".
This bug was fixed in release 0.13.0 (see HIVE-5528 and HIVE-5676).
To configure a different log location, set hive.log.dir in $HIVE_HOME/conf/hive-
log4j.properties. Make sure the directory has the sticky bit set (chmod 1777 <dir>).
hive.log.dir=<other_location>

28 of 31
Hive Components - Architecture
Hive Components
Hadoop: Hive needs Hadoop as a Base Framework to operate.

Driver: Hive has its own drivers to communicate with the Hadoop World. The
component that manages the lifecycle of a HiveQL statement as it moves
through Hive. The driver also maintains a session handle and any session
statistics

CLI: The Hive CLI is the console for firing Hive Queries. The CLI would be used
for operating on our data.

Web interface: Hive also provides a web interface to monitor/administrate

Hive jobs.

MetaStore: Metastore is the Hive’s data warehouse which stores all the
structure information of various tables/partitions in Hive.
(Database Catalog)

Thrift Server(Hive Server): The component that provides a trift interface and
a JDBC/ODBC server and provides a way of integrating Hive with other
applications
Hive components - Architecture
Hive-The SQL interface to Hadoop
How Hive process the data?
Hive-Reliable SQL Processing at a scale
Hive Architecture
• Internal Components
• Compiler and Planner
• The component that compiles HiveQL into a directed acyclic graph of map/reduce tasks.
• Optimizer
• consists of a chain of transformations such that the operator DAG resulting from one
transformation is passed as input to the next transformation
• Performs tasks like Column Pruning , Partition Pruning, Repartitioning of Data
• Execution Engine
• The component that executes the tasks produced by the compiler in proper dependency
order. The execution engine interacts with the underlying Hadoop instance.

Directed acyclic graph(DAG): is a directed graph with no directed cycles.

Hive - Metastore
• To support features like schemas and data partitioning hive keeps its
metadata in a relational database
• Packaged with Derby, a light weight embedded SQL-DB
• Default Derby based is good for evaluation and testing
• Schema is not shared between users as each user has their own
instance of embedded Derby
• Stored in metastore_db directory which resides in the directory that
hive was started from
• Can easily switch another SQL installation such as MySQL
Hive - Metastore
• The metastore is the central
repository of Hive metatdata.

• The metastore is divided into two

pieces:
• a service and the backing store
for the data.
• By default, the metastore
service runs in the same JVM as
the Hive service and contains
an embedded Derby database
instance backed by the local
disk

37
Hcatlog
• Performing Java MapReduce computation on data, mapped to Hive tables
• HCatalog is a meta-data abstraction layer for files stored in HDFS and makes it easy for
different components to process data stored in HDFS.
• HCatalog abstraction is based on tabular table mode and augments structure, location,
storage format and other meta-data information for the datasets stored in HDFS.
• With HCatalog we can use data processing tools such as Pig, Java MapReduce and
others read and write data to Hive tables without worrying about the structure, storage
format and or storage location of the data.
How Hive loads and stores data?
How Hive loads and stores data?
Hive Query Language(HQL)
Hive query language
Hive Query Language(HiveQL)
Hive Query Language(HiveQL)
• HiveQL does not strictly follow the full SQL-92 standard.
• HiveQL offers extensions not in SQL, including multitable inserts and create
table as select, but only offers basic support for indexes.
• HiveQL lacks support for transactions andmaterialized views, and only
limited subquery support.
• Support for insert, update, and delete with full ACID functionality was
made available with release 0.14.
• Internally, a compiler translates HiveQL statements into a directed acyclic
graph of MapReduce or Tez, or Spark jobs, which are submitted to Hadoop
for execution.
Hive Query Language(HiveQL)
HiveQL
• Commands and CLIs Data Manipulation Statements Procedural Language: Hive HPL/SQL,
• Commands DML: Load, Insert, Update, Delete Explain Execution Plan
• Hive CLI (old) Import/Export Locks
• Beeline CLI (new) Data Retrieval: Queries Authorization
• Variable Substitution Select Storage Based Authorization
• HCatalog CLI Group By
SQL Standard Based Authorization
• File Formats
Sort/Distribute/Cluster/Order By
Transform and Map-Reduce Scripts
Hive Default Authorization - Legacy
• Avro Files Mode
Operators and User-Defined Functions
• ORC Files
(UDFs) Configuration Properties
• Parquet
XPath-specific Functions
• Compressed Data Storage
Joins
• LZO Compression
Join Optimization
• Data Types Union
• Data Definition Statements Lateral View
• DDL Statements Sub Queries
• Bucketed Tables Sampling
• Statistics (Analyse and Describe) Virtual Columns
• Indexes
Windowing and Analytics Functions
• Archiving
Enhanced Aggregation, Cube, Grouping
and Rollup

https://cwiki.apache.org/confluence/display/Hive/LanguageManual
Hive vs. SQL
HIVE SQL
Hive is a SQL-like scripting language According to ANSI, SQL is the standard
built on MapReduce language for RDMBS, used to
communicate with databases
Used for transactional processing(OLTP) &
Used for analytics
analytics
Data per query in PBs Data per query in GBs
Faster execution while performing Slower execution while performing
analytics on Huge data sets compared analytics on huge data sets compared to
to SQL HIVE
No Normalization required Supports Normalization
Hive vs. SQL
Hive Data types
Hive Data types
Hive Simple Data types
Hive Complex Data types
Physical Layout
• Warehouse directory in HDFS
• E.g., /user/hive/warehouse
• Tables stored in subdirectories of warehouse
• Partitions form subdirectories of tables
• Actual data stored in flat files
• Control char-delimited text, or SequenceFiles
• With custom SerDe, can use arbitrary format

Source: cc-licensed slide by Cloudera

Data Abstractions in Hive
• Tables
• Typed columns (int, float,
string, boolean)
• Also, list: map (for JSON-like
data)
• Partitions
• For example, range-
partition tables by date
• Buckets
• Hash partitions within
ranges (useful for sampling,
join optimization)
Hive Databases and Tables
Tables in Hive
Managed Tables(Internal Tables) Tables created in Hive, by default it is managed by Hive
which means Hive moves the data into its warehouse directory
• Example:
CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;

Unmanaged Tables(External Tables)

Create external table, which tells Hive to refer to the data that is at an existing location
outside of Hive warehouse directory
• Example:
CREATE EXTERNAL TABLE external_table (dummy STRING)
LOCATION '/user/tom/external_table';
LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external table
External vs. Internal table
• When you drop Internal tables data also gets deleted where as in
external tables data remain intact, hive only drop tables structure
from the metadata.
Also in external, when we create table, Hive does not check the data
and even does not check the location of data. So data can be created
even after creation of tables
• Hive external tables allow us to map a dataset in HDFS to a Hive table
without letting Hive manage the dataset. Datasets for external tables
will not get moved to the Hive default warehouse location
HiveQL-DDL Commands
HiveQL DDL statements are as follows, including:
• CREATE DATABASE/SCHEMA, TABLE, VIEW, FUNCTION, INDEX
• DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX
• TRUNCATE TABLE
• ALTER DATABASE/SCHEMA, TABLE, VIEW
• MSCK REPAIR TABLE (or ALTER TABLE RECOVER PARTITIONS)
• SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, PARTITIONS, FUNCTIONS,
INDEX[ES], COLUMNS, CREATE TABLE
• DESCRIBE DATABASE/SCHEMA, table_name, view_name
• PARTITION statements are usually options of TABLE statements, except for SHOW
PARTITIONS.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
HiveQL-DML Commands
There are multiple ways to modify data in Hive:
• LOAD
• INSERT
• into Hive tables from queries
• into directories from queries
• into Hive tables from SQL
• UPDATE
• DELETE
• EXPORT and IMPORT commands are also available (as of Hive 0.8).

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
Quick Examples Describe formatted show more details
hive>DESCRIBE formatted emp_table
Creating a Database:
Load a File from the Local File System
hive>CREATE DATABASE EMPdb;
hive>load data local inpath<filename> into table<tablename>
hive>CREATE DATABASE IF NOT EXISTS EMPdb;
Load File from HDFS
Listing Databases: hive> SHOW DATABASES;
hive>load data inpath<filename> into table<tablename>
Using a Database: hive> USE EMPdb;
Show Table Contents: hive>select * from emp;
Show Table Contents: hive>Explain select * from emp;
Creating a Table
hive> Create table emp(id int, name string, sal float)
Renaming the Current Table:
> row format delimited
hive> ALTER TABLE EMP RENAME TO EMP_TABLE
> fields terminated by ‘\t’
> stored as {TEXTFILE|SEQUENTIALFILE|RCFILE};
Adding New Columns to an Existing Table:
hive> ALTER TABLE EMP_TABLE ADD COLUMNS (YOJ DATE)
List Tables: hive> show Tables; or show Tables in
EMPdb;
Truncating a Table: hive> truncate table emp_table;
Describe Schema of the Table
Dropping a Database: hive> DROP DATABASE EMPdb;
hive>DESCRIBE emp_table
Examples – Combining query results with UNION ALL
Examples – Sub Queries in Hive
Examples – Joins in Hive
Examples – Joins Syntax in Hive
Examples – Using an outer join to find unmatched
entries
Examples – Left Semi Join
Examples – Creating tables with complex column types
Examples – Creating tables with complex column types
Examples – Row format Example for complex types
Data validation in Hive
Hive is not a Traditional Database
• Traditional solution to all RDBMS problems:
–Put an index on it!
Partitions
• To Increase performance Hive has the capability to partition data
• The values of partitioned column divide a table into segments
• Entire partitions can be ignored at query time
• Similar to relational databases indexes but not as granular
• Partitions have to be properly created by users
• When inserting data must specify a partition
• At query time, whenever appropriate, Hive automatically filter out partitions
• There is no difference in schema between partitions columns and data columns
• Partitions are physically stored under separate directories
Querying Partitioned Table
• There is no difference in syntax
• When partitioned column is specified in the where clause entire
directories/partitions could be ignored
• When partitioning you will use 1 or more virtual columns.
• Virtual columns cause directories to be created in HDFS.
• Files for that partition are stored within that subdirectory.
Loading Data with Virtual Columns
• By default at least one virtual column must be hardcoded

• You can load all partitions in one shot:

• set hive.exec.dynamic.partition.mode=nonstrict;
• Warning: You can easily overwhelm your cluster this way.

• Virtual columns must be last within the inserted data set.

• You can use the SELECT statement to re-order.
Buckets
• Buckets give extra structure to the data that may be used for more efficient queries
• Mechanism to query and examine random sample of data
• Bucketing by user ID means we can quickly evaluate a user based query by running it on randomized sample of
the total set of users
• Break data into a set of buckets based on a hash function of a “bucket column”
• Capability to execute queries on a sub-set of random data
• Doesn’t automatically enforce bucketing
• User is required to specify the number of buckets by setting # of reducer

75
Controlling data locality with Hive
Bucketing:
– Hash partition values into a configurable number of buckets.
– Usually coupled with sorting.

Skews:
– Split values out into separate files.
– Used when certain values are frequently seen.

Replication Factor:
– Increase replication factor to accelerate reads.
– Controlled at the HDFS layer.

Sorting:
– Sort the values within given columns.
– Greatly accelerates query when used with ORCFile filter pushdown.
Guidelines for Architecting Hive Data
SQL Coverage- SQL 92 with extensions
SQL Datatypes SQL Semantics
INT SELECT, LOAD, INSERT from query
TINYINT/SMALLINT/BIGINT Expressions in WHERE and HAVING
BOOLEAN GROUP BY, ORDER BY, SORT BY
FLOAT CLUSTER BY, DISTRIBUTE BY
DOUBLE Sub-queries in FROM clause
STRING GROUP BY, ORDER BY DATE (From Hive 0.12.0)
BINARY ROLLUP and CUBE
VARCHAR (From Hive 0.12.0)
TIMESTAMP UNION CHAR (From Hive 0.13.0)
ARRAY, MAP, STRUCT, UNION LEFT, RIGHT and FULL INNER/OUTER
JOIN
DECIMAL CROSS JOIN, LEFT SEMI JOIN
CHAR Windowing functions (OVER, RANK,
etc.)
VARCHAR Sub-queries for IN/NOT IN, HAVING
DATE EXISTS / NOT EXISTS
Loading data into Hive
Loading Data in Hive
Small size Bulk
Hive LOAD Sqoop - SQl to hadOOP, Apache license
• Load files from HDFS or local filesystem. • Data transfer from external RDBMS to
• Format must agree with table format.
Hive.
• Sqoop can load data directly to/from
HCatalog.
Insert from query
• CREATE TABLE AS SELECT or INSERT INTO. Talend – Community version
WebHDFS + WebHCat
• Load data via REST APIs.
SyncSort – Commercial Version

• STP(Straight Through Processing)

• Flume – Apache lisenced
• Chukwa - a part of Apache Hadoop distribution
• Scribe – Facebook solution for log processing and aggregation.
Optimized sqoop connectors
Current:
Teradata
Oracle
Netezza
SQL Server
MySQL
Postgres
Verticas.
Handling Semi-structured Data
Hive supports arrays, maps, structs and unions.

SerDes map JSON, XML and other formats natively into Hive.
Security: Hive Authorization
• Hive provides Users, Groups, Roles and Privileges
• Granular permissions on tables, DDL and DML operations.
• Not designed for high security:
1. On non-kerberized cluster, up to the client to supply their user name.
2. Suitable for preventing accidental data loss.
HiveServer2
• HiveServer2 is a gateway / JDBC / ODBC endpoint Hive clients can talk
to.
• Supports secure and non-secure clusters.
• DoAs support allows Hive query to run as the requester.
• (Coming Soon) LDAP authentication.
Parameterization of Hive Queries
Parameterized queries
Parameterized queries
Processing Data with external Scripts
Process Data using External Scripts
Data Input and Output with TRANFORM
Hive TRANSFORM example
Hive TRANSFORM Example
User Defined Functions(UDF)
Hive Built in Functions
Hive Built in Functions
Overview of User Defined Functions (UDF’s)
Developing Hive UDF’s
Example: Usage of UDF in HIVE
Example: Usage of UDF in Hive
User-Defined Functions (UDF)
• 1 input to 1 output
• Typically used in select
• SELECT concat(first, ‘ ‘, last) AS full_name…
• See Hive language wiki for full list of built-in UDF’s
• http://wiki.apache.org/hadoop/Hive/LanguageManual
• Noteworthy features
• Sometimes you want to cast
• SELECT CAST(5.0/2.0 AS INT)…
• Conditional functions
• SELECT IF(boolean, if_true, if_not_true)…

Facebook
User Defined Aggregate Functions (UDAF)
• N inputs to 1 output
• Typically used with GROUP BY
• SELECT count(1) FROM … GROUP BY age
• SELECT count(DISTINCT first_name) GROUP BY
last_name…
• sum(), avg(), min(), max()
• For skew
• set hive.groupby.skewindata = true;
• set hive.map.aggr.hash.percentmemory = <some lower
value>

Facebook
User Defined Table-Generating Functions (UDTF)
• 1 input to N outputs
• explode(Array<?> arg)
• Converts an array into multiple rows, with one element per row
• Transform-like syntax
• SELECT udtf(col0, col1, …) AS colAlias FROM srcTable
• Lateral view syntax
• …FROM baseTable
LATERAL VIEW udtf(col0, col1…)
tableAlias AS colAlias
• Also see: http://bit.ly/hive-udtf

Facebook
Summary: UDF vs. UDAF vs. UDTF
• User Defined Functions
• One-to-One mapping
• concat(“firstname”, “lastname”)
• User Defined Aggregate Functions
• Many-to-one mapping
• Sum(num_ads)
• User Defined Table-generating Functions
• One-to-many mapping
• explode([1,2,3])
Interfaces to write UDF
• There are two different interfaces you can use for writing UDFs for Apache
Hive. One is really simple, the other… not so much.

• The simple API (org.apache.hadoop.hive.ql.exec.UDF) can be used so long

as your function reads and returns primitive types. By this I mean basic
Hadoop & Hive writable types - Text, IntWritable, LongWritable,
DoubleWritable, etc.

• However, if you plan on writing a UDF that can manipulate embedded data
structures, such as Map,List, and Set, then you’re stuck using
org.apache.hadoop.hive.ql.udf.generic.GenericUDF, which is a little more
involved.
SerDe –
Serialization/Deserialization
Serialization and De-serialization in Hive(SerDe)
• SerDe is short for serialization/deserialization. It controls the format
of a row.
• Hive uses the SerDe interface for IO. The interface handles both
serialization and Deserialization and also interpreting the result of
serialization as Individual fields for processing.
• A SerDe allows Hive to read in data from a table, and write it back out
to HDFS in any custom format. Anyone can write their own SerDe for
their own data formats.

Facebook
Serialization and Deserialization in Hive
• Default Value of SerDe is org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
• In some situations, the interface used for de-serialization is LazySerDe.
• Unstructured data gets converted into structured data due to the
flexibility of LazySerDe interface.
• While using the LazySerDe interface, data is read based on the
separation by different delimiter characters.
• The SerDe interface is located in ‘hive_contrib.jar’.
Hive SerDe’s
SerDe Examples
• CREATE TABLE mylog (
user_id BIGINT,
page_url STRING,
unix_time INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

• CREATE table mylog_rc (

user_id BIGINT,
page_url STRING,
unix_time INT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS RCFILE;
When to add a new SerDe

• User has data with special serialized format not supported by Hive
yet, and user does not want to convert the data before loading into
Hive.

• User has a more efficient way of serializing the data on disk.

Facebook
Adding a Custom SerDe to Hive
Using SerDe’s in Hive
How to add a new SerDe for text data
• Follow the example in
contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java

• RegexSerDe uses a user-provided regular expression to deserialize data.

• CREATE TABLE apache_log(host STRING,

identity STRING, user STRING, time STRING, request STRING,
status STRING, size STRING, referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-
9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s”)
STORED AS TEXTFILE;

Facebook
Text Processing in Hive
Text Processing
Basic String Functions
Parsing URL’s with Hive
Numeric format functions
Splitting and Combining strings
Converting Array to Records with Explode
Regular expressions
Hive’s Regular expressions Functions
Regex SerDe
Creating Table with Regex serDe
Creating Table with Regex serDe
Fixed width formats in Hive
Fixed width formats example
Parsing Sentences in to words
Sentiment Analysis
n-grams
Calculating n-gram in Hive
Calculating n-gram in Hive
Finding specific n-grams in text
Calculating data for histograms
File Formats
Hive Persistence Formats
Built-in Formats:
– ORCFile - Optimized Row columnar
– RCFile - Record columnar File
– Avro - Avro is data serialization system. Avro schemas are defined with JSON. This facilitates
implementation in languages that already have JSON libraries.
– Delimited Text /Text FIles
– Regular Expression - a sequence of symbols and characters expressing a string or pattern to
be searched for within a longer piece of text.
– S3 Logfile –Server access log format
– Typed Bytes
3rd-Party Addons:
– JSON – Java Script object notation
– XML

Hive allows mixed format

Hive Text File Formats
Hive Binary File Formats
Use Case for Mixed format
Use Case:
• Ingest data in a write-optimized format like JSON or delimited.
• Every night, run a batch job to convert to read-optimized ORCFile.
ORCFile – Efficient Columnar Layout
ORCFile – Advantages
High Compression
– Many tricks used out-of-the-box to ensure high compression rates.
– RLE, dictionary encoding, etc.

High Performance
– Inline indexes record value ranges within blocks of ORCFile data.
– Filter pushdown allows efficient scanning during precise queries.

Flexible Data Model

–All Hive types including maps, structs and unions.
Some ORCFile samples
ORCFile Options and Defaults
No Compression – Faster but larger
Connecting Hive from Popular Analytics Tools
Connecting Hive from Popular tools
Connectivity
 JDBC –I
 ODBC- Free driver available with across vendors
 WebHCat – Run jobs using simple REST interface
 BI Ecosystem- Most popular BI tools support Hive
Hive Use Cases
Use Case: Log file Analytics
Use Case: Sentiment Analysis
Case Study:
Netflix data loading
Hive Process- Use Case

• NetFlix Case Study

• Usage of Chukwa
• Log processing
• Count Errors per session
• Count Streams per day
• Ad-hoc queries like summaries
(sum, max, min, …)
Hive Process- Contd.
• Phase 1
• Hadoop job parses the logs and loads to Hive every hour.
• Previous job should also run every 24 hours for summary
• Phase 2
• Real-time log processing(parse/merge/load)
• Chukwa has non-stop log collection.
Performance
• According to Globant investigations
• Tables:
Performance

155 of 31
SAMPLE CODES
Hive Configuration
• Default configuration file “hive-site.xml”. We can overwrite using the
following command
$ hive –config /user/lib/hive/hive-conf

• Hive also permits to set the connection properties per session basis
as follows
$ hive –hiveconf fs.defaultFS=hdfs://localhost –hiveconf
mapreduce.framework.name=yarn \
--hiveconf yarn.resourncemanager.address=localhost:8032
Using Apache Tez as the execution engine for the HIVE
• Default execution engine is MapReduce, we can change to new execution framework “Tez”

• Tez is a new execution framework built on top of YARN, which provides a lower-level API(directed acyclic
graphs) than MapReduce. Tez is more flexible and powerful than MapReduce.

• Tez allows applications to improve performance by utilizing more expressive execution patterns than
MapReduce pattern. Hive supports the Tez execution engine as substitute for the background MapReduce
computations

• Hive would convert the Hive queries into Tez execution graphs

• You can instruct Hive to use Tez as the execution framework as follows
$ hive> set hive.execution.engine=tez;
$ hive> set hjve.execution.engine=mr;
To check the current setting, we can use set as follows
$ hive> set hive.execution.engine;
$ hive> set: # it will list all the properties set by Hive
(Change setting within session using “set” command)
Create data base & tables

159
Load Data - Queries

Hive provides several operators for the ordering of query results, with subtle differences and performance trade-offs

ORDER BY: This guarantees the global ordering of the data using single reducer
SORT BY: This guarantees the local ordering of data that is output by each reduce task.
CLUSTER BY: This distributes the data to reduce tasks, avoiding any range overlap and each reduce task will output
the data in sorted order.
160
Schema Violations
• What would happen if try to insert data that does not comply with
the pre-defined schema?
• Null is set for any that violates pre-defined schema
Managing outputs
 Inserting Output into another table
INSERT OVERWRITE TABLE results(SELECT * from txnrecords);

 Inserting Output into Local file

INSERT OVERWRITE LOCAL DIRECTORY ‘results’ SELECT * from txnrecords;

 Inserting Output into HDFS

INSERT OVERWRITE DIRECTORY ‘/results’ SELECT * from txnrecords;

Writing Hive User-defined Functions:

Hive UDFs allow us to extend the capabilities of Hive for our customized requirement, without having to resort to
implement Java MapReduce program from scratch

Hive QL Cheat Sheet

http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/

162
Hive Script – Running Batch commands
 Hive Scripts are used to execute Hive commands collectively from terminal. This helps in
reducing the time and effort invested in writing and executing each command manually

 Hive supports scripting from Hive 0.10.0 and above versions

Steps:
1. Create file with list of all commands to run and save as file with extension .hql or .hive
2. Run the below command to execute the file from terminal
hive –f < path of the .hql/.hive file> (e.g: hive –f /home/user/Desktop/demo.hive)

The script runs all the queries and save the output in respective places (where you re directed)
Appendix
Brief about Data Warehouse
Brief About Data Warehouse
• OLAP vs OLTP
• DW is needed in OLAP
• We want report and summary not live data of transactions for continuing the
operate
• We need reports to make operation better not to conduct and operation!
• We use ETL to populate data in DW.

OLAP: online analytical processing

OLTP: online transactional processing
Brief About Data Warehouse
Inmon approach
vs
Kimbal approach
• Inmon and Kimball are two pioneers that started different philosophies for
enterprise-wide information gathering, information management, and analytics for
decision support.

• Inmon believes in creating a single enterprise-wide data warehouse for achieving an

overall business intelligence system.

• Kimball believes in creating several smaller data marts for achieving department-level
analysis and reporting. William Inmon: Father of Datawarehouse
Rolph Kimbal: Father of Business Intelligence
Brief About Data Warehouse (Inmon approach vs Kimbal approach)
Brief About Data Warehouse
• Other keywords
• ODS- Operational Data Store
• Star Schema & Snowflake schema
• Fact Tables
• Data Mart
• Dimensions
• Concurrent ETLs
SQL vs. HiveQL(HQL)
Hadoop cluster is not a database server
Hive Vs. Traditional RDBMS

172
“Schema on read” Accelerates Data Innovation
Finally, we
Let me see…
I need new start is it any good?
data collec+ng
“Schema change”
project

Start 3 months 6 months 9 months

Let me see…
is it any good? My model is
awesome!

Let’s just put it in a

folder on HDFS
Hive Vs. SQL
Join Syntax
• In SQL, usually the default join is the "inner" join, in which the result includes entries where both sides of the condition are not null. However in
HiveQL, the default is the "equi" join, in which the only entries that are returned are the ones where the condition is true and returns no null
values. This distinction often causes problems when the user does not explicitly specify the type of join that Hive should use. One thing to note is
that when converting existing SQL queries into HiveQL, the syntax of "LEFT JOIN" or "RIGHT JOIN" does not work. Hive requires that the join be
specified as "LEFT OUTER JOIN" or "RIGHT OUTER JOIN".
Largest Table Last
• Since queries in HiveQL are being converted into MapReduce jobs, for some queries it is important to keep that in mind. In order to improve run
time, Hive will attempt to perform a map-side join where it loads the first table into memory and reads the second table in as normal input to the
map function. When writing queries, try to facilitate this as much as possible and order the tables used in the join so that the largest table is last.

Group By
• Like SQL, HiveQL provides support for the "group by" command, in which multiple rows in a table can be collected into groups with the same
values in specific columns. Once the groups are formed, the columns being used in the group by can be accessed as usual in the select statement.
The columns that were not used in the group by should be accessed through aggregation functions, like average or sum. If a column that is not
part of the group by is accessed as a single element in the select statement, this will cause Hive to throw an error. Thankfully, the error displayed
by Hive will alert the user to which column is being used incorrectly. The issue can be resolved by ensuring that all columns in the select
statement are either part of the group by and accessed as a single element, or not part of the group by and accessed through aggregation.
Collect Set
• One additional thing to consider when using group by is the existence of a special aggregation function: collect_set. This allows a column not
used in the group by to be aggregated into a set. The values in the set are accessible using normal array-like syntax and can be used the same way
as any column in the original table. The elements in the array will not be sorted, so any ordering will need to happen through the use of a user
defined function. Additionally, Hive will not evaluate an expression to calculate an index.
Limit
• One effective way to understand the results of a query is to run it and use the "limit" function to view a small subset of the results. It is important
to keep in mind that unless the "sort by" command is used as part of the query, the result will not be ordered in any way.
Hive Vs. SQL
• The main difference between RDBMs databases and Hive is specialization. While MySQL
is general purpose database suited both for transactional processing (OLTP) and for
analytics (OLAP), Hive is built for the analytics only. Technically the main difference is
lack of update/delete functionality. Data can only by be added and selected. In the same
time Hive is capable of processing data volumes which can not be processed by MySQL
or other conventional RDBMS (in shy budget).

• MPP (massive parallel processing) databases are closest to the Hive by their functionality
- while they have full SQL support they are scalable up to hundreds of computers.
Another serious different - is query language.

• Hive do not support full SQL even in select because of it's implementation. In my view
main difference is lack of join for any condition other then equal. Hive query language
syntax is also a bit different so you can not connect report generation software right to
Hive.
How to improve efficiency?
Performance Question
• Which of the following is faster?
• select count(distinct(Col)) from Tbl
• select count(*) from (select distict(Col) from Tbl)
Count distinct
Answer
Surprisingly the second is usually faster
• In the first case:
• Maps send each value to the reduce
• Single reduce counts them all
• In the second case:
• Maps split up the values to many reduces
• Each reduce generates its list
• Final job counts the size of each list
• Singleton reduces are almost always BAD
Going Fast in Hadoop
Hadoop:
• Really good at coordinated sequential scans.
• No random I/O. Traditional index pretty much useless.

Keys to speed in Hadoop:

• Sorting and skipping take the place of indexing.
• Minimizing data shuffle the other key consideration.

Skipping data:
• Divide data among different files which can be pruned out.
• Partitions, buckets and skews.
• Skip records during scans using small embedded indexes.
• Automatic when you use ORCFile format.
• Sort data ahead of time.
• Simplifies joins and skipping becomes more effective.
Data Layout Considerations for Fast Hive
Hive Fast Query Check List
For Even More Performance
Future Trends?
Hive – Hivemall

Hivemall is a scalable machine learning library

that runs on Apache Hive, Spark and Pig.

Planning the initial Apache release on Q1,

2017

185 of 31
Hive – Shark
Shark is a large-scale data warehouse system for Spark designed to be compatible
with Apache Hive.
It can execute Hive QL queries up to 100 times faster than Hive without any
modification to the existing data or queries.
Shark supports Hive's query language, metastore, serialization formats, and user-
defined functions, providing seamless integration with existing Hive deployments and
a familiar, more powerful option for new ones.

186 of 31
Further Readings
Further Reading
• Apache Drill
• Software framework that supports data-intensive, distributed applications, for interactive
analysis of large-scale datasets
• PIG
• MR Platform for creating and using MR on Hadoop
• Oracle Big Data
• DB2 10 and InfoSphere Warehouse
• Parallel databases: Gamma, Bubba, Volcano
• Google: Sawzall
• Yahoo: Pig
• IBM: JAQL
• Microsoft: DradLINQ , SCOPE

188 of 31
References
• https://www.facebook.com/note.php?note_id=89508453919
• https://github.com/facebook/scribe
• http://sqoop.apache.org/docs/
• http://flume.apache.org/FlumeDeveloperGuide.html
• Sqoop Database Import For Hadoop, Cloudera, Oct.2009
• https://cwiki.apache.org/confluence/display/Hive/LanguageManual
• http://www.semantikoz.com/blog/the-free-apache-hive-book/
• BEGINNING MICROSOFT® SQL SERVER® 2012 PROGRAMMING, Wiley, Paul
Atkinson and Robert Vieira, ISBN: 978-1-118-10228-2
• Hive – A Petabyte Scale Data Warehouse Using Hadoop, facebook team, 2009

189 of 31
Contact Us
Visit us on: http://www.analytixlabs.in/

For more information, please contact us: http://www.analytixlabs.co.in/contact-us/

Or email: [email protected]

Call us we would love to speak with you: (+91) 9910509849

Join us on:
Twitter - http://twitter.com/#!/AnalytixLabs
Facebook - http://www.facebook.com/analytixlabs
LinkedIn - http://www.linkedin.com/in/analytixlabs
Blog - http://www.analytixlabs.co.in/category/blog/

KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
Microsoft Azure DP-300 Exam Dumps
No ratings yet
Microsoft Azure DP-300 Exam Dumps
11 pages
BIG DATA ANALYTICS - Syllabus
No ratings yet
BIG DATA ANALYTICS - Syllabus
4 pages
SImplified Solutions of BAD601 Model Question Paper
No ratings yet
SImplified Solutions of BAD601 Model Question Paper
32 pages
Design Final 1
33% (3)
Design Final 1
16 pages
Dbms File
100% (4)
Dbms File
37 pages
Apache Hive DDL DML, Queries
100% (2)
Apache Hive DDL DML, Queries
4 pages
CCS334 BIG DATA ANALYTICS Session 1 Intr
No ratings yet
CCS334 BIG DATA ANALYTICS Session 1 Intr
18 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
Unit-V Pig Programming
No ratings yet
Unit-V Pig Programming
123 pages
Unit 4 Hadoop
No ratings yet
Unit 4 Hadoop
86 pages
Artificial Intelligence in Higher Education Challe
No ratings yet
Artificial Intelligence in Higher Education Challe
16 pages
Big Data Analytics
No ratings yet
Big Data Analytics
131 pages
BDA Unit 5 HIVE HBASE
No ratings yet
BDA Unit 5 HIVE HBASE
33 pages
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
100% (1)
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
31 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Hive Installation On Windows 10
No ratings yet
Hive Installation On Windows 10
13 pages
Operating System
100% (1)
Operating System
11 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
Midterm Solution
0% (1)
Midterm Solution
7 pages
MidTerm Exam - Attempt Review
100% (1)
MidTerm Exam - Attempt Review
16 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
PT 1 Viii Ict
100% (1)
PT 1 Viii Ict
4 pages
Big Data Analytics PDF
No ratings yet
Big Data Analytics PDF
22 pages
Unit 4 Leverage Buyout
No ratings yet
Unit 4 Leverage Buyout
20 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
Install and Run Hadoop On Windows
No ratings yet
Install and Run Hadoop On Windows
29 pages
Hive Mock Test
100% (1)
Hive Mock Test
6 pages
Hive Quiz and Questions
No ratings yet
Hive Quiz and Questions
6 pages
Data Preparation
No ratings yet
Data Preparation
28 pages
Hadoop Hive - One
No ratings yet
Hadoop Hive - One
10 pages
GL AP Drilldown
No ratings yet
GL AP Drilldown
1 page
Da Notes (Big Data) PDF
No ratings yet
Da Notes (Big Data) PDF
32 pages
Introduction To Information and Big Data Security
No ratings yet
Introduction To Information and Big Data Security
39 pages
Big Data and Data Analytics Cloudera.
No ratings yet
Big Data and Data Analytics Cloudera.
3 pages
Access Final Exam Chapter 1 CS1000
100% (1)
Access Final Exam Chapter 1 CS1000
2 pages
Twitter Sentimental Analysis
No ratings yet
Twitter Sentimental Analysis
42 pages
DBMS Slides
No ratings yet
DBMS Slides
127 pages
Di - Sums and Solutions
No ratings yet
Di - Sums and Solutions
65 pages
Big Data Engineer Ibm Exploree Cartes - Quizlet
No ratings yet
Big Data Engineer Ibm Exploree Cartes - Quizlet
30 pages
M Tech 1sem BDA Question Paper With Answers
No ratings yet
M Tech 1sem BDA Question Paper With Answers
98 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Lecture 20-22 (Memory II)
No ratings yet
Lecture 20-22 (Memory II)
56 pages
Compe 431 Sample Questions
100% (1)
Compe 431 Sample Questions
19 pages
MultidimensionalDataModeling UnitIV
No ratings yet
MultidimensionalDataModeling UnitIV
86 pages
File Hanling - New - C++
No ratings yet
File Hanling - New - C++
26 pages
IT Infrastructure Evolution
100% (1)
IT Infrastructure Evolution
1 page
Sample Paper Q0503
No ratings yet
Sample Paper Q0503
20 pages
Ebook PE Query Optimization
No ratings yet
Ebook PE Query Optimization
62 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
100 Interview Questions On Hadoop PDF
No ratings yet
100 Interview Questions On Hadoop PDF
24 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
BDA Lab ManuaL
No ratings yet
BDA Lab ManuaL
83 pages
Hive Lab
No ratings yet
Hive Lab
33 pages
Unit 4 Deal Structuring and Negotiation
No ratings yet
Unit 4 Deal Structuring and Negotiation
20 pages
SQL Joins
No ratings yet
SQL Joins
9 pages
Registering Map
No ratings yet
Registering Map
41 pages
BigData Exam C2122 PDF
No ratings yet
BigData Exam C2122 PDF
6 pages
Hive
No ratings yet
Hive
12 pages
Unit 1 Corporate Restructuring PT 2
No ratings yet
Unit 1 Corporate Restructuring PT 2
16 pages
Cloudera Quickstart PDF
No ratings yet
Cloudera Quickstart PDF
28 pages
CSCI312 Big Data Management Singapore 2022-2 Assignment 2: Published On 24 April 2022
No ratings yet
CSCI312 Big Data Management Singapore 2022-2 Assignment 2: Published On 24 April 2022
10 pages
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
No ratings yet
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
11 pages
Business Intelligence Systems - Types of BI Tools in 2023
No ratings yet
Business Intelligence Systems - Types of BI Tools in 2023
16 pages
Computerproject Class 12
No ratings yet
Computerproject Class 12
17 pages
Data Warehousing Experienced Level Questions
No ratings yet
Data Warehousing Experienced Level Questions
11 pages
43 PPT On Apache Pig
No ratings yet
43 PPT On Apache Pig
16 pages
Lab Manual Exec System Call: A Family of Six Functions
No ratings yet
Lab Manual Exec System Call: A Family of Six Functions
9 pages
Edureka Interview Questions - HDFS
No ratings yet
Edureka Interview Questions - HDFS
4 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
13 pages
Brief Review On SQL and NoSQL
No ratings yet
Brief Review On SQL and NoSQL
4 pages
What Is A Data Mart - IBM
No ratings yet
What Is A Data Mart - IBM
9 pages
Schema Domain Table
No ratings yet
Schema Domain Table
7 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
20BSC Worksheet 3.1
No ratings yet
20BSC Worksheet 3.1
8 pages
SAP HANA Distributed In-Memory Database System Transaction Session and Metadata Management
No ratings yet
SAP HANA Distributed In-Memory Database System Transaction Session and Metadata Management
9 pages
The Pill Store Database 2 2
No ratings yet
The Pill Store Database 2 2
16 pages
Lab File Format
No ratings yet
Lab File Format
4 pages
Inf1343 2011W Assignment 1
No ratings yet
Inf1343 2011W Assignment 1
4 pages
Synonyms (Oracle)
No ratings yet
Synonyms (Oracle)
4 pages
Database Types
No ratings yet
Database Types
3 pages
CRUD Di Android Studio
No ratings yet
CRUD Di Android Studio
3 pages
Homework 7 Fa 11
No ratings yet
Homework 7 Fa 11
3 pages
Adobe DreamAdobe Dreamweaver & Dreamweaver Developer Toolbox Tutorialweaver & Dreamweaver Developer Toolbox Tutorial
No ratings yet
Adobe DreamAdobe Dreamweaver & Dreamweaver Developer Toolbox Tutorialweaver & Dreamweaver Developer Toolbox Tutorial
3 pages
Cloudera Lab Preparation
No ratings yet
Cloudera Lab Preparation
3 pages
Big Data Analytics Assignment 1
No ratings yet
Big Data Analytics Assignment 1
1 page
SQL Rdbms Concepts
No ratings yet
SQL Rdbms Concepts
3 pages
Document - 2 - 1643019
No ratings yet
Document - 2 - 1643019
2 pages
Unit Test
No ratings yet
Unit Test
3 pages
ICT Assignment For F4
No ratings yet
ICT Assignment For F4
7 pages
Codd's Rule For RDBMS
No ratings yet
Codd's Rule For RDBMS
2 pages
Dataguatd Issues
No ratings yet
Dataguatd Issues
2 pages

Hadoop - Hive

Uploaded by

Hadoop - Hive

Uploaded by

Hadoop Hive

(A Framework for data warehousing

HIVE PIG LATIN MAHOUT  These Tools

FLUME Import & Export SQOOP

• Scalable SQL processing over data in Hadoop

Scribe server tier MySQL server tier

Data collection server Oracle Database

Amazon uses it in Amazon Elastic MapReduce

• It is not independent and it’s performance is tied Hadoop

• Hadoop is great for large-data processing!

• Use yum install or apt-install if possible

• Set the environment variable HIVE_HOME to point to the installation directory

• Add $HIVE_HOME/bin to your PATH:

Web interface: Hive also provides a web interface to monitor/administrate

Directed acyclic graph(DAG): is a directed graph with no directed cycles.

• The metastore is divided into two

Source: cc-licensed slide by Cloudera

Unmanaged Tables(External Tables)

• You can load all partitions in one shot:

• Virtual columns must be last within the inserted data set.

• STP(Straight Through Processing)

• The simple API (org.apache.hadoop.hive.ql.exec.UDF) can be used so long

• CREATE table mylog_rc (

• User has a more efficient way of serializing the data on disk.

• RegexSerDe uses a user-provided regular expression to deserialize data.

• CREATE TABLE apache_log(host STRING,

Hive allows mixed format

Flexible Data Model

• NetFlix Case Study

 Inserting Output into Local file

 Inserting Output into HDFS

Writing Hive User-defined Functions:

Hive QL Cheat Sheet

 Hive supports scripting from Hive 0.10.0 and above versions

OLAP: online analytical processing

• Inmon believes in creating a single enterprise-wide data warehouse for achieving an

Start 3 months 6 months 9 months

Let’s just put it in a

Keys to speed in Hadoop:

Hivemall is a scalable machine learning library

Planning the initial Apache release on Q1,

For more information, please contact us: http://www.analytixlabs.co.in/contact-us/

Call us we would love to speak with you: (+91) 9910509849

You might also like