Hadoop - Hive
Hadoop - Hive
Disclaimer: This material is protected under copyright act AnalytixLabs ©, 2011-2016. Unauthorized use and/ or duplication of this material or any part of this material
including data, in any form without explicit and written permission from AnalytixLabs is strictly prohibited. Any violation of this copyright will attract legal actions
(A Framework for data warehousing on top of
Hadoop)
Recall: Hadoop Eco-System – Analytics mapping
APACHE OOZIE (WORK FLOW)
Import
Export
Processing
Unstructured & Semi structured Structured
Zookeeper is software project, providing an open source distributed configuration service and synchronization service and naming registry for large distributed system
Hive - Overview
Hive-SQL Analytics For Any Data Size
7
What is Hive?
Apache Hive is a data warehouse software facilitates querying as well as managing large datasets
residing in distributed storage. Hive is one of the easiest to use of the high-level MapReduce (MR)
frameworks.
Features of Hive
• Its Open Source(Very Important!) so free
• Data-warehousing tool on top of Hadoop
• Suitable for structured & semi structured data
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
• Ability to bring structure to various data formats
• Simple interface for ad hoc querying, analysing and summarizing large amounts of data
• Access to files on various data stores such as HDFS and Hbase
What is Hive
• Creates table schema before loading data into tables.
• Hive is batch-oriented and has high latency for query execution
• Database / table / partition / bucket – DDL Operations
• SQL Types + Complex Types (ARRAY, MAP, etc)
• No need to learn java and Hadoop API’s
• Abstracts complexity of Hadoop
• Indexing to provide acceleration, index type including compaction and Bitmap index as of 0.10,
more index types are planned.
• Different storage types such as plain text, RCFile, HBase, ORC, and others.
• Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during
query execution.
• Operating on compressed data stored into the Hadoop ecosystem using algorithms
including DEFLATE, BWT, snappy, etc.
• Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools.
Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
• SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.
What Hive is not?
Hive is not
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• Does not use complex indexes so do not response in a seconds!
• But it scales very well and, It works with data of Peta Byte order
https://git-wip-us.apache.org/repos/asf?p=hive.git
What is Hive?
What is cool about Hive?
Translates HiveQL statements into set of MapReduce jobs which then executed on Hadoop cluster
Why use Hive
Where to use Hive
• Log processing
• Daily Report
• User Activity Measurement
• Data/Text mining
• Machine learning (Training Data)
• Business intelligence
• Advertising Delivery
• Spam Detection
Hive Installation
$ cd hive-x.y.z
$ export HIVE_HOME={{pwd}}
$ export PATH=$HIVE_HOME/bin:$PATH
Using the Hive shell
Accessing Hive from command line
Hive Properties
Interacting with Operating system and HDFS
Accessing Hive with HUE
Accessing Hive with HUE
Interacting with Hive Server-2 – Hive as service
Interacting with Hive Server-2 – Hive as service
Hive, Map-Reduce and Local-Mode
Hive also supports a mode to run map-reduce jobs in local-mode automatically. The
relevant options are hive.exec.mode.local.auto, hive.exec.mode.local.auto.inputbytes.max,
and hive.exec.mode.local.auto.tasks.max:
hive> SET mapreduce.framework.name=local;
hive> SET hive.exec.mode.local.auto=false;
Note that this feature is disabled by default. If enabled, Hive analyzes the size of each map-
reduce job in a query and may run it locally if the following thresholds are satisfied:
The total input size of the job is lower
than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)
The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by
default). The total number of reduce tasks required is 1 or 0.
So for queries over small data sets, or for queries with multiple map-reduce jobs where the
input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior
job), jobs may be run locally.
27 of 31
Hive-logs
Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The
default logging level is WARN for Hive releases prior to 0.13.0. Starting with Hive 0.13.0,
the default logging level is INFO.
The logs are stored in the directory /tmp/<user.name>:
/tmp/<user.name>/hive.log
Note: In local mode, prior to Hive 0.13.0 the log file name was ".log" instead of "hive.log".
This bug was fixed in release 0.13.0 (see HIVE-5528 and HIVE-5676).
To configure a different log location, set hive.log.dir in $HIVE_HOME/conf/hive-
log4j.properties. Make sure the directory has the sticky bit set (chmod 1777 <dir>).
hive.log.dir=<other_location>
28 of 31
Hive Components - Architecture
Hive Components
Hadoop: Hive needs Hadoop as a Base Framework to operate.
Driver: Hive has its own drivers to communicate with the Hadoop World. The
component that manages the lifecycle of a HiveQL statement as it moves
through Hive. The driver also maintains a session handle and any session
statistics
CLI: The Hive CLI is the console for firing Hive Queries. The CLI would be used
for operating on our data.
MetaStore: Metastore is the Hive’s data warehouse which stores all the
structure information of various tables/partitions in Hive.
(Database Catalog)
Thrift Server(Hive Server): The component that provides a trift interface and
a JDBC/ODBC server and provides a way of integrating Hive with other
applications
Hive components - Architecture
Hive-The SQL interface to Hadoop
How Hive process the data?
Hive-Reliable SQL Processing at a scale
Hive Architecture
• Internal Components
• Compiler and Planner
• The component that compiles HiveQL into a directed acyclic graph of map/reduce tasks.
• Optimizer
• consists of a chain of transformations such that the operator DAG resulting from one
transformation is passed as input to the next transformation
• Performs tasks like Column Pruning , Partition Pruning, Repartitioning of Data
• Execution Engine
• The component that executes the tasks produced by the compiler in proper dependency
order. The execution engine interacts with the underlying Hadoop instance.
37
Hcatlog
• Performing Java MapReduce computation on data, mapped to Hive tables
• HCatalog is a meta-data abstraction layer for files stored in HDFS and makes it easy for
different components to process data stored in HDFS.
• HCatalog abstraction is based on tabular table mode and augments structure, location,
storage format and other meta-data information for the datasets stored in HDFS.
• With HCatalog we can use data processing tools such as Pig, Java MapReduce and
others read and write data to Hive tables without worrying about the structure, storage
format and or storage location of the data.
How Hive loads and stores data?
How Hive loads and stores data?
Hive Query Language(HQL)
Hive query language
Hive Query Language(HiveQL)
Hive Query Language(HiveQL)
• HiveQL does not strictly follow the full SQL-92 standard.
• HiveQL offers extensions not in SQL, including multitable inserts and create
table as select, but only offers basic support for indexes.
• HiveQL lacks support for transactions andmaterialized views, and only
limited subquery support.
• Support for insert, update, and delete with full ACID functionality was
made available with release 0.14.
• Internally, a compiler translates HiveQL statements into a directed acyclic
graph of MapReduce or Tez, or Spark jobs, which are submitted to Hadoop
for execution.
Hive Query Language(HiveQL)
HiveQL
• Commands and CLIs Data Manipulation Statements Procedural Language: Hive HPL/SQL,
• Commands DML: Load, Insert, Update, Delete Explain Execution Plan
• Hive CLI (old) Import/Export Locks
• Beeline CLI (new) Data Retrieval: Queries Authorization
• Variable Substitution Select Storage Based Authorization
• HCatalog CLI Group By
SQL Standard Based Authorization
• File Formats
Sort/Distribute/Cluster/Order By
Transform and Map-Reduce Scripts
Hive Default Authorization - Legacy
• Avro Files Mode
Operators and User-Defined Functions
• ORC Files
(UDFs) Configuration Properties
• Parquet
XPath-specific Functions
• Compressed Data Storage
Joins
• LZO Compression
Join Optimization
• Data Types Union
• Data Definition Statements Lateral View
• DDL Statements Sub Queries
• Bucketed Tables Sampling
• Statistics (Analyse and Describe) Virtual Columns
• Indexes
Windowing and Analytics Functions
• Archiving
Enhanced Aggregation, Cube, Grouping
and Rollup
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
Hive vs. SQL
HIVE SQL
Hive is a SQL-like scripting language According to ANSI, SQL is the standard
built on MapReduce language for RDMBS, used to
communicate with databases
Used for transactional processing(OLTP) &
Used for analytics
analytics
Data per query in PBs Data per query in GBs
Faster execution while performing Slower execution while performing
analytics on Huge data sets compared analytics on huge data sets compared to
to SQL HIVE
No Normalization required Supports Normalization
Hive vs. SQL
Hive Data types
Hive Data types
Hive Simple Data types
Hive Complex Data types
Physical Layout
• Warehouse directory in HDFS
• E.g., /user/hive/warehouse
• Tables stored in subdirectories of warehouse
• Partitions form subdirectories of tables
• Actual data stored in flat files
• Control char-delimited text, or SequenceFiles
• With custom SerDe, can use arbitrary format
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
Quick Examples Describe formatted show more details
hive>DESCRIBE formatted emp_table
Creating a Database:
Load a File from the Local File System
hive>CREATE DATABASE EMPdb;
hive>load data local inpath<filename> into table<tablename>
hive>CREATE DATABASE IF NOT EXISTS EMPdb;
Load File from HDFS
Listing Databases: hive> SHOW DATABASES;
hive>load data inpath<filename> into table<tablename>
Using a Database: hive> USE EMPdb;
Show Table Contents: hive>select * from emp;
Show Table Contents: hive>Explain select * from emp;
Creating a Table
hive> Create table emp(id int, name string, sal float)
Renaming the Current Table:
> row format delimited
hive> ALTER TABLE EMP RENAME TO EMP_TABLE
> fields terminated by ‘\t’
> stored as {TEXTFILE|SEQUENTIALFILE|RCFILE};
Adding New Columns to an Existing Table:
hive> ALTER TABLE EMP_TABLE ADD COLUMNS (YOJ DATE)
List Tables: hive> show Tables; or show Tables in
EMPdb;
Truncating a Table: hive> truncate table emp_table;
Describe Schema of the Table
Dropping a Database: hive> DROP DATABASE EMPdb;
hive>DESCRIBE emp_table
Examples – Combining query results with UNION ALL
Examples – Sub Queries in Hive
Examples – Joins in Hive
Examples – Joins Syntax in Hive
Examples – Using an outer join to find unmatched
entries
Examples – Left Semi Join
Examples – Creating tables with complex column types
Examples – Creating tables with complex column types
Examples – Row format Example for complex types
Data validation in Hive
Hive is not a Traditional Database
• Traditional solution to all RDBMS problems:
–Put an index on it!
Partitions
• To Increase performance Hive has the capability to partition data
• The values of partitioned column divide a table into segments
• Entire partitions can be ignored at query time
• Similar to relational databases indexes but not as granular
• Partitions have to be properly created by users
• When inserting data must specify a partition
• At query time, whenever appropriate, Hive automatically filter out partitions
• There is no difference in schema between partitions columns and data columns
• Partitions are physically stored under separate directories
Querying Partitioned Table
• There is no difference in syntax
• When partitioned column is specified in the where clause entire
directories/partitions could be ignored
• When partitioning you will use 1 or more virtual columns.
• Virtual columns cause directories to be created in HDFS.
• Files for that partition are stored within that subdirectory.
Loading Data with Virtual Columns
• By default at least one virtual column must be hardcoded
75
Controlling data locality with Hive
Bucketing:
– Hash partition values into a configurable number of buckets.
– Usually coupled with sorting.
Skews:
– Split values out into separate files.
– Used when certain values are frequently seen.
Replication Factor:
– Increase replication factor to accelerate reads.
– Controlled at the HDFS layer.
Sorting:
– Sort the values within given columns.
– Greatly accelerates query when used with ORCFile filter pushdown.
Guidelines for Architecting Hive Data
SQL Coverage- SQL 92 with extensions
SQL Datatypes SQL Semantics
INT SELECT, LOAD, INSERT from query
TINYINT/SMALLINT/BIGINT Expressions in WHERE and HAVING
BOOLEAN GROUP BY, ORDER BY, SORT BY
FLOAT CLUSTER BY, DISTRIBUTE BY
DOUBLE Sub-queries in FROM clause
STRING GROUP BY, ORDER BY DATE (From Hive 0.12.0)
BINARY ROLLUP and CUBE
VARCHAR (From Hive 0.12.0)
TIMESTAMP UNION CHAR (From Hive 0.13.0)
ARRAY, MAP, STRUCT, UNION LEFT, RIGHT and FULL INNER/OUTER
JOIN
DECIMAL CROSS JOIN, LEFT SEMI JOIN
CHAR Windowing functions (OVER, RANK,
etc.)
VARCHAR Sub-queries for IN/NOT IN, HAVING
DATE EXISTS / NOT EXISTS
Loading data into Hive
Loading Data in Hive
Small size Bulk
Hive LOAD Sqoop - SQl to hadOOP, Apache license
• Load files from HDFS or local filesystem. • Data transfer from external RDBMS to
• Format must agree with table format.
Hive.
• Sqoop can load data directly to/from
HCatalog.
Insert from query
• CREATE TABLE AS SELECT or INSERT INTO. Talend – Community version
WebHDFS + WebHCat
• Load data via REST APIs.
SyncSort – Commercial Version
SerDes map JSON, XML and other formats natively into Hive.
Security: Hive Authorization
• Hive provides Users, Groups, Roles and Privileges
• Granular permissions on tables, DDL and DML operations.
• Not designed for high security:
1. On non-kerberized cluster, up to the client to supply their user name.
2. Suitable for preventing accidental data loss.
HiveServer2
• HiveServer2 is a gateway / JDBC / ODBC endpoint Hive clients can talk
to.
• Supports secure and non-secure clusters.
• DoAs support allows Hive query to run as the requester.
• (Coming Soon) LDAP authentication.
Parameterization of Hive Queries
Parameterized queries
Parameterized queries
Processing Data with external Scripts
Process Data using External Scripts
Data Input and Output with TRANFORM
Hive TRANSFORM example
Hive TRANSFORM Example
User Defined Functions(UDF)
Hive Built in Functions
Hive Built in Functions
Overview of User Defined Functions (UDF’s)
Developing Hive UDF’s
Example: Usage of UDF in HIVE
Example: Usage of UDF in Hive
User-Defined Functions (UDF)
• 1 input to 1 output
• Typically used in select
• SELECT concat(first, ‘ ‘, last) AS full_name…
• See Hive language wiki for full list of built-in UDF’s
• http://wiki.apache.org/hadoop/Hive/LanguageManual
• Noteworthy features
• Sometimes you want to cast
• SELECT CAST(5.0/2.0 AS INT)…
• Conditional functions
• SELECT IF(boolean, if_true, if_not_true)…
Facebook
User Defined Aggregate Functions (UDAF)
• N inputs to 1 output
• Typically used with GROUP BY
• SELECT count(1) FROM … GROUP BY age
• SELECT count(DISTINCT first_name) GROUP BY
last_name…
• sum(), avg(), min(), max()
• For skew
• set hive.groupby.skewindata = true;
• set hive.map.aggr.hash.percentmemory = <some lower
value>
Facebook
User Defined Table-Generating Functions (UDTF)
• 1 input to N outputs
• explode(Array<?> arg)
• Converts an array into multiple rows, with one element per row
• Transform-like syntax
• SELECT udtf(col0, col1, …) AS colAlias FROM srcTable
• Lateral view syntax
• …FROM baseTable
LATERAL VIEW udtf(col0, col1…)
tableAlias AS colAlias
• Also see: http://bit.ly/hive-udtf
Facebook
Summary: UDF vs. UDAF vs. UDTF
• User Defined Functions
• One-to-One mapping
• concat(“firstname”, “lastname”)
• User Defined Aggregate Functions
• Many-to-one mapping
• Sum(num_ads)
• User Defined Table-generating Functions
• One-to-many mapping
• explode([1,2,3])
Interfaces to write UDF
• There are two different interfaces you can use for writing UDFs for Apache
Hive. One is really simple, the other… not so much.
• However, if you plan on writing a UDF that can manipulate embedded data
structures, such as Map,List, and Set, then you’re stuck using
org.apache.hadoop.hive.ql.udf.generic.GenericUDF, which is a little more
involved.
SerDe –
Serialization/Deserialization
Serialization and De-serialization in Hive(SerDe)
• SerDe is short for serialization/deserialization. It controls the format
of a row.
• Hive uses the SerDe interface for IO. The interface handles both
serialization and Deserialization and also interpreting the result of
serialization as Individual fields for processing.
• A SerDe allows Hive to read in data from a table, and write it back out
to HDFS in any custom format. Anyone can write their own SerDe for
their own data formats.
Facebook
Serialization and Deserialization in Hive
• Default Value of SerDe is org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
• In some situations, the interface used for de-serialization is LazySerDe.
• Unstructured data gets converted into structured data due to the
flexibility of LazySerDe interface.
• While using the LazySerDe interface, data is read based on the
separation by different delimiter characters.
• The SerDe interface is located in ‘hive_contrib.jar’.
Hive SerDe’s
SerDe Examples
• CREATE TABLE mylog (
user_id BIGINT,
page_url STRING,
unix_time INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
• User has data with special serialized format not supported by Hive
yet, and user does not want to convert the data before loading into
Hive.
Facebook
Adding a Custom SerDe to Hive
Using SerDe’s in Hive
How to add a new SerDe for text data
• Follow the example in
contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
Facebook
Text Processing in Hive
Text Processing
Basic String Functions
Parsing URL’s with Hive
Numeric format functions
Splitting and Combining strings
Converting Array to Records with Explode
Regular expressions
Hive’s Regular expressions Functions
Regex SerDe
Creating Table with Regex serDe
Creating Table with Regex serDe
Fixed width formats in Hive
Fixed width formats example
Parsing Sentences in to words
Sentiment Analysis
n-grams
Calculating n-gram in Hive
Calculating n-gram in Hive
Finding specific n-grams in text
Calculating data for histograms
File Formats
Hive Persistence Formats
Built-in Formats:
– ORCFile - Optimized Row columnar
– RCFile - Record columnar File
– Avro - Avro is data serialization system. Avro schemas are defined with JSON. This facilitates
implementation in languages that already have JSON libraries.
– Delimited Text /Text FIles
– Regular Expression - a sequence of symbols and characters expressing a string or pattern to
be searched for within a longer piece of text.
– S3 Logfile –Server access log format
– Typed Bytes
3rd-Party Addons:
– JSON – Java Script object notation
– XML
High Performance
– Inline indexes record value ranges within blocks of ORCFile data.
– Filter pushdown allows efficient scanning during precise queries.
155 of 31
SAMPLE CODES
Hive Configuration
• Default configuration file “hive-site.xml”. We can overwrite using the
following command
$ hive –config /user/lib/hive/hive-conf
• Hive also permits to set the connection properties per session basis
as follows
$ hive –hiveconf fs.defaultFS=hdfs://localhost –hiveconf
mapreduce.framework.name=yarn \
--hiveconf yarn.resourncemanager.address=localhost:8032
Using Apache Tez as the execution engine for the HIVE
• Default execution engine is MapReduce, we can change to new execution framework “Tez”
• Tez is a new execution framework built on top of YARN, which provides a lower-level API(directed acyclic
graphs) than MapReduce. Tez is more flexible and powerful than MapReduce.
• Tez allows applications to improve performance by utilizing more expressive execution patterns than
MapReduce pattern. Hive supports the Tez execution engine as substitute for the background MapReduce
computations
• Hive would convert the Hive queries into Tez execution graphs
• You can instruct Hive to use Tez as the execution framework as follows
$ hive> set hive.execution.engine=tez;
$ hive> set hjve.execution.engine=mr;
To check the current setting, we can use set as follows
$ hive> set hive.execution.engine;
$ hive> set: # it will list all the properties set by Hive
(Change setting within session using “set” command)
Create data base & tables
159
Load Data - Queries
Hive provides several operators for the ordering of query results, with subtle differences and performance trade-offs
ORDER BY: This guarantees the global ordering of the data using single reducer
SORT BY: This guarantees the local ordering of data that is output by each reduce task.
CLUSTER BY: This distributes the data to reduce tasks, avoiding any range overlap and each reduce task will output
the data in sorted order.
160
Schema Violations
• What would happen if try to insert data that does not comply with
the pre-defined schema?
• Null is set for any that violates pre-defined schema
Managing outputs
Inserting Output into another table
INSERT OVERWRITE TABLE results(SELECT * from txnrecords);
162
Hive Script – Running Batch commands
Hive Scripts are used to execute Hive commands collectively from terminal. This helps in
reducing the time and effort invested in writing and executing each command manually
Steps:
1. Create file with list of all commands to run and save as file with extension .hql or .hive
2. Run the below command to execute the file from terminal
hive –f < path of the .hql/.hive file> (e.g: hive –f /home/user/Desktop/demo.hive)
The script runs all the queries and save the output in respective places (where you re directed)
Appendix
Brief about Data Warehouse
Brief About Data Warehouse
• OLAP vs OLTP
• DW is needed in OLAP
• We want report and summary not live data of transactions for continuing the
operate
• We need reports to make operation better not to conduct and operation!
• We use ETL to populate data in DW.
• Kimball believes in creating several smaller data marts for achieving department-level
analysis and reporting. William Inmon: Father of Datawarehouse
Rolph Kimbal: Father of Business Intelligence
Brief About Data Warehouse (Inmon approach vs Kimbal approach)
Brief About Data Warehouse
• Other keywords
• ODS- Operational Data Store
• Star Schema & Snowflake schema
• Fact Tables
• Data Mart
• Dimensions
• Concurrent ETLs
SQL vs. HiveQL(HQL)
Hadoop cluster is not a database server
Hive Vs. Traditional RDBMS
172
“Schema on read” Accelerates Data Innovation
Finally, we
Let me see…
I need new start is it any good?
data collec+ng
“Schema change”
project
Let me see…
is it any good? My model is
awesome!
Group By
• Like SQL, HiveQL provides support for the "group by" command, in which multiple rows in a table can be collected into groups with the same
values in specific columns. Once the groups are formed, the columns being used in the group by can be accessed as usual in the select statement.
The columns that were not used in the group by should be accessed through aggregation functions, like average or sum. If a column that is not
part of the group by is accessed as a single element in the select statement, this will cause Hive to throw an error. Thankfully, the error displayed
by Hive will alert the user to which column is being used incorrectly. The issue can be resolved by ensuring that all columns in the select
statement are either part of the group by and accessed as a single element, or not part of the group by and accessed through aggregation.
Collect Set
• One additional thing to consider when using group by is the existence of a special aggregation function: collect_set. This allows a column not
used in the group by to be aggregated into a set. The values in the set are accessible using normal array-like syntax and can be used the same way
as any column in the original table. The elements in the array will not be sorted, so any ordering will need to happen through the use of a user
defined function. Additionally, Hive will not evaluate an expression to calculate an index.
Limit
• One effective way to understand the results of a query is to run it and use the "limit" function to view a small subset of the results. It is important
to keep in mind that unless the "sort by" command is used as part of the query, the result will not be ordered in any way.
Hive Vs. SQL
• The main difference between RDBMs databases and Hive is specialization. While MySQL
is general purpose database suited both for transactional processing (OLTP) and for
analytics (OLAP), Hive is built for the analytics only. Technically the main difference is
lack of update/delete functionality. Data can only by be added and selected. In the same
time Hive is capable of processing data volumes which can not be processed by MySQL
or other conventional RDBMS (in shy budget).
• MPP (massive parallel processing) databases are closest to the Hive by their functionality
- while they have full SQL support they are scalable up to hundreds of computers.
Another serious different - is query language.
• Hive do not support full SQL even in select because of it's implementation. In my view
main difference is lack of join for any condition other then equal. Hive query language
syntax is also a bit different so you can not connect report generation software right to
Hive.
How to improve efficiency?
Performance Question
• Which of the following is faster?
• select count(distinct(Col)) from Tbl
• select count(*) from (select distict(Col) from Tbl)
Count distinct
Answer
Surprisingly the second is usually faster
• In the first case:
• Maps send each value to the reduce
• Single reduce counts them all
• In the second case:
• Maps split up the values to many reduces
• Each reduce generates its list
• Final job counts the size of each list
• Singleton reduces are almost always BAD
Going Fast in Hadoop
Hadoop:
• Really good at coordinated sequential scans.
• No random I/O. Traditional index pretty much useless.
Skipping data:
• Divide data among different files which can be pruned out.
• Partitions, buckets and skews.
• Skip records during scans using small embedded indexes.
• Automatic when you use ORCFile format.
• Sort data ahead of time.
• Simplifies joins and skipping becomes more effective.
Data Layout Considerations for Fast Hive
Hive Fast Query Check List
For Even More Performance
Future Trends?
Hive – Hivemall
185 of 31
Hive – Shark
Shark is a large-scale data warehouse system for Spark designed to be compatible
with Apache Hive.
It can execute Hive QL queries up to 100 times faster than Hive without any
modification to the existing data or queries.
Shark supports Hive's query language, metastore, serialization formats, and user-
defined functions, providing seamless integration with existing Hive deployments and
a familiar, more powerful option for new ones.
186 of 31
Further Readings
Further Reading
• Apache Drill
• Software framework that supports data-intensive, distributed applications, for interactive
analysis of large-scale datasets
• PIG
• MR Platform for creating and using MR on Hadoop
• Oracle Big Data
• DB2 10 and InfoSphere Warehouse
• Parallel databases: Gamma, Bubba, Volcano
• Google: Sawzall
• Yahoo: Pig
• IBM: JAQL
• Microsoft: DradLINQ , SCOPE
188 of 31
References
• https://www.facebook.com/note.php?note_id=89508453919
• https://github.com/facebook/scribe
• http://sqoop.apache.org/docs/
• http://flume.apache.org/FlumeDeveloperGuide.html
• Sqoop Database Import For Hadoop, Cloudera, Oct.2009
• https://cwiki.apache.org/confluence/display/Hive/LanguageManual
• http://www.semantikoz.com/blog/the-free-apache-hive-book/
• BEGINNING MICROSOFT® SQL SERVER® 2012 PROGRAMMING, Wiley, Paul
Atkinson and Robert Vieira, ISBN: 978-1-118-10228-2
• Hive – A Petabyte Scale Data Warehouse Using Hadoop, facebook team, 2009
189 of 31
Contact Us
Visit us on: http://www.analytixlabs.in/
Join us on:
Twitter - http://twitter.com/#!/AnalytixLabs
Facebook - http://www.facebook.com/analytixlabs
LinkedIn - http://www.linkedin.com/in/analytixlabs
Blog - http://www.analytixlabs.co.in/category/blog/