0% found this document useful (0 votes)
175 views11 pages

Impala

Cloudera Impala is a distributed SQL query engine that allows users to query data stored in HDFS and HBase. It consists of daemon processes that run on cluster nodes and allows SQL queries through interfaces like Impala-shell. Impala stores table definitions in the Hive metastore and can access Hive tables, but uses its own query engine rather than MapReduce so queries are faster than Hive. It can also work with HDFS to store data files or with HBase as an alternative to HDFS storage.

Uploaded by

chandra reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
175 views11 pages

Impala

Cloudera Impala is a distributed SQL query engine that allows users to query data stored in HDFS and HBase. It consists of daemon processes that run on cluster nodes and allows SQL queries through interfaces like Impala-shell. Impala stores table definitions in the Hive metastore and can access Hive tables, but uses its own query engine rather than MapReduce so queries are faster than Hive. It can also work with HDFS to store data files or with HBase as an alternative to HDFS storage.

Uploaded by

chandra reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Overview of Cloudera Impala

Objectives

After completing this lesson, you should be able to:


• Describe the features of Cloudera Impala
• Explain how Impala works with Hive, HDFS, and HBase

7- 2
Hadoop: Some Data Access/Processing Options

Component Purpose
Hive Puts a partial SQL interface in front of Hadoop. Includes
a metadata “repository” called the Metastore.
Pig A SQL-like scripting language on top of Java - for
MapReduce programming
HBase Applies a partial columnar scheme on top of Hadoop
Impala A database-like SQL layer on top of Hadoop

7- 3
Cloudera Impala

• The Impala server is a distributed, massively parallel


processing (MPP) database engine.
• It consists of different daemon processes that run on
specific hosts within your CDH cluster.
• The core Impala component is a daemon process that runs
on each node of the cluster.
• SQL is the primary development language.

7- 4
Cloudera Impala: Key Features

• Open source and Apache-licensed


• MPP architecture
• Interactive analysis on data stored in HDFS and HBase
• Incorporates native Hadoop security
• Provides ANSI- SQL support
• Shares workload management with Apache
• Supports common Hadoop file formats

7- 5
Cloudera Impala: Programming Interfaces

You can connect and submit requests to the Impala daemons


through:
• The Impala-shell interactive command interpreter
• The Apache Hue web-based user interface
• JDBC and ODBC

7- 6
How Impala Fits Into the Hadoop Ecosystem

Makes use of components within the Hadoop ecosystem:


• Provides a SQL layer on Hadoop
• May interchange data with other Hadoop components
• Can assist in ETL processes

7- 7
Working of Impala

Impala does not make use of Mapreduce as it contains its own


pre-defined daemon process to run a job. It sits on top of
only the Hadoop Distributed File System (HDFS) as it uses the
same to merely store the data. Therefore, we prefer calling it as
simply “SQL on HDFS”

However ,Hive functions on top of Hadoop which itself includes


HDFS as well as MapReduce. Executing an Hive query
would then, set forth a series of mapreduce commands until we
arrive at the results.

Since Impala doesn’t have to translate a SQL query into


another processing framework like the map/shuffle/reduce, it
does not suffer from the latencies that those operations impose
and this makes Impala much faster than Hive on
performance benchmarks.
7- 8
How Impala Works with Hive

• Uses existing Hive infrastructure


• Stores its table definitions in the Hive Metastore
• Accesses Hive tables
• Focuses on query performance

7- 9
How Impala Works with HDFS and HBase

• HDFS
– Impala’s primary storage mechanism
– Data stored as data files
• HBase
– Alternative to HDFS to store Impala data
– Impala table definition can be mapped to HBase tables

7- 10
Summary of Cloudera Impala Benefits

• MPP performance (uses its own MPP query engine)


• Cost savings
• Analysis of raw and historical data
• Security

7- 11

You might also like