Cloudera Introduction PDF
Cloudera Introduction PDF
Important Notice
© 2010-2019 Cloudera, Inc. All rights reserved.
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software
Foundation. All other trademarks, registered trademarks, product names and company
names or logos mentioned in this document are the property of their respective owners.
Reference to any products, services, processes or other information, by trade name,
trademark, manufacturer, supplier or otherwise does not constitute or imply
endorsement, sponsorship or recommendation thereof by us.
Complying with all applicable copyright laws is the responsibility of the user. Without
limiting the rights under copyright, no part of this document may be reproduced, stored
in or introduced into a retrieval system, or transmitted in any form or by any means
(electronic, mechanical, photocopying, recording, or otherwise), or for any purpose,
without the express written permission of Cloudera.
The information in this document is subject to change without notice. Cloudera shall
not be liable for any damages resulting from technical errors or omissions which may
be present in this document, or from use of this document.
Cloudera, Inc.
395 Page Mill Road
Palo Alto, CA 94306
[email protected]
US: 1-888-789-1488
Intl: 1-650-362-0488
www.cloudera.com
Release Information
CDH Overview..........................................................................................................9
Apache Impala Overview.....................................................................................................................................9
Impala Benefits....................................................................................................................................................................10
How Impala Works with CDH...............................................................................................................................................10
Primary Impala Features......................................................................................................................................................11
Cloudera Search Overview.................................................................................................................................11
How Cloudera Search Works................................................................................................................................................12
Understanding Cloudera Search..........................................................................................................................................13
Cloudera Search and Other Cloudera Components..............................................................................................................13
Cloudera Search Architecture..............................................................................................................................................15
Cloudera Search Tasks and Processes..................................................................................................................................18
Apache Sentry Overview....................................................................................................................................20
Apache Spark Overview.....................................................................................................................................20
File Formats and Compression...........................................................................................................................21
Using Apache Parquet Data Files with CDH.........................................................................................................................21
Using Apache Avro Data Files with CDH..............................................................................................................................30
Data Compression................................................................................................................................................................33
External Documentation....................................................................................................................................35
Getting Support.....................................................................................................81
Cloudera Support...............................................................................................................................................81
Information Required for Logging a Support Case...............................................................................................................81
Community Support...........................................................................................................................................81
Get Announcements about New Releases.........................................................................................................82
Report Issues......................................................................................................................................................82
Documentation Overview
The following guides are included in the Cloudera documentation set:
Guide Description
Overview of Cloudera and the Cloudera Cloudera provides a scalable, flexible, integrated platform that makes it easy
Documentation Set to manage rapidly increasing volumes and varieties of data in your enterprise.
Cloudera products and solutions enable you to deploy and manage Apache
Hadoop and related projects, manipulate and analyze your data, and keep
that data secure and protected.
Cloudera Release Notes This guide contains release and download information for installers and
administrators. It includes release notes as well as information about versions
and downloads. The guide also provides a release matrix that shows which
major and minor release version of a product is supported with which release
version of Cloudera Manager, CDH and, if applicable, Cloudera Impala.
6 | Cloudera Introduction
About Cloudera Introduction
Guide Description
Cloudera QuickStart This set of guides describes ways to rapidly begin experimenting with Cloudera
software. The first section describes how to download and use QuickStart
virtual machines, which provide everything you need to try CDH, Cloudera
Manager, Impala, and Cloudera Search. Subsequent sections show you how
to create a new installation of Cloudera Manager 5, CDH5, and managed
services on a cluster of four hosts and an unmanaged CDH pseudo cluster.
Quick start installations are for demonstration and POC applications only and
are not recommended for production use.
Cloudera Installation This guide provides instructions for installing Cloudera software.
Cloudera Upgrade This topic provides an overview of upgrade procedures for Cloudera Manager
and CDH. The procedures described here use Cloudera Manager to perform
some or all of the upgrade steps. You can also upgrade unmanaged CDH
clusters (clusters that are not managed by Cloudera Manager). See Upgrading
Unmanaged CDH Using the Command Line.
Cloudera Administration This guide describes how to configure and administer a Cloudera deployment.
Administrators manage resources, availability, and backup and recovery
configurations. In addition, this guide shows how to implement high
availability, and discusses integration.
Cloudera Data Management This guide describes how to perform data management using Cloudera
Navigator. Data management activities include auditing access to data residing
in HDFS and Hive metastores, reviewing and updating metadata, and
discovering the lineage of data objects.
Cloudera Operation This guide shows how to monitor the health of a Cloudera deployment and
diagnose issues. You can obtain metrics and usage information and view
processing activities. This guide also describes how to examine logs and reports
to troubleshoot issues with cluster configuration and operation as well as
monitor compliance.
Cloudera Security This guide is intended for system administrators who want to secure a cluster
using data encryption, user authentication, and authorization techniques. It
provides conceptual overviews and how-to information about setting up
various Hadoop components for optimal security, including how to setup a
gateway to restrict access.
Apache Impala - Interactive SQL This guide describes Impala, its features and benefits, and how it works with
CDH. This topic introduces Impala concepts, describes how to plan your Impala
deployment, and provides tutorials for first-time users as well as more
advanced tutorials that describe scenarios and specialized features. You will
also find a language reference, performance tuning, instructions for using the
Impala shell, troubleshooting information, and frequently asked questions.
Cloudera Search Guide This guide explains how to configure and use Cloudera Search. This includes
topics such as extracting, transforming, and loading data, establishing high
availability, and troubleshooting.
Spark Guide This guide describes Apache Spark, a general framework for distributed
computing that offers high performance for both batch and interactive
processing. The guide provides tutorial Spark applications, how to develop
and run Spark applications, and how to use Spark with other Hadoop
components.
Cloudera Introduction | 7
About Cloudera Introduction
Guide Description
Cloudera Glossary This guide contains a glossary of terms for Cloudera components.
8 | Cloudera Introduction
CDH Overview
CDH Overview
CDH is the most complete, tested, and popular distribution of Apache Hadoop and related projects. CDH delivers the
core elements of Hadoop – scalable storage and distributed computing – along with a Web-based user interface and
vital enterprise capabilities. CDH is Apache-licensed open source and is the only Hadoop solution to offer unified batch
processing, interactive SQL and interactive search, and role-based access controls.
CDH provides:
• Flexibility—Store any type of data and manipulate it with a variety of different computation frameworks including
batch processing, interactive SQL, free text search, machine learning and statistical computation.
• Integration—Get up and running quickly on a complete Hadoop platform that works with a broad range of hardware
and software solutions.
• Security—Process and control sensitive data.
• Scalability—Enable a broad range of applications and scale and extend them to suit your requirements.
• High availability—Perform mission-critical business tasks with confidence.
• Compatibility—Leverage your existing IT infrastructure and investment.
For information about CDH components, which is out of scope for Cloudera documentation, see the links in External
Documentation on page 35.
Cloudera Introduction | 9
CDH Overview
Impala is an addition to tools available for querying big data. Impala does not replace the batch processing frameworks
built on MapReduce such as Hive. Hive and other frameworks built on MapReduce are best suited for long running
batch jobs, such as those involving batch processing of Extract, Transform, and Load (ETL) type jobs.
Note: Impala graduated from the Apache Incubator on November 15, 2017. In places where the
documentation formerly referred to “Cloudera Impala”, now the official name is “Apache Impala”.
Impala Benefits
Impala provides:
• Familiar SQL interface that data scientists and analysts already know.
• Ability to query high volumes of data (“big data”) in Apache Hadoop.
• Distributed queries in a cluster environment, for convenient scaling and to make use of cost-effective commodity
hardware.
• Ability to share data files between different components with no copy or export/import step; for example, to
write with Pig, transform with Hive and query with Impala. Impala can read from and write to Hive tables, enabling
simple data interchange using Impala for analytics on Hive-produced data.
• Single system for big data processing and analytics, so customers can avoid costly modeling and ETL just for
analytics.
10 | Cloudera Introduction
CDH Overview
1. User applications send SQL queries to Impala through ODBC or JDBC, which provide standardized querying
interfaces. The user application may connect to any impalad in the cluster. This impalad becomes the coordinator
for the query.
2. Impala parses the query and analyzes it to determine what tasks need to be performed by impalad instances
across the cluster. Execution is planned for optimal efficiency.
3. Services such as HDFS and HBase are accessed by local impalad instances to provide data.
4. Each impalad returns data to the coordinating impalad, which sends these results to the client.
Feature Description
Unified management and monitoring Cloudera Manager provides unified and centralized management and
with Cloudera Manager monitoring for CDH and Cloudera Search. Cloudera Manager simplifies
deployment, configuration, and monitoring of your search services. Many
existing search solutions lack management and monitoring capabilities and
Cloudera Introduction | 11
CDH Overview
Feature Description
fail to provide deep insight into utilization, system health, trending, and other
supportability aspects.
Index storage in HDFS Cloudera Search is integrated with HDFS for index storage. Indexes created
by Solr/Lucene can be directly written in HDFS with the data, instead of to
local disk, thereby providing fault tolerance and redundancy.
Cloudera Search is optimized for fast read and write of indexes in HDFS while
indexes are served and queried through standard Solr mechanisms. Because
data and indexes are co-located, data processing does not require transport
or separately managed storage.
Batch index creation through To facilitate index creation for large data sets, Cloudera Search has built-in
MapReduce MapReduce jobs for indexing data stored in HDFS. As a result, the linear
scalability of MapReduce is applied to the indexing pipeline.
Real-time and scalable indexing at data Cloudera Search provides integration with Flume to support near real-time
ingest indexing. As new events pass through a Flume hierarchy and are written to
HDFS, those events can be written directly to Cloudera Search indexers.
In addition, Flume supports routing events, filtering, and annotation of data
passed to CDH. These features work with Cloudera Search for improved index
sharding, index separation, and document-level access control.
Easy interaction and data exploration A Cloudera Search GUI is provided as a Hue plug-in, enabling users to
through Hue interactively query data, view result files, and do faceted exploration. Hue can
also schedule standing queries and explore index files. This GUI uses the
Cloudera Search API, which is based on the standard Solr API.
Simplified data processing for Search Cloudera Search relies on Apache Tika for parsing and preparation of many
workloads of the standard file formats for indexing. Additionally, Cloudera Search supports
Avro, Hadoop Sequence, and Snappy file format mappings, as well as Log file
formats, JSON, XML, and HTML. Cloudera Search also provides data
preprocessing using Morphlines, which simplifies index configuration for these
formats. Users can use the configuration for other applications, such as
MapReduce jobs.
HBase search Cloudera Search integrates with HBase, enabling full-text search of stored
data without affecting HBase performance. A listener monitors the replication
event stream and captures each write or update-replicated event, enabling
extraction and mapping. The event is then sent directly to Solr indexers and
written to indexes in HDFS, using the same process as for other indexing
workloads of Cloudera Search. The indexes can be served immediately, enabling
near real-time search of HBase data.
12 | Cloudera Introduction
CDH Overview
submitted to Solr through either the standard Solr API, or through a simple search GUI application, included in Cloudera
Search, which can be deployed in Hue.
Cloudera Search batch-oriented indexing capabilities can address needs for searching across batch uploaded files or
large data sets that are less frequently updated and less in need of near-real-time indexing. For such cases, Cloudera
Search includes a highly scalable indexing workflow based on MapReduce. A MapReduce workflow is launched onto
specified files or folders in HDFS, and the field extraction and Solr schema mapping is run during the mapping phase.
Reducers use Solr to write the data as a single index or as index shards, depending on your configuration and preferences.
Once the indexes are stored in HDFS, they can be queried using standard Solr mechanisms, as previously described
above for the near-real-time indexing use case.
The Lily HBase Indexer Service is a flexible, scalable, fault tolerant, transactional, near real-time oriented system for
processing a continuous stream of HBase cell updates into live search indexes. Typically, the time between data ingestion
using the Flume sink to that content potentially appearing in search results is measured in seconds, although this
duration is tunable. The Lily HBase Indexer uses Solr to index data stored in HBase. As HBase applies inserts, updates,
and deletes to HBase table cells, the indexer keeps Solr consistent with the HBase table contents, using standard HBase
replication features. The indexer supports flexible custom application-specific rules to extract, transform, and load
HBase data into Solr. Solr search results can contain columnFamily:qualifier links back to the data stored in
HBase. This way applications can use the Search result set to directly access matching raw HBase cells. Indexing and
searching do not affect operational stability or write throughput of HBase because the indexing and searching processes
are separate and asynchronous to HBase.
Cloudera Introduction | 13
CDH Overview
14 | Cloudera Introduction
CDH Overview
Each Cloudera Search server can handle requests for information. As a result, a client can send requests to index
documents or perform searches to any Search server, and that server routes the request to the correct server.
Each search deployment requires:
• ZooKeeper on one host. You can install ZooKeeper, Search, and HDFS on the same host.
• HDFS on at least one but as many as all hosts. HDFS is commonly installed on all hosts.
Cloudera Introduction | 15
CDH Overview
• Solr on at least one but as many as all hosts. Solr is commonly installed on all hosts.
More hosts with Solr and HDFS provides benefits of:
• More search host installations doing work.
• More search and HDFS collocation increasing the degree of data locality. More local data provides faster
performance and reduces network traffic.
The following graphic illustrates some of the key elements in a typical deployment.
16 | Cloudera Introduction
CDH Overview
• For actions related to collections, such as adding or deleting collections, the name of the collection is required as
well.
• Indexing jobs, such as MapReduceIndexer jobs, use a MapReduce driver that starts a MapReduce job. These jobs
can also process morphlines, indexing the results to add to Solr.
Search can be deployed using parcels or packages. Some files are always installed to the same location and some files
are installed to different locations based on whether the installation is completed using parcels or packages.
Client Files
Client files are always installed to the same location and are required on any host where corresponding services are
installed. In a Cloudera Manager environment, Cloudera Manager manages settings. In an unmanaged deployment,
all files can be manually edited. All files are found in a subdirectory of /etc/. Client configuration file types and their
locations are:
• /etc/solr/conf for Solr client settings files
• /etc/hadoop/conf for HDFS, MapReduce, and YARN client settings files
• /etc/zookeeper/conf for ZooKeeper configuration files
Server Files
Server configuration file locations vary based on how services are installed.
• Cloudera Manager environments store configuration all files in /var/run/.
• Unmanaged environments store configuration files in /etc/svc/conf. For example:
– /etc/solr/conf
– /etc/zookeeper/conf
– /etc/hadoop/conf
Cloudera Introduction | 17
CDH Overview
Ingestion
You can move content to CDH by using:
• Flume, a flexible, agent-based data ingestion framework.
• A copy utility such as distcp for HDFS.
• Sqoop, a structured data ingestion connector.
• fuse-dfs.
In a typical environment, administrators establish systems for search. For example, HDFS is established to provide
storage; Flume or distcp are established for content ingestion. After administrators establish these services, users
can use ingestion tools such as file copy utilities or Flume sinks.
Indexing
Content must be indexed before it can be searched. Indexing comprises the following steps:
1. Extraction, transformation, and loading (ETL) - Use existing engines or frameworks such as Apache Tika or Cloudera
Morphlines.
a. Content and metadata extraction
b. Schema mapping
2. Create indexes using Lucene.
a. Index creation
b. Index serialization
Indexes are typically stored on a local file system. Lucene supports additional index writers and readers. One HDFS-based
interface implemented as part of Apache Blur is integrated with Cloudera Search and has been optimized for CDH-stored
indexes. All index data in Cloudera Search is stored in and served from HDFS.
You can index content in three ways:
18 | Cloudera Introduction
CDH Overview
Batch indexing is most often used when bootstrapping a search cluster. The Map component of the MapReduce task
parses input into indexable documents, and the Reduce component contains an embedded Solr server that indexes
the documents produced by the Map. You can also configure a MapReduce-based indexing job to use all assigned
resources on the cluster, utilizing multiple reducing steps for intermediate indexing and merging operations, and then
writing the reduction to the configured set of shard sets for the service. This makes the batch indexing process as
scalable as MapReduce workloads.
NRT indexing using some other client that uses the NRT API
Other clients can complete NRT indexing. This is done when the client first writes files directly to HDFS and then triggers
indexing using the Solr REST API. Specifically, the API does the following:
1. Extract content from the document contained in HDFS, where the document is referenced by a URL.
2. Map the content to fields in the search schema.
3. Create or update a Lucene index.
This is useful if you index as part of a larger workflow. For example, you could trigger indexing from an Oozie workflow.
Cloudera Introduction | 19
CDH Overview
Querying
After data is available as an index, the query API provided by the search service allows direct queries to be completed
or to be facilitated through a command-line tool or graphical interface. Cloudera Search provides a simple UI application
that can be deployed with Hue, or you can create a custom application based on the standard Solr API. Any application
that works with Solr is compatible and runs as a search-serving application for Cloudera Search, because Solr is the
core.
Note:
This page contains information related to Spark 1.6, which is included with CDH. For information about
the separately available parcel for CDS 2 Powered by Apache Spark, see the documentation for CDS
2.
Unsupported Features
The following Spark features are not supported:
• Spark SQL:
– Thrift JDBC/ODBC server
– Spark SQL CLI
• Spark Dataset API
20 | Cloudera Introduction
CDH Overview
• SparkR
• GraphX
• Spark on Scala 2.11
• Mesos cluster manager
Related Information
• Managing Spark
• Monitoring Spark Applications
• Spark Authentication
• Spark Encryption
• Cloudera Spark forum
• Apache Spark documentation
Important:
The configuration property serialization.null.format is set in Hive and Impala engines as
SerDes or table properties to specify how to serialize/deserialize NULL values into a storage format.
This configuration option is suitable for text file formats only. If used with binary storage formats such
as RCFile or Parquet, the option causes compatibility, complexity and efficiency issues.
All file formats include support for compression, which affects the size of data on the disk and, consequently, the
amount of I/O and CPU resources required to serialize and deserialize data.
Cloudera Introduction | 21
CDH Overview
CDH lets you use the component of your choice with the Parquet file format for each phase of data processing. For
example, you can read and write Parquet files using Pig and MapReduce jobs. You can convert, transform, and query
Parquet tables through Hive, Impala, and Spark. And you can interchange data files between all of these components.
Note:
• Once you create a Parquet table, you can query it or insert into it through other components
such as Impala and Spark.
• Set dfs.block.size to 256 MB in hdfs-site.xml.
If the table will be populated with data files generated outside of Impala and Hive, you can create the table as an
external table pointing to the location where the files will be created:
To populate the table with an INSERT statement, and to read the table with a SELECT statement, see Using the Parquet
File Format with Impala Tables.
To set the compression type to use when writing data, configure the parquet.compression property:
set parquet.compression=GZIP;
INSERT OVERWRITE TABLE tinytable SELECT * FROM texttable;
Once you create a Parquet table this way in Impala, you can query it or insert into it through either Impala or Hive.
The Parquet format is optimized for working with large data files. In Impala 2.0 and higher, the default size of Parquet
files written by Impala is 256 MB; in lower releases, 1 GB. Avoid using the INSERT ... VALUES syntax, or partitioning
22 | Cloudera Introduction
CDH Overview
the table at too granular a level, if that would produce a large number of small files that cannot use Parquet optimizations
for large data chunks.
Inserting data into a partitioned Impala table can be a memory-intensive operation, because each data file requires a
memory buffer to hold the data before it is written. Such inserts can also exceed HDFS limits on simultaneous open
files, because each node could potentially write to a separate data file for each partition, all at the same time. Make
sure table and column statistics are in place for any table used as the source for an INSERT ... SELECT operation
into a Parquet table. If capacity problems still occur, consider splitting insert operations into one INSERT statement
per partition.
Impala can query Parquet files that use the PLAIN, PLAIN_DICTIONARY, BIT_PACKED, and RLE encodings. Currently,
Impala does not support RLE_DICTIONARY encoding. When creating files outside of Impala for use by Impala, make
sure to use one of the supported encodings. In particular, for MapReduce jobs, parquet.writer.version must not
be defined (especially as PARQUET_2_0) for writing the configurations of Parquet MR jobs. Use the default version (or
format). The default format, 1.0, includes some enhancements that are compatible with older versions. Data using the
2.0 format might not be consumable by Impala, due to use of the RLE_DICTIONARY encoding.
If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any resulting values from DATE,
DATETIME, or TIMESTAMP columns. The underlying values are represented as the Parquet INT64 type, which is
represented as BIGINT in the Impala table. The Parquet values represent the time in milliseconds, while Impala
interprets BIGINT as the time in seconds. Therefore, if you have a BIGINT column in a Parquet table that was imported
this way from Sqoop, divide the values by 1000 when interpreting as the TIMESTAMP type.
For complete instructions and examples, see Using the Parquet File Format with Impala Tables.
if [ -e /opt/cloudera/parcels/CDH ] ; then
CDH_BASE=/opt/cloudera/parcels/CDH
else
CDH_BASE=/usr
fi
THRIFTJAR=`ls -l $CDH_BASE/lib/hive/lib/libthrift*jar | awk '{print $9}' | head -1`
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$THRIFTJAR
export LIBJARS=`echo "$CLASSPATH" | awk 'BEGIN { RS = ":" } { print }' | grep
parquet-format | tail -1`
export LIBJARS=$LIBJARS,$THRIFTJAR
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
Cloudera Introduction | 23
CDH Overview
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import parquet.Log;
import parquet.example.data.Group;
import parquet.hadoop.example.ExampleInputFormat;
/*
* Read a Parquet record
*/
public static class MyMap extends
Mapper<LongWritable, Group, NullWritable, Text> {
@Override
public void map(LongWritable key, Group value, Context context) throws IOException,
InterruptedException {
NullWritable outKey = NullWritable.get();
String outputRecord = "";
// Get the schema and field values of the record
String inputRecord = value.toString();
// Process the value, create an output record
// ...
context.write(outKey, new Text(outputRecord));
}
}
job.setJarByClass(getClass());
job.setJobName(getClass().getName());
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMap.class);
job.setNumReduceTasks(0);
job.setInputFormatClass(ExampleInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
return 0;
}
24 | Cloudera Introduction
CDH Overview
...
import parquet.Log;
import parquet.example.data.Group;
import parquet.hadoop.example.GroupWriteSupport;
import parquet.hadoop.example.ExampleInputFormat;
import parquet.hadoop.example.ExampleOutputFormat;
import parquet.hadoop.metadata.CompressionCodecName;
import parquet.hadoop.ParquetFileReader;
import parquet.hadoop.metadata.ParquetMetadata;
import parquet.schema.MessageType;
import parquet.schema.MessageTypeParser;
import parquet.schema.Type;
...
public int run(String[] args) throws Exception {
...
job.submit();
If input files are in Parquet format, the schema can be extracted using the getSchema method:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.LocatedFileStatus;
import org.apache.hadoop.fs.RemoteIterator;
...
RemoteIterator<LocatedFileStatus> it = FileSystem.get(getConf()).listFiles(new
Path(inputFile), true);
while(it.hasNext()) {
FileStatus fs = it.next();
if(fs.isFile()) {
parquetFilePath = fs.getPath();
break;
}
}
if(parquetFilePath == null) {
LOG.error("No file found for " + inputFile);
return 1;
}
ParquetMetadata readFooter =
ParquetFileReader.readFooter(getConf(), parquetFilePath);
MessageType schema =
readFooter.getFileMetaData().getSchema();
GroupWriteSupport.setSchema(schema, getConf());
job.submit();
Cloudera Introduction | 25
CDH Overview
You can then write records in the mapper by composing a Group value using the Example classes and no key:
To set the compression type before submitting the job, invoke the setCompression method:
ExampleOutputFormat.setCompression(job, compression_type);
To set the compression type, configure the parquet.compression property before the first store instruction in a
Pig script:
The supported compression types are uncompressed, gzip, and snappy (the default).
sqlContext.setConf("spark.sql.parquet.compression.codec","codec")
The supported codec values are: uncompressed, gzip, lzo, and snappy. The default is gzip.
Currently, Spark looks up column data from Parquet files by using the names stored within the data files. This is different
than the default Parquet lookup behavior of Impala and Hive. If data files are produced with a different physical layout
due to added or reordered columns, Spark still decodes the column data correctly. If the logical layout of the table is
26 | Cloudera Introduction
CDH Overview
changed in the metastore database, for example through an ALTER TABLE CHANGE statement that renames a column,
Spark still looks for the data using the now-nonexistent column name and returns NULLs when it cannot locate the
column values. To avoid behavior differences between Spark and Impala or Hive when modifying Parquet tables, avoid
renaming columns, or use Impala, Hive, or a CREATE TABLE AS SELECT statement to produce a new table and new
set of Parquet files containing embedded column names that match the new layout.
For an example of writing Parquet files to Amazon S3, see Reading and Writing Data Sources From and To Amazon S3.
For general information and examples of Spark working with data in different file formats, see Accessing External
Storage from Spark.
Cloudera Introduction | 27
CDH Overview
• cat: Print a file's contents to standard out. In CDH 5.5 and higher, you can use the -j option to output JSON.
• head: Print the first few records of a file to standard output.
• schema: Print the Parquet schema for the file.
• meta: Print the file footer metadata, including key-value properties (like Avro schema), compression ratios,
encodings, compression used, and row group information.
• dump: Print all data and metadata.
Use parquet-tools -h to see usage information for all the arguments. Here are some examples showing
parquet-tools usage:
$ # Be careful doing this for a big file! Use parquet-tools head to be safe.
$ parquet-tools cat sample.parq
year = 1992
month = 1
day = 2
dayofweek = 4
dep_time = 748
crs_dep_time = 750
arr_time = 851
crs_arr_time = 846
carrier = US
flight_num = 53
actual_elapsed_time = 63
crs_elapsed_time = 56
arrdelay = 5
depdelay = -2
origin = CMH
dest = IND
distance = 182
cancelled = 0
diverted = 0
year = 1992
month = 1
day = 3
...
year = 1992
month = 1
day = 3
...
28 | Cloudera Introduction
CDH Overview
message schema {
optional int32 year;
optional int32 month;
optional int32 day;
optional int32 dayofweek;
optional int32 dep_time;
optional int32 crs_dep_time;
optional int32 arr_time;
optional int32 crs_arr_time;
optional binary carrier;
optional int32 flight_num;
...
Cloudera Introduction | 29
CDH Overview
agent-name.sinks.sink-name.serializer = AVRO_EVENT
30 | Cloudera Introduction
CDH Overview
{
"name":"last_name",
"type":"string",
"doc":"last name of actor playing role"
},
{
"name":"extra_field",
"type":"string",
"doc:":"an extra field not in the original file",
"default":"fishfingers and custard"
}
]
}');
You can also create an Avro backed Hive table by using an Avro schema file:
avro.schema.url is a URL (here a file:// URL) pointing to an Avro schema file used for reading and writing. It
could also be an hdfs: URL; for example, hdfs://hadoop-namenode-uri/examplefile.
To enable Snappy compression on output files, run the following before writing to the table:
SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;
Haivvreo SerDe has been merged into Hive as AvroSerDe and is no longer supported in its original form. schema.url
and schema.literal have been changed to avro.schema.url and avro.schema.literal as a result of the
merge. If you were using Haivvreo SerDe, you can use the Hive AvroSerDe with tables created with the Haivvreo
SerDe. For example, if you have a table my_avro_table that uses the Haivvreo SerDe, add the following to make the
table use the new AvroSerDe:
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-mapred</artifactId>
<version>1.7.6-cdh5.9.3</version>
<classifier>hadoop2</classifier>
</dependency>
Cloudera Introduction | 31
CDH Overview
Then write your program, using the Avro MapReduce javadoc for guidance.
At run time, include the avro and avro-mapred JARs in the HADOOP_CLASSPATH and the avro, avro-mapred and
paranamer JARs in -libjars.
To enable Snappy compression on output, call AvroJob.setOutputCodec(job, "snappy") when configuring the
job. You must also include the snappy-java JAR in -libjars.
REGISTER piggybank.jar
REGISTER lib/avro-1.7.6.jar
REGISTER lib/json-simple-1.1.jar
REGISTER lib/snappy-java-1.0.4.1.jar
With store, Pig generates an Avro schema from the Pig schema. You can override the Avro schema by specifying it
literally as a parameter to AvroStorage or by using the same schema as an existing Avro data file. See the Pig wiki
for details.
To store two relations in one script, specify an index to each store function. For example:
For more information, see the Pig wiki. The version numbers of the JAR files to register are different on that page, so
adjust them as shown above.
--as-avrodatafile
Sqoop 1 automatically generates an Avro schema that corresponds to the database table being exported from.
32 | Cloudera Introduction
CDH Overview
--compression-codec snappy
Data Compression
Data compression and compression formats can have a significant impact on performance. Three important places to
consider data compression are in MapReduce and Spark jobs, data stored in HBase, and Impala queries. For the most
part, the principles are similar for each.
You must balance the processing capacity required to compress and uncompress the data, the disk IO required to read
and write the data, and the network bandwidth required to send the data across the network. The correct balance of
these factors depends upon the characteristics of your cluster and your data, as well as your usage patterns.
Compression is not recommended if your data is already compressed (such as images in JPEG format). In fact, the
resulting file can sometimes be larger than the original.
Compression Types
Hadoop supports the following compression types and codecs:
• gzip - org.apache.hadoop.io.compress.GzipCodec
• bzip2 - org.apache.hadoop.io.compress.BZip2Codec
• LZO - com.hadoop.compression.lzo.LzopCodec
• Snappy - org.apache.hadoop.io.compress.SnappyCodec
• Deflate - org.apache.hadoop.io.compress.DeflateCodec
Different file types and CDH components support different compression types. For details, see Using Apache Avro Data
Files with CDH on page 30 and Using Apache Parquet Data Files with CDH on page 21.
For guidelines on choosing compression types and configuring compression, see Choosing and Configuring Data
Compression.
Snappy Compression
Snappy is a compression/decompression library. It optimizes for very high-speed compression and decompression,
and moderate compression instead of maximum compression or compatibility with other compression libraries.
Snappy is supported for all CDH components. How you specify compression depends on the component.
Cloudera Introduction | 33
CDH Overview
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
For information about configuring Snappy compression for Parquet files with Hive, see Using Parquet Tables in Hive.
For information about using Snappy compression for Parquet files with Impala, see Snappy and GZip Compression for
Parquet Data Files in the Impala Guide.
Using Snappy with MapReduce
Enabling MapReduce intermediate compression can make jobs run faster without requiring application changes. Only
the temporary intermediate files created by Hadoop for the shuffle phase are compressed; the final output may or
may not be compressed. Snappy is ideal in this case because it compresses and decompresses very quickly compared
to other compression algorithms, such as Gzip. For information about choosing a compression format, see Choosing
and Configuring Data Compression.
To enable Snappy for MapReduce intermediate compression for the whole cluster, set the following properties in
mapred-site.xml:
• MRv1
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
• YARN
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
mapred.output. mapreduce.output. If the final job outputs are to be compressed, the codec
compression.codec fileoutputformat.
compress.codec to use. Set to
org.apache.hadoop.io.compress.SnappyCodec
for Snappy compression.
34 | Cloudera Introduction
CDH Overview
Note: The MRv1 property names are also supported (but deprecated) in YARN. You do not need to
update them in this release.
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")
--compression-codec org.apache.hadoop.io.compress.SnappyCodec
Cloudera recommends using the --as-sequencefile option with this compression option.
• Sqoop 2 - When you create a job (sqoop:000> create job), choose 7 (SNAPPY) as the compression format.
External Documentation
Cloudera provides documentation for CDH as a whole, whether your CDH cluster is managed by Cloudera Manager or
not. In addition, you may find it useful to refer to documentation for the individual components included in CDH. Where
possible, these links point to the main documentation for a project, in the Cloudera release archive. This ensures that
you are looking at the correct documentation for the version of a project included in CDH. Otherwise, the links may
point to the project's main site.
• Apache Avro
• Apache Crunch
• Apache DataFu
• Apache Flume
• Apache Hadoop
• Apache HBase
• Apache Hive
• Hue
• Kite
• Apache Mahout
• Apache Oozie
• Apache Parquet
• Apache Pig
• Apache Sentry
• Apache Solr
• Apache Spark
• Apache Sqoop
Cloudera Introduction | 35
CDH Overview
• Apache Sqoop2
• Apache Whirr
• Apache ZooKeeper
36 | Cloudera Introduction
Cloudera Manager 5 Overview
Terminology
To effectively use Cloudera Manager, you should first understand its terminology. The relationship between the terms
is illustrated below and their definitions follow:
Some of the terms, such as cluster and service, will be used without further explanation. Others, such as role group,
gateway, host template, and parcel are expanded upon in later sections.
A common point of confusion is the overloading of the terms service and role for both types and instances; Cloudera
Manager and this section sometimes uses the same term for type and instance. For example, the Cloudera Manager
Admin Console Home > Status tab and Clusters > ClusterName menu lists service instances. This is similar to the
practice in programming languages where for example the term "string" may indicate either a type (java.lang.String)
or an instance of that type ("hi there"). When it's necessary to distinguish between types and instances, the word
"type" is appended to indicate a type and the word "instance" is appended to explicitly indicate an instance.
deployment
A configuration of Cloudera Manager and all the clusters it manages.
Cloudera Introduction | 37
Cloudera Manager 5 Overview
cluster
• A set of computers or racks of computers that contains an HDFS filesystem and runs MapReduce and other
processes on that data. A pseudo-distributed cluster is a CDH installation run on a single machine and useful for
demonstrations and individual study.
• In Cloudera Manager, a logical entity that contains a set of hosts, a single version of CDH installed on the hosts,
and the service and role instances running on the hosts. A host can belong to only one cluster. Cloudera Manager
can manage multiple CDH clusters, however each cluster can only be associated with a single Cloudera Manager
Server or Cloudera Manager HA pair.
host
In Cloudera Manager, a physical or virtual machine that runs role instances. A host can belong to only one cluster.
rack
In Cloudera Manager, a physical entity that contains a set of physical hosts typically served by the same switch.
service
• A Linux command that runs a System V init script in /etc/init.d/ in as predictable an environment as possible,
removing most environment variables and setting the current working directory to /.
• A category of managed functionality in Cloudera Manager, which may be distributed or not, running in a cluster.
Sometimes referred to as a service type. For example: MapReduce, HDFS, YARN, Spark, and Accumulo. In traditional
environments, multiple services run on one host; in distributed systems, a service runs on many hosts.
service instance
In Cloudera Manager, an instance of a service running on a cluster. For example: "HDFS-1" and "yarn". A service instance
spans many role instances.
role
In Cloudera Manager, a category of functionality within a service. For example, the HDFS service has the following
roles: NameNode, SecondaryNameNode, DataNode, and Balancer. Sometimes referred to as a role type. See also user
role.
role instance
In Cloudera Manager, an instance of a role running on a host. It typically maps to a Unix process. For example:
"NameNode-h1" and "DataNode-h1".
role group
In Cloudera Manager, a set of configuration properties for a set of role instances.
host template
A set of role groups in Cloudera Manager. When a template is applied to a host, a role instance from each role group
is created and assigned to that host.
gateway
A type of role that typically provides client access to specific cluster services. For example, HDFS, Hive, Kafka, MapReduce,
Solr, and Spark each have gateway roles to provide access for their clients to their respective services. Gateway roles
do not always have "gateway" in their names, nor are they exclusively for client access. For example, Hue Kerberos
Ticket Renewer is a gateway role that proxies tickets from Kerberos.
The node supporting one or more gateway roles is sometimes referred to as the gateway node or edge node, with
the notion of "edge" common in network or cloud environments. In terms of the Cloudera cluster, the gateway nodes
38 | Cloudera Introduction
Cloudera Manager 5 Overview
in the cluster receive the appropriate client configuration files when Deploy Client Configuration is selected from the
Actions menu in Cloudera Manager Admin Console.
parcel
A binary distribution format that contains compiled code and meta-information such as a package description, version,
and dependencies.
The host tcdn501-1 is the "master" host for the cluster, so it has many more role instances, 21, compared with the 7
role instances running on the other hosts. In addition to the CDH "master" role instances, tcdn501-1 also has Cloudera
Management Service roles:
Cloudera Introduction | 39
Cloudera Manager 5 Overview
Architecture
As depicted below, the heart of Cloudera Manager is the Cloudera Manager Server. The Server hosts the Admin Console
Web Server and the application logic, and is responsible for installing software, configuring, starting, and stopping
services, and managing the cluster on which the services run.
40 | Cloudera Introduction
Cloudera Manager 5 Overview
Heartbeating
Heartbeats are a primary communication mechanism in Cloudera Manager. By default Agents send heartbeats every
15 seconds to the Cloudera Manager Server. However, to reduce user latency the frequency is increased when state
is changing.
During the heartbeat exchange, the Agent notifies the Cloudera Manager Server of its activities. In turn the Cloudera
Manager Server responds with the actions the Agent should be performing. Both the Agent and the Cloudera Manager
Server end up doing some reconciliation. For example, if you start a service, the Agent attempts to start the relevant
processes; if a process fails to start, the Cloudera Manager Server marks the start command as having failed.
State Management
The Cloudera Manager Server maintains the state of the cluster. This state can be divided into two categories: "model"
and "runtime", both of which are stored in the Cloudera Manager Server database.
Cloudera Introduction | 41
Cloudera Manager 5 Overview
Cloudera Manager models CDH and managed services: their roles, configurations, and inter-dependencies. Model state
captures what is supposed to run where, and with what configurations. For example, model state captures the fact
that a cluster contains 17 hosts, each of which is supposed to run a DataNode. You interact with the model through
the Cloudera Manager Admin Console configuration screens and API and operations such as "Add Service".
Runtime state is what processes are running where, and what commands (for example, rebalance HDFS or run a
Backup/Disaster Recovery schedule or rolling restart or stop) are currently running. The runtime state includes the
exact configuration files needed to run a process. When you select Start in the Cloudera Manager Admin Console, the
server gathers up all the configuration for the relevant services and roles, validates it, generates the configuration files,
and stores them in the database.
When you update a configuration (for example, the Hue Server web port), you have updated the model state. However,
if Hue is running while you do this, it is still using the old port. When this kind of mismatch occurs, the role is marked
as having an "outdated configuration". To resynchronize, you restart the role (which triggers the configuration
re-generation and process restart).
While Cloudera Manager models all of the reasonable configurations, some cases inevitably require special handling.
To allow you to workaround, for example, a bug or to explore unsupported options, Cloudera Manager supports an
"advanced configuration snippet" mechanism that lets you add properties directly to the configuration files.
Configuration Management
Cloudera Manager defines configuration at several levels:
• The service level may define configurations that apply to the entire service instance, such as an HDFS service's
default replication factor (dfs.replication).
• The role group level may define configurations that apply to the member roles, such as the DataNodes' handler
count (dfs.datanode.handler.count). This can be set differently for different groups of DataNodes. For
example, DataNodes running on more capable hardware may have more handlers.
• The role instance level may override configurations that it inherits from its role group. This should be used sparingly,
because it easily leads to configuration divergence within the role group. One example usage is to temporarily
enable debug logging in a specific role instance to troubleshoot an issue.
• Hosts have configurations related to monitoring, software management, and resource management.
• Cloudera Manager itself has configurations related to its own administrative operations.
Role Groups
You can set configuration at the service instance (for example, HDFS) or role instance (for example, the DataNode on
host17). An individual role inherits the configurations set at the service level. Configurations made at the role level
override those inherited from the service level. While this approach offers flexibility, configuring a set of role instances
in the same way can be tedious.
42 | Cloudera Introduction
Cloudera Manager 5 Overview
Cloudera Manager supports role groups, a mechanism for assigning configurations to a group of role instances. The
members of those groups then inherit those configurations. For example, in a cluster with heterogeneous hardware,
a DataNode role group can be created for each host type and the DataNodes running on those hosts can be assigned
to their corresponding role group. That makes it possible to set the configuration for all the DataNodes running on the
same hardware by modifying the configuration of one role group. The HDFS service discussed earlier has the following
role groups defined for the service's roles:
In addition to making it easy to manage the configuration of subsets of roles, role groups also make it possible to
maintain different configurations for experimentation or managing shared clusters for different users or workloads.
Host Templates
In typical environments, sets of hosts have the same hardware and the same set of services running on them. A host
template defines a set of role groups (at most one of each type) in a cluster and provides two main benefits:
• Adding new hosts to clusters easily - multiple hosts can have roles from different services created, configured,
and started in a single operation.
• Altering the configuration of roles from different services on a set of hosts easily - which is useful for quickly
switching the configuration of an entire cluster to accommodate different workloads or users.
In contrast, the HDFS role instances (for example, NameNode and DataNode) obtain their configurations from a private
per-process directory, under /var/run/cloudera-scm-agent/process/unique-process-name. Giving each process
its own private execution and configuration environment allows Cloudera Manager to control each process
independently. For example, here are the contents of an example 879-hdfs-NAMENODE process directory:
$ tree -a /var/run/cloudera-scm-Agent/process/879-hdfs-NAMENODE/
/var/run/cloudera-scm-Agent/process/879-hdfs-NAMENODE/
cloudera_manager_Agent_fencer.py
cloudera_manager_Agent_fencer_secret_key.txt
cloudera-monitor.properties
core-site.xml
dfs_hosts_allow.txt
dfs_hosts_exclude.txt
event-filter-rules.json
hadoop-metrics2.properties
Cloudera Introduction | 43
Cloudera Manager 5 Overview
hdfs.keytab
hdfs-site.xml
log4j.properties
logs
stderr.log
stdout.log
topology.map
topology.py
44 | Cloudera Introduction
Cloudera Manager 5 Overview
Process Management
In a non-Cloudera Manager managed cluster, you most likely start a role instance process using an init script, for
example, service hadoop-hdfs-datanode start. Cloudera Manager does not use init scripts for the daemons
it manages; in a Cloudera Manager managed cluster, starting and stopping services using init scripts will not work.
In a Cloudera Manager managed cluster, you can only start or stop role instance processes using Cloudera Manager.
Cloudera Manager uses an open source process management tool called supervisord, that starts processes, takes
care of redirecting log files, notifying of process failure, setting the effective user ID of the calling process to the right
user, and so on. Cloudera Manager supports automatically restarting a crashed process. It will also flag a role instance
with a bad health flag if its process crashes repeatedly right after start up.
Stopping the Cloudera Manager Server and the Cloudera Manager Agents will not bring down your services; any running
role instances keep running.
The Agent is started by init.d at start-up. It, in turn, contacts the Cloudera Manager Server and determines which
processes should be running. The Agent is monitored as part of Cloudera Manager's host monitoring. If the Agent stops
heartbeating, the host is marked as having bad health.
One of the Agent's main responsibilities is to start and stop processes. When the Agent detects a new process from
the Server heartbeat, the Agent creates a directory for it in /var/run/cloudera-scm-agent and unpacks the
configuration. It then contacts supervisord, which starts the process.
These actions reinforce an important point: a Cloudera Manager process never travels alone. In other words, a process
is more than just the arguments to exec()—it also includes configuration files, directories that need to be created,
and other information.
Cloudera Introduction | 45
Cloudera Manager 5 Overview
• Distribution of CDH as a single object - Instead of having a separate package for each part of CDH, parcels have
just a single object to install. This makes it easier to distribute software to a cluster that is not connected to the
Internet.
• Internal consistency - All CDH components are matched, eliminating the possibility of installing parts from different
versions of CDH.
• Installation outside of /usr - In some environments, Hadoop administrators do not have privileges to install
system packages. These administrators needed to use CDH tarballs, which do not provide the infrastructure that
packages do. With parcels, administrators can install to /opt, or anywhere else, without completing the additional
manual steps of regular tarballs.
Note: With parcels, the path to the CDH libraries is /opt/cloudera/parcels/CDH/lib instead
of the usual /usr/lib. Do not link /usr/lib/ elements to parcel-deployed paths, because the
links may cause scripts that distinguish between the two paths to not work.
• Installation of CDH without sudo - Parcel installation is handled by the Cloudera Manager Agent running as root
or another user, so you can install CDH without sudo.
• Decoupled distribution from activation - With side-by-side install capabilities, you can stage a new version of
CDH across the cluster before switching to it. This allows the most time-consuming part of an upgrade to be done
ahead of time without affecting cluster operations, thereby reducing downtime.
• Rolling upgrades - Packages require you to shut down the old process, upgrade the package, and then start the
new process. Any errors in the process can be difficult to recover from, and upgrading requires extensive integration
with the package management system to function seamlessly. With parcels, when a new version is staged
side-by-side, you can switch to a new minor version by simply changing which version of CDH is used when
restarting each process. You can then perform upgrades with rolling restarts, in which service roles are restarted
in the correct order to switch to the new version with minimal service interruption. Your cluster can continue to
run on the existing installed components while you stage a new version across your cluster, without impacting
your current operations. Major version upgrades (for example, CDH 4 to CDH 5) require full service restarts because
of substantial changes between the versions. Finally, you can upgrade individual parcels or multiple parcels at the
same time.
• Upgrade management - Cloudera Manager manages all the steps in a CDH version upgrade. With packages,
Cloudera Manager only helps with initial installation.
• Additional components - Parcels are not limited to CDH. Impala, Cloudera Search, LZO, Apache Kafka, and add-on
service parcels are also available.
• Compatibility with other distribution tools - Cloudera Manager works with other tools you use for download and
distribution. For example, you can use Puppet. Or, you can download the parcel to Cloudera Manager Server
manually if your cluster has no Internet connectivity and then have Cloudera Manager distribute the parcel to the
cluster.
Host Management
Cloudera Manager provides several features to manage the hosts in your Hadoop clusters. The first time you run
Cloudera Manager Admin Console you can search for hosts to add to the cluster and once the hosts are selected you
can map the assignment of CDH roles to hosts. Cloudera Manager automatically deploys all software required to
participate as a managed host in a cluster: JDK, Cloudera Manager Agent, CDH, Impala, Solr, and so on to the hosts.
Once the services are deployed and running, the Hosts area within the Admin Console shows the overall status of the
managed hosts in your cluster. The information provided includes the version of CDH running on the host, the cluster
to which the host belongs, and the number of roles running on the host. Cloudera Manager provides operations to
manage the lifecycle of the participating hosts and to add and delete hosts. The Cloudera Management Service Host
Monitor role performs health tests and collects host metrics to allow you to monitor the health and performance of
the hosts.
46 | Cloudera Introduction
Cloudera Manager 5 Overview
Resource Management
Resource management helps ensure predictable behavior by defining the impact of different services on cluster
resources. Use resource management to:
• Guarantee completion in a reasonable time frame for critical workloads.
• Support reasonable cluster scheduling between groups of users based on fair allocation of resources per group.
• Prevent users from depriving other users access to the cluster.
With Cloudera Manager 5, statically allocating resources using cgroups is configurable through a single static service
pool wizard. You allocate services as a percentage of total resources, and the wizard configures the cgroups.
Static service pools isolate the services in your cluster from one another, so that load on one service has a bounded
impact on other services. Services are allocated a static percentage of total resources—CPU, memory, and I/O
weight—which are not shared with other services. When you configure static service pools, Cloudera Manager computes
recommended memory, CPU, and I/O configurations for the worker roles of the services that correspond to the
percentage assigned to each service. Static service pools are implemented per role group within a cluster, using Linux
control groups (cgroups) and cooperative memory limits (for example, Java maximum heap sizes). Static service pools
can be used to control access to resources by HBase, HDFS, Impala, MapReduce, Solr, Spark, YARN, and add-on services.
Static service pools are not enabled by default.
For example, the following figure illustrates static pools for HBase, HDFS, Impala, and YARN services that are respectively
assigned 20%, 30%, 20%, and 30% of cluster resources.
You can dynamically apportion resources that are statically allocated to YARN and Impala by using dynamic resource
pools.
Depending on the version of CDH you are using, dynamic resource pools in Cloudera Manager support the following
scenarios:
• YARN (CDH 5) - YARN manages the virtual cores, memory, running applications, and scheduling policy for each
pool. In the preceding diagram, three dynamic resource pools—Dev, Product, and Mktg with weights 3, 2, and 1
respectively—are defined for YARN. If an application starts and is assigned to the Product pool, and other
applications are using the Dev and Mktg pools, the Product resource pool receives 30% x 2/6 (or 10%) of the total
Cloudera Introduction | 47
Cloudera Manager 5 Overview
cluster resources. If no applications are using the Dev and Mktg pools, the YARN Product pool is allocated 30% of
the cluster resources.
• Impala (CDH 5 and CDH 4) - Impala manages memory for pools running queries and limits the number of running
and queued queries in each pool.
User Management
Access to Cloudera Manager features is controlled by user accounts. A user account identifies how a user is authenticated
and determines what privileges are granted to the user.
Cloudera Manager provides several mechanisms for authenticating users. You can configure Cloudera Manager to
authenticate users against the Cloudera Manager database or against an external authentication service. The external
authentication service can be an LDAP server (Active Directory or an OpenLDAP compatible directory), or you can
specify another external service. Cloudera Manager also supports using the Security Assertion Markup Language (SAML)
to enable single sign-on.
For information about the privileges associated with each of the Cloudera Manager user roles, see Cloudera Manager
User Roles.
Security Management
Cloudera Manager strives to consolidate security configurations across several projects.
Authentication
The purpose of authentication in Hadoop, as in other systems, is simply to prove that a user or service is who he or
she claims to be.
Typically, authentication in enterprises is managed through a single distributed system, such as a Lightweight Directory
Access Protocol (LDAP) directory. LDAP authentication consists of straightforward username/password services backed
by a variety of storage systems, ranging from file to database.
A common enterprise-grade authentication system is Kerberos. Kerberos provides strong security benefits including
capabilities that render intercepted authentication packets unusable by an attacker. It virtually eliminates the threat
of impersonation by never sending a user's credentials in cleartext over the network.
Several components of the Hadoop ecosystem are converging to use Kerberos authentication with the option to manage
and store credentials in LDAP or AD. For example, Microsoft's Active Directory (AD) is an LDAP directory that also
provides Kerberos authentication for added security.
Authorization
Authorization is concerned with who or what has access or control over a given resource or service. Since Hadoop
merges together the capabilities of multiple varied, and previously separate IT systems as an enterprise data hub that
stores and works on all data within an organization, it requires multiple authorization controls with varying granularities.
In such cases, Hadoop management tools simplify setup and maintenance by:
• Tying all users to groups, which can be specified in existing LDAP or AD directories.
• Providing role-based access control for similar interaction methods, like batch and interactive SQL queries. For
example, Apache Sentry permissions apply to Hive (HiveServer2) and Impala.
CDH currently provides the following forms of access control:
• Traditional POSIX-style permissions for directories and files, where each directory and file is assigned a single
owner and group. Each assignment has a basic set of permissions available; file permissions are simply read, write,
and execute, and directories have an additional permission to determine access to child directories.
• Extended Access Control Lists (ACLs) for HDFS that provide fine-grained control of permissions for HDFS files by
allowing you to set different permissions for specific named users or named groups.
48 | Cloudera Introduction
Cloudera Manager 5 Overview
• Apache HBase uses ACLs to authorize various operations (READ, WRITE, CREATE, ADMIN) by column, column
family, and column family qualifier. HBase ACLs are granted and revoked to both users and groups.
• Role-based access control with Apache Sentry.
Encryption
The goal of encryption is to ensure that only authorized users can view, use, or contribute to a data set. These security
controls add another layer of protection against potential threats by end-users, administrators, and other malicious
actors on the network. Data protection can be applied at a number of levels within Hadoop:
• OS Filesystem-level - Encryption can be applied at the Linux operating system filesystem level to cover all files in
a volume. An example of this approach is Cloudera Navigator Encrypt (formerly Gazzang zNcrypt) which is available
for Cloudera customers licensed for Cloudera Navigator. Navigator Encrypt operates at the Linux volume level, so
it can encrypt cluster data inside and outside HDFS, such as temp/spill files, configuration files and metadata
databases (to be used only for data related to a CDH cluster). Navigator Encrypt must be used with Cloudera
Navigator Key Trustee Server (formerly Gazzang zTrustee).
CDH components, such as Impala, MapReduce, YARN, or HBase, also have the ability to encrypt data that lives
temporarily on the local filesystem outside HDFS. To enable this feature, see Configuring Encryption for Data Spills.
• Network-level - Encryption can be applied to encrypt data just before it gets sent across a network and to decrypt
it just after receipt. In Hadoop, this means coverage for data sent from client user interfaces as well as
service-to-service communication like remote procedure calls (RPCs). This protection uses industry-standard
protocols such as TLS/SSL.
Note: Cloudera Manager and CDH components support either TLS 1.0, TLS 1.1, or TLS 1.2, but
not SSL 3.0. References to SSL continue only because of its widespread use in technical jargon.
• HDFS-level - Encryption applied by the HDFS client software. HDFS Transparent Encryption operates at the HDFS
folder level, allowing you to encrypt some folders and leave others unencrypted. HDFS transparent encryption
cannot encrypt any data outside HDFS. To ensure reliable key storage (so that data is not lost), use Cloudera
Navigator Key Trustee Server; the default Java keystore can be used for test purposes. For more information, see
Enabling HDFS Encryption Using Cloudera Navigator Key Trustee Server.
Unlike OS and network-level encryption, HDFS transparent encryption is end-to-end. That is, it protects data at
rest and in transit, which makes it more efficient than implementing a combination of OS-level and network-level
encryption.
Cloudera Introduction | 49
Cloudera Manager 5 Overview
Health Tests
Cloudera Manager monitors the health of the services, roles, and hosts that are running in your clusters using health
tests. The Cloudera Management Service also provides health tests for its roles. Role-based health tests are enabled
by default. For example, a simple health test is whether there's enough disk space in every NameNode data directory.
A more complicated health test may evaluate when the last checkpoint for HDFS was compared to a threshold or
whether a DataNode is connected to a NameNode. Some of these health tests also aggregate other health tests: in a
distributed system like HDFS, it's normal to have a few DataNodes down (assuming you've got dozens of hosts), so we
allow for setting thresholds on what percentage of hosts should color the entire service down.
Health tests can return one of three values: Good, Concerning, and Bad. A test returns Concerning health if the test
falls below a warning threshold. A test returns Bad if the test falls below a critical threshold. The overall health of a
service or role instance is a roll-up of its health tests. If any health test is Concerning (but none are Bad) the role's or
service's health is Concerning; if any health test is Bad, the service's or role's health is Bad.
In the Cloudera Manager Admin Console, health tests results are indicated with colors: Good , Concerning , and
Bad .
One common question is whether monitoring can be separated from configuration. One of the goals for monitoring
is to enable it without needing to do additional configuration and installing additional tools (for example, Nagios). By
having a deep model of the configuration, Cloudera Manager is able to know which directories to monitor, which ports
to use, and what credentials to use for those ports. This tight coupling means that, when you install Cloudera Manager
all the monitoring is enabled.
50 | Cloudera Introduction
Cloudera Manager 5 Overview
and Home page display. The Cloudera Manager Admin Console top navigation bar provides the following tabs and
menus:
• Clusters > cluster_name
– Services - Display individual services, and the Cloudera Management Service. In these pages you can:
– View the status and other details of a service instance or the role instances associated with the service
– Make configuration changes to a service instance, a role, or a specific role instance
– Add and delete a service or role
– Stop, start, or restart a service or role.
– View the commands that have been run for a service or a role
– View an audit event history
– Deploy and download client configurations
– Decommission and recommission role instances
– Enter or exit maintenance mode
– Perform actions unique to a specific type of service. For example:
– Enable HDFS high availability or NameNode federation
– Run the HDFS Balancer
– Create HBase, Hive, and Sqoop directories
Cloudera Introduction | 51
Cloudera Manager 5 Overview
• Hosts - Display the hosts managed by Cloudera Manager. In this page you can:
– View the status and a variety of detail metrics about individual hosts
– Make configuration changes for host monitoring
– View all the processes running on a host
– Run the Host Inspector
– Add and delete hosts
– Create and manage host templates
– Manage parcels
– Decommission and recommission hosts
– Make rack assignments
– Run the host upgrade wizard
• Diagnostics - Review logs, events, and alerts to diagnose problems. The subpages are:
– Events - Search for and displaying events and alerts that have occurred.
– Logs - Search logs by service, role, host, and search phrase as well as log level (severity).
– Server Log -Display the Cloudera Manager Server log.
• Audits - Query and filter audit events across clusters, including logins, across clusters.
• Charts - Query for metrics of interest, display them as charts, and display personalized chart dashboards.
• Backup - Manage replication schedules and snapshot policies.
• Administration - Administer Cloudera Manager. The subpages are:
– Settings - Configure Cloudera Manager.
– Alerts - Display when alerts will be generated, configure alert recipients, and send test alert email.
– Users - Manage Cloudera Manager users and user sessions.
– Kerberos - Generate Kerberos credentials and inspect hosts.
– License - Manage Cloudera licenses.
– Language - Set the language used for the content of activity events, health events, and alert email messages.
– Peers - Connect multiple instances of Cloudera Manager.
• Parcel Icon - link to the Hosts > Parcels page.
• Running Commands Indicator - displays the number of commands currently running for all services or roles.
• Search - Supports searching for services, roles, hosts, configuration properties, and commands. You can enter a
partial string and a drop-down list with up to sixteen entities that match will display.
• Support - Displays various support actions. The subcommands are:
– Send Diagnostic Data - Sends data to Cloudera Support to support troubleshooting.
– Support Portal (Cloudera Enterprise) - Displays the Cloudera Support portal.
– Mailing List (Cloudera Express) - Displays the Cloudera Manager Users list.
– Scheduled Diagnostics: Weekly - Configure the frequency of automatically collecting diagnostic data and
sending to Cloudera support.
– The following links open the latest documentation on the Cloudera web site:
– Help
– Installation Guide
– API Documentation
– Release Notes
– About - Version number and build details of Cloudera Manager and the current date and time stamp of the
Cloudera Manager server.
• Logged-in User Menu - The currently logged-in user. The subcommands are:
– Change Password - Change the password of the currently logged in user.
– Logout
52 | Cloudera Introduction
Cloudera Manager 5 Overview
Note: You can configure the Cloudera Manager Admin Console to automatically log out a user after
a configurable period of time. See Automatic Logout on page 56.
You can also go to the Home > Status tab by clicking the Cloudera Manager logo in the top navigation bar.
Status
The Status tab contains:
• Clusters - The clusters being managed by Cloudera Manager. Each cluster is displayed either in summary form or
in full form depending on the configuration of the Administration > Settings > Other > Maximum Cluster Count
Shown In Full property. When the number of clusters exceeds the value of the property, only cluster summary
information displays.
– Summary Form - A list of links to cluster status pages. Click Customize to jump to the Administration >
Settings > Other > Maximum Cluster Count Shown In Full property.
– Full Form - A separate section for each cluster containing a link to the cluster status page and a table containing
links to the Hosts page and the status pages of the services running in the cluster.
Cloudera Introduction | 53
Cloudera Manager 5 Overview
Each service row in the table has a menu of actions that you select by clicking
Click the indicator to display the Health Issues pop-up dialog box.
By default only Bad health test results are shown in the dialog box. To display
Concerning health test results, click the Also show n concerning issue(s)
link.Click the link to display the Status page containing with details about
the health test result.
Configuration Indicates that the service has at least one configuration issue. The indicator
issue shows the number of configuration issues at the highest severity level. If
there are configuration errors, the indicator is red. If there are no errors
but configuration warnings exist, then the indicator is yellow. No indicator
is shown if there are no configuration notifications.
Click the indicator to display the Configuration Issues pop-up dialog box.
By default only notifications at the Error severity level are listed, grouped
by service name are shown in the dialog box. To display Warning
notifications, click the Also show n warning(s) link.Click the message
associated with an error or warning to be taken to the configuration property
for which the notification has been issued where you can address the
issue.See Managing Services.
Restart Configuration Indicates that at least one of a service's roles is running with a configuration
Needed modified that does not match the current configuration settings in Cloudera Manager.
Refresh Click the indicator to display the Stale Configurations page.To bring the
Needed cluster up-to-date, click the Refresh or Restart button on the Stale
Configurations page or follow the instructions in Refreshing a Cluster,
Restarting a Cluster, or Restarting Services and Instances after Configuration
Changes.
Client Indicates that the client configuration for a service should be redeployed.
configuration
Click the indicator to display the Stale Configurations page.To bring the
redeployment
cluster up-to-date, click the Deploy Client Configuration button on the Stale
required
Configurations page or follow the instructions in Manually Redeploying
Client Configuration Files.
54 | Cloudera Introduction
Cloudera Manager 5 Overview
– Cloudera Management Service - A table containing a link to the Cloudera Manager Service. The Cloudera
Manager Service has a menu of actions that you select by clicking
.
– Charts - A set of charts (dashboard) that summarize resource utilization (IO, CPU usage) and processing
metrics.
Click a line, stack area, scatter, or bar chart to expand it into a full-page view with a legend for the individual
charted entities as well more fine-grained axes divisions.
By default the time scale of a dashboard is 30 minutes. To change the time scale, click a duration link
at the top-right of the dashboard.
To set the dashboard type, click and select one of the following:
• Custom - displays a custom dashboard.
• Default - displays a default dashboard.
• Reset - resets the custom dashboard to the predefined set of charts, discarding any customizations.
Note: You can configure the Cloudera Manager Admin Console to automatically log out a user after
a configurable period of time. See Automatic Logout on page 56.
Cloudera Introduction | 55
Cloudera Manager 5 Overview
Automatic Logout
For security purposes, Cloudera Manager automatically logs out a user session after 30 minutes. You can change this
session logout period.
To configure the timeout period:
1. Click Administration > Settings.
2. Click Category > Security.
3. Edit the Session Timeout property.
4. Click Save Changes to commit the changes.
When the timeout is one minute from triggering, the user sees the following message:
If the user does not click the mouse or press a key, the user is logged out of the session and the following message
appears:
56 | Cloudera Introduction
Cloudera Manager 5 Overview
Cloudera Introduction | 57
Cloudera Manager 5 Overview
Cloudera Manager Admin Console on page 51, and does not require an extra process or extra configuration. The API
supports HTTP Basic Authentication, accepting the same users and credentials as the Cloudera Manager Admin Console.
Resources
• Quick Start
• Cloudera Manager API tutorial
• Cloudera Manager API documentation
• Python client
• Using the Cloudera Manager API for Cluster Automation on page 61
Obtaining Configuration Files
1. Obtain the list of a service's roles:
http://cm_server_host:7180/api/v14/clusters/clusterName/services/serviceName/roles
http://cm_server_host:7180/api/v14/clusters/clusterName/services/serviceName/roles/roleName/process
http://cm_server_host:7180/api/v14/clusters/clusterName/services/serviceName/roles/roleName/process/
configFiles/configFileName
For example:
http://cm_server_host:7180/api/v14/clusters/Cluster%201/services/OOZIE-1/roles/
OOZIE-1-OOZIE_SERVER-e121641328fcb107999f2b5fd856880d/process/configFiles/oozie-site.xml
http://cm_server_host:7180/api/v14/clusters/Cluster%201/services/service_name/config?view=FULL
Search the results for the display name of the desired property. For example, a search for the display name HDFS
Service Environment Advanced Configuration Snippet (Safety Valve) shows that the corresponding property name
is hdfs_service_env_safety_valve:
{
"name" : "hdfs_service_env_safety_valve",
"require" : false,
"displayName" : "HDFS Service Environment Advanced Configuration Snippet (Safety
Valve)",
"description" : "For advanced use onlyu, key/value pairs (one on each line) to be
inserted into a roles
environment. Applies to configurations of all roles in this service except client
configuration.",
"relatedName" : "",
"validationState" : "OK"
}
Similar to finding service properties, you can also find host properties. First, get the host IDs for a cluster with the URL:
http://cm_server_host:7180/api/v14/hosts
58 | Cloudera Introduction
Cloudera Manager 5 Overview
{
"hostId" : "2c2e951c-aaf2-4780-a69f-0382181f1821",
"ipAddress" : "10.30.195.116",
"hostname" : "cm_server_host",
"rackId" : "/default",
"hostUrl" :
"http://cm_server_host:7180/cmf/hostRedirect/2c2e951c-adf2-4780-a69f-0382181f1821",
"maintenanceMode" : false,
"maintenanceOwners" : [ ],
"commissionState" : "COMMISSIONED",
"numCores" : 4,
"totalPhysMemBytes" : 10371174400
}
Then obtain the host properties by including one of the returned host IDs in the URL:
http://cm_server_host:7180/api/v14/hosts/2c2e951c-adf2-4780-a69f-0382181f1821?view=FULL
Where:
• admin_uname is a username with either the Full Administrator or Cluster Administrator role.
• admin_pass is the password for the admin_uname username.
• cm_server_host is the hostname of the Cloudera Manager server.
• path_to_file is the path to the file where you want to save the configuration.
Important: If you configure this redaction, you cannot use an exported configuration to restore the
configuration of your cluster due to the redacted information.
-Dcom.cloudera.api.redaction=true
Cloudera Introduction | 59
Cloudera Manager 5 Overview
For example:
Important: This feature is available only with a Cloudera Enterprise license. It is not available in
Cloudera Express. For information on Cloudera Enterprise licenses, see Managing Licenses.
Using a previously saved JSON document that contains the Cloudera Manager configuration data, you can restore that
configuration to a running cluster.
1. Using the Cloudera Manager Administration Console, stop all running services in your cluster:
a. On the Home > Status tab, click
Warning: If you do not stop the cluster before making this API call, the API call will stop all cluster
services before running the job. Any running jobs and data are lost.
Where:
• admin_uname is a username with either the Full Administrator or Cluster Administrator role.
• admin_pass is the password for the admin_uname username.
• cm_server_host is the hostname of the Cloudera Manager server.
• path_to_file is the path to the file containing the JSON configuration file.
60 | Cloudera Introduction
Cloudera Manager 5 Overview
<project>
<repositories>
<repository>
<id>cdh.repo</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
<name>Cloudera Repository</name>
</repository>
…
</repositories>
<dependencies>
<dependency>
<groupId>com.cloudera.api</groupId>
<artifactId>cloudera-manager-api</artifactId>
<version>4.6.2</version> <!-- Set to the version of Cloudera Manager you use
-->
</dependency>
…
</dependencies>
...
</project>
Cloudera Introduction | 61
Cloudera Manager 5 Overview
The Java client works like a proxy. It hides from the caller any details about REST, HTTP, and JSON. The entry point is
a handle to the root of the API:
From the root, you can traverse down to all other resources. (It's called "v14" because that is the current Cloudera
Manager API version, but the same builder will also return a root from an earlier version of the API.) The tree view
shows some key resources and supported operations:
• RootResourcev14
– ClustersResourcev14 - host membership, start cluster
– ServicesResourcev14 - configuration, get metrics, HA, service commands
– RolesResource - add roles, get metrics, logs
– RoleConfigGroupsResource - configuration
– ParcelsResource - parcel management
// List of clusters
ApiClusterList clusters = apiRoot.getClustersResource().readClusters(DataView.SUMMARY);
for (ApiCluster cluster : clusters) {
LOG.info("{}: {}", cluster.getName(), cluster.getVersion());
}
Python Example
You can see an example of automation with Python at the following link: Python example. The example contains
information on the requirements and steps to automate a cluster deployment.
62 | Cloudera Introduction
Cloudera Manager 5 Overview
General Questions
What are the differences between the Cloudera Express and the Cloudera Enterprise versions of Cloudera Manager?
Cloudera Express includes a free version of Cloudera Manager. The Cloudera Enterprise version of Cloudera Manager
provides additional functionality. Both the Cloudera Express and Cloudera Enterprise versions automate the installation,
configuration, and monitoring of CDH 5 on an entire cluster. See the data sheet at Cloudera Enterprise Datasheet for
a comparison of the two versions.
The Cloudera Enterprise version of Cloudera Manager is available as part of the Cloudera Enterprise subscription
offering, and requires a license. You can also choose a Cloudera Enterprise Enterprise Data Hub Edition Trial that is
valid for 60 days.
If you are not an existing Cloudera customer, contact Cloudera Sales using this form or call 866-843-7207 to obtain a
Cloudera Enterprise license. If you are already a Cloudera customer and you need to upgrade from Cloudera Express
to Cloudera Enterprise, contact Cloudera Support to obtain a license.
Cloudera Introduction | 63
Cloudera Manager 5 Overview
Warning:
• Cloudera Manager 4 and CDH 4 have reached End of Maintenance (EOM) on August 9, 2015.
Cloudera does not support or provide updates for Cloudera Manager 4 and CDH 4 releases.
• Cloudera Manager 3 and CDH 3 have reached End of Maintenance (EOM) on June 20, 2013.
Cloudera does not support or provide updates for Cloudera Manager 3 and CDH 3 releases.
Where are CDH libraries located when I distribute CDH using parcels?
With parcel software distribution, the path to the CDH libraries is /opt/cloudera/parcels/CDH/lib/ instead of
the usual /usr/lib/.
What upgrade paths are available for Cloudera Manager, and what's involved?
For instructions about upgrading, see Cloudera Upgrade.
Do worker hosts need access to the Cloudera public repositories for an install with Cloudera Manager?
You can perform an installation or upgrade using the parcel format and when using parcels, only the Cloudera Manager
Server requires access to the Cloudera public repositories. Distribution of the parcels to worker hosts is done between
the Cloudera Manager Server and the worker hosts. See Parcels for more information. If you want to install using the
traditional packages, hosts only require access to the installation files.
For both parcels and packages, it is also possible to create local repositories that serve these files to the hosts that are
being upgraded. If you have established local repositories, no access to the Cloudera public repository is required. For
more information, see Creating and Using a Package Repository for Cloudera Manager.
Can I use the service monitoring features of Cloudera Manager without the Cloudera Management Service?
No. To understand the desired state of the system, Cloudera Manager requires the global configuration that the
Cloudera Management Service roles gather and provide. The Cloudera Manager Agent doubles as both the agent for
supervision and for monitoring.
Can I run the Cloudera Management Service and the Hadoop services on the host where the Cloudera Manager Server
is running?
Yes. This is especially common in deployments that have a small number of hosts.
64 | Cloudera Introduction
Cloudera Navigator 2 Overview
Cloudera Navigator provides the following components to help you answer these questions and meet data-management
and security requirements.
• Data Management - Provides visibility into and control over the data in Hadoop datastores, and the computations
performed on that data. Hadoop administrators, data stewards, and data scientists can use Cloudera Navigator
to:
– Audit data access and verify access privileges - The goal of auditing is to capture a complete and immutable
record of all activity within a system. Cloudera Navigator auditing adds secure, real-time audit components
to key data and access frameworks. Compliance groups can use Cloudera Navigator to configure, collect, and
view audit events that show who accessed data, and how.
– Search metadata and visualize lineage - Cloudera Navigator metadata management allows DBAs, data stewards,
business analysts, and data scientists to define, search for, amend the properties of, and tag data entities
and view relationships between datasets.
– Policies - Data stewards can use Cloudera Navigator policies to define automated actions, based on data
access or on a schedule, to add metadata, create alerts, and move or purge data.
– Analytics - Hadoop administrators can use Cloudera Navigator analytics to examine data usage patterns and
create policies based on those patterns.
• Data Encryption - Data encryption and key management provide a critical layer of protection against potential
threats by malicious actors on the network or in the datacenter. Encryption and key management are also
requirements for meeting key compliance initiatives and ensuring the integrity of your enterprise data. The
following Cloudera Navigator components enable compliance groups to manage encryption:
– Cloudera Navigator Encrypt transparently encrypts and secures data at rest without requiring changes to
your applications and ensures there is minimal performance lag in the encryption or decryption process.
– Cloudera Navigator Key Trustee Server is an enterprise-grade virtual safe-deposit box that stores and manages
cryptographic keys and other security artifacts.
Cloudera Introduction | 65
Cloudera Navigator 2 Overview
– Cloudera Navigator Key HSM allows Cloudera Navigator Key Trustee Server to seamlessly integrate with a
hardware security module (HSM).
You can install Cloudera Navigator data management and data encryption components independently.
Related Information
• Installing the Cloudera Navigator Data Management Component
• Upgrading the Cloudera Navigator Data Management Component
• Cloudera Navigator Data Management Component Administration
• Cloudera Data Management
• Configuring Authentication in the Cloudera Navigator Data Management Component
• Configuring TLS/SSL for the Cloudera Navigator Data Management Component
• Cloudera Navigator Data Management Component User Roles
Navigator auditing, metadata, lineage, policies, and analytics all support multi-cluster deployments managed by a
single Cloudera Manager instance. So if you have five clusters, all centrally managed by a single Cloudera Manager,
you see all this information in a single Navigator data management UI. In the metadata part of the UI, Navigator provides
technical metadata that tracks the specific cluster from which the data is derived.
66 | Cloudera Introduction
Cloudera Navigator 2 Overview
Metadata Server role and port is the port configured for the role. The default port of the Navigator Metadata
Server is 7187. To change the port, follow the instructions in Configuring the Navigator Metadata Server Port.
• Select Clusters > Cloudera Management Service > Cloudera Navigator.
• Navigate from the Navigator Metadata Server role:
1. Select Clusters > Cloudera Management Service.
2. Click the Instances tab.
3. Click the Navigator Metadata Server role.
4. Click the Cloudera Navigator link.
2. Log into Cloudera Navigator UI using the credentials assigned by your administrator.
Cloudera Introduction | 67
Cloudera Navigator 2 Overview
information about the API calls, click Download debug file. A file named
api-data-Navigator_Metadata_Server_host-UTC timestamp.json is downloaded. For example:
{
"href": "http://Navigator Metadata Server
hostname:port/?view=detailsView&id=7f44221738670c98baf0799aa6abd330&activeView=lineage&b=ImMka",
"userAgent": ...
"windowSize": ...
},
"timestamp": 1456795776671,
"calls": [
{
"type": "POST",
"url": "/api/v6/interactive/entities?limit=0&offset=0",
"data":...,
"page": "http://Navigator Metadata Server
hostname:port/?view=resultsView&facets=%7B%22type%22%3A%5B%22database%22%5D%7D",
"timestamp": 1456795762472
},
{
"type": "GET",
"url": "/api/v3/entities?query=type%3Asource",
"status": 200,
"responseText": ...,
"page": "http://Navigator Metadata Server
hostname:port/?view=resultsView&facets=%7B%22type%22%3A%5B%22database%22%5D%7D",
"timestamp": 1456795763233
},
...
68 | Cloudera Introduction
Cloudera Navigator 2 Overview
– Search metadata and visualize lineage - Cloudera Navigator metadata management allows DBAs, data stewards,
business analysts, and data scientists to define, search for, amend the properties of, and tag data entities
and view relationships between datasets.
– Policies - Data stewards can use Cloudera Navigator policies to define automated actions, based on data
access or on a schedule, to add metadata, create alerts, and move or purge data.
– Analytics - Hadoop administrators can use Cloudera Navigator analytics to examine data usage patterns and
create policies based on those patterns.
• Data Encryption - Data encryption and key management provide a critical layer of protection against potential
threats by malicious actors on the network or in the datacenter. Encryption and key management are also
requirements for meeting key compliance initiatives and ensuring the integrity of your enterprise data. The
following Cloudera Navigator components enable compliance groups to manage encryption:
– Cloudera Navigator Encrypt transparently encrypts and secures data at rest without requiring changes to
your applications and ensures there is minimal performance lag in the encryption or decryption process.
– Cloudera Navigator Key Trustee Server is an enterprise-grade virtual safe-deposit box that stores and manages
cryptographic keys and other security artifacts.
– Cloudera Navigator Key HSM allows Cloudera Navigator Key Trustee Server to seamlessly integrate with a
hardware security module (HSM).
The Cloudera Navigator data management component is implemented as two roles in the Cloudera Management
Service: Navigator Audit Server and Navigator Metadata Server. You can add Cloudera Navigator data management
roles while installing Cloudera Manager for the first time or into an existing Cloudera Manager installation. For
information on compatible Cloudera Navigator and Cloudera Manager versions, see the Product Compatibility Matrix
for Cloudera Navigator product compatibility matrix.
Is Cloudera Navigator included with a Cloudera Enterprise Enterprise Data Hub Edition license?
Yes. Cloudera Navigator is included with a Cloudera Enterprise Enterprise Data Hub Edition license and can be selected
as a choice with a Cloudera Enterprise Flex Edition license.
What Cloudera Manager, CDH, and Impala releases does Cloudera Navigator 2 work with?
See Product Compatibility Matrix for Cloudera Navigator.
How are Cloudera Navigator logs different from Cloudera Manager logs?
Cloudera Navigator tracks and aggregates only the accesses to the data stored in CDH services and used for audit
reports and analysis. Cloudera Manager monitors and logs all the activity performed by CDH services that helps
administrators maintain the health of the cluster. Together these logs provide better visibility into both the data access
and system activity for an enterprise cluster.
Cloudera Introduction | 69
Cloudera Navigator Data Encryption Overview
Warning: Encryption transforms coherent data into random, unrecognizable information for
unauthorized users. It is absolutely critical that you follow the documented procedures for encrypting
and decrypting data, and that you regularly back up the encryption keys and configuration files. Failure
to do so can result in irretrievable data loss. See Backing Up and Restoring Key Trustee Server and
Clients for more information.
Do not attempt to perform any operations that you do not understand. If you have any questions
about a procedure, contact Cloudera Support before proceeding.
Cloudera Navigator includes a turnkey encryption and key management solution for data at rest, whether data is stored
in HDFS or on the local Linux filesystem. Cloudera Navigator data encryption comprises the following components:
• Cloudera Navigator Key Trustee Server
Key Trustee Server is an enterprise-grade virtual safe-deposit box that stores and manages cryptographic keys.
With Key Trustee Server, encryption keys are separated from the encrypted data, ensuring that sensitive data is
protected in the event that unauthorized users gain access to the storage media.
• Cloudera Navigator Key HSM
Key HSM is a service that allows Key Trustee Server to integrate with a hardware security module (HSM). Key HSM
enables Key Trustee Server to use an HSM as the root of trust for cryptographic keys, taking advantage of Key
Trustee Server’s policy-based key and security asset management capabilities while satisfying existing internal
security requirements regarding treatment of cryptographic materials.
• Cloudera Navigator Encrypt
Navigator Encrypt is a client-side service that transparently encrypts data at rest without requiring changes to
your applications and with minimal performance lag in the encryption or decryption process. Advanced key
management with Key Trustee Server and process-based access controls in Navigator Encrypt enable organizations
to meet compliance regulations and ensure unauthorized parties or malicious actors never gain access to encrypted
data.
• Key Trustee KMS
For HDFS Transparent Encryption, Cloudera provides Key Trustee KMS, a customized key management server
(KMS) that uses Key Trustee Server for robust and scalable encryption key storage and management instead of
the file-based Java KeyStore used by the default Hadoop KMS.
• Cloudera Navigator HSM KMS
Also for HDFS Transparent Encryption, Navigator HSM KMS provides a customized key management server (KMS)
that uses third-party HSMs to provide the highest level of key isolation, storing key material on the HSM. When
using the Navigator HSM KMS, encryption zone key material originates on the HSM and never leaves the HSM.
While Navigator HSM KMS allows for the highest level of key isolation, it also requires some overhead for network
calls to the HSM for key generation, encryption and decryption operations.
• Cloudera Navigator HSM KMS Services and HA
Navigator HSM KMSs running on a single node fulfill the functional needs of users, but do not provide the
non-functional qualities of service necessary for production deployment (primarily key data high availability and
key data durability). You can achieve high availability (HA) of key material through the HA mechanisms of the
backing HSM. However, metadata cannot be stored on the HSM directly, so the HSM KMS provides for high
availability of key metadata via a built-in replication mechanism between the metadata stores of each KMS role
instance. This release supports a two-node topology for high availability. When deployed using this topology,
70 | Cloudera Introduction
Cloudera Navigator Data Encryption Overview
there is a durability guarantee enforced for key creation and roll such that a key create or roll operation will fail
if it cannot be successfully replicated between the two nodes.
Temp/spill files for CDH Local filesystem N/A (temporary keys are None (enable native
components with native stored in memory only) temp/spill encryption for
encryption: each component)
• Impala
• YARN
• MapReduce
• Flume
• HBase
• Accumulo
Temp/spill files for CDH Local filesystem Key Trustee Server Navigator Encrypt
components without native
encryption:
Cloudera Introduction | 71
Cloudera Navigator Data Encryption Overview
For instructions on using Navigator Encrypt to secure local filesystem data, see Cloudera Navigator Encrypt.
Key Trustee clients include Navigator Encrypt and Key Trustee KMS. Encryption keys are created by the client and
stored in Key Trustee Server.
72 | Cloudera Introduction
Cloudera Navigator Data Encryption Overview
For more details on the individual components of Cloudera Navigator data encryption, continue reading:
Cloudera Introduction | 73
Cloudera Navigator Data Encryption Overview
The most common Key Trustee Server clients are Navigator Encrypt and Key Trustee KMS.
When a Key Trustee client registers with Key Trustee Server, it generates a unique fingerprint. All client interactions
with the Key Trustee Server are authenticated with this fingerprint. You must ensure that the file containing this
fingerprint is secured with appropriate Linux file permissions. The file containing the fingerprint is
/etc/navencrypt/keytrustee/ztrustee.conf for Navigator Encrypt clients, and
/var/lib/kms-keytrustee/keytrustee/.keytrustee/keytrustee.conf for Key Trustee KMS.
Many clients can use the same Key Trustee Server to manage security objects. For example, you can have several
Navigator Encrypt clients using a Key Trustee Server, and also use the same Key Trustee Server as the backing store
for Key Trustee KMS (used in HDFS encryption).
1. A Key Trustee client (for example, Navigator Encrypt or Key Trustee KMS) sends an encrypted secret to Key Trustee
Server.
2. Key Trustee Server forwards the encrypted secret to Key HSM.
3. Key HSM generates a symmetric encryption key and sends it to the HSM over an encrypted channel.
74 | Cloudera Introduction
Cloudera Navigator Data Encryption Overview
4. The HSM generates a new key pair and encrypts the symmetric key and returns the encrypted symmetric key to
Key HSM.
5. Key HSM encrypts the original client-encrypted secret with the symmetric key, and returns the twice-encrypted
secret, along with the encrypted symmetric key, to Key Trustee Server. Key HSM discards its copy of the symmetric
key.
6. Key Trustee Server stores the twice-encrypted secret along with the encrypted symmetric key in its PostgreSQL
database.
The only way to retrieve the original encrypted secret is for Key HSM to request the HSM to decrypt the encrypted
symmetric key, which is required to decrypt the twice-encrypted secret. If the key has been revoked on the HSM, it is
not possible to retrieve the original secret.
For instructions on installing Navigator Key HSM, see Installing Cloudera Navigator Key HSM. For instructions on
configuring Navigator Key HSM, see Initializing Navigator Key HSM.
Cloudera Introduction | 75
Cloudera Navigator Data Encryption Overview
• Automatic key management: Encryption keys are stored in Key Trustee Server to separate the keys from the
encrypted data. If the encrypted data is compromised, it is useless without the encryption key.
• Transparent encryption and decryption: Protected data is encrypted and decrypted seamlessly, with minimal
performance impact and no modification to the software accessing the data.
• Process-based access controls: Processes are authorized individually to access encrypted data. If the process is
modified in any way, access is denied, preventing malicious users from using customized application binaries to
bypass the access control.
• Performance: Navigator Encrypt supports the Intel AES-NI cryptographic accelerator for enhanced performance
in the encryption and decryption process.
• Compliance: Navigator Encrypt enables you to comply with requirements for HIPAA-HITECH, PCI-DSS, FISMA, EU
Data Protection Directive, and other data security regulations.
• Multi-distribution support: Navigator Encrypt supports Debian, Ubuntu, RHEL, CentOS, and SLES.
• Simple installation: Navigator Encrypt is distributed as RPM and DEB packages, as well as SLES KMPs.
• Multiple mountpoints: You can separate data into different mountpoints, each with its own encryption key.
Navigator Encrypt can be used with many kinds of data, including (but not limited to):
• Databases
• Temporary files (YARN containers, spill files, and so on)
• Log files
• Data directories
• Configuration files
Navigator Encrypt uses dmcrypt for its underlying cryptographic operations. Navigator Encrypt uses several different
encryption keys:
• Master Key: The master key can be a single passphrase, dual passphrase, or RSA key file. The master key is stored
in Key Trustee Server and cached locally. This key is used when registering with a Key Trustee Server and when
performing administrative functions on Navigator Encrypt clients.
• Mount Encryption Key (MEK): This key is generated by Navigator Encrypt using openssl rand by default, but it
can alternatively use /dev/urandom. This key is generated when preparing a new mount point. Each mount point
has its own MEK. This key is uploaded to Key Trustee Server.
• dmcrypt Device Encryption Key (DEK): This key is not managed by Navigator Encrypt or Key Trustee Server. It is
managed locally by dmcrypt and stored in the header of the device.
This rule allows the /usr/bin/myapp process to access any encrypted path (*) that was encrypted under the category
@mydata.
Note: You have the option of using wildcard characters when defining process-based ACLs. The
following example shows valid wildcard definitions:
"ALLOW @* * *"
"ALLOW @* path/* /path/to/process"
Navigator Encrypt uses a kernel module that intercepts any input/output (I/O) sent to an encrypted and managed path.
The Linux module filename is navencryptfs.ko and it resides in the kernel stack, injecting filesystem hooks. It also
authenticates and authorizes processes and caches authentication results for increased performance.
76 | Cloudera Introduction
Cloudera Navigator Data Encryption Overview
Because the kernel module intercepts and does not modify I/O, it supports any filesystem (ext3, ext4, xfs, and so
on).
The following diagram shows /usr/bin/myapp sending an open() call that is intercepted by
navencrypt-kernel-module as an open hook:
The kernel module calculates the process fingerprint. If the authentication cache already has the fingerprint, the process
is allowed to access the data. If the fingerprint is not in the cache, the fingerprint is checked against the ACL. If the ACL
grants access, the fingerprint is added to the authentication cache, and the process is permitted to access the data.
When you add an ACL rule, you are prompted for the master key. If the rule is accepted, the ACL rules file is updated
as well as the navencrypt-kernel-module ACL cache.
The next diagram illustrates different aspects of Navigator Encrypt:
The user adds a rule to allow /usr/bin/myapp to access the encrypted data in the category @mylogs, and adds
another rule to allow /usr/bin/myapp to access encrypted data in the category @mydata. These two rules are loaded
into the navencrypt-kernel-module cache after restarting the kernel module.
Cloudera Introduction | 77
Cloudera Navigator Data Encryption Overview
The /mydata directory is encrypted under the @mydata category and /mylogs is encrypted under the @mylogs
category using dmcrypt (block device encryption).
When myapp tries to issue I/O to an encrypted directory, the kernel module calculates the fingerprint of the process
(/usr/bin/myapp) and compares it with the list of authorized fingerprints in the cache.
The master key is encrypted with a local GPG key. Before being stored in the Key Trustee Server database, it is encrypted
again with the Key Trustee Server GPG key. When the master key is needed to perform a Navigator Encrypt operation,
Key Trustee Server decrypts the stored key with its server GPG key and sends it back to the client (in this case, Navigator
Encrypt), which decrypts the deposit with the local GPG key.
All communication occurs over TLS-encrypted connections.
78 | Cloudera Introduction
Cloudera Navigator Optimizer
Cloudera Introduction | 79
Frequently Asked Questions About Cloudera Software
80 | Cloudera Introduction
Getting Support
Getting Support
This section describes how to get support.
Cloudera Support
Cloudera can help you install, configure, optimize, tune, and run CDH for large scale data processing and analysis.
Cloudera supports CDH whether you run it on servers in your own datacenter or on hosted infrastructure services,
such as Amazon Web Services, Microsoft Azure, or Google Compute Engine.
If you are a Cloudera customer, you can:
• Register for an account to create a support ticket at the support site.
• Visit the Cloudera Knowledge Base.
If you are not a Cloudera customer, learn how Cloudera can help you.
Community Support
There are several vehicles for community support. You can:
• Register for the Cloudera forums.
• If you have any questions or comments about CDH, you can visit the Using the Platform forum.
• If you have any questions or comments about Cloudera Manager, you can
– Visit the Cloudera Manager forum forum.
– Cloudera Express users can access the Cloudera Manager support mailing list from within the Cloudera
Manager Admin Console by selecting Support > Mailing List.
– Cloudera Enterprise customers can access the Cloudera Support Portal from within the Cloudera Manager
Admin Console, by selecting Support > Cloudera Support Portal. From there you can register for a support
account, create a support ticket, and access the Cloudera Knowledge Base.
• If you have any questions or comments about Cloudera Navigator, you can visit the Cloudera Navigator forum.
• Find more documentation for specific components by referring to External Documentation on page 35.
Cloudera Introduction | 81
Getting Support
Report Issues
Your input is appreciated, but before filing a request:
• Search the Cloudera issue tracker, where Cloudera tracks software and documentation bugs and enhancement
requests for CDH.
• Search the CDH Manual Installation, Using the Platform, and Cloudera Manager forums.
82 | Cloudera Introduction
Appendix: Apache License, Version 2.0
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
Cloudera | 83
Appendix: Apache License, Version 2.0
licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their
Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against
any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated
within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under
this License for that Work shall terminate as of the date such litigation is filed.
4. Redistribution.
You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You meet the following conditions:
1. You must give any other recipients of the Work or Derivative Works a copy of this License; and
2. You must cause any modified files to carry prominent notices stating that You changed the files; and
3. You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark,
and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part
of the Derivative Works; and
4. If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute
must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices
that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE
text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along
with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party
notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify
the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or
as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be
construed as modifying the License.
You may add Your own copyright statement to Your modifications and may provide additional or different license
terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as
a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated
in this License.
5. Submission of Contributions.
Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the
Licensor shall be under the terms and conditions of this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement
you may have executed with Licensor regarding such Contributions.
6. Trademarks.
This License does not grant permission to use the trade names, trademarks, service marks, or product names of the
Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing
the content of the NOTICE file.
7. Disclaimer of Warranty.
Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides
its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied,
including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or
FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or
redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
8. Limitation of Liability.
In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required
by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable
to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising
as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss
of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even
if such Contributor has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability.
84 | Cloudera
Appendix: Apache License, Version 2.0
While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance
of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in
accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any
other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional
liability.
END OF TERMS AND CONDITIONS
http://www.apache.org/licenses/LICENSE-2.0
Cloudera | 85