Cloudera Releases
Cloudera Releases
Important Notice
Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service
names or slogans contained in this document are trademarks of Cloudera and its
suppliers or licensors, and may not be copied, imitated or used, in whole or in part,
without the prior written permission of Cloudera or the applicable trademark holder.
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software
Foundation. All other trademarks, registered trademarks, product names and
company names or logos mentioned in this document are the property of their
respective owners. Reference to any products, services, processes or other
information, by trade name, trademark, manufacturer, supplier or otherwise does
not constitute or imply endorsement, sponsorship or recommendation thereof by
us.
Complying with all applicable copyright laws is the responsibility of the user. Without
limiting the rights under copyright, no part of this document may be reproduced,
stored in or introduced into a retrieval system, or transmitted in any form or by any
means (electronic, mechanical, photocopying, recording, or otherwise), or for any
purpose, without the express written permission of Cloudera.
Cloudera, Inc.
1001 Page Mill Road Bldg 2
Palo Alto, CA 94304
[email protected]
US: 1-888-789-1488
Intl: 1-650-362-0488
www.cloudera.com
Release Information
Version: 5.4.x
Date: May 20, 2015
Table of Contents
Release Notes.............................................................................................................5
CDH 5 Release Notes...................................................................................................................................5
New Features in CDH 5............................................................................................................................................5
Incompatible Changes............................................................................................................................................56
Known Issues in CDH 5..........................................................................................................................................78
Issues Fixed in CDH 5...........................................................................................................................................110
Cloudera Manager 5 Release Notes......................................................................................................167
New Features and Changes in Cloudera Manager 5........................................................................................167
Known Issues and Workarounds in Cloudera Manager 5................................................................................181
Issues Fixed in Cloudera Manager 5...................................................................................................................189
Cloudera Navigator 2 Release Notes.....................................................................................................209
New Features and Changes in Cloudera Navigator 2.......................................................................................209
Known Issues and Workarounds in Cloudera Navigator 2..............................................................................212
Issues Fixed in Cloudera Navigator 2.................................................................................................................214
Release Notes
Note:
There is no CDH 5.2.2 release.
Important: Cloudera recommends that you use YARN (now production-ready) with CDH 5.
• MapReduce 2.0 (MRv2): CDH 5 includes MapReduce 2.0 (MRv2) running on YARN. The fundamental idea of
the YARN architecture is to split up the two primary responsibilities of the JobTracker — resource management
and job scheduling/monitoring — into separate daemons: a global ResourceManager (RM) and per-application
ApplicationMasters (AM). With MRv2, the ResourceManager (RM) and per-node NodeManagers (NM), form
the data-computation framework. The ResourceManager service effectively replaces the functions of the
JobTracker, and NodeManagers run on slave nodes instead of TaskTracker daemons. The per-application
ApplicationMaster is, in effect, a framework-specific library and is tasked with negotiating resources from
the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks. For details
of the new architecture, see Apache Hadoop NextGen MapReduce (YARN).
• MapReduce Version 1 (MRv1): For backward compatibility, CDH 5 continues to support the original MapReduce
framework (i.e. the JobTracker and TaskTrackers), but you should begin migrating to MRv2.
Note:
Cloudera does not support running MRv1 and YARN daemons on the same nodes at the same
time.
• Deprecated properties:
In Hadoop 2.0.0 and later (MRv2), a number of Hadoop and HDFS properties have been deprecated. (The
change dates from Hadoop 0.23.1, on which the Beta releases of CDH 4 were based). A list of deprecated
properties and their replacements can be found at Hadoop Deprecated Properties.
Note: All of these deprecated properties continue to work in MRv1. Conversely the
newmapreduce*properties listed do not work in MRv1.
<updateHandler class="solr.DirectUpdateHandler2">
<!-- Enables a transaction log, used for real-time get, durability, and
and solr cloud replica recovery. The log can grow as big as
uncommitted changes to the index, so use of a hard autoCommit
is recommended (see below).
"dir" - the target directory for transaction logs, defaults to the
solr data directory. -->
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
<int name="tlogDfsReplication">3</int>
</updateLog>
You might want to increase the replication level from the default level of 1 to some higher value such as 3.
Increasing the transaction log replication level can:
• Reduce the chance of data loss, especially when the system is otherwise configured to have single replicas
of shards. For example, having single replicas of shards is reasonable when autoAddReplicas is enabled,
but without additional transaction log replicas, the risk of data loss during a node failure would increase.
• Facilitate rolling upgrade of HDFS while Search is running. If you have multiple copies of the log, when a
node with the transaction log becomes unavailable during the rolling upgrade process, another copy of
the log can continue to collect transactions.
• Facilitate HDFS write lease recovery.
Initial testing shows no significant performance regression for common use cases.
Important:
Upgrading to CDH 5.4.0 and later from any earlier release requires an HDFS metadata upgrade.
• If you are using Cloudera Manager to upgrade CDH, see Upgrading CDH and Managed Services
Using Cloudera Manager.
– If you are running an earlier CDH 5 release and have an Enterprise License, you can perform a
rolling upgrade: see Performing a Rolling Upgrade on a CDH 5 Cluster.
• If you are not using Cloudera Manager, see Upgrading Unmanaged CDH Using the Command Line.
Be careful to follow all of the upgrade steps as instructed.
For the latest Impala features, see New Features in Impala Version 2.2.0 / CDH 5.4.0 on page 41.
Operating System Support
CDH 5.4.0 adds support for RHEL and CentOS 6.6.
Security
The following summarizes new security capabilities in CDH 5.4.0:
• Secure Hue impersonation support for the Hue HBase application.
• Redaction of sensitive data from logs, centrally managed by Cloudera Manager, which prevents the WHERE
clause in queries from leaking sensitive data into logs and management UIs.
• Cloudera Manager support for custom Kerberos principals.
• Kerberos support for Sqoop 2.
• Kerberos and TLS/SSL support for Flume Thrift source and sink.
• Navigator SAML support (requires Cloudera Manager).
• Navigator Key Trustee can now be installed and monitored by Cloudera Manager.
• Search can be configured to use SSL.
• Search supports protecting Solr and Lily HBase Indexer metadata using ZooKeeper ACLs in a Kerberos-enabled
environment.
Apache Crunch
New HBase-related features:
• HBaseTypes.cells() was added to support serializing HBase Cell objects.
• All of the HFileUtils methods now support PCollectionC extends Cell, which includes both
PCollectionKeyValue and PCollectionCell, on their method signatures.
• HFileTarget, HBaseTarget, and HBaseSourceTarget all support any subclass of Cell as an output type.
HFileSource and HBaseSourceTarget still return KeyValue as the input type for backward compatibility
with existing Crunch pipelines.
Developers can use Cell-based APIs in the same way as KeyValue-based APIs if they are not ready to update
their code, but will probably have to change code inside DoFns because HBase 0.99 and later APIs deprecated
or removed a number of methods from the HBase 0.96 API.
Apache Flume
CDH 5.4.0 adds SSL and Kerberos support for the Thrift source and sink, and implements DatasetSink 2.0.
Apache Hadoop
HDFS
• CDH 5.4.0 implements HDFS 2.6.0.
• CDH 5.4.0 HDFS provides hot-swap capability for DataNode disk drives. You can add or replace HDFS data
volumes without shutting down the DataNode host (HDFS-1362); see Performing Disk Hot Swap for DataNodes.
• CDH 5.4.0 introduces cluster-wide redaction of sensitive data in logs and SQL queries. See Sensitive Data
Redaction.
• CDH 5.4.0 adds support for Heterogenous Storage Policies.
MapReduce
CDH 5.4.0 implements MAPREDUCE-5785, which simplifies MapReduce job configuration. Instead of having to
set both the heap size (mapreduce.map.java.opts or mapreduce.reduce.java.opts) and the container size
(mapreduce.map.memory.mb or mapreduce.reduce.memory.mb), you can now choose to set only one of them;
the other is inferred from mapreduce.job.heap.memory-mb.ratio. If you do not specify either of them, the
container size defaults to 1 GB and the heap size is inferred.
For jobs that do not set the heap size, the JVM size increases from 200 MB to a default 820 MB. This is adequate
for most jobs, but streaming tasks might need more memory because the Java process causes total usage to
exceed the container size. This typically occurs only for those tasks relying on aggressive garbage collection to
keep the heap under 200 MB.
YARN
• YARN-2990 improves application launch time by 6 seconds when using FairScheduler (with the default
Cloudera Manager settings shown in YARN (MR2 Included) Properties in CDH 5.4.0).
Apache HBase
CDH 5.4.0 implements HBase 1.0. For detailed information and instructions on how to use the new capabilities,
see New Features and Changes for HBase in CDH 5.
MultiWAL Support for HBase
CDH 5.4.0 introduces MultiWAL support for HBase region servers, allowing you to increase throughput when a
region writes the write-ahead log (WAL).
doAs Impersonation for HBase
CDH 5.4.0 introduces doAs impersonation for the HBase Thrift server. doAs impersonation allows a client to
authenticate to HBase as any user, and re-authenticate at any time, instead of as a static user only. See Configure
doAs Impersonation for the HBase Thrift Gateway.
Read Replicas for HBase
CDH 5.4.0 introduces read replicas, along with a new timeline consistency model. This feature allows you to
balance consistency and availability on a per-read basis, and provides a measure of high availability for reads
if a RegionServer becomes unavailable. See HBase Read Replicas.
Storing Medium Objects (MOBs) in HBase
CDH 5.4.0 HBase MOB allows you to store objects up to 10 MB (medium objects, or MOBs) directly in HBase
while maintaining read and write performance. See Storing Medium Objects (MOBs) in HBase.
Apache Hive
CDH 5.4.0 implements Hive 1.1.0. New capabilities include:
• A test-only version of Hive on Spark with the following limitations:
– Parquet does not currently support vectorization; it simply ignores the setting of
hive.vectorized.execution.enabled.
– Hive on Spark does not yet support dynamic partition pruning.
– Hive on Spark does not yet support HBase. If you want to interact with HBase, Cloudera recommends
that you use Hive on MapReduce.
Important: Hive on Spark is included in CDH 5.4 but is not currently supported nor recommended
for production use. If you are interested in this feature, try it out in a test environment until we
address the issues and limitations needed for production-readiness.
To deploy and test Hive on Spark in a test environment, use Cloudera Manager (seeConfiguring Hive on Spark).
• Support for JAR files changes without scheduled maintenance.
To implement this capability, proceed as follows:
1. Set hive.reloadable.aux.jars.path in /etc/hive/conf/hive-site.xml to the directory that
contains the JAR files.
2. Execute the reload; statement on HiveServer2 clients such as Beeline and the Hive JDBC.
• Beeline support for retrieving and printing query logs.
Some features in the upstream release are not yet supported for production use in CDH; these include:
• HIVE-7935 - Support dynamic service discovery for HiveServer2
• HIVE-6455 - Scalable dynamic partitioning and bucketing optimization
• HIVE-5317 - Implement insert, update, and delete in Hive with full ACID support
• HIVE-7068 - Integrate AccumuloStorageHandler
• HIVE-7090 - Support session-level temporary tables in Hive
• HIVE-7341 - Support for Table replication across HCatalog instances
• HIVE-4752 - Add support for HiveServer2 to use Thrift over HTTP
Hue
CDH 5.4.0 adds the following:
• New Oozie editor
• Performance improvements
• New Search facets
• HBase impersonation
Kite
Kite in CDH has been rebased on the 1.0 release upstream. This breaks backward compatibility with existing
APIs. The APIs are documented at http://kitesdk.org/docs/1.0.0/apidocs/index.html.
Notable changes are:
• Dataset writers that implement flush and sync now extend interfaces (Flushable and Syncable). Writers that
no longer have misleading flush and sync methods.
• DatasetReaderException, DatasetWriterException, and DatasetRepositoryException have been
removed and replaced with more specific exceptions, such as IncompatibleSchemaException. Exception
classes now indicate what went wrong instead of what threw the exception.
• The partition API is no longer exposed; use the view API instead.
• kite-data-hcatalog is now kite-data-hive.
Note:
From 1.0 on, Kite will be strict about breaking compatibility and will use semantic versioning to signal
which compatibility guarantees you can expect from a release (for example, incompatible changes
require increasing the major version number). For more information, see the Hello, Kite SDK 1.0 blog
post.
Apache Oozie
• Added Spark action which lets you run Spark applications from Oozie workflows. See the Oozie documentation
for more details.
• The Hive2 action now collects and reports Hadoop Job IDs for MapReduce jobs launched by Hive Server 2.
• The launcher job now uses YARN uber mode for all but the Shell action; this reduces the overhead (time and
resources) of running these Oozie actions.
Apache Parquet
• Parquet memory manager now changes the row group size if the current size is expected to cause
out-of-memory (OOM) errors because too many files are open. This causes a WARN message to be printed in
the logs. A new setting, parquet.memory.pool.ratio, controls the percentage of the JVM's heap memory
Parquet attempts to use.
• To improve job startup time, footers are no longer read by default for MapReduce jobs (PARQUET-139).
Note:
To revert to the old behavior (ParquetFileReader reads in all the files to obtain the footers), set
parquet.task.side.metadata to false in the job configuration.
• The Parquet Avro object model can now read lists and maps written by Hive, Avro, and Thrift (similar capabilities
were added to Hive in CDH 5.3). This compatibility fix does not change behavior. The extra record layer wrapping
the list elements when Avro reads lists written by Hive can now be removed; to do this, set the expected
Avro schema or set parquet.avro.add-list-element-records to false.
• Avro's map representation now writes null values correctly.
• The Parquet Thrift object model can now read data written by other object models (such as Hive, Impala, or
Parquet-Avro), given a Thrift class for the data; compile a Thrift definition into an object, and supply it when
creating the job.
Cloudera Search
• Solr metadata stored in ZooKeeper can now be protected by Zookeeper ACLs. In a Kerberos-enabled
environment, Solr metadata stored in ZooKeeper is owned by the solr user and cannot be modified by other
users.
Note:
• The Solr principal name can be configured in Cloudera Manager. The default name is solr,
although other names can be specified.
• Collection configuration information stored under the /solr/configs znode in not affected
by this change. As a result, collection configuration behavior is unchanged.
Administrators who modify Solr ZooKeeper metadata through operations like solrctl init or solrctl
cluster --put-solrxml must now supply solrctl with a JAAS configuration using the --jaas configuration
parameter. The JAAS configuration must specify the principal, typically solr, that the solr process uses. See
Solrctl Reference for more information.
End users, who typically do not need to modify Solr metadata, are unaffected by this change.
• Lily HBase Indexer metadata stored in ZooKeeper can now be protected by Zookeeper ACLs. In a
Kerberos-enabled environment, Lily HBase Indexer metadata stored in ZooKeeper is owned by the Solr user
and cannot be modified by other users.
End users, who typically do not manage the Lily HBase Indexer, are unaffected by this change.
• The Lily HBase Indexer supports restricting access using Sentry. For more information, see Sentry integration.
• Services included with Search for CDH 5.4.0, including Solr, Key-Value Store Indexer, and Flume, now support
SSL.
• The Spark Indexer and the Lily HBase Batch Indexer support delegation tokens for mapper-only jobs. For
more information, see Spark Indexing Reference (CDH 5.2 or later only) and HBaseMapReduceIndexerTool.
• Search for CDH 5.4.0 implements SOLR-5746, which improves solr.xml file parsing. Error checking for
duplicated options or unknown option names was added. These checks can help identify mistakes made
during manual edits of the solr.xml file. User-modified solr.xml files may cause errors on startup due to
these parsing improvements.
• By default, CloudSolrServer now uses multiple threads to add documents.
To get the old, single-threaded behavior, set parallel updates to false on the CloudSolrServer instance.
Related JIRA: SOLR-4816.
• Updates are routed directly to the correct shard leader, eliminating document routing at the server. This
allows for near linear indexing throughput scalability. Document routing requires that the solrj client must
know each document’s unique identifier. The unique identifiers allow the client to route the update directly
to the correct shard. For additional information, see Shards and Indexing Data in SolrCloud.
Related JIRA: SOLR-4816.
• The loadSolr morphline command supports nested documents. For more information, see Morphlines
Reference Guide.
• Navigator can be used to audit Cloudera Search activity. For more information on the Solr operations that
can be audited, see Audit Events and Audit Reports.
• Search for CDH 5.4 supports logging queries before they are executed. This allows you can identify queries
that could increase resource consumption. This also enables improving schemas or filters to meet your
performance requirements. To enable this feature, set the SolrCore and SolrCore.Request log level to DEBUG.
Related JIRA: SOLR-6919
• UniqFieldsUpdateProcessorFactory, which Solr Server implements, has been improved to support all of the
FieldMutatingUpdateProcessorFactory selector options. The <lst named="fields"> init param option
is deprecated. Replace this option with <arr name="fieldName">.
If the <lst named="fields"> init param option is used, Solr logs a warning.
Related JIRA: SOLR-4249.
• Configuration information was previously available using FieldMutatingUpdateProcessorFactory
(oneOrMany or getBooleanArg). Those methods are now deprecated. The methods have been moved to
NamedList and renamed to removeConfigArgs and removeBooleanArg, respectively.
If the oneOrMany or getBooleanArg methods of FieldMutatingUpdateProcessorFactory are used, Solr
logs a warning.
Related JIRA: SOLR-5264.
Apache Spark
CDH 5.4.0 Spark is rebased on Apache Spark 1.3.0 and provides the following new capabilities:
• Spark Streaming WAL (write-ahead log) on HDFS, preventing any data loss on driver failure
• Spark external shuffle service
• Improvements in automatically setting CDH classpaths for Avro, Parquet, Flume, and Hive
• Improvements in the collection of task metrics
• Kafka connector for Spark Streaming to avoid the need for the HDFS WAL
The following is not yet supported in a production environment because of its immaturity:
• Spark SQL (which now includes dataframes)
See also Apache Spark Known Issues on page 106 and Apache Spark Incompatible Changes on page 77.
Apache Sqoop
• Sqoop 2:
• – CDH 5.4.0 implements Sqoop 2 version 1.99.5.
– Sqoop 2 supports Kerberos as of CDH 5.4.0.
– Sqoop 2 supports PostgreSQL as the repository database.
Important:
Client hosts may need a more recent version of libcrypto.so. See Apache Hadoop Known Issues
on page 80 for more information.
• S3A - S3A is an HDFS implementation of the Simple Storage Service (S3) from Amazon Web Services. It is
similar to S3N, which is the other implementation of this functionality. The key difference is that S3A relies
on the officially-supported AWS Java SDK for communicating with S3, while S3N uses a best-effort-supported
jets3t library to do the same. For a listing of the parameters, see HADOOP-10400.
YARN
YARN now provides a way for long-running applications to get new delegation tokens.
Apache Flume
CDH 5.3 provides a Kafka Channel (FLUME-2500).
Apache HBase
CDH 5.3 provides checkAndMutate(RowMutations), in addition to existing support for atomic checkAndPut as
well as checkAndDelete operations on individual rows (HBASE-11796).
Apache Hive
• Hive can use multiple HDFS encryption zones.
• Hive-HBase integration contains many fixes and new features such as reading HBase snapshots.
• Many Hive Parquet fixes.
• Hive Server 2 can handle multiple LDAP domains for authentication.
Hue
New Features:
• Hue is re-based on Hue 3.7
• SAML authentication has been revamped.
• CDH 5.3 simplifies the task of configuring Hue to store data in an Oracle database by bundling the Oracle
Install Client. For instructions, see Using an External Database for Hue Using the Command Line.
Apache Oozie
• You can now update the definition and properties of an already running Coordinator. See the documentation
for more information.
• A new poll command in the Oozie client polls a Workflow Job, Coordinator Job, Coordinator Action, or Bundle
Job until it finishes. See the documentation for more information.
Apache Parquet
• PARQUET-132: Add type parameter to AvroParquetInputFormat for Spark
• PARQUET-107: Add option to disable summary metadata files
• PARQUET-64: Add support for new type annotations (date, time, timestamp, etc.)
Cloudera Search
New Features:
• Cloudera Search includes a version of Kite 0.15.0, which includes all morphlines-related backports of all fixes
and features in Kite 0.17.1. Morphlines now includes functionality that enables partially updating document
as well as deleting documents. Partial updating or deleting can be completed by unique IDs or by documents
that match a query. For additional information on Kite, see:
– Kite repository
– Kite Release Notes
– Kite documentation
– Kite examples
• CrunchIndexerTool now sends a commit to Solr on job success.
• Added support for deleting documents stored in Solr by unique id as well as by query.
Apache Sentry (incubating)
• Sentry HDFS Plugin - Allows you to configure synchronization of Sentry privileges to HDFS ACLs for specific
HDFS directories. This simplifies the process of sharing table data between Hive or Impala and other clients
(such as MapReduce, Pig, Spark), by automatically updating the ACLs when a GRANT or REVOKE statement is
executed. It also allows all roles and privileges to be managed in a central location (by Sentry).
• Metrics - CDH 5.3 supports metrics for the Sentry service. These metrics can be reported either through JMX
or the console; configure this by setting the property sentry.service.reporter to jmx or console. A
Sentry web server listening by default on port 51000 can expose the metrics in json format. Web reporting
is disabled by default; enable it by setting sentry.service.web.enable to true. You can configure the
port on which Sentry web server listens by means of the sentry.service.web.port property .
Apache Spark
• CDH Spark has been rebased on Apache Spark 1.2.0.
• Spark Streaming can now save incoming data to a WAL (write-ahead log) on HDFS, preventing any data loss
on driver failure.
Important:
This feature is currently in Beta; Cloudera includes it in CDH Spark but does not support it.
• The YARN back end now supports dynamic allocation of executors. See
http://spark.apache.org/docs/latest/job-scheduling.html for more information.
• Native library paths (set via Spark configuration options) are correctly propagated to executors in YARN mode
(SPARK-1719).
• The Snappy codec should now work out-of-the-box on Linux distributions with older glibc versions such
as CentOS 5.
• Spark SQL now includes the Spark Thrift Server in CDH.
Important:
Spark SQL remains an experimental and unsupported feature in CDH.
See Apache Spark Incompatible Changes on page 77 and Apache Spark Known Issues on page 106 for additional
important information.
Apache Sqoop
• Sqoop 1:
– The MySQL connector now fetches on a row-by row-basis.
– The SQL server now has upsert (insert or update) support (SQOOP-1403).
– The Oracle direct connector now works with index-organized tables (SQOOP-1632). To use this capability,
you must set the chunk method to PARTITION:
-Doraoop.chunk.method=PARTITION
• Sqoop 2:
– FROM/TO re-factoring is now supported (SQOOP-1367).
requires changes across a wide variety of components of CDH and Cloudera Manager in 5.2.0 and all earlier
versions. CDH 5.2.1 provides these changes for CDH 5.2.0 deployments. For more information, see the Cloudera
Security Bulletin.
Apache Hadoop Distributed Cache Vulnerability
The Distributed Cache Vulnerability allows a malicious cluster user to expose private files owned by the user
running the YARN NodeManager process. For more information, see the Cloudera Security Bulletin.
Other Fixes
CDH 5.2.1 also fixes the following issues:
• HADOOP-11243 - SSLFactory shouldn't allow SSLv3
• HADOOP-11217 - Disable SSLv3 in KMS
• HADOOP-11156 - DelegateToFileSystem should implement getFsStatus(final Path f).
• HADOOP-11176 - KMSClientProvider authentication fails when both currentUgi and loginUgi are a proxied
user
• HDFS-7235 - DataNode#transferBlock should report blocks that don't exist using reportBadBlock
• HDFS-7274 - Disable SSLv3 in HttpFS
• HDFS-7391 - Reenable SSLv2Hello in HttpFS
• HDFS-6781 - Separate HDFS commands from CommandsManual.apt.vm
• HDFS-6831 - Inconsistency between 'hdfs dfsadmin' and 'hdfs dfsadmin -help'
• HDFS-7278 - Add a command that allows sysadmins to manually trigger full block reports from a DN
• YARN-2010 - Handle app-recovery failures gracefully
• YARN-2588 - Standby RM does not transitionToActive if previous transitionToActive is failed with ZK exception.
• YARN-2566 - DefaultContainerExecutor should pick a working directory randomly
• YARN-2641 - Decommission nodes on -refreshNodes instead of next NM-RM heartbeat
• MAPREDUCE-6147 - Support mapreduce.input.fileinputformat.split.maxsize
• HBASE-12376 - HBaseAdmin leaks ZK connections if failure starting watchers (ConnectionLossException)
• HBASE-12201 - Close the writers in the MOB sweep tool
• HBASE-12220 - Add hedgedReads and hedgedReadWins metrics
• HIVE-8693 - Separate out fair scheduler dependency from hadoop 0.23 shim
• HIVE-8634 - HiveServer2 fair scheduler queue mapping doesn't handle the secondary groups rules correctly
• HIVE-8675 - Increase thrift server protocol test coverage
• HIVE-8827 - Remove SSLv2Hello from list of disabled protocols protocols
• HIVE-8615 - beeline csv,tsv outputformat needs backward compatibility mode
• HIVE-8627 - Compute stats on a table from impala caused the table to be corrupted
• HIVE-7764 - Support all JDBC-HiveServer2 authentication modes on a secure cluster cluster
• HIVE-8182 - beeline fails when executing multiple-line queries with trailing spaces
• HUE-2438 - [core] Disable SSLv3 for Poodle vulnerability
• IMPALA-1361: FE Exceptions with BETWEEN predicates
• IMPALA-1397: free local expr allocations in scanner threads
• IMPALA-1400: Window function insert issue (LAG() + OVER)
• IMPALA-1401: raise MAX_PAGE_HEADER_SIZE and use scanner context to stitch together header buffer
• IMPALA-1410: accept "single character" character classes in regex functions
• IMPALA-1411: Create table as select produces incorrect results
• IMPALA-1416 - Queries fail with metastore exception after upgrade and compute stats
• OOZIE-2034 - Disable SSLv3 (POODLEbleed vulnerability)
• OOZIE-2063 - Cron syntax creates duplicate actions
• PARQUET-107 - Add option to disable summary metadata aggregation after MR jobs
• SPARK-3788 - Yarn dist cache code is not friendly to HDFS HA, Federation
• SPARK-3661 - spark.*.memory is ignored in cluster mode
• SPARK-3979 - Yarn backend's default file replication should match HDFS' default one
Important:
Upgrading to CDH 5.2.0 and later from any earlier release requires an HDFS metadata upgrade and
other steps not usually required for a minor-release upgrade. See Upgrading from an Earlier CDH 5
Release to the Latest Release for more information, and be careful to follow all of the upgrade steps
as instructed.
Important:
Installing CDH by adding a repository entails an additional step on Ubuntu Trusty, to ensure that you
get the CDH version of ZooKeeper, rather than the version that is bundled with Trusty.
Apache Avro
CDH 5.2 implements Avro version 1.7.6, with backports from 1.7.7. Important changes include:
• AVRO-1398: Increase default sync interval from 16k to 64k. There is a very small chance this could causes
an incompatibility in some cases, but you can be control the interval by setting avro.mapred.sync.interval
in the MapReduce job configuration. For example, set it to 16000 to get the old behavior.
• AVRO-1355: Record schema should reject duplicate field names. This change rejects schemas with duplicate
field names. This could affect some applications, but if schemas have duplicate field names then they are
unlikely to work properly in any case. The workaround is to make sure a record's field names are unique
within the record.
Apache Hadoop
HDFS
CDH 5.2 provides the following new capabilities:
• HDFS Data at Rest Encryption
Note: Cloudera provides the following two solutions for data at rest encryption:
• Navigator Encrypt - is production ready and available for Cloudera customers licensed for
Cloudera Navigator. Navigator Encrypt operates at the Linux volume level, so it can encrypt
cluster data inside and outside HDFS. Talk to your Cloudera account team for more information
about this capability.
• HDFS Encryption - included in CDH 5.2.0 operates at the HDFS folder level, enabling encryption
to be applied only to HDFS folders where needed. This feature has several known limitations.
Therefore, Cloudera does not currently support this feature in CDH 5.2 and it is not recommended
for production use. If you're interested in trying the feature out, upgrade to the latest version
of CDH 5.
HDFS now implements transparent, end-to-end encryption of data read from and written to
HDFS by creating encryption zones. An encryption zone is a directory in HDFS with all of its
contents, that is, every file and subdirectory in it, encrypted. You can use either the KMS or the
Key Trustee service to store, manage, and access encryption zone keys.
HDFS now implements transparent, end-to-end encryption of data read from and written to HDFS by creating
encryption zones. An encryption zone is a directory in HDFS with all of its contents, that is, every file and
subdirectory in it, encrypted.
• Extended attributes: HDFS XAttrs allow extended attributes to be stored per file
(https://issues.apache.org/jira/browse/HDFS-2006).
• Authentication improvements when using an HTTP proxy server.
• A new Hadoop Metrics sink that allows writing directly to Graphite.
• Specification for Hadoop Compatible Filesystem effort.
• OfflineImageViewer to browse an fsimage via the WebHDFS API.
• Supportability improvements and bug fixes to the NFS gateway.
• Modernized web UIs (HTML5 and JavaScript) for HDFS daemons.
MapReduce
CDH 5.2 provides an optimized implementation of the mapper side of the MapReduce shuffle. The optimized
implementation may require tuning different from the original implementation, and so it is considered
experimental and is not enabled by default.
You can select this new implementation on a per-job basis by setting the job configuration value
mapreduce.job.map.output.collector.class to
org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator, or use enable Cloudera
Manager to enable it.
Some jobs which use custom writable types or comparators may not be able to take advantage of the optimized
implementation.
the following new capabilities and improvements:
YARN
CDH 5.2 provides the following new capabilities and improvements:
• New features and improvements in the Fair Scheduler:
– New features:
– Fair Scheduler now allows setting the fairsharePreemptionThreshold per queue (leaf and non-leaf).
This threshold is a decimal value between 0 and 1; if a queue's usage is under (preemption-threshold
* fairshare) for a configured duration, resources from other queues are preempted to satisfy this
queue's request. Set this value in fair-scheduler.xml. The default value is 0.5.
– Fair Scheduler now allows setting the fairsharePreemptionTimeout per queue (leaf and non-leaf).
For a starved queue, this timeout determines when to trigger preemption from other queues. Set this
value in fair-scheduler.xml.
– Fair Scheduler now shows the Steady Fair Share in the Web UI. The Steady Fair Share is the share of
the cluster resources a particular queue or pool would get if all existing queues had running applications.
– Improvements:
– Fair Scheduler uses Instantaneous Fair Share ( fairshare that considers only active queues) for
scheduling decisions to improve the time to achieve steady state (fairshare).
– The default for maxAMShare is now 0.5, meaning that only half the cluster's resources can be taken
up by Application Masters. You can change this value in fair-scheduler.xml.
• A new module, crunch-hive, for reading and writing Optimized Row Columnar (ORC) Files with Crunch.
Apache Flume
CDH 5.2 provides the following new capabilities:
• Kafka Integration: Flume can now accept data from Kafka via the KafkaSource (FLUME-2250) and push to
Kafka using the KafkaSink (FLUME-2251).
• Kite Sink can now write to Hive and HBase datasets (FLUME-2463).
• Flume agents can now be configured via Zookeeper (experimental, FLUME-1491)
• Embedded Agents now support Interceptors (FLUME-2426)
• syslog Sources now support configuring which fields should be kept (FLUME-2438)
• File Channel replay is now much faster (FLUME-2450)
• New regular-expression search-and-replace interceptor (FLUME-2431)
• Backup checkpoints can be optionally compressed (FLUME-2401)
Hue
CDH 5.2 provides the following new capabilities:
• New application for editing Sentry roles and Privileges on databases and tables
• Search App
• Heatmap, Tree, Leaflet widgets
• Micro-analysis of fields
• Exclusion facets
• Oozie Dashboard: bulk actions, faster display
• File Browser: drag-and-drop upload, history, ACLs edition
• Hive and Impala: LDAP pass-through, query expiration, SSL (Hive), new graphs
• Job Browser: YARN kill application button
Apache HBase
CDH 5.2 implements HBase 0.98.6, which represents a minor upgrade to HBase. This upgrade introduces new
features and moves some features which were previously marked as experimental to fully supported status.
For detailed information and instructions on how to use the new capabilities, see New Features and Changes
for HBase in CDH 5.
Apache Hive
CDH 5.2 introduces the following important changes in Hive.
• CDH 5.2 implements Hive 0.13, providing the following new capabilities:
– Sub-queries in the WHERE clause
– Common table expressions (CTE)
– Parquet supports timestamp
– HiveServer2 can be configured with a hiverc file that is automatically run when users connect
– Permanent UDFs
– HiveServer2 session and operation timeouts
– Beeline accepts a -i option to initialize with a SQL file
– New join syntax (implicit joins)
• As of CDH 5.2.0, you can create Avro-backed tables simply by using STORED AS AVRO in a DDL statement.
The AvroSerDe takes care of creating the appropriate Avro schema from the Hive table schema, making it
much easier to use Avro with Hive.
• Hive supports additional datatypes, as follows:
– Hive can read char and varchar datatypes written by Hive, and char and varchar datatypes written by
Impala.
– Impala can read char and varchar datatypes written by Hive and Impala.
These new types have been enabled by expanding the supported DDL, so they are backward compatible. You
can add varchar(n) columns by creating new tables with that type, or changing a string column in existing
tables to varchar.
Note:
char(n) columns are not stored in a fixed-length representation, and do not improve performance
(as they do in some other databases). Cloudera recommends that in most cases you use text or
varchar instead.
• DESCRIBE DATABASE returns additional fields: owner_name and owner_type. The command will continue
to behave as expected if you identify the field you're interested in by its (string) name, but could produce
unexpected results if you use a numeric index to identify the field(s).
Impala
Impala in CDH 5.2.0 includes major new features such as spill-to-disk for memory-intensive queries, subquery
enhancements, analytic functions, and new CHAR and VARCHAR data types. For the full feature list and more
details, see What's New in Impala on page 40.
Kite
Kite is an open source set of libraries, references, tutorials, and code samples for building data-oriented systems
and applications. For more information about Kite, see the Kite SDK Development Guide.
Kite has been rebased to version 0.15.0 in CDH 5.2.0, from the base version 0.10.0 in CDH 5.1. kite-morphlines
modules are backward-compatible, but this change breaks backward-compatibility for the kite-data API.
Kite Data
The Kite data API has had substantial updates since the version included in CDH 5.1.
Dataset URIs
Datasets are identified with a single URI, rather than a repository URI and dataset name. The dataset URI contains
all the information Kite needs to determine which implementation (Hive, HBase, or HDFS) to use for the dataset,
and includes both the dataset's name and a namespace.
The Kite API has been updated so that developers call methods in the Datasets utility class as they would use
DatasetRepository methods. The Datasets methods are recommended, and the DatasetRepository API
is deprecated.
Views
The Kite data API now allows you to select a view of the dataset by setting constraints. These constraints are
used by Kite to automatically prune unnecessary partitions and filter records.
Flume DatasetSink
The Flume DatasetSink has been updated for the kite-data API changes. It supports all previous configurations
without modification.
In addition, the DatasetSink now supports dataset URIs with the configuration option kite.dataset.uri.
Apache Mahout
Mahout jobs launched from the bin/mahout script will now use cluster's default parameters, rather than
hard-coded parameters from the library. This may change the algorithms' run-time behavior, possibly for the
better. (MAHOUT-1565.)
Apache Oozie
CDH 5.2 introduces the following important changes:
• A new Hive 2 Action allows Oozie to run HiveServer2 scripts. Using the Hive Action with HiveServer2 is now
deprecated; you should switch to the new Hive 2 Action as soon as possible.
• The MapReduce action can now also be configured by Java code
This gives users the flexibility of using their own driver Java code for configuring the MR job, while also getting
the advantages of the MapReduce action (instead of using the Java action). See the documentation for more
info.
• The PurgeService is now able to remove completed child jobs from long running coordinator jobs
• ALL can now be set for oozie.service.LiteWorkflowStoreService.user.retry.error.code.ext to
make Oozie retry actions automatically for every type of error
• All Oozie servers in an Oozie HA group now synchronize on the same randomly generated rolling secret for
signing auth tokens
• You can now upgrade from CDH 4.x to CDH 5.2 and later with jobs in RUNNING and SUSPENDED states. (An
upgrade from CDH 4.x to a CDH 5.x release earlier than CDH 5.2.0 would still require that no jobs be in either
of those states).
– Kite repository
– Kite Release Notes
– Kite documentation
– Kite examples
• Search adds support for multi-threaded faceting on fields. This enables parallelizing operations, allowing
them to run more quickly on highly concurrent hardware. This is especially helpful in cases where faceting
operations apply to large datasets over many fields.
• Search adds support for distributed pivot faceting, enabling faceting on multi-shard collections.
Apache Sentry (incubating)
CDH 5.2 introduces the following changes to Sentry.
Sentry Service:
• If you are using the database-backed Sentry service, upgrading from CDH 5.1 to CDH 5.2 will require a schema
upgrade.
• Hive SQL Syntax:
– GRANT and REVOKE statements have been expanded to include WITH GRANT OPTION, thus allowing you
to delegate granting and revoking privileges.
– The SHOW GRANT ROLE command has been updated to allow non-admin users to list grants for roles that
are currently assigned to them.
– The SHOW ROLE GRANT GROUP <groupName> command has been updated to allow non-admin users
that are part of the group specified by <groupName> to list all roles assigned to this group.
Apache Spark
CDH 5.2 Spark is rebased on Apache Spark/Streaming 1.1 and provides the following new capabilities:
• Stability and performance improvements.
• New sort-based shuffle implementation (disabled by default).
• Better performance monitoring through the Spark UI.
• Support for arbitrary Hadoop InputFormats in PySpark.
• Improved Yarn support with several bug fixes.
Apache Sqoop
CDH 5.2 Sqoop 1 is rebased on Sqoop 1.4.5 and includes the following changes:
• Mainframe connector added.
• Parquet support added.
There are no changes for Sqoop 2.
Note:
There is no CDH 5.1.1 release. This skip in the CDH 5.x sequence allows the CDH and CM components
of Cloudera Enterprise 5.1.2 to have consistent numbering.
• HDFS-6788
• HDFS-6825
• HUE-2211
• HUE-2223
• HUE-2232
• HIVE-5515
• HIVE-6495
• HIVE-7450
• IMPALA-1093
• IMPALA-1107
• IMPALA-1131
• IMPALA-1142
• IMPALA-1149
• MAPREDUCE-5966
• MAPREDUCE-5979
• MAPREDUCE-6012
• OOZIE-1920
• PARQUET-19
• SENTRY-363
• YARN-2273
• YARN-2274
• YARN-2313
• YARN-2352
• YARN-2359
uid 10 100 # Map the remote UID 10 the local UID 100
gid 11 101 # Map the remote GID 11 to the local GID 101
Note:
After initially registering with the system portmap as root, the NFS Gateway drops privileges
and runs as a regular user.
– You can now pass the -size-in-bytes flag to print the size of snapshot files in bytes rather than the
default human-readable format.
– The size of each snapshot file in bytes is checked against the size reported in the manifest, and if the two
sizes differ, the tool reports the file as corrupt.
• A new -target option for ExportSnapshot allows you to specify a different name for the target cluster
from the snapshot name on the source cluster.
In addition, Cloudera has fixed some binary incompatibilities between HBase 0.96 and 0.98. As a result, the
incompatibilities introduced by HBASE-10452 and HBASE-10339 do not affect CDH 5.1 HBase, as explained
below:
• HBASE-10452 introduced a new exception and error message in setTimeStamp(), for an extremely unlikely
event when where getting a TimeRange could fail because of an integer overflow. CDH 5.1 suppresses the
new exception to retain compatibility with HBase 0.96, but logs the error.
• HBASE-10339 contained code which inadvertently changed the signatures of the getFamilyMap method.
CDH 5.1 restores these signatures to those used in HBase 0.96, to retain compatibility.
Apache Hive
• Permission inheritance fixes
• Support for decimal computation, and for reading and writing decimal-format data from and to Parquet and
Avro
Hue
CDH 5.1.0 implements Hue 3.6.
New Features:
• Search App v2:
– 100% Dynamic dashboard
– Drag-and-Drop dashboard builder
– Text, Timeline, Pie, Line, Bar, Map, Filters, Grid and HTML widgets
– Solr Index creation wizard (from a file)
• Ability to view compressed Snappy, Avro and Parquet files
• Impala HA
• Close Impala and Hive sessions queries and commands
Apache Mahout
• CDH 5.1.0 implements Mahout 0.9.
See also Apache Mahout Incompatible Changes on page 74.
Apache Oozie
• You can now submit Sqoop jobs from the Oozie command line.
• LAST_ONLY execution mode now works correctly (OOZIE-1319).
Cloudera Search
New Features:
• A Quick Start script that automates using Search to query data from the Enron Email dataset. The script
downloads the data, expands it, moves it to HDFS, indexes, and pushes the results live. The documentation
now also includes a companion quick start guide, which describes the tasks the script completes, as well as
customization options.
• Solrctl now has built-in support for schema-less Solr.
• Sentry-based document-level security for role-based access control of a collection. Document-level access
control associates authorization tokens with each document in the collection, enabling granting Sentry roles
access to sets of documents in a collection.
• Cloudera Search includes a version of Kite 0.10.0, which includes all morphlines-related backports of all fixes
and features in Kite 0.15.0. For additional information on Kite, see:
– Kite repository
– Kite Release Notes
– Kite documentation
– Kite examples
• Support for the Parquet file format is included with this version of Kite 0.10.0.
• Inclusion of hbase-indexer-1.5.1, a new version of the Lily HBase Indexer. This new version of the indexer
includes the 0.10.0 version of Kite mentioned above. This 0.10.0 version of Kite includes the backports and
fixes included in Kite 0.15.0.
Apache Sentry (incubating)
• CDH 5.1.0 implements Sentry 1.2. This includes a database-backed Sentry service which uses the more
traditional GRANT/REVOKE statements instead of the previous policy file approach making it easier to
maintain and modify privileges.
• Revised authorization privilege model for Hive and Impala.
Apache Spark
• CDH 5.1.0 implements Spark 1.0.
• The spark-submit command abstracts across the variety of deployment modes that Spark supports and
takes care of assembling the classpath for you.
• Application History Server (SparkHistoryServer) improves monitoring capabilities.
• You can launch PySpark applications against YARN clusters. PySpark currently only works in YARN Client
mode.
Other improvements include:
• Streaming integration with Kerberos
• Addition of more algorithms to MLLib (Sparse Vector Support)
• Improvements to Avro integration
• Spark SQL alpha release (new SQL engine). Spark SQL allows you to run SQL statements inside a Spark
application that manipulate and produce RDDs.
Note:
Because of its immaturity and alpha status, Cloudera does not currently offer commercial support
for Spark SQL, but bundles it with our distribution so that you can try it out.
• HDFS-6077
• HDFS-6340
• HDFS-6475
• HDFS-6510
• HDFS-6527
• HDFS-6563
• HUE-1928
• HUE-2184
• HUE-2085
• HUE-2192
• HUE-2193
• OOZIE-1621
• OOZIE-1890
• OOZIE-1907
• SOLR-5593
• SOLR-5915
• SOLR-6161
• YARN-1550
• YARN-2155
• OOZIE-1794 - java-opts and java-opt in the Java action don't always work properly in YARN
• SOLR-5608 - Frequently reproducible failures in CollectionsAPIDistributedZkTest#testDistribSearch
• YARN-1924 - STATE_STORE_OP_FAILED happens when ZKRMStateStore tries to update app(attempt) before
storing it
Enabling SSL in CDH 5: Enabling HTTPS communication in CDH 5 requires extra configuration properties to be
added to YARN (yarn-site.xml and mapred-site.xml) and HDFS (hdfs-site.xml), in addition to the existing
configuration settings described here.
{
"name" : "hadoop:service=Master,name=Master",
"modelerType" : "org.apache.hadoop.hbase.master.MXBeanImpl",
"ZookeeperQuorum" : "localhost:2181",
....
"RegionsInTransition" : [ ],
"RegionServers" : [ {
"key" : "localhost,48346,1390857257246",
"value" : {
"load" : 2,
....
CDH 5 Beta 1 and Beta 2 did not contain this list; they only displayed counts of the number of live and dead
RegionServers. As of CDH 5.0.0, this list is now presented in a semi-colon separated field as follows:
{
"name" : "Hadoop:service=HBase,name=Master,sub=Server",
"modelerType" : "Master,sub=Server",
"tag.Context" : "master",
"tag.liveRegionServers" : "localhost,56196,1391992019130",
"tag.deadRegionServers" :
"localhost,40010,1391035309673;localhost,41408,1391990380724;localhost,38682,1390950017735",
...
}
Apache Oozie
As of CDH 5.0.0 Oozie includes a glob pattern feature (OOZIE-1471), allowing you do a move of wild cards in the
FS Action. For example:
<fs name="archive-files">
<move source="hdfs://namenode/output/*"
target="hdfs://namenode/archive" />
<ok to="next"/>
<error to="fail"/>
</fs>
By default, up to 1000 files can be matched; you can change this default by means of the
oozie.action.fs.glob.max parameter.
Cloudera Search
• Cloudera Search includes a version of Kite 0.10.0, which includes backports of all fixes and features in Kite
0.12.0. For additional information on Kite, see:
– Kite repository
– Kite Release Notes
– Kite documentation
– Kite examples
Apache Flume
• FLUME-2294 - Added a new sink to write Kite datasets.
• FLUME-2056 - Spooling Directory Source can now only pass the name of the file in the event headers.
• FLUME-2155 - File Channel is indexed during replay to improve replay performance for faster startup.
• FLUME-2217 - Syslog Sources can optionally preserve all syslog headers in the message body.
• FLUME-2052 - Spooling Directory Source can now replace or ignore malformed characters in input files.
Apache Hadoop
HDFS
New Features/Improvements:
• As of CDH 5 Beta 2, you can upgrade HDFS with high availability (HA) enabled, if you are using Quorum-based
storage. (Quorum-based storage is the only method available in CDH 5; NFS shared storage is not supported.)
For upgrade instructions, see Upgrading from CDH 4 to CDH 5.
• HDFS-4949 - CDH 5 Beta 2 supports Configuring Centralized Cache Management in HDFS.
• As of CDH 5 Beta 2, you can configure an NFSv3 gateway that allows any NFSv3-compatible client to mount
HDFS as a file system on the client's local file system. For more information and instructions, see Configuring
an NFSv3 Gateway Using the Command Line.
• HDFS-5709 - Improve upgrade with existing files and directories named .snapshot.
Major Bug Fixes:
• HDFS-5449- Fix WebHDFS compatibility break.
• HDFS-5671- Fix socket leak in DFSInputStream#getBlockReader.
• HDFS-5353- Short circuit reads fail when dfs.encrypt.data.transfer is enabled.
• HDFS-5438- Flaws in block report processing can cause data loss.
Changed Behavior:
• As of CDH 5 Beta 2, in order for the NameNode to start up on a secure cluster, you should have the
dfs.web.authentication.kerberos.principal property defined in hdfs-site.xml. This has been
documented in the CDH 5 Security Guide. For clusters managed by Cloudera Manager, you do not need to
explicitly define this property.
• HDFS-5037 - Active NameNode should trigger its own edit log rolls.Clients will now retry for a configurable
period when encountering a NameNode in Safe Mode.
• The default behavior of the mkdir command has changed. As of CDH 5 Beta 2, if the parent folder does not
exist, the -p switch must be explicitly mentioned otherwise the command fails.
MapReduce (MRv1 and YARN)
• Fair Scheduler (in YARN and MRv1) now supports advance configuration to automatically place applications
in queues.
• MapReduce now supports running multiple reducers in uber mode and in local job runner.
Apache HBase
• Online Schema Change is now a supported feature.
• Online Region Merge is now a supported feature.
• Namespaces: CDH 5 Beta 2 includes the namespaces feature which enables different sets of tables to be
administered by different administrative users. All upgraded tables will live in the default "hbase" namespace.
Administrators may create new namespaces and create tables users with rights to the namespace may
administer permissions on the tables within the namespace.
• There have been several improvements to HBase’s mean time to recovery (mttr) in the face of Master or
RegionServer failures.
– Distributed log splitting has matured, and is always activated. The option to use the old slower splitting
mechanism no longer exists.
– Failure detection time has been improved. New notifications are now sent when RegionServers or Masters
fail which triggers corrective action quickly.
– The Meta table has a dedicated write ahead log which enables faster recovery region recovery if the
RegionServer serving meta goes down.
• The Region Balancer has been significantly updated to take more load attributes into account.
• Added TableSnapshotInputFormat and TableSnapshotScanner to perform scans over HBase table snapshots
from the client side, bypassing the HBase servers. The former configures a MapReduce job, while the latter
does a single client-side scan over snapshot files. Can also be used with offline HBase with in-place or
exported snapshot files.
• The KeyValue API has been deprecated for applications in favor of the Cell interface. Users upgrading to
HBase 0.96 may still use KeyValue by future upgrades may remove the class or parts of its functionality.
Users are encouraged to update their applications to use the new Cell interface.
• Currently Experimental features:
– Distributed log replay: This mechanism allows for faster recovery from RegionServer failures but has one
special case where it will violate ACID guarantees. Cloudera does not currently recommend activating this
feature.
– Bucket cache: This is an offheap caching mechanism that use extra RAM and block devices (such as flash
drives) to greatly increase the read caching capabilities provided by the BlockCache. Cloudera does not
currently recommend activating this feature.
– Favored nodes: This feature enables HBase to better control where its data is written to in HDFS in order
to better preserve performance after a failure. This is disabled currently because it doesn’t interact well
with the HBase Balancer or HDFS Balancer. Cloudera does not currently recommend activating this feature.
Hue
• Hue has been upgraded to version 3.5.0.
• Impala and Hive Editor are now one-page apps. The Editor, Progress, Table list and Results are all on the
same page
• Result graphing for the Hive and Impala Editors.
• Editor and Dashboard for Oozie SLA, crontab and credentials.
• The Sqoop2 app supports autocomplete of database and table names/fields.
• DBQuery App: MySQL and PostgreSQL Query Editors.
• New Search feature: Graphical facets
• Integrate external Web applications in any language. See this blog post for more details.
• Create Hive tables and load quoted CSV data. Tutorial available here.
• Submit any Oozie jobs directly from HDFS. Tutorial available here
• New SAML backend enables single sign-on (SSO) with Hue.
Apache Oozie
• Oozie now supports cron-style scheduling capability.
• Oozie now supports High Availability with security.
Apache Pig
• AvroStorage rewritten for better performance, and moved from piggybank to core Pig
• ASSERT, IN, and CASE operators added
• ParquetStorage added for integration with Parquet
Cloudera Search
• The Cloudera CDK has been renamed and updated to Kite version 0.11.0. For additional information on Kite,
see:
– Kite repository
– Kite Release Notes
– Kite documentation
– Kite examples
Changed Behavior:
• HDFS-4645: Move from randomly generated block ID to sequentially generated block ID.
• HDFS-4451: HDFS balancer command returns exit code 1 on success instead of 0.
MapReduce v2 (YARN)
New Features:
• ResourceManager High Availability: YARN now allows you to use multiple ResourceManagers so that there
is no single point of failure. In-flight jobs are recovered without re-running completed tasks.
• Monitoring and enforcing memory and CPU-based resource utilization using cgroups.
• Continuous Scheduling: This feature decouples scheduling from the node heartbeats for improved performance
in large clusters.
Changed Feature:
• ResourceManager Restart: Persistent implementations of the RMStateStore (filesystem-based and
ZooKeeper-based) allow recovery of in-flight jobs.
Apache HBase
Administrative Features
ProtoBuf: All of the serialization that goes across the wire between servers written to and read by HBase file
formats have been converted to extensible Protobuf encodings. This breaks compatibility with previous versions
but should make future extensions less likely to break compatibility in these areas. This feature is enabled by
default.
• HBASE-5305: Improve cross-version compatibility and upgradeability.
• HBASE-7898: Serializing cells over RPC.
Namespaces: Namespaces is a new feature that groups tables into different administrative domains. An admin
can be only given rights to act upon a particular namespace. This feature is enabled by default and requires file
system layout changes that must be completed during upgrade.
• HBASE-8015: Added support for namespaces.
MTTR Improvements: Mean time to recovery has greatly improved.
• HBASE-7590: “Costless” notifications from master to rs/clients.
• HBASE-7213 / HBASE-8631: New .meta suffix to separate HLog file / Recover Meta before other regions in
case of server crash.
• HBASE-7006: Distributed log replay (Caveat).
• HBASE-9116: Adds a view/edit tool for favored node mappings for regions (incomplete, likely a dot version).
Metrics: There are several new metrics and a new naming convention for metrics in HBase. This also includes
metrics for each region.
• HBASE-3614: Per region metrics.
• HBASE-4050: Rationalize metrics; Update HBase metrics framework to metrics2.
Miscellaneous:
• HBASE-7403: HBase online region merge.
• Shell improvements; tables list to be more well-rounded.
• HBASE-5953: Expose the current state of the balancerSwitch.
• HBASE-5934: Add the ability for Performance Evaluation to set table compression.
• HBASE-6135: New Web UI.
• HBASE-8148: Allow IPC to bind to a specific address (also 0.94.7)
• HBASE-5498: Secure Bulk Load (also 0.94.5)
HBase Proxies
The REST server now supports Hadoop authentication and authorization mechanisms. The Avro gateway has
been removed while the Thrift2 proxy has made progress but is not complete. However, it has been included as
a preview feature.
REST:
• HBASE-9347: Support for specifying filter in REST server requests.
• HBASE-7803: Support caching on scan.
• HBASE-7757: Add Web UI for Thrift and REST servers.
• HBASE-5050: SPNEGO-based authentication.
• HBASE-8661: Support REST over HTTPS.
• HBASE-8662: Support for impersonation.
• HBASE-7986: [REST] Make HTablePool size configurable.
Thrift:
• HBASE-5879: Enable JMX metrics collection for the Thrift proxy.
Thrift2: Ongoing efforts to match Thrift and REST functionality. (Incomplete, only a preview feature)
Avro:
• HBASE-5948: Avro gateway removed.
Stability Features
There have been several bug fixes, test fixes and configuration default changes that greatly increase our confidence
in the stability of the 0.96.0 release. The main improvement comes from the use of a systematic fault-injection
framework.
• HBASE-7721 Atomic multi-row mutations in META
• Integration testing
• HBASE-7977 - TableLocks
• HBASE-7898 many flaky tests hardened
Performance Features
Several features have been added to improve throughput and performance characteristics of HBase and its
clients.
Warning:
Currently the 0.95.2/CDH 5 beta 1 release will suffer performance degradation when over 40 nodes
are used when compared to CDH 4.
Throughput:
• HBASE-4676: Prefix compression / tree encoding.
• HBASE-8334: Essential column families on by default (filtering optimization).
• HBASE-5074 / HBASE-8322: Re-enable HBase checksums by default.
• HBASE-6466: Enable multi-threaded memstore flush
• HBASE-6783: Make short circuit read the default.
Predictable Performance:
• HBASE-5959: Added a Stochastic LoadBalancer
• HBASE-7842: Exploring compactor.
• HBASE-7236: Add per-table/per-cf configuration via metadata
• HBASE-8163: MemStoreChunkPool: Improvement for Java GC
• HBASE-4391HBASE-6567: Mlock / Memory locking improvements (less disk swap).
• HBASE-4391: Bucket cache (untested)
Miscellaneous:
• HBASE-6870: Improvement to HTable coprocessorExec scan performance.
Developer Features
These features are to aid application developers or for major changes that will enable future minor version
improvements.
• HBASE-9121: HTrace updates.
• HBASE-8375: Durability setting per table.
– HBASE-7801 Deferred sync for WAL logs (0.94.7 and later)
• HBASE-7897: Tags supported in cell interface (for future security features).
• HBASE-5937: Refactor HLog into interface (allows for new HLogs in 0.96.x).
• HBASE-4336: Modularization of POM / Multiple jars (many follow-ons, HBASE-7898).
• HBASE-8224: Publish -hadoop1 and -hadoop2 versioned jars to Maven (CDH published jars are assumed
-hadoop2).
• HBASE-9164: Move towards Cell interface in client instead of KeyValue.
• HBASE-7898: Serializing cells over RPC.
• HBASE-7725: Add ability to create custom compaction request.
Hue
New Features:
• With the Sqoop 2 application, data from databases can be easily exported or imported into HDFS in a scalable
manner. The Job Wizard hides the complexity of creating Sqoop jobs and the dashboard offers live progress
and log access.
• Zookeeper App: Navigate and browse the Znode hierarchy and content of a Zookeeper cluster. Znodes can
be added, deleted and edited. Multi-clusters are supported and various statistics are available for them.
• The Hue Shell application has been removed and replaced by the Pig Editor, HBase Browser and the Sqoop
1 apps.
• Python 2.6 is required.
• Beeswax daemon has been replaced by HiveServer2.
• CDH 5 Hue will only work with HiveServer2 from CDH 5. No support for impersonation.
Hue also includes the following changed features (Updated to upstream version 3.0.0):
• [HUE-897] - [core] Redesign of the overall layout
• [HUE-1521] - [core] Improve JobTracker High Availability
• [HUE-1493] - [beeswax] Replace the Beeswax server with HiveServer2
• [HUE-1474] - [core] Upgrade Django backend version from 1.2 to 1.4
• [HUE-1506]- [search] Impersonation support added
• [HUE-1475] - [core] Switch back from the Spawning web server
• [HUE-917] - Support SAML based authentication to enable single sign-on (SSO)
From master:
• [HUE-950] - [core] Improvements to the document model
• [HUE-1595] - Integrate Metastore data into Hive and Impala Query UIs
• [HUE-1275] - [metastore] Show Metastore table details
• [HUE-1622] - [core] Mini tour added to Hue home page
Apache Hive and HCatalog
New Features (Updated to upstream version 0.11.0):
• [HIVE-446] - Implement TRUNCATE for table data
• [HIVE-896] - Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive
• [HIVE-2693] - Add DECIMAL data type
• [HIVE-3834] - Support ALTER VIEW AS SELECT in Hive
Performance improvements (from 0.12):
• [HIVE-3764] - Support metastore version consistency check
• [HIVE-305] - Port Hadoop streaming process's counters/status reporters to Hive Transforms
• [HIVE-1402] - Add parallel ORDER BY to Hive
• [HIVE-2206] - Add a new optimizer for query correlation discovery and optimization
• [HIVE-2517] - Support GROUP BY on struct type
• [HIVE-2655] - Ability to define functions in HQL
• [HIVE-4911] - Enable QOP configuration for HiveServer2 Thrift transport
Cloudera Impala
Cloudera Impala 1.2.0 is now available as part of CDH 5. For more details on Impala, refer the Impala
Documentation.
Llama
Llama is a system that mediates resource management between Cloudera Impala and Hadoop YARN. Llama
enables Impala to reserve, use, and release resource allocations in a Hadoop cluster. Llama is only required if
resource management is enabled in Impala.
See Managing the Impala Llama ApplicationMaster for more information.
Apache Mahout
New Features (Updated to Mahout 0.8):
• Numerous performance improvements to Vector and Matrix implementations, APIs and their iterators (see
also MAHOUT-1192, MAHOUT-1202)
• Numerous performance improvements to the recommender implementations (see also MAHOUT-1272,
MAHOUT-1035, MAHOUT-1042, MAHOUT-1151, MAHOUT-1166, MAHOUT-1167, MAHOUT-1169, MAHOUT-1205,
MAHOUT-1264)
• MAHOUT-1088: Support for biased item-based recommender.
• MAHOUT-1089: SGD matrix factorization for rating prediction with user and item biases.
• MAHOUT-1106: Support for SVD++
• MAHOUT-944: Support for converting one or more Lucene storage indexes to SequenceFiles as well as an
upgrade of the supported Lucene version to Lucene 4.3.
• MAHOUT-1154 and related: New streaming k-means implementation that offers online (and fast) clustering.
• MAHOUT-833: Make conversion to SequenceFiles Map-Reduce. 'seqdirectory' can now be run as a
MapReduce job.
• MAHOUT-1052: Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or
values).
• MAHOUT-884: Matrix concatenate utility; presently only concatenates two matrices.
Apache Oozie
New Features:
• Updated to Oozie 4.0.0.
• High Availability: Multiple Oozie servers can now be utilized to provide an HA Oozie service as well as provide
horizontal scalability. See upstream documentation for more details.
• HCatalog Integration: HCatalog table partitions can now be used as data dependencies in coordinators. See
upstream documentation for more details. .
• SLA Monitoring: Oozie can now actively monitor SLA-sensitive jobs and send out notifications for SLA meets
and misses. SLA information is also now available through a new SLA tab in the Oozie Web UI, JMS messages,
and a REST API. See upstream documentation.
• JMS Notifications: Oozie can now publish notifications to a JMS Provider about job status changes and SLA
events. See upstream documentation.
• The FileSystem action can now use glob patterns for file paths when doing move, delete, chmod, and chgrp.
Cloudera Search
Cloudera Search 1.0.0 is now available as part of CDH 5. For more details on Search see the Search documentation.
The Cloudera Development Kit (CDK) is a set of libraries and tools that can be used with Search and other CDH
components to build jobs/systems on top of the Hadoop ecosystem. See the CDK Documentation and Release
Notes for more details.
Note: An existing dependency, Apache Tika, has been upgraded to version 1.4.
Note: The Impala 2.2.x maintenance releases now use the CDH 5.4.x numbering system rather than
increasing the Impala version numbers. Impala 2.2 and higher are not available under CDH 4.
Note: Impala 2.2.0 is available as part of CDH 5.4.0 and is not available for CDH 4. Cloudera does not
intend to release future versions of Impala for CDH 4 outside patch and maintenance releases if
required. Given the upcoming end-of-maintenance for CDH 4, Cloudera recommends all customers
to migrate to a recent CDH 5 release.
The following are the major new features in Impala 2.2.0. This major release, available as part of CDH 5.4.0,
contains improvements to performance, manageability, security, and SQL syntax.
• Several improvements to date and time features enable higher interoperability with Hive and other database
systems, provide more flexibility for handling time zones, and future-proof the handling of TIMESTAMP values:
– Startup flags for the impalad daemon enable a higher level of compatibility with TIMESTAMP values
written by Hive, and more flexibility for working with date and time data using the local time zone instead
of UTC. To enable these features, set the impalad startup flags
-use_local_tz_for_unix_timestamp_conversions=true and
-convert_legacy_hive_parquet_utc_timestamps=true.
• The SHOW FILES statement lets you view the names and sizes of the files that make up an entire table or a
specific partition. See SHOW FILES Statement for details.
• Impala can now run queries against Parquet data containing columns with composite or nested types, as
long as the query only refers to columns with scalar types.
• Performance improvements for queries that include IN() operators and involve partitioned tables.
• The new -max_log_files configuration option specifies how many log files to keep at each severity level.
The default value is 10, meaning that Impala preserves the latest 10 log files for each severity level (INFO,
WARNING, and ERROR) for each Impala-related daemon (impalad, statestored, and catalogd). Impala checks
to see if any old logs need to be removed based on the interval specified in the logbufsecs setting, every 5
seconds by default. See Rotating Impala Logs for details.
• Redaction of sensitive data from Impala log files. This feature protects details such as credit card numbers
or tax IDs from administrators who see the text of SQL statements in the course of monitoring and
troubleshooting a Hadoop cluster. See Redacting Sensitive Information from Impala Log Files for background
information for Impala users, and Sensitive Data Redaction for usage details.
• Lineage information is available for data created or queried by Impala. This feature lets you track who has
accessed data through Impala SQL statements, down to the level of specific columns, and how data has been
propagated between tables. See Viewing Lineage Information for Impala Data for background information
for Impala users, Impala Lineage Properties for usage details, and Lineage Diagrams for how to interpret the
lineage information.
• Impala tables and partitions can now be located on the Amazon Simple Storage Service (S3) filesystem, for
convenience in cases where data is already located in S3 and you prefer to query it in-place. Queries might
have lower performance than when the data files reside on HDFS, because Impala uses some HDFS-specific
optimizations. Impala can query data in S3, but cannot write to S3. Therefore, statements such as INSERT
and LOAD DATA are not available when the destination table or partition is in S3. See Using Impala to Query
the Amazon S3 Filesystem (Unsupported Preview) for details.
Important:
Impala query support for Amazon S3 is included in CDH 5.4.0, but is not currently supported or
recommended for production use. If you're interested in this feature, try it out in a test environment
until we address the issues and limitations needed for production-readiness.
• Improved support for HDFS encryption. The LOAD DATA statement now works when the source directory
and destination table are in different encryption zones.
• Additional arithmetic function mod(). See Impala Mathematical Functions for details.
• Flexibility to interpret TIMESTAMP values using the UTC time zone (the traditional Impala behavior) or using
the local time zone (for compatibility with TIMESTAMP values produced by Hive).
• Enhanced support for ETL using tools such as Flume. Impala ignores temporary files typically produced by
these tools (filenames with suffixes .copying and .tmp).
• The CPU requirement for Impala, which had become more restrictive in Impala 2.0.x and 2.1.x, has now been
relaxed.
The prerequisite for CPU architecture has been relaxed in Impala 2.2.0 and higher. From this release onward,
Impala works on CPUs that have the SSSE3 instruction set. The SSE4 instruction set is no longer required.
This relaxed requirement simplifies the upgrade planning from Impala 1.x releases, which also worked on
SSSE3-enabled processors.
• Enhanced support for CHAR and VARCHAR types in the COMPUTE STATS statement.
• The amount of memory required during setup for “spill to disk” operations is greatly reduced. This enhancement
reduces the chance of a memory-intensive join or aggregation query failing with an out-of-memory error.
• Several new conditional functions provide enhanced compatibility when porting code that uses industry
extensions. The new functions are: isfalse(), isnotfalse(), isnottrue(), istrue(), notnullvalue(),
and nullvalue(). See Impala Conditional Functions for details.
• The Impala debug web UI now can display a visual representation of the query plan. On the /queries tab,
select Details for a particular query. The Details page includes a Plan tab with a plan diagram that you can
zoom in or out (using scroll gestures through mouse wheel or trackpad).
Note: Impala 2.1.3 is available as part of CDH 5.3.3, not under CDH 4.
Note: Impala 2.1.2 is available as part of CDH 5.3.2, not under CDH 4.
Note: Impala 2.0.4 is available as part of CDH 5.2.5, not under CDH 4.
Note: Impala 2.0.3 is available as part of CDH 5.2.4, not under CDH 4.
Note: Impala 2.0.2 is available as part of CDH 5.2.3, not under CDH 4.
• Queries with joins or aggregation functions involving high volumes of data can now use temporary work
areas on disk, reducing the chance of failure due to out-of-memory errors. When the required memory for
the intermediate result set exceeds the amount available on a particular node, the query automatically uses
a temporary work area on disk. This “spill to disk” mechanism is similar to the ORDER BY improvement from
Impala 1.4. For details, see SQL Operations that Spill to Disk.
• Subquery enhancements:
• Subqueries are now allowed in the WHERE clause, for example with the IN operator.
• The EXISTS and NOT EXISTS operators are available. They are always used in conjunction with subqueries.
• The IN and NOT IN queries can now operate on the result set from a subquery, not just a hardcoded list
of values.
• Uncorrelated subqueries let you compare against one or more values for equality, IN, and EXISTS
comparisons. For example, you might use WHERE clauses such as WHERE column = (SELECT
MAX(some_other_column FROM table) or WHERE column IN (SELECT some_other_column FROM
table WHERE conditions).
• Correlated subqueries let you cross-reference values from the outer query block and the subquery.
• Scalar subqueries let you substitute the result of single-value aggregate functions such as MAX(), MIN(),
COUNT(), or AVG(), where you would normally use a numeric value in a WHERE clause.
For details about subqueries, see Subqueries For information about new and improved operators, see EXISTS
Operator and IN Operator.
• Analytic functions such as RANK(), LAG(), LEAD(), and FIRST_VALUE() let you analyze sequences of rows
with flexible ordering and grouping. Existing aggregate functions such as MAX(), SUM(), and COUNT() can
also be used in an analytic context. See Impala Analytic Functions for details. See Impala Aggregate Functions
for enhancements to existing aggregate functions.
• New data types provide greater compatibility with source code from traditional database systems:
– VARCHAR is like the STRING data type, but with a maximum length. See VARCHAR Data Type (CDH 5.2 or
higher only) for details.
– CHAR is like the STRING data type, but with a precise length. Short values are padded with spaces on the
right. See CHAR Data Type (CDH 5.2 or higher only) for details.
• Security enhancements:
• Formerly, Impala was restricted to using either Kerberos or LDAP / Active Directory authentication within
a cluster. Now, Impala can freely accept either kind of authentication request, allowing you to set up some
hosts with Kerberos authentication and others with LDAP or Active Directory. See Using Multiple
Authentication Methods with Impala for details.
• GRANT statement. See GRANT Statement (CDH 5.2 or higher only) for details.
• REVOKE statement. See REVOKE Statement (CDH 5.2 or higher only) for details.
• CREATE ROLE statement. See CREATE ROLE Statement (CDH 5.2 or higher only) for details.
• DROP ROLE statement. See DROP ROLE Statement (CDH 5.2 or higher only) for details.
• SHOW ROLES and SHOW ROLE GRANT statements. See SHOW Statement for details.
• To complement the HDFS encryption feature, a new Impala configuration option,
--disk_spill_encryption secures sensitive data from being observed or tampered with when
temporarily stored on disk.
The new security-related SQL statements work along with the Sentry authorization framework. See Enabling
Sentry Authorization for Impala for details.
• Impala can now read compressed text files compressed by gzip, bzip, or Snappy. These files do not require
any special table settings to work in an Impala text table. Impala recognizes the compression type
automatically based on file extensions of .gz, .bz2, and .snappy respectively. These types of compressed
text files are intended for convenience with existing ETL pipelines. Their non-splittable nature means they
are not optimal for high-performance parallel queries. See Using gzip, bzip2, or Snappy-Compressed Text
Files for details.
• Query hints can now use comment notation, /* +hint_name */ or -- +hint_name, at the same places in
the query where the hints enclosed by [ ] are recognized. This enhancement makes it easier to reuse Impala
queries on other database systems. See Hints for details.
• A new query option, QUERY_TIMEOUT_S, lets you specify a timeout period in seconds for individual queries.
The working of the --idle_query_timeout configuration option is extended. If no QUERY_OPTION_S query
option is in effect, --idle_query_timeout works the same as before, setting the timeout interval. When
the QUERY_OPTION_S query option is specified, its maximum value is capped by the value of the
--idle_query_timeout option.
That is, the system administrator sets the default and maximum timeout through the --idle_query_timeout
startup option, and then individual users or applications can set a lower timeout value if desired through the
QUERY_TIMEOUT_S query option. See Setting Timeout Periods for Daemons, Queries, and Sessions and
QUERY_TIMEOUT_S Query Option for details.
• New functions VAR_SAMP() and VAR_POP() are aliases for the existing VARIANCE_SAMP() and
VARIANCE_POP() functions.
• A new date and time function, DATE_PART(), provides similar functionality to EXTRACT(). You can also call
the EXTRACT() function using the SQL-99 syntax, EXTRACT(unit FROM timestamp). These enhancements
simplify the porting process for date-related code from other systems. See Impala Date and Time Functions
for details.
• New approximation features provide a fast way to get results when absolute precision is not required:
– The APPX_COUNT_DISTINCT query option lets Impala rewrite COUNT(DISTINCT) calls to use NDV() instead,
which speeds up the operation and allows multiple COUNT(DISTINCT) operations in a single query. See
APPX_COUNT_DISTINCT Query Option for details.
The APPX_MEDIAN() aggregate function produces an estimate for the median value of a column by using
sampling. See APPX_MEDIAN Function for details.
• Impala now supports a DECODE() function. This function works as a shorthand for a CASE() expression, and
improves compatibility with SQL code containing vendor extensions. See Impala Conditional Functions for
details.
• The STDDEV(), STDDEV_POP(), STDDEV_SAMP(), VARIANCE(), VARIANCE_POP(), VARIANCE_SAMP(), and
NDV() aggregate functions now all return DOUBLE results rather than STRING. Formerly, you were required
to CAST() the result to a numeric type before using it in arithmetic operations.
• The default settings for Parquet block size, and the associated PARQUET_FILE_SIZE query option, are changed.
Now, Impala writes Parquet files with a size of 256 MB and an HDFS block size of 256 MB. Previously, Impala
attempted to write Parquet files with a size of 1 GB and an HDFS block size of 1 GB. In practice, Impala used
a conservative estimate of the disk space needed for each Parquet block, leading to files that were typically
512 MB anyway. Thus, this change will make the file size more accurate if you specify a value for the
PARQUET_FILE_SIZE query option. It also reduces the amount of memory reserved during INSERT into
Parquet tables, potentially avoiding out-of-memory errors and improving scalability when inserting data
into Parquet tables.
• Anti-joins are now supported, expressed using the LEFT ANTI JOIN and RIGHT ANTI JOIN clauses. These
clauses returns results from one table that have no match in the other table. You might use this type of join
in the same sorts of use cases as the NOT EXISTS and NOT IN operators. See Joins for details.
• The SET command in impala-shell has been promoted to a real SQL statement. You can now set query
options such as PARQUET_FILE_SIZE, MEM_LIMIT, and SYNC_DDL within JDBC, ODBC, or any other kind of
application that submits SQL without going through the impala-shell interpreter. See SET Statement for
details.
• The impala-shell interpreter now reads settings from an optional configuration file, named
$HOME/.impalarc by default. See impala-shell Configuration Options for details.
• The library used for regular expression parsing has changed from Boost to Google RE2. This implementation
change adds support for non-greedy matches using the .*? notation. This and other changes in the way
regular expressions are interpreted means you might need to re-test queries that use functions such as
regexp_extract() or regexp_replace(), or operators such as REGEXP or RLIKE. See Cloudera Impala
Incompatible Changes on page 65 for those details.
Note: Impala 1.4.4 is available as part of CDH 5.1.5, not under CDH 4.
Note: Impala 1.4.3 is available as part of CDH 5.1.4, and under CDH 4.
Note: Impala 1.4.2 is only available as part of CDH 5.1.3, not under CDH 4.
– A new built-in function, ROUND(), rounds DECIMAL values to a specified number of fractional digits.
– Several built-in aggregate functions for computing properties for statistical distributions: STDDEV(),
STDDEV_SAMP(), STDDEV_POP(), VARIANCE(), VARIANCE_SAMP(), and VARIANCE_POP().
– Several new built-in functions, such as MAX_INT(), MIN_SMALLINT(), and so on, let you conveniently
check whether data values are in an expected range. You might be able to switch a column to a smaller
type, saving memory during processing.
– New built-in functions, IS_INF() and IS_NAN(), check for the special values infinity and “not a number”.
These values could be specified as inf or nan in text data files, or be produced by certain arithmetic
expressions.
• The SHOW PARTITIONS statement displays information about the structure of a partitioned table.
• New configuration options for the impalad daemon let you specify initial memory usage for all queries. The
initial resource requests handled by Llama and YARN can be expanded later if needed, avoiding unnecessary
over-allocation and reducing the chance of out-of-memory conditions.
• Impala can take advantage of the Llama high availability feature in CDH 5.1, for improved reliability of resource
management through YARN.
• The Impala CREATE TABLE statement now has a STORED AS AVRO clause, allowing you to create Avro tables
through Impala.
• New impalad configuration options let you fine-tune the calculations Impala makes to estimate resource
requirements for each query. These options can help avoid problems due to overconsumption due to too-low
estimates, or underutilization due to too-high estimates.
• A new SUMMARY command in the impala-shell interpreter provides a high-level summary of the work
performed at each stage of the explain plan. The summary is also included in output from the PROFILE
command.
• Performance improvements for the COMPUTE STATS statement:
– The NDV function is speeded up through native code generation.
– Because the NULL count is not currently used by the Impala query planner, in Impala 1.4.0 and higher,
COMPUTE STATS does not count the NULL values for each column. (The #Nulls field of the stats table is
left as -1, signifying that the value is unknown.)
• Performance improvements for partition pruning. This feature reduces the time spent in query planning, for
partitioned tables with thousands of partitions. Previously, Impala typically queried tables with up to
approximately 3000 partitions. With the performance improvement in partition pruning, now Impala can
comfortably handle tables with tens of thousands of partitions.
• The documentation provides additional guidance for planning tasks.
• The impala-shell interpreter now supports UTF-8 characters for input and output. You can control whether
impala-shell ignores invalid Unicode code points through the --strict_unicode option. (Although this
option is removed in Impala 2.0.)
Note: Impala 1.3.3 is only available as part of CDH 5.0.5, not under CDH 4.
Note: Impala 1.3.2 is only available as part of CDH 5.0.4, not under CDH 4.
Note:
• The Impala 1.3.1 release is available for both CDH 4 and CDH 5. This is the first release in the 1.3.x
series for CDH 4.
Note:
• The Impala 1.3.1 release is available for both CDH 4 and CDH 5. This is the first release in the 1.3.x
series for CDH 4.
• The admission control feature lets you control and prioritize the volume and resource consumption of
concurrent queries. This mechanism reduces spikes in resource usage, helping Impala to run alongside other
kinds of workloads on a busy cluster. It also provides more user-friendly conflict resolution when multiple
memory-intensive queries are submitted concurrently, avoiding resource contention that formerly resulted
in out-of-memory errors. See Admission Control and Query Queuing for details.
• Enhanced EXPLAIN plans provide more detail in an easier-to-read format. Now there are four levels of
verbosity: the EXPLAIN_LEVEL option can be set from 0 (most concise) to 3 (most verbose). See EXPLAIN
Statement for syntax and Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles for
usage information.
• The TIMESTAMP data type accepts more kinds of input string formats through the UNIX_TIMESTAMP function,
and produces more varieties of string formats through the FROM_UNIXTIME function. The documentation
now also lists more functions for date arithmetic, used for adding and subtracting INTERVAL expressions
from TIMESTAMP values. See Impala Date and Time Functions for details.
• New conditional functions, NULLIF(), NULLIFZERO(), and ZEROIFNULL(), simplify porting SQL containing
vendor extensions to Impala. See Impala Conditional Functions for details.
• New utility function, CURRENT_DATABASE(). See Impala Miscellaneous Functions for details.
• Integration with the YARN resource management framework. Only available in combination with CDH 5. This
feature makes use of the underlying YARN service, plus an additional service (Llama) that coordinates requests
to YARN for Impala resources, so that the Impala query only proceeds when all requested resources are
available. See Integrated Resource Management with YARN for full details.
On the Impala side, this feature involves some new startup options for the impalad daemon:
– -enable_rm
– -llama_host
– -llama_port
– -llama_callback_port
– -cgroup_hierarchy_path
For details of these startup options, see Modifying Impala Startup Options.
This feature also involves several new or changed query options that you can set through the impala-shell
interpreter and apply within a specific session:
– MEM_LIMIT: the function of this existing option changes when Impala resource management is enabled.
– REQUEST_POOL: a new option. (Renamed to RESOURCE_POOL in Impala 1.3.0.)
– V_CPU_CORES: a new option.
– RESERVATION_REQUEST_TIMEOUT: a new option.
For details of these query options, see impala-shell Query Options for Resource Management.
Note: Impala 1.2.4 works with CDH 4. It is primarily a bug fix release for Impala 1.2.3, plus some
performance enhancements for the catalog server to minimize startup and DDL wait times for Impala
deployments with large numbers of databases, tables, and partitions.
• On Impala startup, the metadata loading and synchronization mechanism has been improved and optimized,
to give more responsiveness when starting Impala on a system with a large number of databases, tables,
or partitions. The initial metadata loading happens in the background, allowing queries to be run before the
entire process is finished. When a query refers to a table whose metadata is not yet loaded, the query waits
until the metadata for that table is loaded, and the load operation for that table is prioritized to happen first.
• Formerly, if you created a new table in Hive, you had to issue the INVALIDATE METADATA statement (with
no table name) which was an expensive operation that reloaded metadata for all tables. Impala did not
recognize the name of the Hive-created table, so you could not do INVALIDATE METADATA new_table to
get the metadata for just that one table. Now, when you issue INVALIDATE METADATA table_name, Impala
checks to see if that name represents a table created in Hive, and if so recognizes the new table and loads
the metadata for it. Additionally, if the new table is in a database that was newly created in Hive, Impala also
recognizes the new database.
• If you issue INVALIDATE METADATA table_name and the table has been dropped through Hive, Impala will
recognize that the table no longer exists.
• New startup options let you control the parallelism of the metadata loading during startup for the catalogd
daemon:
– --load_catalog_in_background makes Impala load and cache metadata using background threads
after startup. It is true by default. Previously, a system with a large number of databases, tables, or
partitions could be unresponsive or even time out during startup.
– --num_metadata_loading_threads determines how much parallelism Impala devotes to loading
metadata in the background. The default is 16. You might increase this value for systems with huge
numbers of databases, tables, or partitions. You might lower this value for busy systems that are
CPU-constrained due to jobs from components other than Impala.
Note: Impala 1.2.3 works with CDH 4 and with CDH 5 beta 2. The resource management feature
requires CDH 5 beta.
Impala 1.2.3 contains exactly the same feature set as Impala 1.2.2. Its only difference is one additional fix for
compatibility with Parquet files generated outside of Impala by components such as Hive, Pig, or MapReduce.
If you are upgrading from Impala 1.2.1 or earlier, see New Features in Impala Version 1.2.2 on page 50 for the
latest added features.
New Features in Impala Version 1.2.2
Note: Impala 1.2.2 works with CDH 4. Its feature set is a superset of features in the Impala 1.2.0 beta,
with the exception of resource management, which relies on CDH 5.
Impala 1.2.2 includes new features for performance, security, and flexibility. The major enhancements over 1.2.1
are performance related, primarily for join queries.
New user-visible features include:
• Join order optimizations. This highly valuable feature automatically distributes and parallelizes the work for
a join query to minimize disk I/O and network traffic. The automatic optimization reduces the need to use
query hints or to rewrite join queries with the tables in a specific order based on size or cardinality. The new
COMPUTE STATS statement gathers statistical information about each table that is crucial for enabling the
join optimizations. See Performance Considerations for Join Queries for details.
• COMPUTE STATS statement to collect both table statistics and column statistics with a single statement.
Intended to be more comprehensive, efficient, and reliable than the corresponding Hive ANALYZE TABLE
statement, which collects statistics in multiple phases through MapReduce jobs. These statistics are important
for query planning for join queries, queries on partitioned tables, and other types of data-intensive operations.
For optimal planning of join queries, you need to collect statistics for each table involved in the join. See
COMPUTE STATS Statement for details.
• Reordering of tables in a join query can be overridden by the STRAIGHT_JOIN operator, allowing you to
fine-tune the planning of the join query if necessary, by using the original technique of ordering the joined
tables in descending order of size. See Overriding Join Reordering with STRAIGHT_JOIN for details.
• The CROSS JOIN clause in the SELECT statement to allow Cartesian products in queries, that is, joins without
an equality comparison between columns in both tables. Because such queries must be carefully checked
to avoid accidental overconsumption of memory, you must use the CROSS JOIN operator to explicitly select
this kind of join. See Cross Joins and Cartesian Products with the CROSS JOIN Operator for examples.
• The ALTER TABLE statement has new clauses that let you fine-tune table statistics. You can use this
technique as a less-expensive way to update specific statistics, in case the statistics become stale, or to
experiment with the effects of different data distributions on query planning.
• LDAP username/password authentication in JDBC/ODBC. See Enabling LDAP Authentication for Impala for
details.
• GROUP_CONCAT() aggregate function to concatenate column values across all rows of a result set.
• The INSERT statement now accepts hints, [SHUFFLE] and [NOSHUFFLE], to influence the way work is
redistributed during INSERT...SELECT operations. The hints are primarily useful for inserting into partitioned
Parquet tables, where using the [SHUFFLE] hint can avoid problems due to memory consumption and
simultaneous open files in HDFS, by collecting all the new data for each partition on a specific node.
• Several built-in functions and operators are now overloaded for more numeric data types, to reduce the
requirement to use CAST() for type coercion in INSERT statements. For example, the expression 2+2 in an
INSERT statement formerly produced a BIGINT result, requiring a CAST() to be stored in an INT variable.
Now, addition, subtraction, and multiplication only produce a result that is one step “bigger” than their
arguments, and numeric and conditional functions can return SMALLINT, FLOAT, and other smaller types
rather than always BIGINT or DOUBLE.
• New fnv_hash() built-in function for constructing hashed values. See Impala Mathematical Functions for
details.
• The clause STORED AS PARQUET is accepted as an equivalent for STORED AS PARQUETFILE. This more
concise form is recommended for new code.
Because Impala 1.2.2 builds on a number of features introduced in 1.2.1, if you are upgrading from an older 1.1.x
release straight to 1.2.2, also review New Features in Impala Version 1.2.1 on page 51 to see features such as
the SHOW TABLE STATS and SHOW COLUMN STATS statements, and user-defined functions (UDFs).
New Features in Impala Version 1.2.1
Note: Impala 1.2.1 works with CDH 4. Its feature set is a superset of features in the Impala 1.2.0 beta,
with the exception of resource management, which relies on CDH 5.
Impala 1.2.1 includes new features for security, performance, and flexibility.
New user-visible features include:
• SHOW TABLE STATS table_name and SHOW COLUMN STATS table_name statements, to verify that statistics
are available and to see the values used during query planning.
• CREATE TABLE AS SELECT syntax, to create a new table and transfer data into it in a single operation.
• OFFSET clause, for use with the ORDER BY and LIMIT clauses to produce “paged” result sets such as items
1-10, then 11-20, and so on.
• NULLS FIRST and NULLS LAST clauses to ensure consistent placement of NULL values in ORDER BY queries.
• New built-in functions: least(), greatest(), initcap().
• New aggregate function: ndv(), a fast alternative to COUNT(DISTINCT col) returning an approximate result.
• The LIMIT clause can now accept a numeric expression as an argument, rather than only a literal constant.
• The SHOW CREATE TABLE statement displays the end result of all the CREATE TABLE and ALTER TABLE
statements for a particular table. You can use the output to produce a simplified setup script for a schema.
• The --idle_query_timeout and --idle_session_timeout options for impalad control the time intervals
after which idle queries are cancelled, and idle sessions expire. See Setting Timeout Periods for Daemons,
Queries, and Sessions for details.
• User-defined functions (UDFs). This feature lets you transform data in very flexible ways, which is important
when using Impala as part of an ETL or ELT pipeline. Prior to Impala 1.2, using UDFs required switching into
Hive. Impala 1.2 can run scalar UDFs and user-defined aggregate functions (UDAs). Impala can run
high-performance functions written in C++, or you can reuse existing Hive functions written in Java.
You create UDFs through the CREATE FUNCTION statement and drop them through the DROP FUNCTION
statement. See Impala User-Defined Functions (UDFs) for instructions about coding, building, and deploying
UDFs, and CREATE FUNCTION Statement and DROP FUNCTION Statement for related SQL syntax.
• A new service automatically propagates changes to table data and metadata made by one Impala node,
sending the new or updated metadata to all the other Impala nodes. The automatic synchronization
mechanism eliminates the need to use the INVALIDATE METADATA and REFRESH statements after issuing
Impala statements such as CREATE TABLE, ALTER TABLE, DROP TABLE, INSERT, and LOAD DATA.
For even more precise synchronization, you can enable the SYNC_DDL query option before issuing a DDL,
INSERT, or LOAD DATA statement. This option causes the statement to wait, returning only after the catalog
service has broadcast the applicable changes to all Impala nodes in the cluster.
Note:
Because the catalog service only monitors operations performed through Impala, INVALIDATE
METADATA and REFRESH are still needed on the Impala side after creating new tables or loading
data through the Hive shell or by manipulating data files directly in HDFS. Because the catalog
service broadcasts the result of the REFRESH and INVALIDATE METADATA statements to all Impala
nodes, when you do need to use those statements, you can do so a single time rather than on
every Impala node.
This service is implemented by the catalogd daemon. See The Impala Catalog Service for details.
• CREATE TABLE ... AS SELECT syntax, to create a table and copy data into it in a single operation. See
CREATE TABLE Statement for details.
• The CREATE TABLE and ALTER TABLE statements have new clauses TBLPROPERTIES and WITH
SERDEPROPERTIES. The TBLPROPERTIES clause lets you associate arbitrary items of metadata with a particular
table as key-value pairs. The WITH SERDEPROPERTIES clause lets you specify the serializer/deserializer
(SerDes) classes that read and write data for a table; although Impala does not make use of these properties,
sometimes particular values are needed for Hive compatibility. See CREATE TABLE Statement and ALTER
TABLE Statement for details.
• Impersonation support lets you authorize certain OS users associated with applications (for example, hue),
to submit requests using the credentials of other users. Only available in combination with CDH 5. See
Configuring Per-User Access for Hue for details.
• Enhancements to EXPLAIN output. In particular, when you enable the new EXPLAIN_LEVEL query option,
the EXPLAIN and PROFILE statements produce more verbose output showing estimated resource requirements
and whether table and column statistics are available for the applicable tables and columns. See EXPLAIN
Statement for details.
• SHOW CREATE TABLE summarizes the effects of the original CREATE TABLE statement and any subsequent
ALTER TABLE statements, giving you a CREATE TABLE statement that will re-create the current structure
and layout for a table.
• The LIMIT clause for queries now accepts an arithmetic expression, in addition to numeric literals.
Note: The Impala 1.2.0 beta release only works in combination with the beta version of CDH 5. The
Impala 1.2.0 software is bundled together with the CDH 5 beta 1 download.
The Impala 1.2.0 beta includes new features for security, performance, and flexibility.
New user-visible features include:
• User-defined functions (UDFs). This feature lets you transform data in very flexible ways, which is important
when using Impala as part of an ETL or ELT pipeline. Prior to Impala 1.2, using UDFs required switching into
Hive. Impala 1.2 can run scalar UDFs and user-defined aggregate functions (UDAs). Impala can run
high-performance functions written in C++, or you can reuse existing Hive functions written in Java.
You create UDFs through the CREATE FUNCTION statement and drop them through the DROP FUNCTION
statement. See Impala User-Defined Functions (UDFs) for instructions about coding, building, and deploying
UDFs, and CREATE FUNCTION Statement and DROP FUNCTION Statement for related SQL syntax.
• A new service automatically propagates changes to table data and metadata made by one Impala node,
sending the new or updated metadata to all the other Impala nodes. The automatic synchronization
mechanism eliminates the need to use the INVALIDATE METADATA and REFRESH statements after issuing
Impala statements such as CREATE TABLE, ALTER TABLE, DROP TABLE, INSERT, and LOAD DATA.
Note:
Because this service only monitors operations performed through Impala, INVALIDATE METADATA
and REFRESH are still needed on the Impala side after creating new tables or loading data through
the Hive shell or by manipulating data files directly in HDFS. Because the catalog service broadcasts
the result of the REFRESH and INVALIDATE METADATA statements to all Impala nodes, when you
do need to use those statements, you can do so a single time rather than on every Impala node.
This service is implemented by the catalogd daemon. See The Impala Catalog Service for details.
• Integration with the YARN resource management framework. Only available in combination with CDH 5. This
feature makes use of the underlying YARN service, plus an additional service (Llama) that coordinates requests
to YARN for Impala resources, so that the Impala query only proceeds when all requested resources are
available. See Integrated Resource Management with YARN for full details.
On the Impala side, this feature involves some new startup options for the impalad daemon:
– -enable_rm
– -llama_host
– -llama_port
– -llama_callback_port
– -cgroup_hierarchy_path
For details of these startup options, see Modifying Impala Startup Options.
This feature also involves several new or changed query options that you can set through the impala-shell
interpreter and apply within a specific session:
– MEM_LIMIT: the function of this existing option changes when Impala resource management is enabled.
– YARN_POOL: a new option. (Renamed to RESOURCE_POOL in Impala 1.3.0.)
– V_CPU_CORES: a new option.
– RESERVATION_REQUEST_TIMEOUT: a new option.
For details of these query options, see impala-shell Query Options for Resource Management.
• CREATE TABLE ... AS SELECT syntax, to create a table and copy data into it in a single operation. See
CREATE TABLE Statement for details.
• The CREATE TABLE and ALTER TABLE statements have a new TBLPROPERTIES clause that lets you associate
arbitrary items of metadata with a particular table as key-value pairs. See CREATE TABLE Statement and
ALTER TABLE Statement for details.
• Impersonation support lets you authorize certain OS users associated with applications (for example, hue),
to submit requests using the credentials of other users. Only available in combination with CDH 5. See
Configuring Per-User Access for Hue for details.
• Enhancements to EXPLAIN output. In particular, when you enable the new EXPLAIN_LEVEL query option,
the EXPLAIN and PROFILE statements produce more verbose output showing estimated resource requirements
and whether table and column statistics are available for the applicable tables and columns. See EXPLAIN
Statement for details.
• Parquet data files generated by Impala 1.1.1 are now compatible with the Parquet support in Hive. See
Cloudera Impala Incompatible Changes on page 65 for the procedure to update older Impala-created Parquet
files to be compatible with the Hive Parquet support.
• Additional improvements to stability and resource utilization for Impala queries.
• Additional enhancements for compatibility with existing file formats.
New Features in Impala Version 1.1
Impala 1.1 includes new features for security, performance, and usability.
New user-visible features include:
• Extensive new security features, built on top of the Sentry open source project. Impala now supports
fine-grained authorization based on roles. A policy file determines which privileges on which schema objects
(servers, databases, tables, and HDFS paths) are available to users based on their membership in groups. By
assigning privileges for views, you can control access to table data at the column level. For details, see
Overview of Impala Security.
• Impala 1.1 works with Cloudera Manager 4.6 or higher. To use Cloudera Manager to manage authorization
for the Impala web UI (the web pages served from port 25000 by default), use Cloudera Manager 4.6.2 or
higher.
• Impala can now create, alter, drop, and query views. Views provide a flexible way to set up simple aliases for
complex queries; hide query details from applications and users; and simplify maintenance as you rename
or reorganize databases, tables, and columns. See the overview section Views and the statements CREATE
VIEW Statement, ALTER VIEW Statement, and DROP VIEW Statement.
• Performance is improved through a number of automatic optimizations. Resource consumption is also
reduced for Impala queries. These improvements apply broadly across all kinds of workloads and file formats.
The major areas of performance enhancement include:
– Improved disk and thread scheduling, which applies to all queries.
– Improved hash join and aggregation performance, which applies to queries with large build tables or a
large number of groups.
– Dictionary encoding with Parquet, which applies to Parquet tables with short string columns.
– Improved performance on systems with SSDs, which applies to all queries and file formats.
• Some new built-in functions are implemented: translate() to substitute characters within strings, user() to
check the login ID of the connected user.
• The new WITH clause for SELECT statements lets you simplify complicated queries in a way similar to creating
a view. The effects of the WITH clause only last for the duration of one query, unlike views, which are persistent
schema objects that can be used by multiple sessions or applications. See WITH Clause.
• An enhancement to DESCRIBE statement, DESCRIBE FORMATTED table_name, displays more detailed
information about the table. This information includes the file format, location, delimiter, ownership, external
or internal, creation and access times, and partitions. The information is returned as a result set that can be
interpreted and used by a management or monitoring application. See DESCRIBE Statement.
• You can now insert a subset of columns for a table, with other columns being left as all NULL values. Or you
can specify the columns in any order in the destination table, rather than having to match the order of the
corresponding columns in the source. VALUES clause. This feature is known as “column permutation”. See
INSERT Statement.
• The new LOAD DATA statement lets you load data into a table directly from an HDFS data file. This technique
lets you minimize the number of steps in your ETL process, and provides more flexibility. For example, you
can bring data into an Impala table in one step. Formerly, you might have created an external table where
the data files are not entirely under your control, or copied the data files to Impala data directories manually,
or loaded the original data into one table and then used the INSERT statement to copy it to a new table with
a different file format, partitioning scheme, and so on. See LOAD DATA Statement.
• Improvements to Impala-HBase integration:
– New query options for HBase performance: HBASE_CACHE_BLOCKS and HBASE_CACHING.
– Support for binary data types in HBase tables. See Supported Data Types for HBase Columns for details.
• You can issue REFRESH as a SQL statement through any of the programming interfaces that Impala supports.
REFRESH formerly had to be issued as a command through the impala-shell interpreter, and was not
available through a JDBC or ODBC API call. As part of this change, the functionality of the REFRESH statement
is divided between two statements. In Impala 1.1, REFRESH requires a table name argument and immediately
reloads the metadata; the new INVALIDATE METADATA statement works the same as the Impala 1.0 REFRESH
did: the table name argument is optional, and the metadata for one or all tables is marked as stale, but not
actually reloaded until the table is queried. When you create a new table in the Hive shell or through a different
Impala node, you must enter INVALIDATE METADATA with no table parameter before you can see the new
table in impala-shell. See REFRESH Statement and INVALIDATE METADATA Statement.
New Features in Impala Version 1.0.1
The primary enhancements in Impala 1.0.1 are internal, for compatibility with the new Cloudera Manager 4.6
release. Try out the new Impala Query Monitoring feature in Cloudera Manager 4.6, which requires Impala 1.0.1.
New user-visible features include:
• The VALUES clause lets you INSERT one or more rows using literals, function return values, or other expressions.
For performance and scalability, you should still use INSERT ... SELECT for bringing large quantities of
data into an Impala table. The VALUES clause is a convenient way to set up small tables, particularly for initial
testing of SQL features that do not require large amounts of data. See VALUES Clause for details.
• The -B and -o options of the impala-shell command can turn query results into delimited text files and
store them in an output file. The plain text results are useful for using with other Hadoop components or
Unix tools. In benchmark tests, it is also faster to produce plain rather than pretty-printed results, and write
to a file rather than to the screen, giving a more accurate picture of the actual query time.
• Several bug fixes. See Issues Fixed in the 1.0.1 Release on page 158 for details.
New Features in Impala Version 1.0
This version has multiple performance improvements and adds the following functionality:
• Several bug fixes. See Issues Fixed in the 1.0 GA Release on page 160.
• ALTER TABLE statement.
• Hints to allow specifying a particular join strategy.
• REFRESH for a single table.
• Dynamic resource management, allowing high concurrency for Impala queries.
New Features in Version 0.7 of the Cloudera Impala Beta Release
This version has multiple performance improvements and adds the following functionality:
• Several bug fixes. See Issues Fixed in Version 0.7 of the Beta Release on page 162.
• Support for the Parquet file format. For more information on file formats, see How Impala Works with Hadoop
File Formats.
• Added support for Avro.
• Support for the memory limits. For more information, see the example on modifying memory limits in
Modifying Impala Startup Options.
• Bigger and faster joins through the addition of partitioned joins to the already supported broadcast joins.
• Fully distributed aggregations.
• Fully distributed top-n computation.
• Support for creating and altering tables.
• Support for GROUP BY with floats and doubles.
In this version, both CDH 4.1 and 4.2 are supported, but due to performance improvements added, we highly
recommend you use CDH 4.2 or higher to see the full benefit. If you are using Cloudera Manager, version 4.5 is
required.
New Features in Version 0.6 of the Cloudera Impala Beta Release
• Several bug fixes. See Issues Fixed in Version 0.6 of the Beta Release on page 163.
• Added support for Impala on SUSE and Debian/Ubuntu. Impala is now supported on:
-default_query_options='key=value;key=value'
Incompatible Changes
Important:
For changes in operating-system support, and other major requirements, see CDH 5 Requirements
and Supported Versions.
• Scala's Iterable has been replaced by TraversableOnce inside Scrunch flatMap functions in order to
support functions that return Iterators.
CDH 5.4.0 introduces new HBase APIs, which will probably require some changes to Crunch code developed
against HBase 0.96 APIs. For more information, see the section on Apache Crunch on page 7 under "What's
New in CDH 5.4.0".
Important: There is no separate tarball for MRv1. Instead, the MRv1 binaries, examples, etc., are
delivered in the Hadoop tarball itself. The scripts for running MRv1 are in the bin-mapreduce1
directory in the tarball, and the MRv1 examples are in the examples-mapreduce1 directory. You need
to do some additional configuration; follow the directions below.
Note: In the steps that follow, install_dir is the name of the directory into which you extracted
the files.
ln -s install_dir/bin-mapreduce1 install_dir/share/hadoop/mapreduce1/bin
ln -s install_dir/etc/hadoop-mapreduce1 install_dir/share/hadoop/mapreduce1/conf
4. Set the HADOOP_HOME and HADOOP_CONF_DIR environment variables in your execution environment as follows:
$ export HADOOP_HOME=install_dir/share/hadoop/mapreduce1
$ export HADOOP_CONF_DIR=$HADOOP_HOME/conf
<dependency>
<groupId> org.apache.hbase </groupId>
<artifactId> hbase </artifactId>
<optional> true </optional>
</dependency>
Now, when building against CDH 5 you will need to add a dependency for the hbase-client JAR. The hbase
module continues to exist as a convenient top-level wrapper for existing clients, and it pulls in all the
sub-modules automatically. But it is only a simple wrapper, so its repository directory will carry no actual
jars.
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
If your code uses the HBase minicluster, you can pull in the hbase-testing-util dependency:
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-testing-util</artifactId>
<version>${cdh.hbase.version}</version>
</dependency>
If you need to obtain all HBase JARs required to build a project, copy them from the CDH installation directory
(typically /usr/lib/hbase for an RPM install, or /opt/cloudera/parcels/CDH/lib/hbase if you install
using Parcels), or from the CDH 5 HBase tarballs. However, for building client applications, Cloudera
recommends using build tools such as Maven, rather than manually referencing JARs.
• CDH 5 introduces support for addressing cells with an empty column qualifier (a string of 0 bytes in length),
but not all edge services handle that scenario correctly. In some cases, attempting to address a cell at [
rowkey, fam ] results in interaction with the entire column family, rather than the empty column qualifier.
Users of the HBase Shell, MapReduce, REST, and Thrift must use family instead of family: (notice
the omitted ":"), to interact with an entire column family, rather than an empty column qualifier. Including
the ":" will be interpreted as an interaction with the empty qualifier in the family column family.
• API Removals
• HBASE-7315/HBASE-7263 - Row lock user API has been removed.
• HBASE-6706 - Removed total order partitioner.
• The behavior of the filter MUST_PASS_ALL changed between CDH 4 and CDH 5. In CDH 4, a FilterList with
the default MUST_PASS_ALL operator return all rows (not filtering the results). In CDH 5, no results are
returned when the FilterList is empty with the MUST_PASS_ALL operator. To continue using the CDH 4
behavior, modify your code to use the scan.setLoadColumnFamiliesOnDemand(false); method.
• API changes: see New Features and Changes for HBase in CDH 5. CDH reverted API changes in HBase 1.0
which broke compatibility with HBase in CDH 5.0, 5.1, 5.2, and 5.3. If you have written applications using
Apache HBase 1.0 APIs, you may need to modify these applications to run in CDH 5.4.
Differences between CDH 5.4 HBase 1.0 and Apache HBase 1.0:
• CDH 5.4.0 keeps commons-math at version 2.1 to maintain compatibility with earlier CDH releases, whereas
Apache HBase 1.0 uses commons-math 2.2.
• CDH 5.4.0 keeps Netty at version 3 to maintain compatibility with earlier CDH releases, whereas Apache
HBase 1.0 uses Netty 4.
• Starting with CDH 5.2, you can specify a global default number of versions, which will be applied to all newly
created tables where the number of versions is not otherwise specified, by setting
hbase.column.max.version to the desired number of versions in hbase-site.xml.
• HBase in CDH 5.2 differs from Apache HBase 0.98.6 in that CDH does not include HBASE-11546, which
provides ZooKeeper-less region assignment. CDH omits this feature because it is an incompatible change
that prevents an upgraded cluster from being rolled back to a previous version.
Developer Interface Changes
• HBase 0.98.5 removed ClientSmallScanner from the public API. HBase in CDH 5.2 restores the constructor
to maintain backward compatibility, but in future releases of HBase, this class will no longer be public. You
should change your code to use the Scan.setSmall(true) method instead.
You can set the hbase.hconnection.threads.max property in hbase-site.xml to control the pool size
or you can pass an ExecutorService to HConnectionManager.createConnection().
Warning:
CDH 5 Beta 1 and Beta 2 are not intended for production use, and have been superseded by official
releases in the CDH 5 family.
The HBase client from CDH 5 Beta 1 is not wire compatible with CDH 5 Beta 2 because of changes introduced
in HBASE-9612. As a consequence, CDH 5 Beta 1 users will not be able to execute a rolling upgrade to CDH 5
Beta 2 (or later). This patch unifies the way the HBase clients make requests and simplifies the internals, but
breaks wire compatibility. Developers may need to recompile applications built upon the CDH 5 Beta 1 API.
As of CDH 5 Beta 1 (HBase 0.95), the value of hbase.regionserver.checksum.verify defaults to true; in
earlier releases the default is false.
API Removals
• See API Differences between CDH 4.5 and CDH 5 Beta 2.
Compatibility between CDH Beta and Apache HBase Releases
• Apache HBase 0.95.2 is not wire compatible with CDH 5 Beta 1 HBase 0.95.2.
• Apache HBase 0.96.x should be wire compatible with CDH 5 Beta 2 HBase 0.96.1.1.
Note: As of CDH 5, HCatalog is part of Apache Hive; incompatible changes in HCatalog are included
below.
Metastore schema upgrade: CDH 5.2.0 includes Hive version 0.13.1. Upgrading from an earlier Hive version to
Hive 0.13.1 or later requires a metastore schema upgrade.
Warning:
You must upgrade the metastore schema before starting the new version of Hive. Failure to do so
may result in metastore corruption. See Upgrading Hive.
CDH 5 includes a new offline tool called schematool; Cloudera recommends you use this tool to upgrade your
metastore schema. See Upgrade the Metastore Schema for more information.
Hive upgrade: Upgrading Hive from CDH 4 to CDH 5, or from an earlier CDH 5.x release to CDH 5.2 or later, requires
several manual steps. Follow the upgrade guide closely. See Upgrading Hive.
Incompatible changes between CDH 4 and CDH 5:
• The CDH 4 JDBC client is not compatible with CDH 5 HiveServer2. JDBC applications connecting to the CDH 5
HiveServer2 will require the CDH 5 JDBC client driver.
• JDBC applications will require the newer CDH 5 JDBC packages in order to connect to HiveServer2. You do not
need to recompile applications for this change.
• Because of security and concurrency issues, the original Hive server (HiveServer1) and the Hive command-line
interface (CLI) are deprecated in current versions of CDH 5 and will be removed in a future release. Cloudera
strongly encourages you to migrate to HiveServer2 and Beeline as soon as possible.
• CDH 5 Hue will not work with HiveServer2 from CDH 4.
• The npath function has been removed.
• Cloudera recommends that custom ObjectInspectors created for use with custom SerDes have a no-argument
constructor in addition to their normal constructors, for serialization purposes. See HIVE-5380 for more
details.
• The SerDe interface has been changed which requires the custom SerDe modules to be reworked.
• The decimal data type format has changed as of CDH 5 Beta 2 and is not compatible with CDH 4.
• From CDH 5 Beta 2 onwards, the Parquet SerDe is part of the Hive package. The SerDe class name has
changed as a result. However, there is a wrapper class for backward compatibility, so any existing Hive tables
created with the Parquet SerDe will continue to work with CDH 5 Beta 2 and later Hive versions.
Incompatible changes between any earlier CDH version and CDH 5.4.x:
• CDH 5.2.0 and later clients cannot communicate with CDH 5.1.x and earlier servers. This means that you
must upgrade the server before the clients.
• As of CDH 5.2.0, DESCRIBE DATABASE returns additional fields: owner_name and owner_type. The command
will continue to behave as expected if you identify the field you're interested in by its (string) name, but could
produce unexpected results if you use a numeric index to identify the field(s).
• CDH 5.2.0 implements HIVE-6248 , which includes some backward-incompatible changes to the HCatalog
API.
• The CDH 5.2 Hive JDBC driver is not wire-compatible with the CDH 5.1 version of HiveServer2. Make sure you
upgrade Hive clients and all other Hive hosts in tandem: the server first, and then the clients.
• HiveServer 1 is deprecated as of CDH 5.3, and will be removed in a future release of CDH. Users of HiveServer
1 should upgrade to HiveServer 2 as soon as possible.
• org.apache.hcatalog is deprecated as of CDH 5.3. All client-facing classes were moved from
org.apache.hcatalog to org.apache.hive.hcatalog as of CDH 5.0 and the deprecated classes in
org.apache.hcatalog will be removed altogether in a future release. If you are still using
org.apache.hcatalog, you should move to org.apache.hive.hcatalog immediately.
• Date partition columns: as of Hive version 13, implemented in CDH 5.2, Hive validates the format of dates in
partition columns, if they are stored as dates. A partition column with a date in invalid form can neither be
used nor dropped once you upgrade to CDH 5.2 or higher. To avoid this problem, do one of the following:
– Fix any invalid dates before you upgrade. Hive expects dates in partition columns to be in the form
YYYY-MM-DD.
– Store dates in partition columns as strings or integers.
You can use the following SQL query to find any partition-column values stored as dates:
• Decimal precision and scale: As of CDH 5.4, Hive support for decimal precision and scale changes as follows:
1. When decimal is used as a type, it means decimal(10, 0) rather than a precision of 38 with a variable
scale.
2. When Hive is unable to determine the precision and scale of a decimal type (for example in the case of
non-generic User-Defined Function (UDF) that has an evaluate() method that returns decimal), a
precision and scale of (38, 18) is assumed. In previous versions, a precision of 38 and a variable scale
were assumed. Cloudera recommends you develop generic UDFs instead, and specify exact precision and
scale.
3. When a decimal value is assigned or cast to a different decimal type, rounding is used to handle cases in
which the precision of the value is greater than that of the target decimal type, as long as the integer
portion of the value can be preserved. In previous versions, if the value's precision was greater than 38
(the only allowed precision for the decimal type), the value was set to null, regardless of whether the
integer portion could be preserved.
[desktop]
...
# Comma-separated list of Django middleware classes to use.
# See https://docs.djangoproject.com/en/1.4/ref/middleware/ for more details on
middlewares in Django.
middleware=desktop.auth.backend.LdapSynchronizationBackend,desktop.auth.backend.my_middleware
...
• HUE-1658 [oozie] Hue depends on OOZIE-1306 which is in CDH 5 Beta 2 but has not been included in any
other release yet. Set the following backward compatibility flag to false to use the old frequency number/unit
representation instead of the new crontab.
enable_cron_scheduling = false
• Hue 3.0.0 was a major revision of Hue. The user interface changed significantly.
• CDH 5 Hue will only work with the default system Python version of the operating system it is being installed
on. For example, on RHEL/CentOS 6 you will need Python 2.6 to start Hue.
Note: RHEL 5 and CentOS 5 users will have to download Python 2.6 from the EPEL repository.
• The Beeswax daemon has been replaced by HiveServer2. Hue should therefore point to a running HiveServer2.
This change involves removing the Beeswaxd code entirely and the following major updates to the [beeswax]
section of the Hue configuration file, hue.ini.
[beeswax]
# Host where Hive server Thrift daemon is running.
# If Kerberos security is enabled, use fully-qualified domain name (FQDN).
## hive_server_host=<FQDN of Hive Server>
• Search bind authentication is now used by default instead of direct bind. To revert to the previous settings,
use the new search_bind_authentication configuration property.
[desktop]
[[ldap]]
search_bind_authentication=false
• The Hue Shell app has been removed completely. This includes removing both the Shell app code and the
[shell] section from hue.ini.
• YARN should be used by default.
Note: The Impala 2.2.x maintenance releases now use the CDH 5.4.x numbering system rather than
increasing the Impala version numbers. Impala 2.2 and higher are not available under CDH 4.
Note: Impala 2.2.0 is available as part of CDH 5.4.0 and is not available for CDH 4. Cloudera does not
intend to release future versions of Impala for CDH 4 outside patch and maintenance releases if
required. Given the upcoming end-of-maintenance for CDH 4, Cloudera recommends all customers
to migrate to a recent CDH 5 release.
Changes to Prerequisites
The prerequisite for CPU architecture has been relaxed in Impala 2.2.0 and higher. From this release onward,
Impala works on CPUs that have the SSSE3 instruction set. The SSE4 instruction set is no longer required. This
relaxed requirement simplifies the upgrade planning from Impala 1.x releases, which also worked on
SSSE3-enabled processors.
Incompatible Changes Introduced in Impala 2.1.3 / CDH 5.3.3
No incompatible changes.
Note: Impala 2.1.3 is available as part of CDH 5.3.3, not under CDH 4.
Note: Impala 2.1.2 is available as part of CDH 5.3.2, not under CDH 4.
Changes to Prerequisites
Currently, Impala 2.1.x does not function on CPUs without the SSE4.1 instruction set. This minimum CPU
requirement is higher than in previous versions, which relied on the older SSSE3 instruction set. Check the CPU
level of the hosts in your cluster before upgrading to Impala 2.1.x or CDH 5.3.x.
Note: Impala 2.0.4 is available as part of CDH 5.2.5, not under CDH 4.
Note: Impala 2.0.3 is available as part of CDH 5.2.4, not under CDH 4.
Note: Impala 2.0.2 is available as part of CDH 5.2.3, not under CDH 4.
Changes to Prerequisites
Currently, Impala 2.0.x does not function on CPUs without the SSE4.1 instruction set. This minimum CPU
requirement is higher than in previous versions, which relied on the older SSSE3 instruction set. Check the CPU
level of the hosts in your cluster before upgrading to Impala 2.0.x or CDH 5.2.x.
• By default, . does not match newline. This behavior can be overridden in the regex itself using the s flag.
• \Z is not supported.
• < and > for start of word and end of word are not supported.
• Lookahead and lookbehind are not supported.
• Shorthand notation for character classes, such as \d for digit, is not recognized. (This restriction is lifted in
Impala 2.0.1, which restores the shorthand notation.)
export LC_CTYPE=en_US.UTF-8
This change is unlikely to affect memory usage while writing Parquet files, because Impala does not pre-allocate
the memory needed to hold the entire Parquet block.
Incompatible Changes Introduced in Impala 1.4.4 / CDH 5.1.5
No incompatible changes.
Note: Impala 1.4.4 is available as part of CDH 5.1.5, not under CDH 4.
Note: Impala 1.4.3 is available as part of CDH 5.1.4, and under CDH 4.
Note: Impala 1.4.2 is only available as part of CDH 5.1.3, not under CDH 4.
The following were formerly reserved keywords, but are no longer reserved:
– COUNT
– GROUP_CONCAT
– NDV
– SUM
• The fix for issue IMPALA-973 changes the behavior of the INVALIDATE METADATA statement regarding
nonexistent tables. In Impala 1.4.0 and higher, the statement returns an error if the specified table is not in
the metastore database at all. It completes successfully if the specified table is in the metastore database
but not yet recognized by Impala, for example if the table was created through Hive. Formerly, you could
issue this statement for a completely nonexistent table, with no error.
Note: Impala 1.3.3 is only available as part of CDH 5.0.5, not under CDH 4.
Note: Impala 1.3.2 is only available as part of CDH 5.0.4, not under CDH 4.
• The result set for the SHOW FUNCTIONS statement includes a new first column, with the data type of the
return value.
Note: Although the DECIMAL keyword is a reserved word, currently Impala does not support
DECIMAL as a data type for columns.
• The query option named YARN_POOL during the CDH 5 beta period is now named REQUEST_POOL to reflect
its broader use with the Impala admission control feature.
• There are some changes to the list of reserved words.
– The names of aggregate functions are no longer reserved words, so you can have databases, tables,
columns, or other objects named AVG, MIN, and so on without any name conflicts.
– The internal function names DISTINCTPC and DISTINCTPCSA are no longer reserved words, although
DISTINCT is still a reserved word.
Hive. Loading the metadata for only this one table is faster and involves less network overhead. Therefore, you
might revisit your setup DDL scripts to add the table name to INVALIDATE METADATA statements, in cases
where you create and populate the tables through Hive before querying them through Impala.
Incompatible Changes Introduced in Impala 1.2.3
Because the feature set of Impala 1.2.3 is identical to Impala 1.2.2, there are no new incompatible changes. See
Incompatible Changes Introduced in Impala 1.2.2 on page 71 if you are upgrading from Impala 1.2.1 or 1.1.x.
Incompatible Changes Introduced in Impala 1.2.2
The following changes to SQL syntax and semantics in Impala 1.2.2 could require updates to your SQL code, or
schema objects such as tables or views:
• With the addition of the CROSS JOIN keyword, you might need to rewrite any queries that refer to a table
named CROSS or use the name CROSS as a table alias:
• Formerly, a DROP DATABASE statement in Impala would not remove the top-level HDFS directory for that
database. The DROP DATABASE has been enhanced to remove that directory. (You still need to drop all the
tables inside the database first; this change only applies to the top-level directory for the entire database.)
• The keyword PARQUET is introduced as a synonym for PARQUETFILE in the CREATE TABLE and ALTER TABLE
statements, because that is the common name for the file format. (As opposed to SequenceFile and RCFile
where the “File” suffix is part of the name.) Documentation examples have been changed to prefer the new
shorter keyword. The PARQUETFILE keyword is still available for backward compatibility with older Impala
versions.
• New overloads are available for several operators and built-in functions, allowing you to insert their result
values into smaller numeric columns such as INT, SMALLINT, TINYINT, and FLOAT without using a CAST()
call. If you remove the CAST() calls from INSERT statements, those statements might not work with earlier
versions of Impala.
Because many users are likely to upgrade straight from Impala 1.x to Impala 1.2.2, also read Incompatible
Changes Introduced in Impala 1.2.1 on page 71 for things to note about upgrading to Impala 1.2.x in general.
In a Cloudera Manager environment, the catalog service is not recognized or managed by Cloudera Manager
versions prior to 4.8. Cloudera Manager 4.8 and higher require the catalog service to be present for Impala.
Therefore, if you upgrade to Cloudera Manager 4.8 or higher, you must also upgrade Impala to 1.2.1 or higher.
Likewise, if you upgrade Impala to 1.2.1 or higher, you must also upgrade Cloudera Manager to 4.8 or higher.
Incompatible Changes Introduced in Impala 1.2.1
The following changes to SQL syntax and semantics in Impala 1.2.1 could require updates to your SQL code, or
schema objects such as tables or views:
• In Impala 1.2.1 and higher, all NULL values come at the end of the result set for ORDER BY ... ASC queries,
and at the beginning of the result set for ORDER BY ... DESC queries. In effect, NULL is considered greater
than all other values for sorting purposes. The original Impala behavior always put NULL values at the end,
even for ORDER BY ... DESC queries. The new behavior in Impala 1.2.1 makes Impala more compatible
with other popular database systems. In Impala 1.2.1 and higher, you can override or specify the sorting
behavior for NULL by adding the clause NULLS FIRST or NULLS LAST at the end of the ORDER BY clause.
Impala 1.2.1 goes along with CDH 4.5 and Cloudera Manager 4.8. If you used the beta version Impala 1.2.0 that
came with the beta of CDH 5, Impala 1.2.1 includes all the features of Impala 1.2.0 except for resource
management, which relies on the YARN framework from CDH 5.
The new catalogd service might require changes to any user-written scripts that stop, start, or restart Impala
services, install or upgrade Impala packages, or issue REFRESH or INVALIDATE METADATA statements:
• See Impala Installation, Upgrading Impala and Starting Impala, for usage information for the catalogd
daemon.
• The REFRESH and INVALIDATE METADATA statements are no longer needed when the CREATE TABLE, INSERT,
or other table-changing or data-changing operation is performed through Impala. These statements are still
needed if such operations are done through Hive or by manipulating data files directly in HDFS, but in those
cases the statements only need to be issued on one Impala node rather than on all nodes. See REFRESH
Statement and INVALIDATE METADATA Statement for the latest usage information for those statements.
• See The Impala Catalog Service for background information on the catalogd service.
In a Cloudera Manager environment, the catalog service is not recognized or managed by Cloudera Manager
versions prior to 4.8. Cloudera Manager 4.8 and higher require the catalog service to be present for Impala.
Therefore, if you upgrade to Cloudera Manager 4.8 or higher, you must also upgrade Impala to 1.2.1 or higher.
Likewise, if you upgrade Impala to 1.2.1 or higher, you must also upgrade Cloudera Manager to 4.8 or higher.
Incompatible Changes Introduced in Impala 1.2.0 (Beta)
There are no incompatible changes to SQL syntax in Impala 1.2.0 (beta).
Because Impala 1.2.0 is bundled with the CDH 5 beta download and depends on specific levels of Apache Hadoop
components supplied with CDH 5, you can only install it in combination with the CDH 5 beta.
The new catalogd service might require changes to any user-written scripts that stop, start, or restart Impala
services, install or upgrade Impala packages, or issue REFRESH or INVALIDATE METADATA statements:
• See Impala Installation, Upgrading Impala and Starting Impala, for usage information for the catalogd
daemon.
• The REFRESH and INVALIDATE METADATA statements are no longer needed when the CREATE TABLE, INSERT,
or other table-changing or data-changing operation is performed through Impala. These statements are still
needed if such operations are done through Hive or by manipulating data files directly in HDFS, but in those
cases the statements only need to be issued on one Impala node rather than on all nodes. See REFRESH
Statement and INVALIDATE METADATA Statement for the latest usage information for those statements.
• See The Impala Catalog Service for background information on the catalogd service.
The new resource management feature interacts with both YARN and Llama services, which are available in
CDH 5. These services are set up for you automatically in a Cloudera Manager (CM) environment. For information
about setting up the YARN and Llama services, see the instructions for YARN and Llama in the CDH 5
Documentation.
Incompatible Changes Introduced in Impala 1.1.1
There are no incompatible changes in Impala 1.1.1.
Previously, it was not possible to create Parquet data through Impala and reuse that table within Hive. Now
that Parquet support is available for Hive 10, reusing existing Impala Parquet data files in Hive requires updating
the table metadata. Use the following command if you are already running Impala 1.1.1:
If you are running a level of Impala that is older than 1.1.1, do the metadata update through Hive:
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat";
Impala 1.1.1 and higher can reuse Parquet data files created by Hive, without any action required.
As usual, make sure to upgrade the impala-lzo-cdh4 package to the latest level at the same time as you
upgrade the Impala server.
Incompatible Change Introduced in Cloudera Impala 1.1
• The REFRESH statement now requires a table name; in Impala 1.0, the table name was optional. This syntax
change is part of the internal rework to make REFRESH a true Impala SQL statement so that it can be called
through the JDBC and ODBC APIs. REFRESH now reloads the metadata immediately, rather than marking it
for update the next time any affected table is accessed. The previous behavior, where omitting the table
name caused a refresh of the entire Impala metadata catalog, is available through the new INVALIDATE
METADATA statement. INVALIDATE METADATA can be specified with a table name to affect a single table, or
without a table name to affect the entire metadata catalog; the relevant metadata is reloaded the next time
it is requested during the processing for a SQL statement. See REFRESH Statement and INVALIDATE
METADATA Statement for the latest details about these statements.
• As of CDH 5.1, the Reserve API requires you to provide the reservationId as part of the request. In previous
releases, the reservationId was auto-generated and returned to the user.
The TLlamaAMReservationRequest has an additional field called reservation_id which needs to be
initialized to a UUID value. If you do not set this field, the request will result in an error with the error code
set to ErrorCode.RESERVATION_NO_ID_PROVIDED.
• The Expand API now requires you to provide the expansionId as part of the request. In previous releases,
the expansionId was auto-generated and returned to the user.
The TLlamaAMReservationExpansionRequest has an additional field called expansion_id which needs
to be initialized to a UUID value. If you do not set this field, the request will result in an error with the error
code set to ErrorCode.EXPANSION_NO_EXPANSION_ID_PROVIDED
• MAHOUT-1362:
The examples/bin/build-reuters.sh script has been removed.
Some SSVD support code, such as UpperTriangularMatrix, has been moved to mahout-math.
• MAHOUT-1363:
Scala-related math code has been moved into an org.apache.mahout.math.scalabindings sub-package.
As a result of this change, compilation failures against the 4.10.3 Solr libraries may fail. To avoid this issue,
make relevant source code changes, such as removing catch phrases related to MalformedURLException,
and then recompile the application.
Related JIRA: Solr-5555
• The solrJ client JavaBinCodec serializes unknown objects differently
Starting with Search for CDH 5.4.0, Search moves from Solr 4.4 to Solr 4.1.0. With Solr 4.4, JavaBinCodec
serialized unknown Java objects as obj.toString(). In Solr 4.10.0, JavaBinCodec serializes unknown Java
objects as obj.getClass().getName() + ':' + obj.toString().
As a result, the same objects may produce different results when serialized with CDH 5.4 and later compared
with objects serialized with CDH 5.3 and earlier.
• Parsing using schema.xml creates an init error when <dynamicField/> declarations include default or
required attributes
In previous releases, these attributes were ignored. If init errors occur when upgrading with an existing
schema.xml, remove the default or required attributes. After removing these attributes, Search functions
as it did before upgrading.
Related JIRA: SOLR-5227.
• Indexing documents with terms that exceed Lucene's MAX_TERM_LENGTH registers errors
In previous releases, terms that exceeded the length limit were silently ignored. To make Search function as
it did in previous releases, silently ignoring longer terms, use solr.LengthFilterFactory in all of your
Analyzers.
Related JIRA: LUCENE-5472.
• The fieldType configuration docValuesFormat="Disk" is no longer supported
If your schema.xml contains fieldTypes using docValuesFormat="Disk", modify the file to remove the
docValuesFormat attribute and optimize your index to rewrite to the default codec. Make these changes
before upgrading to CDH 5.4.
Related JIRA: LUCENE-5761.
• UpdateRequestExt has been removed.
Use UpdateRequest instead.
Related JIRA: SOLR-4816.
• Parsing schema.xml registers errors when multiple values exist where only a single value is permitted.
With previous releases, when multiple values existed where only a single value was permitted, one value
was silently chosen. In CDH 5.4, if multiple values exist where only a single value is supported, configuration
parsing fails. The extra values must be removed.
Related JIRAs: SOLR-4953, SOLR-5108.
Incompatible changes between Cloudera Search for CDH 5.2 and Cloudera Search for CDH 5.3
Some packaging changes were made that have consequences for CrunchIndexerTool start-up scripts. If those
startup scripts include the following line:
Incompatible changes between Cloudera Search for CDH 5 beta 2 and older versions of Cloudera Search:
The following incompatible changes occurred between Cloudera Search for CDH 5 beta 2 and older versions of
Cloudera Search including both earlier versions of Cloudera Search for CDH 5 and Cloudera Search 1.x:
• Supported values for the --reducers option of the MapReduceIndexer tool change with the release of
Search for CDH 5 beta 2. To use one reducer per output shard, 0 is used in Search 1.x and Search for CDH 5
beta 1. With the release of Search for CDH 5 beta 2, -2 is used for one reducer per output shard. Because of
this change, commands using --reducers 0 that were written for previous Search releases do not continue
to work in the same way after upgrading to Search for CDH 5 beta 2. After upgrading to Search for CDH 5
beta 2, using --reducers 0 results in an exception stating that zero is an illegal value.
spark.master=spark://MASTER_IP:MASTER_PORT
where MASTER_IP is the IP address of the host the Spark master is running on and MASTER_PORT is the
port.
This setting means that all jobs will run in standalone mode by default; you can override the default on the
command line.
• The CDH 5.1 release of Spark includes changes that will enable Spark to avoid breaking compatibility in the
future. As a result, most applications will require a recompile to run against Spark 1.0, and some will require
changes in source code. The details are as follows:
• There are two changes in the core Scala API:
• The cogroup and groupByKey operators now return Iterators over their values instead of Seqs. This
change means that the set of values corresponding to a particular key need not all reside in memory
at the same time.
• SparkContext.jarOfClass now returns Option[String] instead of Seq[String].
• Spark’s Java APIs have been updated to accommodate Java 8 lambdas. See Migrating from pre-1.0 Versions
of Spark for more information.
Note:
CDH 5.1 does not support Java 8, which is supported as of CDH 5.3.
• If you have uploaded the Spark assembly JAR file to HDFS, you must upload the new version of the file each
time you upgrade Spark to a new minor CDH release (for example, any CDH 5.2.x, 5.3.x or 5.4 release, including
5.2.0, 5.3.0, and 5.4.0) . You may also need to modify the configured path for the file; see the next bullet below.
• As of CDH 5.2, the configured paths for spark.eventLog.dir, spark.history.fs.logDirectory, and the
SPARK_JAR environment variable have changed in a way that may not be backward-compatible. By default,
those paths now refer to the local filesystem. To make sure everything works as before, modify the paths
as follows:
– For HDFS, if this is not a federated cluster, prepend hdfs: to the path.
– For HDFS in a federated cluster, prepend viewfs: to the path.
Alternatively, you can prepend the value of fs.defaultFS, set in core-site.xml in the HDFS configuration.
• The following changes introduced in CDH 5.2 may affect existing applications:
– The default for I/O compression is now Snappy (changed from LZF).
– PySpark now performs external spilling during aggregations.
• As of CDH 5.2, the following Spark-related artifacts are no longer published as part of the Cloudera repository:.
– spark-assembly: The spark-assembly jar is used internally by Spark distributions when executing Spark
applications and should not be referenced directly. Instead, projects should add dependencies for those
parts of the Spark project that are being used, for example, spark-core.
– spark-yarn
– spark-tools
– spark-examples
– spark-repl
• Spark 1.2, on which CDH 5.3 is based, does not expose a transitive dependency on the Guava library. As a
result, projects that use Guava but don't explicitly add it as a dependency will need to be modified: the
dependency must be added to the project and also packaged with the job.
• The CDH 5.3 version of Spark 1.2 differs from the Apache Spark 1.2 release in using Akka version 2.2.3, the
version used by Spark 1.1 and CDH 5.2. Apache Spark 1.2 uses Akka version 2.3.4.
• The CDH 5.4 version of Spark 1.3 differs from the Apache Spark 1.3 release in using Akka version 2.2.3, the
version used by Spark 1.1 and CDH 5.2. Apache Spark 1.3 uses Akka version 2.3.4.
Important: For best practices, and solutions to known performance problems, see Improving
Performance.
What to do:
To avoid the problem: Do not upgrade to CDH 5.4.1; upgrade to CDH 5.4.2 instead.
If you experience the problem: If you have already started an upgrade and seen it fail, contact Cloudera Support.
This problem involves no risk of data loss, and manual recovery is possible.
If you have already completed an upgrade to CDH 5.4.1, or are installing a new cluster: In this case you are not
affected and can continue to run CDH 5.4.1.
— No in-place upgrade to CDH 5 from CDH 4
Cloudera fully supports upgrade from Cloudera Enterprise 4 and CDH 4 to Cloudera Enterprise 5. Upgrade requires
uninstalling the CDH 4 packages before installing CDH 5 packages. See the CDH 5 upgrade documentation for
instructions.
— Upgrading to CDH 5.4 or later requires an HDFS upgrade
Upgrading to CDH 5.4.0 or later from an earlier CDH 5 release requires an HDFS upgrade, and upgrading from a
release earlier than CDH 5.2.0 requires additional steps. See Upgrading from an Earlier CDH 5 Release to the
Latest Release for further information. See also What's New in CDH 5.4.0 on page 7.
— Upgrading from CDH 4 requires an HDFS upgrade
Upgrading from CDH 4 requires an HDFS upgrade. See Upgrading from CDH 4 to CDH 5 for further information.
See also What's New in CDH 5.4.0 on page 7.
— CDH 5 requires JDK 1.7
JDK 1.6 is not supported on any CDH 5 release, but before CDH 5.4.0, CDH libraries have been compatible with
JDK 1.6. As of CDH 5.4.0, CDH libraries are no longer compatible with JDK 1.6 and applications using CDH libraries
must use JDK 1.7.
In addition, you must upgrade your cluster to a supported version of JDK 1.7 before upgrading to CDH 5. See
Upgrading to Oracle JDK 1.7 before Upgrading to CDH 5 for instructions.
— Extra step needed on Ubuntu Trusty if you add the Cloudera repository
If you install or upgrade CDH on Ubuntu Trusty using the command line, and add the Cloudera repository yourself
(rather than using the "1-click Install" method) you need to perform an additional step to ensure that you get
the CDH version of ZooKeeper, rather than the version that is bundled with Trusty. See Steps to Install CDH 5
Manually.
— No upgrade directly from CDH 3 to CDH 5
You must upgrade to CDH 4, then to CDH 5. See the CDH 4 documentation for instructions on upgrading from
CDH 3 to CDH 4.
— Upgrading hadoop-kms from 5.2.x and 5.3.x releases fails on SLES
Upgrading hadoop-kms fails on SLES when you try to upgrade an existing version from 5.2.x releases earlier
than 5.2.4, and from 5.3.x releases earlier than 5.3.2. For details and troubleshooting instructions, see
Troubleshooting: upgrading hadoop-kms from 5.2.x and 5.3.x releases on SLES.
After upgrading from a release earlier than CDH 4.6, you may see reports of corrupted files
Some older versions of CDH do not handle DataNodes with a large number of blocks correctly. The problem
exists on versions 4.6, 4.7, 4.8, 5.0, and 5.1. The symptom is that the NameNode Web UI and the fsck command
incorrectly report missing blocks, even when those blocks are present.
The cause of the problem is that if the DataNode attempts to send a block report that is larger than the maximum
RPC buffer size, the NameNode rejects the report. This prevents the NameNode from becoming aware of the
blocks on the affected DataNodes. The maximum buffer size is controlled by the ipc.maximum.data.length
property, which defaults to 64 MB.
This problem does not exist in CDH 4.5 and earlier because there is no maximum RPC buffer size in these versions.
Starting in CDH5.2, DataNodes now send individual block reports for each storage volume, which mitigates the
problem.
Bug: HADOOP-9676
Severity: Medium
Workaround: Immediately after upgrading, increase the value of ipc.maximum.data.length; Cloudera
recommends doubling the default value, from 64 MB to 128 MB:
<property>
<name>ipc.maximum.data.length</name>
<value>134217728</value>
</property>
HDFS
— Upgrade Requires an HDFS Upgrade
Upgrading from any release earlier than CDH 5.2.0 to CDH 5.2.0 or later requires an HDFS Upgrade.
— Optimizing HDFS Encryption at Rest Requires Newer openssl Library on Some Systems
CDH 5.3 implements the Advanced Encryption Standard New Instructions (AES-NI), which provide substantial
performance improvements. To get these improvements, you need a recent version of libcrytpto.so on HDFS
and MapReduce client hosts -- that is, any host from which you originate HDFS or MapReduce requests. Many
OS versions have an older version of the library that does not support AES-NI.
See HDFS Data At Rest Encryption in the Encryption section of the Cloudera Security guide for instructions for
obtaining the right version.
— Other HDFS Encryption Known Issues
— Solr, Oozie and HttpFS fail when KMS and SSL are enabled using self-signed certificates
When the KMS service is added and SSL is enabled, Solr, Oozie and HttpFS are not automatically configured to
trust the KMS's self-signed certificate and you might see the following error.
Severity: Medium
Workaround: You must explicitly load the relevant truststore with the KMS certificate to allow these services
to communicate with the KMS.
Solr, Oozie: Add the following arguments to their environment safety valve so as to load the truststore with the
required KMS certificate.
CATALINA_OPTS="-Djavax.net.ssl.trustStore=/etc/path-to-truststore.jks
-Djavax.net.ssl.trustStorePassword=<password>"
HttpFS: Add the following arguments to the Java Configuration Options for HttpFS property.
-Djavax.net.ssl.trustStore=/etc/path-to-truststore.jks
-Djavax.net.ssl.trustStorePassword=<password>
rm -r -skipTrash /testdir
— If you install CDH using packages, HDFS NFS gateway works out of the box only on RHEL-compatible systems
Because of a bug in native versions of portmap/rpcbind, the HDFS NFS gateway does not work out of the box
on SLES, Ubuntu, or Debian systems if you install CDH from the command-line, using packages. It does work on
supported versions of RHEL-compatible systems on which rpcbind-0.2.0-10.el6 or later is installed, and it
does work if you use Cloudera Manager to install CDH, or if you start the gateway as root.
Bug: 731542 (Red Hat), 823364 (SLES), 594880 (Debian)
Severity: High
Workarounds and caveats:
• On Red Hat and similar systems, make sure rpcbind-0.2.0-10.el6 or later is installed.
• On SLES, Debian, and Ubuntu systems, do one of the following:
– Install CDH using Cloudera Manager; or
– As of CDH 5.1, start the NFS gateway as root; or
– Start the NFS gateway without using packages; or
– You can use the gateway by running rpcbind in insecure mode, using the -i option, but keep in mind
that this allows anyone from a remote host to bind to the portmap.
— HDFS does not currently provide ACL support for the HDFS gateway
Bug: HDFS-6949
— No error when changing permission to 777 on .snapshot directory
Snapshots are read-only; running chmod 777 on the .snapshots directory does not change this, but does not
produce an error (though other illegal operations do).
Bug: HDFS-4981
Severity: Low
Workaround: None
— Snapshot operations are not supported by ViewFileSystem
Bug: None
Severity: Low
Workaround: None
Bug: HDFS-7489
Severity: Low
Workaround: Disable the directory scanner by setting dfs.datanode.directoryscan.interval to -1.
— The active NameNode will not accept an fsimage sent from the standby during rolling upgrade
The result is that the NameNodes fail to checkpoint until the upgrade is finalized.
Note:
Rolling upgrade is supported only for clusters managed by Cloudera Manager; you cannot do a rolling
upgrade in a command-line-only deployment.
Bug: HDFS-7185
Severity: Medium
Workaround: None.
— On a DataNode with a large number of blocks, the block report may exceed the maximum RPC buffer size
Bug: None
Workaround: Increase the value ipc.maximum.data.length in hdfs-site.xml:
<property>
<name>ipc.maximum.data.length</name>
<value>268435456</value>
</property>
MapReduce, YARN
Unsupported Features
The following features are not currently supported:
• FileSystemRMStateStore: Cloudera recommends you use ZKRMStateStore (ZooKeeper-based implementation)
to store the ResourceManager's internal state for recovery on restart or failover. Cloudera does not support
the use of FileSystemRMStateStore in production.
• ApplicationTimelineSever (also known as Application History Server): Cloudera does not support
ApplicationTimelineServer v1. ApplicationTimelineServer v2 is under development and Cloudera does not
currently support it.
• Scheduler Reservations: Scheduler reservations are currently at an experimental stage, and Cloudera does
not support their use in production.
• Scheduler node-labels: Node-labels are currently experimental with CapacityScheduler. Cloudera does not
support their use in production.
— Starting an unmanaged ApplicationMaster may fail
Starting a custom Unmanaged ApplicationMaster may fail due to a race in getting the necessary tokens.
Bug: YARN-1577
Severity: Low
Workaround: Try to get the tokens again; the custom unmanaged ApplicationMaster should be able to fetch the
necessary tokens and start successfully.
— Job movement between queues does not persist across ResourceManager restart
CDH 5 adds the capability to move a submitted application to a different scheduler queue. This queue placement
is not persisted across ResourceManager restart or failover, which resumes the application in the original queue.
Bug: YARN-1558
Severity: Medium
— Hadoop Pipes may not be usable in an MRv1 Hadoop installation done through tarballs
Under MRv1, MapReduce's C++ interface, Hadoop Pipes, may not be usable with a Hadoop installation done
through tarballs unless you build the C++ code on the operating system you are using.
Bug: None
Severity: Medium
Workaround: Build the C++ code on the operating system you are using. The C++ code is present under src/c++
in the tarball.
— Task-completed percentage may be reported as slightly under 100% in the web UI, even when all of a job's
tasks have successfully completed.
Bug: None
Severity: Low
Workaround: None
— Spurious warning in MRv1 jobs
The mapreduce.client.genericoptionsparser.used property is not correctly checked by JobClient and
this leads to a spurious warning.
Bug: None
Severity: Low
Workaround: MapReduce jobs using GenericOptionsParser or implementing Tool can remove the warning
by setting this property to true.
— Oozie workflows will not be recovered in the event of a JobTracker failover on a secure cluster
Delegation tokens created by clients (via JobClient#getDelegationToken()) do not persist when the JobTracker
fails over. This limitation means that Oozie workflows will not be recovered successfully in the event of a failover
on a secure cluster.
Bug: None
Severity: Medium
Workaround: Re-submit the workflow.
— Encrypted shuffle in MRv2 does not work if used with LinuxContainerExecutor and encrypted web UIs.
In MRv2, if the LinuxContainerExecutor is used (usually as part of Kerberos security), and hadoop.ssl.enabled
is set to true (See Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS Transport), then the
encrypted shuffle does not work and the submitted job fails.
Bug: MAPREDUCE-4669
Severity: Medium
Workaround: Use encrypted shuffle with Kerberos security without encrypted web UIs, or use encrypted shuffle
with encrypted web UIs without Kerberos security.
— Link from ResourceManager to Application Master does not work when the Web UI over HTTPS feature is
enabled.
In MRv2 (YARN), if hadoop.ssl.enabled is set to true (use HTTPS for web UIs), then the link from the
ResourceManager to the running MapReduce Application Master fails with an HTTP Error 500 because of a PKIX
exception.
A job can still be run successfully, and, when it finishes, the link to the job history does work.
Bug: YARN-113
Severity: Low
Workaround: Don't use encrypted web UIs.
— Hadoop client JARs don't provide all the classes needed for clean compilation of client code
The compile does succeed, but you may see warnings as in the following example:
Note: This means that the example at the bottom of the page on managing Hadoop API dependencies
(see "Using the CDH 4 Maven Repository" under CDH Version and Packaging Information will produce
a similar warning.
Bug:
Severity: Low
Workaround: None
— The ulimits setting in /etc/security/limits.conf is applied to the wrong user if security is enabled.
Bug: https://issues.apache.org/jira/browse/DAEMON-192
Severity: Low
Anticipated Resolution: None
Workaround: To increase the ulimits applied to DataNodes, you must change the ulimit settings for the root
user, not the hdfs user.
—Must set yarn.resourcemanager.scheduler.address to routable host:port when submitting a job from the
ResourceManager
When you submit a job from the ResourceManager, yarn.resourcemanager.scheduler.address must be
set to a real, routable address, not the wildcard 0.0.0.0.
Bug: None
Severity: Low
Workaround: Set the address, in the form host:port, either in the client-side configuration, or on the command
line when you submit the job.
—Amazon S3 copy may time out
The Amazon S3 filesystem does not support renaming files, and performs a copy operation instead. If the file
to be moved is very large, the operation can time out because S3 does not report progress to the TaskTracker
during the operation.
Bug: MAPREDUCE-972
Severity: Low
Workaround: Use -Dmapred.task.timeout=15000000 to increase the MR task timeout.
Task Controller Changed from DefaultTaskController to LinuxTaskController
In CDH 5, the MapReduce task controller is changed from DefaultTaskController to LinuxTaskController.
The new task controller has different directory ownership requirements which can cause jobs to fail. You can
switch back to DefaultTaskController by adding the following to the MapReduce Advanced Configuration
Snippet if you use Cloudera Manager, or directly to mapred-default.xml otherwise.
<property>
<name>mapreduce.tasktracker.taskcontroller</name>
<value>org.apache.hadoop.mapred.DefaultTaskController</value>
</property>
As of CDH 5.4.0, hadoop-test.jar has been renamed to hadoop-test-mr1.jar. This JAR file contains the
mrbench, TestDFSIO, and nnbench tests.
Bug: None
Workaround: None.
Workaround: None, but see Checksums in the HBase section of the Cloudera Installation and Upgrade guide.
Must explicitly add permissions for owner users before upgrading from 4.1. x
In CDH 4.1. x, an HBase table could have an owner. The owner user had full administrative permissions on the
table (RWXCA). These permissions were implicit (that is, they were not stored explicitly in the HBase acl table),
but the code checked them when determining if a user could perform an operation.
The owner construct was removed as of CDH 4.2.0, and the code now relies exclusively on entries in the acl
table. Since table owners do not have an entry in this table, their permissions are removed on upgrade from
CDH 4.1. x to CDH 4.2.0 or later.
Bug: None
Severity: Medium
Anticipated Resolution: None; use workaround
Workaround: Add permissions for owner users before upgrading from CDH 4.1. x. You can automate the task
of making the owner users' implicit permissions explicit, using code similar to the following. (Note that this
snippet is intended only to give you an idea of how to proceed; it may not compile and run as it stands.)
PERMISSIONS = 'RWXCA'
tables.each do |t|
table_name = t.getNameAsString
owner = t.getOwnerString
LOG.warn( "Granting " + owner + " with
" + PERMISSIONS + " for
table " + table_name)
user_permission = UserPermission. new(owner.to_java_bytes, table_name.to_java_bytes,
Bug: None
Severity: Medium
Anticipated Resolution: None; use workaround
Workaround: If find you are getting too many splits, either go back to the old split policy or increase the
hbase.hregion.memstore.flush.size.
— In a cluster where the HBase directory in HDFS is encrypted, an IOException can occur if the BulkLoad staging
directory is not in the same encryption zone as the HBase root directory.
If you have encrypted the HBase root directory (hbase.rootdir) and you attempt a BulkLoad where the staging
directory is in a different encryption zone from the HBase root directory, you may encounter errors such as:
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
/tmp/output/f/5237a8430561409bb641507f0c531448 can't be moved into an encryption zone.
Bug: None
Anticipated Resolution: None; use workaround
Severity: Medium
Workaround: Configure hbase.bulkload.staging.dir to point to a location within the same encryption zone
as the HBase root directory.
— In a non-secure cluster, MapReduce over HBase does not properly handle splits in the BulkLoad case
You may see errors because of:
• missing permissions on the directory that contains the files to bulk load
• missing ACL rights for the table/families
Bug: None
Anticipated Resolution: None; use workaround
Severity: Medium
Workaround: In a non-secure cluster, execute BulkLoad as the hbase user.
Note: For important information about configuration that is required for BulkLoad in a secure cluster
as of CDH 4.3, see the Apache HBase Incompatible Changes on page 59 subsection under Incompatible
Changes in these Release Notes.
— Pluggable compaction and scan policies via coprocessors (HBASE-6427) not supported
Cloudera does not provide support for user-provided custom coprocessors.
Bug: HBASE-6427
Severity: Low
Workaround: None
— Custom constraints coprocessors (HBASE-4605) not supported
The constraints coprocessor feature provides a framework for constrains and requires you to add your own
custom code. Cloudera does not support user-provided custom code, and hence does not support this feature.
Bug: HBASE-4605
Severity: Low
Workaround: None
— Pluggable split key policy (HBASE-5304) not supported
Cloudera supports the two split policies that are supplied and tested: ConstantSizeSplitPolicy and
PrefixSplitKeyPolicy. The code also provides a mechanism for custom policies that are specified by adding
a class name to the HTableDescriptor. Custom code added via this mechanism must be provided by the user.
Cloudera does not support user-provided custom code, and hence does not support this feature.
Bug: HBASE-5304
Severity: Low
Workaround: None
— HBase may not tolerate HDFS root directory changes
While HBase is running, do not stop the HDFS instance running under it and restart it again with a different root
directory for HBase.
Bug: None
Severity: Medium
Workaround: None
— AccessController postOperation problems in asynchronous operations
When security and Access Control are enabled, the following problems occur:
• If a Delete Table fails for a reason other than missing permissions, the access rights are removed but the
table may still exist and may be used again.
• If hbaseAdmin.modifyTable() is used to delete column families, the rights are not removed from the Access
Control List (ACL) table. The postOperation is implemented only for postDeleteColumn().
• If Create Table fails, full rights for that table persist for the user who attempted to create it. If another
user later succeeds in creating the table, the user who made the failed attempt still has the full rights.
Bug: HBASE-6992
Severity: Medium
Workaround: None
— Native library not included in tarballs
The native library that enables Region Server page pinning on Linux is not included in tarballs. This could impair
performance if you install HBase from tarballs.
Bug: None
Severity: Low
Workaround: None
Note: As of CDH 5, HCatalog is part of Apache Hive; HCatalog known issues are included below.
— Hive upgrade from CDH 5.0.5 fails on Debian 7.0 if a Sentry 5.0.x release is installed
Upgrading Hive from CDH 5.0.5 to CDH 5.4, 5.3 or 5.2 fails with the following error if a Sentry version later than
5.0.4 and earlier than 5.1.0 is installed. You will see an error such as the following:
: error processing
/var/cache/apt/archives/hive_0.13.1+cdh5.2.0+221-1.cdh5.2.0.p0.32~precise-cdh5.2.0_all.deb
Important: Hive on Spark is included in CDH 5.4 but is not currently supported nor recommended for
production use. If you are interested in this feature, try it out in a test environment until we address
the issues and limitations needed for production-readiness.
— Hive creates an invalid table if you specify more than one partition with alter table
Hive (in all known versions from 0.7) allows you to configure multiple partitions with a single alter table
command, but the configuration it creates is invalid for both Hive and Impala.
Bug: None
Severity: Medium
Resolution: Use workaround.
Workaround:
Correct results can be obtained by configuring each partition with its own alter table command in either Hive
or Impala .For example, the following:
at org.datanucleus.jdo.JDOQuery.execute(JDOQuery.java:252)
at org.apache.hadoop.hive.metastore.ObjectStore.getTables(ObjectStore.java:759)
... 28 more
Caused by: org.postgresql.util.PSQLException: ERROR: invalid escape string
Hint: Escape string must be empty or one character.
at
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2096)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1829)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:510)
at
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:386)
at
org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:271)
at
org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
at
org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
at
org.datanucleus.store.rdbms.SQLController.executeStatementQuery(SQLController.java:457)
at org.datanucleus.store.rdbms.query.legacy.SQLEvaluator.evaluate(SQLEvaluator.java:123)
at
org.datanucleus.store.rdbms.query.legacy.JDOQLQuery.performExecute(JDOQLQuery.java:288)
at org.datanucleus.store.query.Query.executeQuery(Query.java:1657)
at org.datanucleus.store.rdbms.query.legacy.JDOQLQuery.executeQuery(JDOQLQuery.java:245)
at org.datanucleus.store.query.Query.executeWithArray(Query.java:1499)
at org.datanucleus.jdo.JDOQuery.execute(JDOQuery.java:243)
... 29 more
— Queries spawned from MapReduce jobs in MRv1 fail if mapreduce.framework.name is set to yarn
Queries spawned from MapReduce jobs fail in MRv1 with a null pointer exception (NPE) if
/etc/mapred/conf/mapred-site.xml has the following:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Bug: None
Severity: High
Resolution: Use workaround
Workaround: Remove the mapreduce.framework.name property from mapred-site.xml.
— Commands run against an Oracle backed Metastore may fail
Commands run against an Oracle-backed Metastore fail with error:
This error may occur if the metastore is run on top of an Oracle database with the configuration property
datanucleus.validateColumns set to true.
Bug: None
Severity: Low
Workaround: Set datanucleus.validateColumns=false in the hive-site.xml configuration file.
— Hive, Pig, and Sqoop 1 fail in MRv1 tarball installation because /usr/bin/hbase sets HADOOP_MAPRED_HOME
to MR2
This problem affects tarball installations only.
Bug: None
Severity: High
Resolution: Use workaround
Workaround: If you are using MRv1, edit the following line in /etc/default/hadoop from
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
to
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
Bug: None
Severity: Low
Resolution: No software fix planned; use the workaround.
Workaround: Modify /etc/hive/conf/hive-site.xml to allow the temporary directory and warehouse directory
to use the same ViewFS mount table. For example, if the warehouse directory is /user/hive/warehouse, add
the following property to /etc/hive/conf/hive-site.xml so both directories use the ViewFS mount table
for /user.
<property>
<name>hive.exec.scratchdir</name>
<value>/user/${user.name}/tmp</value>
</property>
— Cannot create archive partitions with external HAR (Hadoop Archive) tables
ALTER TABLE ... ARCHIVE PARTITION is not supported on external tables.
Bug: None
Severity: Low
Workaround: None
— Setting hive.optimize.skewjoin to true causes long running queries to fail
Bug: None
Severity: Low
Workaround: None
— JDBC - executeUpdate does not returns the number of rows modified
Contrary to the documentation, method executeUpdate always returns zero.
Severity: Low
Workaround: None
—Hive Auth (Grant/Revoke/Show Grant) statements do not support fully qualified table names (default.tab1)
Bug: None
Severity: Low
Workaround: Switch to the database before granting privileges on the table.
—Object types Server and URI are not supported in "SHOW GRANT ROLE roleName on OBJECT objectName"
Bug: None
Severity: Low
Workaround: Use SHOW GRANT ROLE roleNameto list all privileges granted to the role.
HCatalog Known Issues
— Configuring more than one NT domain does not work in CDH 5.4.0
Trying to add users and groups using the multi-NT domain feature
(http://gethue.com/hadoop-tutorial-make-hadoop-more-accessible-by-integrating-multiple-ldap-servers/)
produces an error.
Bug: HUE-2665
Workaround: None.
— Migrations to MySQL fail if multiple Hue users have the same name but different upper/lower case letters
Bug: None
Severity: Medium
Workaround: None.
— Hue hangs or fails because SQLLite database is overloaded, returning "database is locked"
Hue can hangs or fail because the SQLLite database is overloaded, returning database is locked.
Bug: None
Severity: Medium
Workaround: Do one of the following:
• Increase the timeout setting in [desktop][[database]] in the Hue configuration file, OR
• Use a different database.
— Importing Hue data to MySQL can cause columns to be truncated on import
Importing Hue data to MySQL can cause columns to be truncated on import, displaying Warning: Data
truncated for column 'name' at row 1
Bug: None
Severity: Medium
Workaround: In the /etc/my.cnf file, configure the database operation to fail rather than truncate data:
[mysqld]
sql_mode=STRICT_ALL_TABLES
Resolution: The underlying cause is the issue HIVE-8648 that affects the metastore in Hive 0.13. The workaround
is only needed until the fix for this issue is incorporated into a CDH release.
ORDER BY rand() does not work.
Because the value for rand() is computed early in a query, using an ORDER BY expression involving a call to
rand() does not actually randomize the results.
Bug: IMPALA-397
Severity: High
Impala BE cannot parse Avro schema that contains a trailing semi-colon
If an Avro table has a schema definition with a trailing semicolon, Impala encounters an error when the table is
queried.
Bug: IMPALA-1024
Severity: High
Process mem limit does not account for the JVM's memory usage
Some memory allocated by the JVM used internally by Impala is not counted against the memory limit for the
impalad daemon.
Bug: IMPALA-691
Severity: High
Workaround: To monitor overall memory usage, use the top command, or add the memory figures in the Impala
web UI /memz tab to JVM memory usage shown on the /metrics tab.
Impala Parser issue when using fully qualified table names that start with a number.
A fully qualified table name starting with a number could cause a parsing error. In a name such as db.571_market,
the decimal point followed by digits is interpreted as a floating-point number.
Bug: IMPALA-941
Severity: High
Workaround: Surround each part of the fully qualified name with backticks (``).
CatalogServer should not require HBase to be up to reload its metadata
If HBase is unavailable during Impala startup or after an INVALIDATE METADATA statement, the catalogd
daemon could go into an error loop, making Impala unresponsive.
Bug: IMPALA-788
Severity: High
Workaround: For systems not managed by Cloudera Manager, add the following settings to
/etc/impala/conf/hbase-site.xml:
<property>
<name>hbase.client.retries.number</name>
<value>3</value>
</property>
<property>
<name>hbase.rpc.timeout</name>
<value>3000</value>
</property>
Currently, Cloudera Manager does not have an Impala-only override for HBase settings, so any HBase configuration
change you make through Cloudera Manager would take affect for all HBase applications. Therefore, this change
is not recommended on systems managed by Cloudera Manager.
Kerberos tickets must be renewable
In a Kerberos environment, the impalad daemon might not start if Kerberos tickets are not renewable.
Workaround: Configure your KDC to allow tickets to be renewed, and configure krb5.conf to request renewable
tickets.
Avro Scanner fails to parse some schemas
Querying certain Avro tables could cause a crash or return no rows, even though Impala could DESCRIBE the
table.
Bug: IMPALA-635
Severity: High
Workaround: Swap the order of the fields in the schema specification. For example, ["null", "string"]
instead of ["string", "null"].
Resolution: Not allowing this syntax agrees with the Avro specification, so it may still cause an error even when
the crashing issue is resolved.
beeswax_meta_server_only=9004
OR
• If you are using CDH4.1.1 and you want to install Hue and Impala on the same host, change the code in this
file:
/usr/share/hue/apps/beeswax/src/beeswax/management/commands/beeswax_server.py
str(beeswax.conf.BEESWAX_SERVER_PORT.get()),
'8004',
Note:
If you used Cloudera Manager to install Impala, refer to the Cloudera Manager release notes for
information about using an equivalent workaround by specifying the
beeswax_meta_server_only=9004 configuration value in the options field for Hue. In Cloudera
Manager 4, these fields are labelled Safety Valve; in Cloudera Manager 5, they are called Advanced
Configuration Snippet.
Bug: None
Severity: Medium
Workaround: None; if necessary, reduce the number of partitions in the table.
— parquet-thrift cannot read Parquet data written by Hive
parquet-thrift cannot read Parquet data written by Hive, and parquet-avro will show an additional record
level in lists named array_element.
Bug: PARQUET-113
Severity: Medium
Workaround: None; arrays written by parquet-avro or parquet-thrift cannot currently be read by
parquet-hive.
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
to
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
To enable Solr ZooKeeper ACLs while retaining the existing cluster's Solr state, manually modify the existing
znode's ACL information. For example, using zookeeper-client, run the command setAcl [path]
sasl:solr:cdrwa,world:anyone:r. This grants the solr user ownership of the specified path. Run this
command for /solr and every znode under /solr except for the configuration znodes under and including
/solr/configs.
To enable Lily HBase Indexer while retaining the existing HBase-Indexer state, manually modify the existing
znode's ACL information. For example, using zookeeper-client, run the command setAcl
[path]sasl:hbase:cdrwa,world:anyone:r. This grants the hbase user ownership of every znode under
/ngdata (inclusive of /ngdata).
Note: This operation is not recursive, so creating a simple script may be helpful.
— Solr, Oozie and HttpFS fail when KMS and SSL are enabled using self-signed certificates
When the KMS service is added and SSL is enabled, Solr, Oozie and HttpFS are not automatically configured to
trust the KMS's self-signed certificate and you might see the following error.
Severity: Medium
Workaround: You must explicitly load the relevant truststore with the KMS certificate to allow these services
to communicate with the KMS.
Solr, Oozie: Add the following arguments to their environment safety valve so as to load the truststore with the
required KMS certificate.
CATALINA_OPTS="-Djavax.net.ssl.trustStore=/etc/path-to-truststore.jks
-Djavax.net.ssl.trustStorePassword=<password>"
HttpFS: Add the following arguments to the Java Configuration Options for HttpFS property.
-Djavax.net.ssl.trustStore=/etc/path-to-truststore.jks
-Djavax.net.ssl.trustStorePassword=<password>
— CrunchIndexerTool which includes Spark indexer requires specific input file format specifications
If the --input-file-format option is specified with CrunchIndexerTool then its argument must be text, avro,
or avroParquet, rather than a fully qualified class name.
— Previously deleted empty shards may reappear after restarting the leader host
It is possible to be in the process of deleting a collection when hosts are shut down. In such a case, when hosts
are restarted, some shards from the deleted collection may still exist, but be empty.
Workaround: To delete these empty shards, manually delete the folder matching the shard. On the hosts on
which the shards exist, remove folders under /var/lib/solr/ that match the collection and shard. For example,
if you had an empty shard 1 and empty shard 2 in a collection called MyCollection, you might delete all folders
matching /var/lib/solr/MyCollection{1,2}_replica*/.
— The quickstart.sh file does not validate ZooKeeper and the NameNode on some operating systems
The quickstart.sh file uses the timeout function to determine if ZooKeeper and the NameNode are available.
To ensure this check can be complete as intended, the quickstart.sh determines if the operating system on
which the script is running supports timeout. If the script detects that the operating system does not support
timeout, the script continues without checking if the NameNode and ZooKeeper are available. If your environment
is configured properly or you are using an operating system that supports timeout, this issue does not apply.
Workaround: This issue only occurs in some operating systems. If timeout is not available, a warning if displayed,
but the quickstart continues and final validation is always done by the MapReduce jobs and Solr commands
that are run by the quickstart.
— Field value class guessing and Automatic schema field addition are not supported with the
MapReduceIndexerTool nor the HBaseMapReduceIndexerTool
The MapReduceIndexerTool and the HBaseMapReduceIndexerTool can be used with a Managed Schema created
via NRT indexing of documents or via the Solr Schema API. However, neither tool supports adding fields
automatically to the schema during ingest.
— The “Browse” and “Spell” Request Handlers are not enabled in schemaless mode
The “Browse” and “Spell” Request Handlers require certain fields be present in the schema. Since those fields
cannot be guaranteed to exist in a Schemaless setup, the “Browse” and “Spell” Request Handlers are not enabled
by default.
Workaround: If you require the “Browse” and “Spell” Request Handlers, add them to the solrconfig.xml
configuration file. Generate a non-schemaless configuration to see the usual settings and modify the required
fields to fit your schema.
— Using Solr with Sentry may consume more memory than required
The sentry-enabled solrconfig.xml.secure configuration file does not enable the hdfs global block cache.
This does not cause correctness issues, but it can greatly increase the amount of memory that solr requires.
Workaround: Enable the hdfs global block cache, by adding the following line to solrconfig.xml.secure under
the directoryFactory element:
— Solr fails to start when Trusted Realms are added for Solr into Cloudera Manager
Cloudera Manager generates name rules with spaces as a result of entries in the Trusted Realms, which do not
work with Solr. This causes Solr to not start.
Workaround: Do not use the Trusted Realm field for Solr in Cloudera Manager. To write your own name rule
mapping, add an environment variable SOLR_AUTHENTICATION_KERBEROS_NAME_RULES with the mapping. See
the Cloudera Manager Security Guide for more information.
at
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString(TableMapReduceUtil.java:433)
at
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:186)
at
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:147)
at
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:270)
at
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100)
at
com.ngdata.hbaseindexer.mr.HBaseMapReduceIndexerTool.run(HBaseMapReduceIndexerTool.java:124)
at
com.ngdata.hbaseindexer.mr.HBaseMapReduceIndexerTool.run(HBaseMapReduceIndexerTool.java:64)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at
com.ngdata.hbaseindexer.mr.HBaseMapReduceIndexerTool.main(HBaseMapReduceIndexerTool.java:51)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
— Users with insufficient Solr permissions may receive a "Page Loading" message from the Solr Web Admin UI
Users who are not authorized to use the Solr Admin UI are not given page explaining that access is denied, and
instead receive a web page that never finishes loading.
Workaround: None
— Some configurations for Lily HBase Indexers cannot be modified after initial creation.
Newly created Lily HBase Indexers define their configuration using the properties in
/etc/hbase-solr/conf/hbase-indexer-site.xml. Therefore, if the properties in the
hbase-indexer-site.xml file are incorrectly defined, new indexers do not work properly. Even after correcting
the contents of hbase-indexer-site.xml and restarting the indexer service, old, incorrect content persists.
This continues to create non-functioning indexers.
Workaround:
Warning: This workaround involves completing destructive operations that delete all of your other
Lily HBase Indexers.
$ /usr/lib/zookeeper/bin/zkCli.sh
[zk: localhost:2181( CONNECTED) 0] rmr /ngdata
After restarting the client services, ZooKeeper is updated with the correct information stored on the updated
clients.
— User with update access to the administrative collection can elevate the access.
Users are granted access to collections. Access to several collections can be simplified by aliasing a set of
collections. Creating an alias requires update access to the administrative collection. Any user with update
access to the administrative collection is granted query access to all collections in the resulting alias. This is
true even if the user with update access to the administrative collection otherwise would be unable to query
the other collections that have been aliased.
Workaround: None. Mitigate the risk by limiting the users with update access to the administrative collection.
Note:
CDH defaults to hash-based shuffle.
Bug:SPARK-3948
— Spark not automatically picking up hive-site.xml
When you run Spark SQL on Yarn, the client hive-site.xml does not get picked up automatically by
spark-submit.
Bug: SPARK-2669
Severity: Low
Workaround: Do one of the following, depending on which mode you are running in:
• If you are running in yarn-client mode, set HADOOP_CONF_DIR to /etc/hive/conf/ (or the directory where
your hive-site.xml is located).
• If you are running in yarn-cluster mode, the easiest thing to do is to
add--files=/etc/hive/conf/hive-site.xml (or the path for your hive-site.xml) to your spark-submit
script.
— Streaming incompatibility between Spark 1.2 and 1.3
Applications built as JAR with dependencies ("fat JAR) must be built for the specific version of Spark running on
the cluster
Bug: None
Workaround: Rebuild the JAR with the Spark dependencies in pom.xml pointing to the specific version of Spark
running on the target cluster.
— Spark does not support Scala 2.11
CDH does not currently support Spark on Scala 2.11 because it is binary incompatible, and also not yet
full-featured.
— Spark Sink requires spark-assembly.jar in Flume classpath
In CDH5.4.0, Flume requires spark-assembly.jar in the Flume classpath to use the Spark Sink. Without this,
the sink fails with a dependency issue.
Bug: SPARK-7038
Workaround: Use the Spark Sink from CDH5.3.x with Spark from CDH5.4, or add spark-assembly.jar to your
FLUME_CLASSPATH.
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
to
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
Job with id 1 and name job1 (Enabled: true, Created by null at 5/13/15 3:05 PM, Updated
by null at 5/13/15 6:04 PM)
Throttling resources
Extractors:
Loaders:
From link: 1
From database configuration
Schema name: schema1
Table name: tab1
Table SQL statement:
Table column names: col1
Partition column name:
Null value allowed for the partition column: false
Boundary query:
Incremental read
Check column:
Last value:
To link: 2
To database configuration
Schema name: schema2
Table name: tab2
Table SQL statement:
Table column names: col2
Stage table name:
Should clear stage table:
Bug: None
Workaround: Before upgrading, make sure no jobs have source and destination links that point to the same
connector.
Workaround: Delete the *Epoch files if this situation occurs — the version 3.4 server will recreate them as in
case 1) above.
Note: For links to the detailed change lists that describe the bug fixes and improvements to all of
the CDH 5 projects, including bug-fix reports for the corresponding upstream Apache projects, see
the packaging section of CDH Version and Packaging Information.
• HIVE-10476 - Hive query should fail when it fails to initialize a session in SetSparkReducerParallelism
• HIVE-10434 - Cancel connection when remote Spark driver process has failed
• HIVE-10473 - Spark client is recreated even spark configuration is not changed
• HIVE-10291 - Hive on Spark job configuration needs to be logged
• HIVE-10143 - HS2 fails to clean up Spark client state on timeout
• HIVE-10073 - Runtime exception when querying HBase with Spark
• HUE-2723 - [hive] Listing table information in non default DB fails
• HUE-2722 - [hive] Query returns wrong number of rows when HiveServer2 returns data not encoded properly
• HUE-2713 - [oozie] Deleting a Fork of Fork can break the workflow
• HUE-2717 - [oozie] Coordinator editor does not save non-default schedules
• HUE-2716 - [pig] Scripts fail on hcat auth with org.apache.hive.hcatalog.pig.HCatLoader()
• HUE-2707 - [hive] Allow sample of data on partitioned tables in strict mode
• HUE-2720 - [oozie] Intermittent 500s when trying to view oozie workflow history v1
• HUE-2712 - [oozie] Creating a fork can error
• HUE-2710 - [search] Heatmap select on yelp example errors
• HUE-2686 - [impala] Explain button is erroring
• HUE-2671 - [core] sync_groups_on_login doesn't work with NT Domain
• IMPALA-1519/IMPALA-1946 - Fix wrapping of exprs via a TupleIsNullPredicate with analytics.
• IMPALA-1900 - Assign predicates below analytic functions with a compatible partition by clause for partition
pruning.
• IMPALA-1919 - When out_batch->AtCapacity(), avoid calling ProcessBatch in right joins.
• IMPALA-1960 - Illegal reference to non-materialized tuple when query has an empty select-project-join
block.
• IMPALA-1969 - OpenSSL init must not be called concurrently.
• IMPALA-1973 - Fixing crash when uninitialized, empty row is added in HdfsTextScanner due to missing
newline at the end of file.
• OOZIE-2218 - META-INF directories in the war file have 777 permissions
• OOZIE-2170 - Oozie should automatically set configs to make Spark jobs show up in the Spark History Server
• SENTRY-699 - Memory leak when running Sentry with HiveServer2
• SENTRY-703 - Calls to add_partition fail when passed a Partition object with a null location
• SENTRY-696 - Improve Metastoreplugin Cache Initialization time
• SENTRY-683 - HDFS service client should ensure the kerberos ticket is valid before new service connection
• SOLR-7478 - UpdateLog#close shuts down it's executor with interrupts before running close, possibly
preventing a clean close.
• SOLR-7437 - Make HDFS transaction log replication factor configurable.
• SOLR-7338/SOLR-6583 - A reloaded core will never register itself as active after a ZK session expiration.
• SPARK-7281 - No option for AM native library path in yarn-client mode.
• SPARK-6087 - Provide actionable exception if Kryo buffer is not large enough
• SPARK-6868 - Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY
• SPARK-6506 - python support in yarn cluster mode requires SPARK_HOME to be set
• SPARK-6650 - ExecutorAllocationManager never stops
• SPARK-6578 - Outbound channel in network library is not thread-safe, can lead to fetch failures
• SQOOP-2343 - AsyncSqlRecordWriter stuck if any exception is thrown out in its close method
• SQOOP-2286 - Ensure Sqoop generates valid avro column names
• SQOOP-2283 - Support usage of --exec and --password-alias
• SQOOP-2281 - Set overwrite on kite dataset
• SQOOP-2282 - Add validation check for --hive-import and --append
• SQOOP-2257 - Import Parquet data into a hive table with --hive-overwrite option does not work
• ZOOKEEPER-2146 - BinaryInputArchive readString should check length before allocating memory
• ZOOKEEPER-2149 - Log client address when socket connection established
database_name
.
index_table_name
index_table_name
Bug: HIVE-10108
Workaround: None
— HiveServer2 has an unexpected Derby metastore directory in secure clusters
Bug: HIVE-10093
Workaround: None; ignore the Derby database.
Apache Oozie
— Spark jobs run from the Spark action don't show up in the Spark History Server or properly link to it from the Spark AM
Bug: OOZIE-2170
Severity: Low
Workaround: Specify these configuration properties in the spark-opts element of your Spark action in the
workflow.xml file:
where SPH is the hostname of the Spark History Server and NN is the hostname of the NameNode. You can
also find these values in /etc/spark/conf/spark-defaults.conf on the gateway host when Spark is installed
from Cloudera Manager.
With Search 5.4 for CDH, these tags are no longer required for definitions to be included. These tags are supported
so either style may be implemented.
Bug: SOLR-5228
Apache Sentry (incubating)
—INSERT OVERWRITE LOCAL fails if you use only the Linux pathname
Bug: None
Severity: Low
Workaround: Prefix the path of the local file with file:// when using INSERT OVERWRITE LOCAL.
—INSERT OVERWRITE and CREATE EXTERNAL commands fail because of HDFS URI permissions
When you use Sentry to secure Hive, and use HDFS URIs in a HiveQL statement, the query will fail with an HDFS
permissions error unless you specify the NameNode and port.
Bug: None
Severity: Low
Workaround: Specify the NameNode and port, where applicable, in the URI; for example specify
hdfs://nn-uri:port/user/warehouse/hive/tab rather than simply /user/warehouse/hive/tab. In a
high-availability deployment, specify the value of FS.defaultFS.
— After upgrade from a release earlier than CDH 5.2.0, storage IDs may no longer be unique
As of CDH 5.2, each storage volume on a DataNode should have its own unique storageID, but in clusters
upgraded from CDH 4, or CDH 5 releases earlier than CDH 5.2.0, each volume on a given DataNode shares the
same storageID, because the HDFS upgrade does not properly update the IDs to reflect the new naming scheme.
This causes problems with load balancing. The problem affects only clusters upgraded from CDH 5.1.x and earlier
to CDH 5.2 or later. Clusters that are new as of CDH 5.2.0 or later do not have the problem.
Bug: HDFS-7575
Severity: Medium
Workaround: Upgrade to a later or patched version of CDH.
Workaround: None. Do not upgrade your PostgreSQL metastore to version 0.13 if you satisfy the condition
stated above.
<property>
<name>hadoop.kms.authentication.token.validity</name>
<value>SOME VERY HIGH NUMBER</value>
</property>
• You can switch the KMS signature secret provider to the string secret provider by adding the following
property to the kms-site.xml Safety Valve:
<property>
<name>hadoop.kms.authentication.signature.secret</name>
<value>SOME VERY SECRET STRING</value>
</property>
• SOLR-7141 - RecoveryStrategy: Raise time that we wait for any updates from the leader before they saw
the recovery state to have finished.
• SQOOP-1764 - Numeric Overflow when getting extent map
• IMPALA-1658: Add compatibility flag for Hive-Parquet-Timestamps
• IMPALA-1794: Fix infinite loop opening/closing file w/ invalid metadata
• IMPALA-1801: external-data-source-executor leaking global jni refs
Workaround: Use Impala instead; Impala handles Parquet schema evolution correctly.
• MAPREDUCE-6169 - MergeQueue should release reference to the current item from key and value at the
end of the iteration to save memory.
• HBASE-11794 - StripeStoreFlusher causes NullPointerException
• HBASE-12077 - FilterLists create many ArrayList$Itr objects per row.
• HBASE-12386 - Replication gets stuck following a transient zookeeper error to remote peer cluster
• HBASE-11979 - Compaction progress reporting is wrong
• HBASE-12529 - Use ThreadLocalRandom for RandomQueueBalancer
• HBASE-12445 - hbase is removing all remaining cells immediately after the cell marked with marker =
KeyValue.Type.DeleteColumn via PUT
• HBASE-12460 - Moving Chore to hbase-common module.
• HBASE-12366 - Add login code to HBase Canary tool.
• HBASE-12447 - Add support for setTimeRange for RowCounter and CellCounter
• HIVE-9330 - DummyTxnManager will throw NPE if WriteEntity writeType has not been set
• HIVE-9199 - Excessive exclusive lock used in some DDLs with DummyTxnManager
• HIVE-6835 - Reading of partitioned Avro data fails if partition schema does not match table schema
• HIVE-6978 - beeline always exits with 0 status, should exit with non-zero status on error
• HIVE-8891 - Another possible cause to NucleusObjectNotFoundException from drops/rollback
• HIVE-8874 - Error Accessing HBase from Hive via Oozie on Kerberos 5.0.1 cluster
• HIVE-8916 - Handle user@domain username under LDAP authentication
• HIVE-8889 - JDBC Driver ResultSet.getXXXXXX(String columnLabel) methods Broken
• HIVE-9445 - Revert HIVE-5700 - enforce single date format for partition column storage
• HIVE-5454 - HCatalog runs a partition listing with an empty filter
• HIVE-8784 - Querying partition does not work with JDO enabled against PostgreSQL
• HUE-2484 - [beeswax] Configure support for Hive Server2 LDAP authentication
• HUE-2102 - [oozie] Workflow with credentials can't be used with Coordinator
• HUE-2152 - [pig] Credentials support in editor
• HUE-2472 - [impala] Stabilize result retrieval
• HUE-2406 - [search] New dashboard page has a margin problem
• HUE-2373 - [search] Heatmap can break
• HUE-2395 - [search] Broken widget in Solr Apache logs example
• HUE-2414 - [search] Timeline chart breaks when there's no extraSeries defined
• HUE-2342 - [impala] SSL encryption
• HUE-2426 - [pig] Dashboard gives a 500 error
• HUE-2430 - [pig] Progress bars of running scripts not updated on Dashboard
• HUE-2411 - [useradmin] Lazy load user and group list in permission sharing popup
• HUE-2398 - [fb] Drag and Drop hover message should not appear when elements originating in DOM are
dragged
• HUE-2401 - [search] Visually report selected and excluded values for ranges too
• HUE-2389 - [impala] Expand results table after the results are added to datatables
• HUE-2360 - [sentry] Sometimes Groups are not loaded we see the input box instead
• IMPALA-1453 - Fix many bugs with HS2 FETCH_FIRST
• IMPALA-1623 - unix_timestamp() does not return correct time
• IMPALA-1606 - Impala does not always give short name to Llama
• IMPALA-1475 - accept unmangled native UDF symbols
• OOZIE-2102 - Streaming actions are broken cause of incorrect method signature
• PARQUET-145 - InternalParquetRecordReader.close() should not throw an exception if initialization has
failed
• PARQUET-140 - Allow clients to control the GenericData object that is used to read Avro records
• PIG-4330 - Regression test for PIG-3584 - AvroStorage does not correctly translate arrays of strings
• PIG-3584 - AvroStorage does not correctly translate arrays of strings
• SOLR-5515 - NPE when getting stats on date field with empty result on solrcloud
Published Known Issues Fixed
As a result of the above fixes, the following issues, previously published as Known Issues in CDH 5 on page 78,
are also fixed.
— Upgrading a PostgreSQL Hive Metastore from Hive 0.12 to Hive 0.13 may result in a corrupt metastore
HIVE-5700 introduced a serious bug into the Hive Metastore upgrade scripts. This bug affects users who have
a PostgreSQL Hive Metastore and have at least one table which is partitioned by date and the value is stored
as a date type (not string).
Bug: HIVE-5700
Severity: High
Workaround: None. Do not upgrade your PostgreSQL metastore to version 0.13 if you satisfy the condition
stated above.
DataNodes may become unresponsive to block creation requests
DataNodes may become unresponsive to block creation requests from clients when the directory scanner is
running.
Bug: HDFS-7489
Severity: Low
Workaround: Disable the directory scanner by setting dfs.datanode.directoryscan.interval to -1.
Apache Oozie
Using cron-like syntax for Coordinator frequencies can result in duplicate actions
Every throttle number of actions will be a duplicate. For example, if the throttle is set to 5, every fifth action
will be a duplicate.
Bug: OOZIE-2063
Severity: High
Workaround: If possible, use the older syntax to specify an equivalent frequency.
With Search for CDH 5.2 and later, Kerberos name rules are followed.
Bug: None.
Severity: Medium.
Workaround: None.
Apache HBase
— Sending a large amount of invalid data to the Thrift service can cause it to crash
Bug: HBASE-11052.
Severity: High
Workaround: None. This is a longstanding problem, not a new issue in CDH 5.1.
— The metric ageOfLastShippedOp never decreases
This can cause it to appear as though the cluster is in an inconsistent state even when there is no problem.
Bug: HBASE-11143.
Severity: High
Workaround: None.
Apache Oozie
— Oozie HA does not work properly with HCatalog integration or SLA notifications
This issue appears when you are using HCatalog as a data dependency in a coordinator; using HCatalog from
an action (for example, Pig) works correctly.
Bug: OOZIE-1492
Severity: Medium
Workaround: None
Apache Oozie
— When Oozie is configured to use MRv1 and SSL, YARN / MRv2 libraries are erroneously included in the
classpath instead
This problem causes much of the configured Oozie functionality to be unusable.
Bug: None
Severity: Medium
Workaround: Use a different configuration (non-SSL or YARN), if possible.
Found 21 items
ls: Invalid value for webhdfs parameter "op": No enum const class
org.apache.hadoop.hdfs.web.resources.GetOpParam.Op.GETACLSTATUS
Bug: HDFS-6326
Severity: Medium
Workaround: None; note that this is fixed as of CDH 5.0.2.
Apache HBase
— Endless Compaction Loop
If an empty HFile whose max timestamp is past its TTL (time-to-live) is selected for compaction, it is compacted
into another empty HFile, which is selected for compaction, creating an endless compaction loop.
Bug: HBASE-10371
Severity: Medium
Workaround: None
Bug: HIVE-6375
Severity: Medium
Workaround: A workaround for this is to follow up a CREATE TABLE query with an INSERT OVERWRITE TABLE
SELECT * as in the example below.
Apache Oozie
— The oozie-workflow-0.4.5 schema has been removed
Workflows using schema 0.4.5 will no longer be accepted by Oozie because this schema definition version has
been removed.
Bug: OOZIE-1768
Severity: Low
Workaround: Use schema 0.5. It's backwards compatible with 0.4.5, so updating the workflow is as simple as
changing the schema version number.
— Cannot browse filesystem via NameNode Web UI if any directory has the sticky bit set
When listing any directory which contains an entry that has the sticky bit permission set, for example /tmp is
often set this way, nothing will appear where the list of files or directories should be.
Bug: HDFS-5921
Severity: Low
Workaround: Use the Hue File Browser.
— Appending to a file that has been snapshotted previously will append to the snapshotted file as well
If you append content to a file that exists in snapshot, the file in snapshot will have the same content appended
to it, invalidating the original snapshot.
Bug: See also HDFS-5343
Severity: High
Workaround: None
MapReduce
— In MRv2 (YARN), the JobHistory Server has no information about a job if the ApplicationMasters fails while the job is
running
Bug: None
Severity: Medium
Workaround: None.
Apache HBase
— An empty rowkey is treated as the first row of a table
An empty rowkey is allowed in HBase, but it was treated as the first row of the table, even if it was not in fact
the first row. Also, multiple rows with empty rowkeys caused issues.
Bug: HBASE-3170
Severity: High
Workaround: Do not use empty rowkeys.
Apache Hive
— Hive queries that combine multiple splits and query large tables fail on YARN
Hive queries that scan large tables, or perform map side joins may fail with the following exception when the
query is run using YARN:
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:425)
at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
Bug: MAPREDUCE-5186
Severity: High
Workaround: Set mapreduce.job.max.split.locations to a high value such as 100.
— Files in Avro tables no longer have .avro extension
As of CDH 4.3.0 Hive no longer creates files in Avro tables with the .avro extension by default. This does not
cause any problems in Hive, but could affect downstream components such as Pig, MapReduce, or Sqoop 1 that
expect files with the .avro extension.
Bug: None
Severity: Low
Workaround: Manually set the extension to .avro before using a query that inserts data into your Avro table.
Use the following set statement:
set hive.output.file.extension=".avro";
Apache Oozie
— Oozie does not work seamlessly with ResourceManager HA
Oozie workflows are not recovered on ResourceManager failover when ResourceManager HA is enabled. Further,
users can not specify the clusterId for JobTracker to work against either ResourceManager.
Bug: None
Severity: Medium
Workaround: On non-secure clusters, users are required to specify either of the ResourceManagers' host:port.
For secure clusters, users are required to specify the Active ResourceManager's host:port.
— When using Oozie HA with security enabled, some znodes have world ACLs
Oozie High Availability with security enabled will still work, but a malicious user or program can alter znodes
used by Oozie for locking, possibly causing Oozie to be unable to finish processing certain jobs.
Bug: OOZIE-1608
Severity: Medium
Workaround: None
— Oozie and Sqoop 2 may need additional configuration to work with YARN
In CDH 5, MRv2 (YARN) MapReduce 2.0 is recommended over the Hadoop 0.20-based MRv1. The default
configuration may not reflect this in Oozie and Sqoop 2 in CDH 5 Beta 2, however, unless you are using Cloudera
Manager.
Bug: None
Severity: Low
Workaround: Check the value of CATALINA_BASE in /etc/oozie/conf/oozie-env.sh (if you are running an
Oozie server) and /etc/default/sqoop2-server (if you are using a Sqoop 2 server). You should also ensure
that CATALINA_BASE is correctly set in your environment if you are invoking /usr/bin/sqoop2-server directly
instead of using the service init scripts. For Oozie, CATALINA_BASE should be set to
/usr/lib/oozie/oozie-server for YARN, or /usr/lib/oozie/oozie-server-0.20 for MRv1. For Sqoop 2,
Cloudera Search
— Creating cores using the web UI with default values causes the system to become unresponsive
You can use the Solr Server web UI to create new cores. If you click Create Core without making any changes to
the default attributes, the server may become unresponsive. Checking the log for the server shows a repeated
error that begins:
Bug: Solr-5813
Severity: Medium
Workaround: To avoid this issue, do not create cores without first updating values for the new core in the web
UI. For example, you might enter a new name for the core to be created.
If you created a core with default settings and are seeing this error, you can address the problem by finding
which node is having problems and removing that node. Find the problematic node by using a tool that can
inspect ZooKeeper, such as the Solr Admin UI. Using such a tool, examine items in the ZooKeeper queue, reviewing
the properties for the item. The problematic node will have an item in its queue with the property collection="".
Remove the node with the item with the collection="" property using a ZooKeeper management tool. For
example, you can remove nodes using the ZooKeeper command line tool or recent versions of HUE.
Workaround: Set keep.failed.task.files to true, which will sidestep the memory leak but require job staging
directories to be cleaned out manually.
Hue
— Running a Hive Beeswax metastore on the same host as the Hue server will result in Simple Authentication
and Security Layer (SASL) authentication failures on a Kerberos-enabled cluster
Bug: None
Severity: Medium
Workaround: The simple workaround is to run the metastore server remotely on a different host and make sure
all Hive and Hue configurations properly refer to it. A more complex workaround is to adjust network configurations
to ensure that reverse DNS properly resolves the host's address to its fully qualified-domain name (FQDN) rather
than localhost.
— The Pig shell does not work when NameNode uses a wildcard address
The Pig shell does not work from Hue if you use a wildcard for the NameNode's RPC or HTTP bind address. For
example, dfs.namenode.http-address must be a real, routable address and port, not 0.0.0.0.<port>.
Bug: HUE-1060
Severity: Medium
Workaround: Use a real, routable address and port, not 0.0.0.0.<port>, for the NameNode; or use the Pig application
directly, rather than from Hue.
Apache Sqoop
— Oozie and Sqoop 2 may need additional configuration to work with YARN
In CDH 5, MRv2 (YARN) MapReduce 2.0 is recommended over the Hadoop 0.20-based MRv1. The default
configuration may not reflect this in Oozie and Sqoop 2 in CDH 5 Beta 2, however, unless you are using Cloudera
Manager.
Bug: None
Severity: Low
Workaround: Check the value of CATALINA_BASE in /etc/oozie/conf/oozie-env.sh (if you are running an
Oozie server) and /etc/default/sqoop2-server (if you are using a Sqoop 2 server). You should also ensure
that CATALINA_BASE is correctly set in your environment if you are invoking /usr/bin/sqoop2-server directly
instead of using the service init scripts. For Oozie, CATALINA_BASE should be set to
/usr/lib/oozie/oozie-server for YARN, or /usr/lib/oozie/oozie-server-0.20 for MRv1. For Sqoop 2,
CATALINA_BASE should be set to /usr/lib/sqoop2/sqoop-server for YARN, or
/usr/lib/sqoop2/sqoop-server-0.20 on MRv1.
Note: The Impala 2.2.x maintenance releases now use the CDH 5.4.x numbering system rather than
increasing the Impala version numbers. Impala 2.2 and higher are not available under CDH 4.
For the full list of fixed issues, see the CDH 5.4.1 release notes.
Issues Fixed in the 2.2.0 Release / CDH 5.4.0
This section lists the most frequently encountered customer issues fixed in Impala 2.2.0.
For the full list of fixed issues in Impala 2.2.0, including over 40 critical issues, see this report in the JIRA system.
Note: Impala 2.2.0 is available as part of CDH 5.4.0 and is not available for CDH 4. Cloudera does not
intend to release future versions of Impala for CDH 4 outside patch and maintenance releases if
required. Given the upcoming end-of-maintenance for CDH 4, Cloudera recommends all customers
to migrate to a recent CDH 5 release.
Altering a column's type causes column stats to stop sticking for that column
When the type of a column was changed in either Hive or Impala through ALTER TABLE CHANGE COLUMN, the
metastore database did not correctly propagate that change to the table that contains the column statistics.
The statistics (particularly the NDV) for that column were permanently reset and could not be changed by Impala's
COMPUTE STATS command. The underlying cause is a Hive bug (HIVE-9866).
Bug: IMPALA-1607
Severity: Major
Resolution: Resolved by incorporating the fix for HIVE-9866.
Workaround: On systems without the corresponding Hive fix, change the column back to its original type. The
stats reappear and you can recompute or drop them.
Impala may leak or use too many file descriptors
If a file was truncated in HDFS without a corresponding REFRESH in Impala, Impala could allocate memory for
file descriptors and not free that memory.
Bug: IMPALA-1854
Severity: High
Spurious stale block locality messages
Impala could issue messages stating the block locality metadata was stale, when the metadata was actually
fine. The internal “remote bytes read” counter was not being reset properly. This issue did not cause an actual
slowdown in query execution, but the spurious error could result in unnecessary debugging work and unnecessary
use of the INVALIDATE METADATA statement.
Bug: IMPALA-1712
Severity: High
DROP TABLE fails after COMPUTE STATS and ALTER TABLE RENAME to a different database.
When a table was moved from one database to another, the column statistics were not pointed to the new
database.i This could result in lower performance for queries due to unavailable statistics, and also an inability
to drop the table.
Bug: IMPALA-1711
Severity: High
IMPALA-1556 causes memory leak with secure connections
impalad daemons could experience a memory leak on clusters using Kerberos authentication, with memory
usage growing as more data is transferred across the secure channel, either to the client program or between
Impala nodes. The same issue affected LDAP-secured clusters to a lesser degree, because the LDAP security
only covers data transferred back to client programs.
Bug: https://issues.cloudera.org/browse/IMPALA-1674 IMPALA-1674
Severity: High
unix_timestamp() does not return correct time
The unix_timestamp() function could return an incorrect value (a constant value of 1).
Bug: IMPALA-1623
Severity: High
Impala incorrectly handles text data missing a newline on the last line
Some queries did not recognize the final line of a text data file if the line did not end with a newline character.
This could lead to inconsistent results, such as a different number of rows for SELECT COUNT(*) as opposed
to SELECT *.
Bug: IMPALA-1476
Severity: High
Impala's ACLs check do not consider all group ACLs, only checked first one.
If the HDFS user ID associated with the impalad process had read or write access in HDFS based on group
membership, Impala statements could still fail with HDFS permission errors if that group was not the first listed
group for that user ID.
Bug: IMPALA-1805
Severity: High
Fix infinite loop opening or closing file with invalid metadata
Truncating a file in HDFS, after Impala had cached the file metadata, could produce a hang when Impala queried
a table containing that file.
Bug: IMPALA-1794
Severity: High
Cannot write Parquet files when values are larger than 64KB
Impala could sometimes fail to INSERT into a Parquet table if a column value such as a STRING was larger than
64 KB.
Bug: IMPALA-1705
Severity: High
Impala Will Not Run on Certain Intel CPUs
This fix relaxes the CPU requirement for Impala. Now only the SSSE3 instruction set is required. Formerly, SSE4.1
instructions were generated, making Impala refuse to start on some older CPUs.
Bug: IMPALA-1646
Severity: High
Note: Impala 2.1.3 is available as part of CDH 5.3.3, not under CDH 4.
Note: Impala 2.1.2 is available as part of CDH 5.3.2, not under CDH 4.
Impala incorrectly handles double numbers with more than 19 significant decimal digits
When a floating-point value was read from a text file and interpreted as a FLOAT or DOUBLE value, it could be
incorrectly interpreted if it included more than 19 significant digits.
Bug: IMPALA-1622
Severity: High
unix_timestamp() does not return correct time
The unix_timestamp() function could return an incorrect value (a constant value of 1).
Bug: IMPALA-1623
Severity: High
Row Count Mismatch: Partition pruning with NULL
A query against a partitioned table could return incorrect results if the WHERE clause compared the partition key
to NULL using operators such as = or !=.
Bug: IMPALA-1535
Severity: High
Fetch column stats in bulk using new (Hive .13) HMS APIs
The performance of the COMPUTE STATS statement and queries was improved, particularly for wide tables.
Bug: IMPALA-1120
Severity: High
Issues Fixed in the 2.1.1 Release / CDH 5.3.1
This section lists the most significant issues fixed in Impala 2.1.1.
For the full list of fixed issues in Impala 2.1.1, see this report in the JIRA system.
IMPALA-1556 causes memory leak with secure connections
impalad daemons could experience a memory leak on clusters using Kerberos authentication, with memory
usage growing as more data is transferred across the secure channel, either to the client program or between
Impala nodes. The same issue affected LDAP-secured clusters to a lesser degree, because the LDAP security
only covers data transferred back to client programs.
Bug: https://issues.cloudera.org/browse/IMPALA-1674 IMPALA-1674
Severity: High
Note: Impala 2.0.3 is available as part of CDH 5.2.4, not under CDH 4.
Note: Impala 2.0.2 is available as part of CDH 5.2.3, not under CDH 4.
Bug: IMPALA-1475
Severity: High
Issues Fixed in the 2.0.1 Release / CDH 5.2.1
This section lists the most significant issues fixed in Impala 2.0.1.
For the full list of fixed issues in Impala 2.0.1, see this report in the JIRA system.
Queries fail with metastore exception after upgrade and compute stats
After running the COMPUTE STATS statement on an Impala table, subsequent queries on that table could fail
with the exception message Failed to load metadata for table: default.stats_test.
Bug: https://issues.cloudera.org/browse/IMPALA-1416 IMPALA-1416
Severity: High
Workaround: Upgrading to CDH 5.2.1, or another level of CDH that includes the fix for HIVE-8627, prevents the
problem from affecting future COMPUTE STATS statements. On affected levels of CDH, or for Impala tables that
have become inaccessible, the workaround is to disable the hive.metastore.try.direct.sql setting in the
Hive metastore hive-site.xml file and issue the INVALIDATE METADATA statement for the affected table.
You do not need to rerun the COMPUTE STATS statement for the table.
Issues Fixed in the 2.0.0 Release / CDH 5.2.0
This section lists the most significant issues fixed in Impala 2.0.0.
For the full list of fixed issues in Impala 2.0.0, see this report in the JIRA system.
Join Hint is dropped when used inside a view
Hints specified within a view query did not take effect when the view was queried, leading to slow performance.
As part of this fix, Impala now supports hints embedded within comments.
Bug: IMPALA-995"
Severity: High
WHERE condition ignored in simple query with RIGHT JOIN
Potential wrong results for some types of queries.
Bug: IMPALA-1101"
Severity: High
Query with self joined table may produce incorrect results
Potential wrong results for some types of queries.
Bug: IMPALA-1102"
Severity: High
Incorrect plan after reordering predicates (inner join following outer join)
Potential wrong results for some types of queries.
Bug: IMPALA-1118"
Severity: High
Combining fragments with compatible data partitions can lead to incorrect results due to type incompatibilities
(missing casts).
Potential wrong results for some types of queries.
Bug: IMPALA-1123"
Severity: High
Bug: IMPALA-1121"
Severity: High
Allow creating Avro tables without column definitions. Allow COMPUTE STATS to always work on Impala-created
Avro tables.
Hive-created Avro tables with columns specified by a JSON file or literal could produce errors when queried in
Impala, and could not be used with the COMPUTE STATS statement. Now you can create such tables in Impala
to avoid such errors.
Bug: IMPALA-1104"
Severity: High
Ensure all webserver output is escaped
The Impala debug web UI did not properly encode all output.
Bug: IMPALA-1133"
Severity: High
Queries with union in inline view have empty resource requests
Certain queries could run without obeying the limits imposed by resource management.
Bug: IMPALA-1236"
Severity: High
Impala does not employ ACLs when checking path permissions for LOAD and INSERT
Certain INSERT and LOAD DATA statements could fail unnecessarily, if the target directories in HDFS had restrictive
HDFS permissions, but those permissions were overridden by HDFS extended ACLs.
Bug: IMPALA-1279"
Severity: High
Impala does not map principals to lowercase, affecting Sentry authorisation
In a Kerberos environment, the principal name was not mapped to lowercase, causing issues when a user logged
in with an uppercase principal name and Sentry authorization was enabled.
Bug: IMPALA-1334"
Severity: High
Issues Fixed in the 1.4.4 Release / CDH 5.1.5
For the list of fixed issues, see Known Issues Fixed in CDH 5.1.5 in the CDH 5 Release Notes.
Note: Impala 1.4.4 is available as part of CDH 5.1.5, not under CDH 4.
Note: Impala 1.4.3 is available as part of CDH 5.1.4, and under CDH 4.
Note: Impala 1.4.3 is available as part of CDH 5.1.4, and under CDH 4.
Note: Impala 1.4.1 is only available as part of CDH 5.1.2, not under CDH 4.
boost::exception_detail::clone_impl
<boost::exception_detail::error_info_injector<boost::lock_error> >
Severity: High
Impalad uses wrong string format when writing logs
Impala log files could contain internal error messages due to a problem formatting certain strings. The messages
consisted of a Java call stack starting with:
Severity: High
Update HS2 client API.
A downlevel version of the HiveServer2 API could cause difficulty retrieving the precision and scale of a DECIMAL
value.
Bug: IMPALA-1107
Severity: High
Impalad catalog updates can fail with error: "IllegalArgumentException: fromKey out of range" at
com.cloudera.impala.catalog.CatalogDeltaLog
The error in the title could occur following a DDL statement. This issue was discovered during internal testing
and has not been reported in customer environments.
Bug: IMPALA-1093
Severity: High
"Total" time counter does not capture all the network transmit time
The time for some network operations was not counted in the report of total time for a query, making it difficult
to diagnose network-related performance issues.
Bug: IMPALA-1131
Severity: High
Impala will crash when reading certain Avro files containing bytes data
Certain Avro fields for byte data could cause Impala to be unable to read an Avro data file, even if the field was
not part of the Impala table definition. With this fix, Impala can now read these Avro data files, although Impala
queries cannot refer to the “bytes” fields.
Bug: IMPALA-1149
Severity: High
Support specifying a custom AuthorizationProvider in Impala
The --authorization_policy_provider_class option for impalad was added back. This option specifies a
custom AuthorizationProvider class rather than the default HadoopGroupAuthorizationProvider. It had
been used for internal testing, then removed in Impala 1.4.0, but it was considered useful by some customers.
Bug: IMPALA-1142
Severity: High
Issues Fixed in the 1.4.0 Release / CDH 5.1.0
This section lists the most significant issues fixed in Impala 1.4.0.
For the full list of fixed issues in Impala 1.4.0, see this report in the JIRA system.
Failed DCHECK in disk-io-mgr-reader-context.cc:174
The serious error in the title could occur, with the supplemental message:
The issue was due to the use of HDFS caching with data files accessed by Impala. Support for HDFS caching in
Impala was introduced in Impala 1.4.0 for CDH 5.1.0. The fix for this issue was backported to Impala 1.3.x, and
is the only change in Impala 1.3.2 for CDH 5.0.4.
Bug: IMPALA-1019
Severity: High
Workaround: On CDH 5.0.x, upgrade to CDH 5.0.4 with Impala 1.3.2, where this issue is fixed. In Impala 1.3.0 or
1.3.1 on CDH 5.0.x, do not use HDFS caching for Impala data files in Impala internal or external tables. If some
of these data files are cached (for example because they are used by other components that take advantage of
HDFS caching), set the query option DISABLE_CACHED_READS=true. To set that option for all Impala queries
across all sessions, start impalad with the -default_query_options option and include this setting in the
option argument, or on a cluster managed by Cloudera Manager, fill in this option setting on the Impala Daemon
options page.
Resolution: This issue is fixed in Impala 1.3.2 for CDH 5.0.4. The addition of HDFS caching support in Impala 1.4
means that this issue does not apply to any new level of Impala on CDH 5.
impala-shell only works with ASCII characters
The impala-shell interpreter could encounter errors processing SQL statements containing non-ASCII characters.
Bug: IMPALA-489
Severity: High
The extended view definition SQL text in Views created by Impala should always have fully-qualified table names
When a view was accessed while inside a different database, references to tables were not resolved unless the
names were fully qualified when the view was created.
Bug: IMPALA-962
Severity: High
Impala forgets about partitions with non-existant locations
If an ALTER TABLE specified a non-existent HDFS location for a partition, afterwards Impala would not be able
to access the partition at all.
Bug: IMPALA-741
Severity: High
CREATE TABLE LIKE fails if source is a view
The CREATE TABLE LIKE clause was enhanced to be able to create a table with the same column definitions
as a view. The resulting table is a text table unless the STORED AS clause is specified, because a view does not
have an associated file format to inherit.
Bug: IMPALA-834
Severity: High
Improve partition pruning time
Operations on tables with many partitions could be slow due to the time to evaluate which partitions were
affected. The partition pruning code was speeded up substantially.
Bug: IMPALA-887
Severity: High
Improve compute stats performance
The performance of the COMPUTE STATS statement was improved substantially. The efficiency of its internal
operations was improved, and some statistics are no longer gathered because they are not currently used for
planning Impala queries.
Bug: IMPALA-1003
Severity: High
When I run CREATE TABLE new_table LIKE avro_table, the schema does not get mapped properly from an avro
schema to a hive schema
After a CREATE TABLE LIKE statement using an Avro table as the source, the new table could have incorrect
metadata and be inaccessible, depending on how the original Avro table was created.
Bug: IMPALA-185
Severity: High
Race condition in IoMgr. Blocked ranges enqueued after cancel.
Impala could encounter a serious error after a query was cancelled.
Bug: IMPALA-1046
Severity: High
Deadlock in scan node
A deadlock condition could make all impalad daemons hang, making the cluster unresponsive for Impala queries.
Bug: IMPALA-1083
Severity: High
Issues Fixed in the 1.3.3 Release / CDH 5.0.5
Impala 1.3.3 includes fixes to address what is known as the POODLE vulnerability in SSLv3. SSLv3 access is
disabled in the Impala debug web UI.
Note: Impala 1.3.3 is only available as part of CDH 5.0.5, not under CDH 4.
Note: Impala 1.3.3 is only available as part of CDH 5.0.5, not under CDH 4.
The issue was due to the use of HDFS caching with data files accessed by Impala. Support for HDFS caching in
Impala was introduced in Impala 1.4.0 for CDH 5.1.0. The fix for this issue was backported to Impala 1.3.x, and
is the only change in Impala 1.3.2 for CDH 5.0.4.
Bug: IMPALA-1019
Severity: High
Workaround: On CDH 5.0.x, upgrade to CDH 5.0.4 with Impala 1.3.2, where this issue is fixed. In Impala 1.3.0 or
1.3.1 on CDH 5.0.x, do not use HDFS caching for Impala data files in Impala internal or external tables. If some
of these data files are cached (for example because they are used by other components that take advantage of
HDFS caching), set the query option DISABLE_CACHED_READS=true. To set that option for all Impala queries
across all sessions, start impalad with the -default_query_options option and include this setting in the
option argument, or on a cluster managed by Cloudera Manager, fill in this option setting on the Impala Daemon
options page.
Resolution: This issue is fixed in Impala 1.3.2 for CDH 5.0.4. The addition of HDFS caching support in Impala 1.4
means that this issue does not apply to any new level of Impala on CDH 5.
Issues Fixed in the 1.3.1 Release / CDH 5.0.3
This section lists the most significant issues fixed in Impala 1.3.1.
For the full list of fixed issues in Impala 1.3.1, see this report in the JIRA system. Because 1.3.1 is the first 1.3.x
release for CDH 4, if you are on CDH 4, also consult Issues Fixed in the 1.3.0 Release / CDH 5.0.0 on page 151.
Impalad crashes when left joining inline view that has aggregate using distinct
Impala could encounter a severe error in a query combining a left outer join with an inline view containing a
COUNT(DISTINCT) operation.
Bug: IMPALA-904
Severity: High
Incorrect result with group by query with null value in group by data
If the result of a GROUP BY operation is NULL, the resulting row might be omitted from the result set. This issue
depends on the data values and data types in the table.
Bug: IMPALA-901
Severity: High
Drop Function does not clear local library cache
When a UDF is dropped through the DROP FUNCTION statement, and then the UDF is re-created with a new .so
library or JAR file, the original version of the UDF is still used when the UDF is called from queries.
Bug: IMPALA-786
Severity: High
Workaround: Restart the impalad daemon on all nodes.
Compute stats doesn't propagate underlying error correctly
If a COMPUTE STATS statement encountered an error, the error message is “Query aborted” with no further
detail. Common reasons why a COMPUTE STATS statement might fail include network errors causing the
coordinator node to lose contact with other impalad instances, and column names that match Impala reserved
words. (Currently, if a column name is an Impala reserved word, COMPUTE STATS always returns an error.)
Bug: IMPALA-762
Severity: High
Inserts should respect changes in partition location
After an ALTER TABLE statement that changes the LOCATION property of a partition, a subsequent INSERT
statement would always use a path derived from the base data directory for the table.
Bug: IMPALA-624
Severity: High
Text data with carriage returns generates wrong results for count(*)
A COUNT(*) operation could return the wrong result for text tables using nul characters (ASCII value 0) as
delimiters.
Bug: IMPALA-13
Severity: High
Workaround: Impala adds support for ASCII 0 characters as delimiters through the clause FIELDS TERMINATED
BY '\0'.
IO Mgr should take instance memory limit into account when creating io buffers
Impala could allocate more memory than necessary during certain operations.
Bug: IMPALA-488
Severity: High
Workaround: Before issuing a COMPUTE STATS statement for a Parquet table, reduce the number of threads
used in that operation by issuing SET NUM_SCANNER_THREADS=2 in impala-shell. Then issue UNSET
NUM_SCANNER_THREADS before continuing with queries.
Impala should provide an option for new sub directories to automatically inherit the permissions of the parent
directory
When new subdirectories are created underneath a partitioned table by an INSERT statement, previously the
new subdirectories always used the default HDFS permissions for the impala user, which might not be suitable
for directories intended to be read and written by other components also.
Bug: IMPALA-827
Severity: High
Resolution: In Impala 1.3.1 and higher, you can specify the --insert_inherit_permissions configuration
when starting the impalad daemon.
Illegal state exception (or crash) in query with UNION in inline view
Impala could encounter a severe error in a query where the FROM list contains an inline view that includes a
UNION. The exact type of the error varies.
Bug: IMPALA-888
Severity: High
INSERT column reordering doesn't work with SELECT clause
The ability to specify a subset of columns in an INSERT statement, with order different than in the target table,
was not working as intended.
Bug: IMPALA-945
Severity: High
Issues Fixed in the 1.3.0 Release / CDH 5.0.0
This section lists the most significant issues fixed in Impala 1.3.0, primarily issues that could cause wrong results,
or cause problems running the COMPUTE STATS statement, which is very important for performance and
scalability.
For the full list of fixed issues, see this report in the JIRA system.
Inner join after right join may produce wrong results
The automatic join reordering optimization could incorrectly reorder queries with an outer join or semi join
followed by an inner join, producing incorrect results.
Bug: IMPALA-860
Severity: High
Workaround: Including the STRAIGHT_JOIN keyword in the query prevented the issue from occurring.
Incorrect results with codegen on multi-column group by with NULLs.
A query with a GROUP BY clause referencing multiple columns could introduce incorrect NULL values in some
columns of the result set. The incorrect NULL values could appear in rows where a different GROUP BY column
actually did return NULL.
Bug: IMPALA-850
Severity: High
Using distinct inside aggregate function may cause incorrect result when using having clause
A query could return incorrect results if it combined an aggregate function call, a DISTINCT operator, and a
HAVING clause, without a GROUP BY clause.
Bug: IMPALA-845
Severity: High
Aggregation on union inside (inline) view not distributed properly.
An aggregation query or a query with ORDER BY and LIMIT could be executed on a single node in some cases,
rather than distributed across the cluster. This issue affected queries whose FROM clause referenced an inline
view containing a UNION.
Bug: IMPALA-831
Severity: High
Wrong expression may be used in aggregate query if there are multiple similar expressions
If a GROUP BY query referenced the same columns multiple times using different operators, result rows could
contain multiple copies of the same expression.
Bug: IMPALA-817
Severity: High
Incorrect results when changing the order of aggregates in the select list with codegen enabled
Referencing the same columns in both a COUNT() and a SUM() call in the same query, or some other combinations
of aggregate function calls, could incorrectly return a result of 0 from one of the aggregate functions. This issue
affected references to TINYINT and SMALLINT columns, but not INT or BIGINT columns.
Bug: IMPALA-765
Severity: High
Workaround: Setting the query option DISABLE_CODEGEN=TRUE prevented the incorrect results. Switching the
order of the function calls could also prevent the issue from occurring.
Union queries give Wrong result in a UNION followed by SIGSEGV in another union
A UNION query could produce a wrong result, followed by a serious error for a subsequent UNION query.
Bug: IMPALA-723
Severity: High
String data in MR-produced parquet files may be read incorrectly
Impala could return incorrect string results when reading uncompressed Parquet data files containing multiple
row groups. This issue only affected Parquet data files produced by MapReduce jobs.
Bug: IMPALA-729
Severity: High
Compute stats need to use quotes with identifiers that are Impala keywords
Using a column or table name that conflicted with Impala keywords could prevent running the COMPUTE STATS
statement for the table.
Bug: IMPALA-777
Severity: High
COMPUTE STATS child queries do not inherit parent query options.
The COMPUTE STATS statement did not use the setting of the MEM_LIMIT query option in impala-shell,
potentially causing problems gathering statistics for wide Parquet tables.
Bug: IMPALA-903
Severity: High
COMPUTE STATS should update partitions in batches
The COMPUTE STATS statement could be slow or encounter a timeout while analyzing a table with many partitions.
Bug: IMPALA-880
Severity: High
Fail early (in analysis) when COMPUTE STATS is run against Avro table with no columns
If the columns for an Avro table were all defined in the TBLPROPERTIES or SERDEPROPERTIES clauses, the
COMPUTE STATS statement would fail after completely analyzing the table, potentially causing a long delay.
Although the COMPUTE STATS statement still does not work for such tables, now the problem is detected and
reported immediately.
Bug: IMPALA-867
Severity: High
Workaround: Re-create the Avro table with columns defined in SQL style, using the output of SHOW CREATE
TABLE. (See the JIRA page for detailed steps.)
Severity: Medium
Anticipated Resolution: Fixed in Impala 1.2.2.
Workaround: In Impala 1.2.2 and higher, use the COMPUTE STATS statement to gather statistics for each table
involved in the join query, after data is loaded. Prior to Impala 1.2.2, modify the query, if possible, to join the
largest table first. For example:
Parquet in CDH4.5 writes data files that are sometimes unreadable by Impala
Some Parquet files could be generated by other components that Impala could not read.
Bug: IMPALA-694
Severity: High
Resolution: The underlying issue is being addressed by a fix in the CDH Parquet libraries. Impala 1.2.2 works
around the problem and reads the existing data files.
Deadlock in statestore when unregistering a subscriber and building a topic update
The statestore service cound experience an internal error leading to a hang.
Bug: IMPALA-699
Severity: High
IllegalStateException when doing a union involving a group by
A UNION query where one side involved a GROUP BY operation could cause a serious error.
Bug: IMPALA-687
Severity: High
Impala Parquet Writer hit DCHECK in RleEncoder
A serious error could occur when doing an INSERT into a Parquet table.
Bug: IMPALA-689
Severity: High
Hive UDF jars cannot be loaded by the FE
If the JAR file for a Java-based Hive UDF was not in the CLASSPATH, the UDF could not be called during a query.
Bug: IMPALA-695
Severity: High
Severity: High
Views Sometimes Not Utilizing Partition Pruning
Certain combinations of clauses in a view definition for a partitioned table could result in inefficient performance
and incorrect results.
Bug: IMPALA-495
Severity: High
Update the serde name we write into the metastore for Parquet tables
The SerDes class string written into Parquet data files created by Impala was updated for compatibility with
Parquet support in Hive. See Incompatible Changes Introduced in Impala 1.1.1 on page 72 for the steps to update
older Parquet data files for Hive compatibility.
Bug: IMPALA-485
Severity: High
Selective queries over large tables produce unnecessary memory consumption
A query returning a small result sets from a large table could tie up memory unnecessarily for the duration of
the query.
Bug: IMPALA-534
Severity: High
Impala stopped to query AVRO tables
Queries against Avro tables could fail depending on whether the Avro schema URL was specified in the
TBLPROPERTIES or SERDEPROPERTIES field. The fix causes Impala to check both fields for the schema URL.
Bug: IMPALA-538
Severity: High
Impala continues to allocate more memory even though it has exceed its mem-limit
Queries could allocate substantially more memory than specified in the impalad -mem_limit startup option.
The fix causes more frequent checking of the limit during query execution.
Bug: IMPALA-520
Severity: High
Issues Fixed in the 1.1.0 Release
This section lists the most significant issues fixed in Impala 1.1. For the full list of fixed issues, see this report
in the JIRA system.
10-20% perf regression for most queries across all table formats
This issue is due to a performance tradeoff between systems running many queries concurrently, and systems
running a single query. Systems running only a single query could experience lower performance than in early
beta releases. Systems running many queries simultaneously should experience higher performance than in
the beta releases.
Severity: High
planner fails with "Join requires at least one equality predicate between the two tables" when "from" table order
does not match "where" join order
A query could fail if it involved 3 or more tables and the last join table was specified as a subquery.
Bug: IMPALA-85
Severity: High
Bug: IMPALA-349
Severity: High
Resolution: Fixed
Double check release of JNI-allocated byte-strings
Operations involving calls to the Java JNI subsystem (for example, queries on HBase tables) could allocate memory
but not release it.
Bug: IMPALA-358
Severity: High
Resolution: Fixed
Impala returns 0 for bad time values in UNIX_TIMESTAMP, Hive returns NULL
Impala returns 0 for bad time values in UNIX_TIMESTAMP, Hive returns NULL.
Impala:
Hive:
Bug: IMPALA-16
Severity: Low
Anticipated Resolution: Fixed
INSERT INTO TABLE SELECT <constant> does not work.
Insert INTO TABLE SELECT <constant> will not insert any data and may return an error.
Severity: Low
Anticipated Resolution: Fixed
Severity: Critical
Resolution: Fixed
Ctrl-C sometimes interrupts shell in system call, rather than cancelling query
Pressing Ctrl-C in the impala-shell interpreter could sometimes display an error and return control to the
shell, making it impossible to cancel the query.
Bug: IMPALA-243
Severity: Critical
Resolution: Fixed
Empty string partition value causes metastore update failure
Specifying an empty string or NULL for a partition key in an INSERT statement would fail.
Bug: IMPALA-252
Severity: High
Resolution: Fixed. The behavior for empty partition keys was made more compatible with the corresponding
Hive behavior.
Round() does not output the right precision
The round() function did not always return the correct number of significant digits.
Bug: IMPALA-266
Severity: High
Resolution: Fixed
Cannot cast string literal to string
Casting from a string literal back to the same type would cause an “invalid type cast” error rather than leaving
the original value unchanged.
Bug: IMPALA-267
Severity: High
Resolution: Fixed
Excessive mem usage for certain queries which are very selective
Some queries that returned very few rows experienced unnecessary memory usage.
Bug: IMPALA-288
Severity: High
Resolution: Fixed
HdfsScanNode crashes in UpdateCounters
A serious error could occur for relatively small and inexpensive queries.
Bug: IMPALA-289
Severity: High
Resolution: Fixed
Parquet performance issues on large dataset
Certain aggregation queries against Parquet tables were inefficient due to lower than required thread utilization.
Bug: IMPALA-292
Severity: High
Resolution: Fixed
impala not populating hive metadata correctly for create table
The Impala CREATE TABLE command did not fill in the owner and tbl_type columns in the Hive metastore
database.
Bug: IMPALA-295
Severity: High
Resolution: Fixed. The metadata was made more Hive-compatible.
impala daemons die if statestore goes down
The impalad instances in a cluster could halt when the statestored process became unavailable.
Bug: IMPALA-312
Severity: High
Resolution: Fixed
Constant SELECT clauses do not work in subqueries
A subquery would fail if the SELECT statement inside it returned a constant value rather than querying a table.
Bug: IMPALA-67
Severity: High
Resolution: Fixed
Right outer Join includes NULLs as well and hence wrong result count
The result set from a right outer join query could include erroneous rows containing NULL values.
Bug: IMPALA-90
Severity: High
Resolution: Fixed
Parquet scanner hangs for some queries
The Parquet scanner non-deterministically hangs when executing some queries.
Bug: IMPALA-204
Severity: Medium
Resolution: Fixed
Issues Fixed in Version 0.7 of the Beta Release
Impala does not gracefully handle unsupported Hive table types (INDEX and VIEW tables)
When attempting to load metadata from an unsupported Hive table type (INDEX and VIEW tables), Impala fails
with an unclear error message.
Bug: IMPALA-167
Severity: Low
Resolution: Fixed in 0.7
DDL statements (CREATE/ALTER/DROP TABLE) are not supported in the Impala Beta Release
Severity: Medium
Resolution: Fixed in 0.7
Avro is not supported in the Impala Beta Release
Severity: Medium
Severity: Low
Resolution: Fixed in 0.6 - Impala reads the namenode location and port from the Hadoop configuration files,
though setting -nn and -nn_port overrides this. Users are advised not to set -nn or -nn_port.
Queries may fail on secure environment due to impalad Kerberos ticket expiration
Queries may fail on secure environment due to impalad Kerberos tickets expiring. This can happen if the Impala
-kerberos_reinit_interval flag is set to a value ten minutes or less. This may lead to an impalad requesting
a ticket with a lifetime that is less than the time to the next ticket renewal.
Bug: IMPALA-64
Severity: Medium
Resolution: Fixed in 0.6
Concurrent queries may fail when Impala uses Thrift to communicate with the Hive Metastore
Concurrent queries may fail when Impala is using Thrift to communicate with part of the Hive Metastore such
as the Hive Metastore Service. In such a case, the error get_fields failed: out of sequence response"
may occur because Impala shared a single Hive Metastore Client connection across threads. With Impala 0.6, a
separate connection is used for each metadata request.
Bug: IMPALA-48
Severity: Low
Resolution: Fixed in 0.6
impalad fails to start if unable to connect to the Hive Metastore
Impala fails to start if it is unable to establish a connection with the Hive Metastore. This behavior was fixed,
allowing Impala to start, even when no Metastore is available.
Bug: IMPALA-58
Severity: Low
Resolution: Fixed in 0.6
Impala treats database names as case-sensitive in some contexts
In some queries (including "USE database" statements), database names are treated as case-sensitive. This
may lead queries to fail with an IllegalStateException.
Bug: IMPALA-44
Severity: Medium
Resolution: Fixed in 0.6
Impala does not ignore hidden HDFS files
Impala does not ignore hidden HDFS files, meaning those files prefixed with a period '.' or underscore '_'. This
diverges from Hive/MapReduce, which skips these files.
Bug: IMPALA-18
Severity: Low
Resolution: Fixed in 0.6
Issues Fixed in Version 0.5 of the Beta Release
Impala may have reduced performance on tables that contain a large number of partitions
Impala may have reduced performance on tables that contain a large number of partitions. This is due to extra
overhead reading/parsing the partition metadata.
Severity: High
Resolution: Fixed in 0.5
Backend client connections not getting cached causes an observable latency in secure clusters
Backend impalads do not cache connections to the coordinator. On a secure cluster, this introduces a latency
proportional to the number of backend clients involved in query execution, as the cost of establishing a secure
connection is much higher than in the non-secure case.
Bug: IMPALA-38
Severity: Medium
Resolution: Fixed in 0.5
Concurrent queries may fail with error: "Table object has not been been initialised : `PARTITIONS`"
Concurrent queries may fail with error: "Table object has not been been initialised : `PARTITIONS`".
This was due to a lack of locking in the Impala table/database metadata cache.
Bug: IMPALA-30
Severity: Medium
SELECT * FROM (SELECT sum(col1) FROM some_table GROUP BY col1) t1 JOIN other_table ON
(...);
Severity: Medium
Resolution: Fixed in 0.2
An insert with a limit that runs as more than one query fragment inserts more rows than the limit.
For example:
Severity: Medium
Resolution: Fixed in 0.2
Query with limit clause might fail.
For example:
Severity: Medium
Resolution: Fixed in 0.2
Files in unsupported compression formats are read as plain text.
Attempting to read such files does not generate a diagnostic.
Severity: Medium
Resolution: Fixed in 0.2
Impala server raises a null pointer exception when running an HBase query.
When querying an HBase table whose row-key is string type, the Impala server may raise a null pointer exception.
Severity: Medium
Resolution: Fixed in 0.2
Note: Although there is a CDH 5.4.2 release, there is no synchronous Cloudera Manager 5.4.2 release.
– One-click differences in configuration settings for a specific service across multiple clusters.
• Support
– Include a Cloudera support ticket with YARN application support bundles.
– Reduce the size of support bundles by specifying log data of interest to include in the bundle.
• HDFS
– Support for HDFS DataNode hot swap.
– Option to include replication of extended attributes during HDFS replication. HDFS ACLs will now be
replicated along with permissions.
• Added support for Hive on Spark.
Important: Hive on Spark is included in CDH 5.4 but is not currently supported nor recommended
for production use. If you are interested in this feature, try it out in a test environment until we
address the issues and limitations needed for production-readiness.
• Security
– Secure impersonation support for the Hue HBase app.
– Redaction of sensitive data in log files and in SQL query history.
– Support for custom Kerberos principals.
– Added commands for regenerating Kerberos keytabs at service and host levels. These commands will
clear existing keytabs from affected role instances and then trigger the Generate Credentials command
to create new keytabs.
– Kerberos support for Sqoop 2.
– Kerberos and SSL/TLS support for Flume Thrift Source and Sink.
– Solr SSL/TLS support.
– Navigator Key Trustee Server can be installed and monitored by Cloudera Manager.
– HBase Indexer integration with Sentry (File-based) for authorization.
HDFS encryption implements transparent, end-to-end encryption of data read from and written to HDFS by
creating encryption zones. An encryption zone is a directory in HDFS with every file and subdirectory in it
encrypted. Use one of the following services to store, manage, and access encryption zone keys:
– KMS (File) - The Hadoop Key Management Server with a file-based Java keystore; maintains a single copy
of keys, using simple password-based protection.
– KMS (Navigator Key Trustee) - An enterprise-grade key management service that replaces the file-based
Java keystore and leverages the advanced key-management capabilities of Cloudera Navigator Key Trustee.
Navigator Key Trustee is designed for secure, authenticated administration and cryptographically strong
storage of keys on multiple redundant servers that can be located outside the cluster.
• The Cloudera Manager Server now reports the correct number of physical cores and hyper-threading cores
if hyper-threading is enabled.
• Client configurations - Client configurations are now managed so that they are redeployed when a machine
is re-imaged.
Important: The changes to client configurations affect some API calls, as follows:
• When a host ceases to have a client configuration assigned to it, Cloudera Manager will remove
it, rather than leaving it behind. If a host has a client configuration assigned and the client
configuration is missing, Cloudera Manager will recreate it.
• If you currently use the API command deployClientConfig to deploy the client configurations
for a particular service, and you pass a specific set of role names to this call to narrow the set
of hosts that receive the new client configuration, then you should be aware that:
– The API command will continue to generate and deploy the client configuration only to the
hosts that correspond to the specified role names.
– Any other hosts that previously had deployed client configurations, but do not have gateway
roles assigned to them, will have those client configurations removed from them. This is
the new behavior.
• The behavior of the cluster level deployClientConfig command, and calling the service level
command with no arguments, is unchanged. The command still deploys a new client
configuration to all hosts with roles corresponding to the specified service or cluster.
• As this change is due to internal functional changes inside CM, it is not restricted to any new
API level. The deployClientConfig command in all API levels is affected.
• Configuration
– NameNode configuration - The decommissioning parameters dfs.namenode.replication.max-streams
and dfs.namenode.replication.max-streams-hard-limit are now available.
– Hue debug options - Two service-level configuration parameters have been added to the Hue service to
enable Django debug mode and debugging of internal server error responses.
Note: Although there is a CDH 5.2.3 release, there is no synchronous Cloudera Manager 5.2.3 release.
Note: Cloudera provides the following two solutions for data at rest encryption:
• Navigator Encrypt - is production ready and available for Cloudera customers licensed for
Cloudera Navigator. Navigator Encrypt operates at the Linux volume level, so it can encrypt
cluster data inside and outside HDFS. Talk to your Cloudera account team for more information
about this capability.
• HDFS Encryption - included in CDH 5.2.0 operates at the HDFS folder level, enabling encryption
to be applied only to HDFS folders where needed. This feature has several known limitations.
Therefore, Cloudera does not currently support this feature in CDH 5.2 and it is not recommended
for production use. If you're interested in trying the feature out, upgrade to the latest version
of CDH 5.
HDFS now implements transparent, end-to-end encryption of data read from and written to
HDFS by creating encryption zones. An encryption zone is a directory in HDFS with all of its
contents, that is, every file and subdirectory in it, encrypted. You can use either the KMS or the
Key Trustee service to store, manage, and access encryption zone keys.
• HBase - Support for configuring hedged reads has been added for HBase. The default configuration is to turn
hedged reads off. Cloudera Manager will emit two properties, dfs.client.hedged.read.threadpool.size
(default: 0) and dfs.client.hedged.read.threshold.millis (default: 500ms) to hbase-site.xml. .
• ZooKeeper - the RMI port can be configured. The port is configured using the JDK7 flag
-Dcom.sun.management.jmxremote.rmi.port. The default value is set to be same as the JMX Agent port.
Also, a special value of 0 or -1 disables the setting and a random port is used. The configuration has no effect
on versions lower than Oracle JDK 7u4.
• Cloudera Manager Agent configuration
– The supervisord port can now be configured in the Agent configuration supervisord_port. The change
takes effect the next time supervisord is restarted (not simply when the Agent is restarted).
– Added an Agent configuration local_filesystem_whitelist that allows configuring the list of local
filesystems that should always be monitored.
• Proxy user configuration
– All services' proxy user configuration properties have been moved to the HDFS service. Other services
running on the cluster inherit the configuration values provided in HDFS. If you have previously configured
a service to have values different from those configured in HDFS, then the proxy user configuration
properties will be moved to that service's Advanced Configuration Snippet (Safety Valve) for core-site.xml
to retain existing behavior.
Oozie and Solr are exceptions to this. Oozie proxy user configuration properties have been moved to Oozie
Server Advanced Configuration Snippet (Safety Valve) for oozie-site.xml if they differ from HDFS. Solr
proxy user configuration properties have been moved to Solr Service Environment Advanced Configuration
Snippet (Safety Valve) if they differ from HDFS.
• Resource management - YARN and Llama integrated resource management and Llama high availability
wizard.
• New and changed user roles - BDR Administrator, Cluster Administrator, Navigator Administrator, and User
Administrator. The Administrator role has been renamed Full Administrator.
• Configuration UI
– Cluster-wide configuration - you can view all modified settings and configure log directories, disk space
thresholds, and port settings.
– New configuration layout - the new layout provides an alternate way to view configuration pages. In the
classic layout, pages are organized by role group and categories within the role groups. The new layout
allows you to filter on configuration status, category, and scope. On each configuration page you can
easily switch between the classic and new layout.
Important: The classic layout is the default. All the configuration procedures described in the
Cloudera Manager documentation assume the classic layout.
Important: Cloudera Manager 5.1.0 is no longer available for download from the Cloudera website or
from archive.cloudera.com due to the JCE policy file issue described in the Fixed Issues in Cloudera
5.1.1 section of the Release Notes. The download URL at archive.cloudera.com for Cloudera
Manager 5.1.0 now forwards to Cloudera Manager 5.1.1 for the RPM-based distributions for Linux
RHEL and SLES.
• SSL Encryption
– Supports several new SSL-related configuration parameters for HDFS, MapReduce, YARN and HBase,
which allow you to configure and enable encrypted shuffle and encrypted web UIs for these services.
– Cloudera Manager now also supports the monitoring of HDFS, MapReduce, YARN, and HBase when SSL
is enabled for these services. New configuration parameters allow you to specify the location and password
of the truststore used to verify certificates in HTTPS communication with CDH services and the Cloudera
Manager Server.
• Sentry Service
– A new Sentry service that stores the authorization metadata in an underlying relational database and
allows you to use Grant/Revoke statements to modify privileges.
– You can also configure the Sentry service to allow Pig, MapReduce, and WebHCat queries access to
Sentry-secured data stored in Hive.
• Kerberos Authentication
– Now supports a Kerberos cluster using an Active Directory KDC.
– New wizard to enable Kerberos on an existing cluster. The wizard works with both MIT KDC and Active
Directory KDC.
– Ability to configure and deploy Kerberos client configuration (krb5.conf) on a cluster.
• Spark Service - added the History Server role
• Impala - added support for Llama ApplicationMaster High Availability
• User Roles - there are two new roles: Operator and Configurator that support fine-grained access to Cloudera
Manager features.
• Monitoring
– Updates to Oozie monitoring
– New Hive metastore canary
• UI - The UI has been updated to improve scalability. The Home page Status tab can be configured to display
clusters in a full or summary format. There is a new Cluster page for each cluster. The Hosts and Instances
pages have added faceted filters.
What's New in Cloudera Manager 5.0.6
A number of issues have been fixed. See Fixed Issues in Cloudera Manager 5.0.6 on page 202.
Important: Because triggers are a new and evolving feature, backward compatibility between
releases is not guaranteed at this time.
– Charting improvements
– New table chart type
– New options for displaying data and metadata from charts
– Support for exporting data from charts to CSV or JSON files
• Administrative Settings
– Added a new role type with limited administrator capabilities.
– Cloudera Manager Server and all JVMs will create a heap dump if they run out of memory.
– Configure the location of the parcel directory and specify whether and when to remove old parcels from
cluster hosts.
• JDK Version - Cloudera Manager 5 supports and installs both JDK 6 and JDK 7.
• Resource Management
– Static and dynamic partitioning of resources: provides a wizard for configuring static partitioning of
resources (cgroups) across core services (HBase, HDFS, MapReduce, Solr, YARN) and dynamic allocation
of resources for YARN and Impala.
– Pool, resource group, and queue administration for YARN and Impala.
– Usage monitoring and trending.
• Monitoring
– YARN service monitoring
– YARN (MRv2) job monitoring
– Configurable histograms of Impala query and YARN job attributes that can be used to quickly filter query
and application lists
– Scalable back-end database for monitoring metrics
– Charting improvements
– New chart types: histogram and heatmap
– New scale types: logarithmic and power
– Updates to tsquery language: new attribute values to support YARN and new functions to support
new chart types
• Extensibility
– Ability to manage both ISV applications and non-CDH services (for example, Accumulo, Spark, and so on)
– Working with select ISVs as part of Beta 1
• Single Sign-On - Support for SAML to enable single sign-on
• Parcels
– Dependency enforcement to ensure incompatible parcels are not used together
– Option to not cache downloaded parcels, to save disk space
– Improved error reporting for management operations
– If there are more than 50% maps waiting than total slots available, health goes concerning.
– If there are more than 50% reduce waiting than total slots available, health goes concerning.
IF (select waiting_reduces / reduce_slots where roleType=JOBTRACKER and
serviceName=$SERVICENAME and last(waiting_reduces / reduce_slots) > 50) DO
health:concerning
– FAILOVERCONTROLLER_FILE_DESCRIPTOR
– FAILOVERCONTROLLER_HOST_HEALTH
– FAILOVERCONTROLLER_LOG_DIRECTORY_FREE_SPACE
– FAILOVERCONTROLLER_SCM_HEALTH
– FAILOVERCONTROLLER_UNEXPECTED_EXITS
and
– HDFS_FAILOVERCONTROLLER_FILE_DESCRIPTOR
– HDFS_FAILOVERCONTROLLER_HOST_HEALTH
– HDFS_FAILOVERCONTROLLER_LOG_DIRECTORY_FREE_SPACE
– HDFS_FAILOVERCONTROLLER_SCM_HEALTH
– HDFS_FAILOVERCONTROLLER_UNEXPECTED_EXITS
The reason for the change is to better distinguish between MapReduce and HDFS failover controller monitoring
in the health system.
– mapreduce.job.jvm.numtasks
– The following YARN configuration parameters were replaced. Only the YARN parameters were replaced.
Old configurations will be lost, but they never had any effect so this does not affect functionality.
– mapreduce.jobtracker.restart.recover replaced by
yarn.resourcemanager.recovery.enabled (changed from Gateway to ResourceManager)
– mapreduce.tasktracker.http.threads replaced by mapreduce.shuffle.max.connections
– mapreduce.jobtracker.staging.root.dir replaced by yarn.app.mapreduce.am.staging-dir
– Cloudera Manager 5 sets the default YARN Resource Scheduler to FairScheduler. If a cluster was
previously running YARN with the FIFO scheduler, it will be changed to FairScheduler the next time
YARN restarts. The FairScheduler is only supported with CDH 4.2.1 and later, and older clusters may
hit failures and need to manually change the scheduler to FIFO or CapacityScheduler. See the Known
Issues section of this Release Note for information on how to change the scheduler back to FIFO or
CapacityScheduler.
Note: Rolling upgrade is not supported between CDH 4 and CDH 5. Rolling upgrade will also not be
supported from CDH 5.0.0 Beta 2 to any later releases, and may not be supported between any future
beta versions of CDH 5 and the General Availability release of CDH 5.
• Impala - The Impala Daemon now supports the Impala Maximum Log Files property which specifies the total
number of log files per severity level that should be retained before they are deleted. By default, after upgrading
to CDH 5.4 this property is set to 10, which means that Impala Daemons will only retain up to 10 log files for
each severity level. Any additional files will be deleted.
• HBase - Moved three settings for HBase coprocessors from Main to Advanced category:
– Service Wide > HBase Coprocessor Abort on Error: move to 'Service Wide > Advanced > HBase Coprocessor
Abort on Error'
– 'Master Default Group > HBase Coprocessor Master Classes': move to 'Master Default Group > Advanced
> HBase Coprocessor Master Classes'
– RegionServer Default Group > HBase Coprocessor Region Classes': move to 'RegionServer Default Group
> Advanced > HBase Coprocessor Region Classes'
Cloudera Manager automatically executes this command during Solr service startup. If this command fails, the
Solr service startup continues without reporting errors, despite the resulting incorrect SSL configuration.
Workaround: If Solr service startup completes without properly configuring urlScheme, set the property manually
by invoking the previously described Solr REST API call.
Backup and Disaster Recover replication does not set MapReduce Java options
Replication used for backup and disaster recovery relies on system-wide MapReduce memory options, and you
cannot configure the options using the Advanced Configuration Snippet.
Agent fails when retrieving log files with very long messages
When searching or retrieving large log files using the Agent, the Agent may consume near 100% CPU until it is
restarted. This can also happen then the collect host statistics command is issued.
One way this can happen is when the Hive hive.log.explain.output property is set to its default value of
true, very large messages with EXPLAIN outputs can cause the Cloudera Manager Agent to hang or become
unstable. In this case the workaround is to set the hive.log.explain.output property to false.
New Sentry Synchronization Path Prefixes added in NameNode configuration are not enforced
correctly
Any new path prefixes added in the NameNode configuration are not correctly enforced by Sentry. The ALCs are
initially set correctly, however they would be reset to old default after some time interval.
Workaround: Set the following property in Sentry Service Advanced Configuration Snippet (Safety Valve) and
Hive Metastore Server Advanced Configuration Snippet (Safety Valve) for hive-site.xml:
<property>
<name>sentry.hdfs.integration.path.prefixes</name>
<value>/user/hive/warehouse, ADDITIONAL_DATA_PATHS</value>
</property>
where ADDITIONAL_DATA_PATHS is a comma-separated list of HDFS paths where Hive data will be stored. The
value should be the same value as sentry.authorization-provider.hdfs-path-prefixes set in the
hdfs-site.xml on the NameNode.
Kafka 1.2 CSD conflicts with CSD included in Cloudera Manager 5.4
If the Kafka CSD was installed in Cloudera Manager to 5.3 or lower, the old version must be uninstalled, otherwise
it will conflict with the version of the Kafka CSD bundled with Cloudera Manager 5.4.
Workaround: Remove the Kafka 1.2 CSD before upgrading Cloudera Manager to 5.4:
1. Determine the location of the CSD directory:
a. Select Administration > Settings.
b. Click the Custom Service Descriptors category.
c. Retrieve the directory from the Local Descriptor Repository Path property.
2. Delete the Kafka CSD from the directory.
Cloudera Manager doesn't correctly generate client configurations for services deployed using
CSDs
HiveServer2 requires a Spark on YARN gateway on the same host in order for Hive on Spark to work. You must
deploy Spark client configurations whenever there's a change in order for HiveServer2 to pick up the change.
CSDs that depend on Spark will get incomplete Spark client configuration. Note that Cloudera Manager does not
ship with any such CSDs by default.
Workaround: Use /etc/spark/conf for Spark configuration, and ensure there is a Spark on YARN gateway on
that host.
Cloudera Manager 5.3.1 upgrade fails if Spark standalone and Kerberos are configured
CDH upgrade fails if Kerberos is enabled and Spark standalone is installed. Spark standalone doesn't work in a
kerberized cluster.
Workaround: To upgrade, remove the Spark standalone service first and then proceed with upgrade.
KMS and Key Trustee ACLs do not work in Cloudera Manager 5.3
ACLs configured for the KMS (File) and KMS (Navigator Key Trustee) services do not work since these services
do not receive the values for hadoop.security.group.mapping and related group mapping configuration
properties.
Workaround:
KMS (File): Add all configuration properties starting with hadoop.security.group.mapping from the NameNode
core-site.xml to the KMS (File) property, Key Management Server Advanced Configuration Snippet (Safety
Valve) for core-site.xml
KMS (Navigator Key Trustee): Add all configuration properties starting with hadoop.security.group.mapping
from the NameNode core-site.xml to the KMS (Navigator Key Trustee) property, Key Management Server
Proxy Advanced Configuration Snippet (Safety Valve) for core-site.xml.
Exporting and importing Hue database sometimes times out after 90 seconds
Executing 'dump database' or 'load database' of Hue from Cloudera Manager returns "command aborted because
of exception: Command timed-out after 90 seconds". The Hue database can be exported to JSON from within
Cloudera Manager. Unfortunately, sometimes the Hue database is quite large and the export times out after 90
seconds.
Workaround: Ignore the timeout. The command should eventually succeed even though Cloudera Manager
reports that it timed out.
Changing hostname of key trustee server requires editing the keytrustee.conf file
If you change the hostname of your primary or backup server, you will need to edit your keytrustee.conf file.
This issue typically arises if you replace a primary or backup server with a server having a different hostname.
If the same hostname is used on the new server, there will be no issues.
Workaround: Use the same hostname on the replacement server.
Hosts with Impala Llama roles must also have at least one YARN role
When integrated resource management is enabled for Impala, host(s) where the Impala Llama role(s) are running
must have at least one YARN role. This is because Llama requires the topology.py script from the YARN
configuration. If this requirement is not met, you may see errors such as:
The high availability wizard does not verify that there is a running ZooKeeper service
If one of the following is true:
• 1. ZooKeeper present and not running and the HDFS dependency on ZooKeeper dependency is not set
• 2. ZooKeeper absent
the enable high-availability wizard fails.
Workaround: Before enabling high availability, do the following:
1. Create and start a ZooKeeper service if one doesn't exist.
2. Go to the HDFS service.
3. Click the Configuration tab.
4. Select Scope > Service-Wide
Cloudera Manager Installation Path A fails on RHEL 5.7 due to PostgreSQL conflict
On RHEL 5.7, cloudera-manager-installer.bin fails due to a PostgreSQL conflict if PostgreSQL 8.1 is already
installed on your host.
Workaround: Remove PostgreSQL from host and rerun cloudera-manager-installer.bin.
Cloudera Management Service roles fail to start after upgrade to Cloudera Manager
If you have enabled TLS security for the Cloudera Manager Admin Console before upgrading to Cloudera Manager,
after the upgrade, the Cloudera Management Service roles will try to communicate with Cloudera Manager using
TLS and will fail to start unless the following SSL properties have been configured.
Hence, if you have the following property enabled in Cloudera Manager, use the workaround below to allow the
Cloudera Management Service roles to communicate with Cloudera Manager.
Property Description
Use TLS Encryption for Admin Select this option to enable TLS encryption between the Server and user's web
Console browser.
Workaround:
1. Open the Cloudera Manager Admin Console and navigate to the Cloudera Management Service.
2. Click Configuration.
3. In the Search field, type SSL to show the SSL properties (found under the Service-Wide > Security category).
4. Edit the following SSL properties according to your cluster configuration.
Property Description
SSL Client Truststore File Path to the client truststore file used in HTTPS communication. The contents
Location of this truststore can be modified without restarting the Cloudera
Management Service roles. By default, changes to its contents are picked
up within ten seconds.
SSL Client Truststore File Password for the client truststore file.
Password
Accumulo 1.6 service log aggregation and search does not work
Cloudera Manager log aggregation and search features are incompatible with the log formatting needed by the
Accumulo Monitor. Attempting to use either the "Log Search" diagnostics feature or the log file link off of an
individual service role's summary page will result in empty search results.
Severity: High
Workaround: Operators can use the Accumulo Monitor to see recent severe log messages. They can see recent
log messages below the WARNING level via a given role's process page and can inspect full logs on individual
hosts by looking in /var/log/accumulo.
Cloudera Manager incorrectly sizes Accumulo Tablet Server max heap size after 1.4.4-cdh4.5.0
to 1.6.0-cdh4.6.0 upgrade
Because the upgrade path from Accumulo 1.4.4-cdh4.5.0 to 1.6.0-cdh4.6.0 involves having both services installed
simultaneously, Cloudera Manager will be under the impression that worker hosts in the cluster are oversubscribed
on memory and attempt to downsize the max heap size allowed for 1.6.0-cdh4.6.0 Tablet Servers.
Severity: High
Workaround: Manually verify that the Accumulo 1.6.0-cdh4.6.0 Tablet Server max heap size is large enough for
your needs. Cloudera recommends you set this value to the sum of 1.4.4-cdh4.5.0 Tablet Server and Logger
heap sizes.
Accumulo installations using LZO do not indicate dependence on the GPL Extras parcel
Accumulo 1.6 installations that use LZO compression functionality do not indicate that LZO depends on the GPL
Extras parcel. When Accumulo is configured to use LZO, Cloudera Manager has no way to track that the Accumulo
service now relies on the GPL Extras parcel. This prevents Cloudera Manager from warning administrators before
they remove the parcel while Accumulo still requires it for proper operation.
Workaround: Check your Accumulo 1.6 service for the configuration changes mentioned in the Cloudera
documentation for using Accumulo with CDH prior to removing the GPL Extras parcel. If the parcel is mistakenly
removed, reinstall it and restart the Accumulo 1.6 service.
Created pools are not preserved when Dynamic Resource Pools page is used to configure
YARN or Impala
Pools created on demand are not preserved when changes are made using the Dynamic Resource Pools page.
If the Dynamic Resource Pools page is used to configure YARN and/or Impala services in a cluster, it is possible
to specify pool placement rules that create a pool if one does not already exist. If changes are made to the
configuration using this page, pools created as a result of such rules are not preserved across the configuration
change.
Workaround: Submit the YARN application or Impala query as before, and the pool will be created on demand
once again.
User should be prompted to add the AMON role when adding MapReduce to a CDH 5 cluster
When the MapReduce service is added to a CDH 5 cluster, the user is not asked to add the AMON role. Then, an
error displays when the user tries to view MapReduce activities.
Workaround: Manually add the AMON role after adding the MapReduce service.
Enterprise license expiration alert not displayed until Cloudera Manager Server is restarted
When an enterprise license expires, the expiration notification banner is not displayed until the Cloudera Manager
Server has been restarted. The enterprise features of Cloudera Manager are not affected by an expired license.
Workaround: None.
The HDFS command Roll Edits does not work in the UI when HDFS is federated
The HDFS command Roll Edits does not work in the Cloudera Manager UI when HDFS is federated because the
command doesn't know which nameservice to use.
Workaround: Use the API, not the Cloudera Manager UI, to execute the Roll Edits command.
Cloudera Manager reports a confusing version number if you have oozie-client, but not oozie
installed on a CDH 4.4 node
In CDH versions before 4.4, the metadata identifying Oozie was placed in the client, rather than the server
package. Consequently, if the client package is not installed, but the server is, Cloudera Manager will report Oozie
has been present but as coming from CDH 3 instead of CDH 4.
Workaround: Either install the oozie-client package, or upgrade to at least CDH 4.4. Parcel based installations
are unaffected.
On CDH 4.1 secure clusters managed by Cloudera Manager 4.8.1 and higher, the Impala Catalog
server needs advanced configuration snippet update
Impala queries fail on CDH 4.1 when Hive "Bypass Hive Metastore Server" option is selected.
Workaround: Add the following to Impala catalog server advanced configuration snippet for hive-site.xml,
replacing Hive_Metastore_Server_Host with the host name of your Hive Metastore Server:
<property>
<name>hive.metastore.local</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://Hive_Metastore_Server_Host:9083</value>
</property>
Error reading .zip file created with the Collect Diagnostic Data command.
After collecting Diagnostic Data and using the Download Diagnostic Data button to download the created zip
file to the local system, the zip file cannot be opened using the FireFox browser on a Macintosh. This is because
the zip file is created as a Zip64 file, and the unzip utility included with Macs does not support Zip64. The zip
utility must be version 6.0 or later. You can determine the zip version with unzip -v.
Workaround: Update the unzip utility to a version that supports Zip64.
After JobTracker failover, complete jobs from the previous active JobTracker are not visible.
When a JobTracker failover occurs and a new JobTracker becomes active, the new JobTracker UI does not show
the completed jobs from the previously active JobTracker (that is now the standby JobTracker). For these jobs
the "Job Details" link does not work.
Severity: Med
Workaround: None.
After JobTracker failover, information about rerun jobs is not updated in Activity Monitor.
When a JobTracker failover occurs while there are running jobs, jobs are restarted by the new active JobTracker
by default. For the restarted jobs the Activity Monitor will not update the following: 1) The start time of the
restarted job will remain the start time of the original job. 2) Any Map or Reduce task that had finished before
the failure happened will not be updated with information about the corresponding task that was rerun by the
new active JobTracker.
Severity: Med
Workaround: None.
<property> <name>dfs.namenode.edits.dir</name>
<value>qjournal://jn1HostName:8485;jn2HostName:8485;jn3HostName:8485/journalhdfs1,file:///dfs/edits</value>
</property>
Changing the rack configuration may temporarily cause mis-replicated blocks to be reported.
A rack re-configuration will cause HDFS to report mis-replicated blocks until HDFS rebalances the system, which
may take some time. This is a normal side-effect of changing the configuration.
Severity: Low
Workaround: None
Workaround: None
Restoring snapshot of a file to an empty directory does not overwrite the directory
Restoring the snapshot of an HDFS file to an HDFS path that is an empty HDFS directory (using the Restore As
action) will result in the restored file present inside the HDFS directory instead of overwriting the empty HDFS
directory.
Workaround: None.
These messages do not correspond to actual validation warnings and can be ignored. However, some validations
normally performed are skipped when this spurious warning is generated, and should be done manually.
Specifically, if Hue's authentication mechanism is set to LDAP, the following configuration should be validated:
1. The Hue LDAP URL property must be set.
2. For CDH 4.4 and lower, set one (but not both) of the following two Hue properties: NT Domain or LDAP
Username Pattern.
3. For CDH 4.5 and higher, if the Hue property Use Search Bind Authentication is selected, exactly one of the
two Hue properties NT Domain and LDAP Username Pattern must be set, as described in step 2 above.
Logging of command unavailable message improved
When a command is unavailable, the error messages are now more descriptive.
Client configuration logs no longer deleted by the Agent
If the Agent fails to deploy a new client configuration, the client log file is no longer deleted by the agent. The
Agent saves the log file and appends new log entries to the saved log file.
HDFS role migration requires certain HDFS roles to be running
Before using the Migrate Roles wizard to migrate HDFS roles, you must ensure that the following HDFS roles
are running as described:
• A majority of the JournalNodes in the JournalNode quorum must be running. With a quorum size of three
JournalNodes, for example, at least two JournalNodes must be running. The JournalNode on the source host
need not be running, as long as a majority of all JournalNodes are running.
• When migrating a NameNode and co-located Failover Controller, the other Failover Controller (that is, the
one that is not on the source host) must be running. This is true whether or not a co-located JournalNode is
being migrated as well, in addition to the NameNode and Failover Controller.
• When migrating a JournalNode by itself, at least one NameNode / Failover Controller co-located pair must
be running.
HDFS role migration requires automatic failover to be enabled
Migration of HDFS NameNode, JournalNode, and Failover Controller roles through the Migrate Roles wizard is
only supported when HDFS automatic failover is enabled. Otherwise, it causes a state in which both NameNodes
are in standby mode.
HDFS/Hive replication fails when replicating to target cluster that runs CDH 4 and has Kerberos enabled
Workaround: None.
The Cloudera Manager Agent now sets the file descriptor ulimit correctly on Ubuntu
During upgrade, bootstrapping the standby NameNode step no longer fails with standby NameNode connection
refused when connecting to active NameNode
Deploy krb5.conf now also deploys it on hosts with Cloudera Management Service roles
Cloudera Manager allows upgrades to unknown CDH maintenance releases
Cloudera Manager 5.3.0 supports any CDH release less than or equal to 5.3, even if the release did not exist
when Cloudera Manager 5.3.0 was released. For packages, you cannot currently use the upgrade wizard to
upgrade to such a release. This release adds a custom CDH field for the package case, where you can type in a
version that did not exist at the time of the Cloudera Manager release.
impalad memory limit units error in EnableLlamaRMCommand
The EnableLlamaRMCommand sets the value of the impalad memory limit to equal the NM container memory
value. But the latter is in MB, and the former is in bytes. Previously, the command did not perform the conversion;
this has been fixed.
Running MapReduce v2 jobs are now visible using the Application Master view
In the Application view, selecting Application Master for a MRv2 job previously resulted in no action.
Deleting services no longer results in foreign key constraint exceptions
The Cloudera Manager Server log previously showed several foreign key constraint exceptions that were associated
with deleted services. This has been fixed.
HiveServer2 keystore and LDAP group mapping passwords are no longer exposed in client configuration files
The HiveServer2 keystore password and LDAP group mapping passwords were emitted into the client
configuration files. This exposed the passwords in plain text in a world-readable file. This has been fixed.
A cross-site scripting vulnerability in Cloudera Management Service web UIs fixed
The high availability wizard now sets the HDFS dependency on ZooKeeper
Workaround: Before enabling high availability, do the following:
1. Create and start a ZooKeeper service if one does not exist.
2. Go to the HDFS service.
3. Click the Configuration tab.
4. Select HDFS Service-Wide.
5. Select Category > Main.
6. Locate the ZooKeeper Service property or search for it by typing its name in the Search box. Select the
ZooKeeper service you created.
If more than one role group applies to this configuration, edit the value for the appropriate role group. See
Modifying Configuration Properties.
7. Click Save Changes to commit the changes.
BDR no longer assumes superuser is common if clusters have the same realm
If source and destination clusters are in the same Kerberos realm, Cloudera Manager assumed that superuser
of the destination is also the superuser on the source cluster. However, HDFS can be configured so that this is
not the case.
Spark and Spark (standalone) services fail to start if you upgrade to CDH 5.2.x parcels from an older CDH package
Spark and Spark standalone services fail to start if you upgrade to CDH 5.2.x parcels from an older CDH package.
Workaround: After upgrading rest of the services, uninstall the old CDH packages, and then start the Spark
service.
Fixed MapReduce Usage by User reports when using an Oracle database backend
Setting the default umask in HDFS fails in new configuration layout
Setting the default umask in the HDFS configuration section to 002 in the new configuration layout displays an
error:"Could not parse: Default Umask : Could not parse parameter 'dfs_umaskmode'. Was
expecting an octal value with a leading 0. Input: 2", preventing the change from being submitted.
Workaround:
1. Add the following to ResourceManager Advanced Configuration Snippet for yarn-site.xml, replacing
MAX_ATTEMPTS with the desired maximum number of attempts:
<property>
<name>yarn.resourcemanager.am.max-attempts</name><value>MAX_ATTEMPTS</value>
</property>
SPARK_HISTORY_OPTS=-Dspark.history.kerberos.enabled=true \
-Dspark.history.kerberos.principal=principal \
-Dspark.history.kerberos.keytab=keytab
where principal is the name of the Kerberos principal to use for the History Server, and keytab is the path to the
principal's keytab file on the local filesystem of the host running the History Server.
Hive replication issue with TLS enabled
Hive replication will fail when the source Cloudera Manager instance has TLS enabled, even though the required
certificates have been added to the target Cloudera Manager's trust store.
Workaround: Add the required Certificate Authority or self-signed certificates to the default Java trust store,
which is typically a copy of the cacerts file named jssecacerts in the $JAVA_HOME/jre/lib/security/ path
of your installed JDK. Use keytool to import your private CA certificates into the jssecacert file.
The Spark Upload Jar command fails in a secure cluster
The Spark Upload Jar command fails in a secure cluster.
Workaround: To run Spark on YARN, manually upload the Spark assembly jar to HDFS /user/spark/share/lib.
The Spark assembly jar is located on the local filesystem, typically in /usr/lib/spark/assembly/lib or
/opt/cloudera/parcels/CDH/lib/spark/assembly/lib.
Clients of the JobHistory Server Admin Interface Require Advanced Configuration Snippet
Clients of the JobHistory server administrative interface, such as the mapred hsadmin tool, may fail to connect
to the server when run on hosts other than the one where the JobHistory server is running.
Workaround: Add the following to both the MapReduce Client Advanced Configuration Snippet for mapred-site.xml
and the Cluster-wide Advanced Configuration Snippet for core-site.xml, replacing JOBHISTORY_SERVER_HOST
with the hostname of your JobHistory server:
<property>
<name>mapreduce.history.admin.address</name>
<value>JOBHISTORY_SERVER_HOST:10033</value>
</property>
Could not find a healthy host with CDH 5 on it to create HiveServer2 error during upgrade
When upgrading from CDH 4 to CDH 5, if no parcel is active then the error message "Could not find a healthy
host with CDH5 on it to create HiveServer2" displays. This can happen when transitioning from packages to
parcels, or if you explicitly deactivate the CDH 4 parcel (which is not necessary) before upgrade.
Workaround: Wait 30 seconds and retry the upgrade.
AWS installation wizard requires Java 7u45 to be installed on Cloudera Manager Server host
Cloudera Manager 5.1 installs Java 7u55 by default. However, the AWS installation wizard does not work with
Java 7u55 due to a bug in the jClouds version packaged with Cloudera Manager.
Workaround:
1. Stop the Cloudera Manager Server.
Workaround:
1. Add the following to ResourceManager Advanced Configuration Snippet for yarn-site.xml, replacing
MAX_ATTEMPTS with the desired maximum number of attempts:
<property>
<name>yarn.resourcemanager.am.max-attempts</name><value>MAX_ATTEMPTS</value>
</property>
Workaround: Do not select the Install Java Unlimited Strength Encryption Policy Files checkbox during the
aforementioned wizards. Instead download and install them manually, following the instructions on Oracle's
website.
• JDK 7 Instructions:
http://www.oracle.com/technetwork/java/javase/downloads/jce-7-download-432124.html
• JDK 8 Instructions:
http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html
Note: To return to the default limited strength files, reinstall the original Oracle rpm:
• yum - yum reinstall jdk
• zypper - zypper in -f jdk
• rpm - rpm -iv --replacepkgs filename, where filename is jdk-7u65-linux-x64.rpm or
jdk-8u11-linux-x64.rpm)
Important: Cloudera Manager 5.1.0 is no longer available for download from the Cloudera website or
from archive.cloudera.com due to the JCE policy file issue described in the Issues Fixed in Cloudera
Manager 5.1.1 on page 200 section of the Release Notes. The download URL at archive.cloudera.com
for Cloudera Manager 5.1.0 now forwards to Cloudera Manager 5.1.1 for the RPM-based distributions
for Linux RHEL and SLES.
Changes to property for yarn.nodemanager.remote-app-log-dir are not included in the JobHistory Server
yarn-site.xml and Gateway yarn-site.xml
When "Remote App Log Directory" is changed in YARN configuration, the property
yarn.nodemanager.remote-app-log-dir are not included in the JobHistory Server yarn-site.xml and
Gateway yarn-site.xml.
Workaround: Set JobHistory Server Advanced Configuration Snippet (Safety Valve) for yarn-site.xml and YARN
Client Advanced Configuration Snippet (Safety Valve) for yarn-site.xml to:
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/path/to/logs</value>
</property>
Secure CDH 4.1 clusters can't have Hue and Impala share the same Hive
In a secure CDH 4.1 cluster, Hue and Impala cannot share the same Hive instance. If "Bypass Hive Metastore
Server" is disabled on the Hive service, then Hue will not be able to talk to Hive. Conversely, if "Bypass Hive
Metastore" enabled on the Hive service, then Impala will have a validation error.
Severity: High
Workaround: Upgrade to CDH 4.2.
The command history has an option to select the number of commands, but doesn't always return the number
you request
Workaround: None.
Hue doesn't support YARN ResourceManager High Availability
Workaround: Configure the Hue Server to point to the active ResourceManager:
1. Go to the Hue service.
2. Click the Configuration tab.
3. Select Scope > Hue or Hue Service-Wide.
4. Select Category > Advanced.
5. Locate the Hue Server Advanced Configuration Snippet (Safety Valve) for hue_safety_valve_server.ini property
or search for it by typing its name in the Search box.
6. In the Hue Server Advanced Configuration Snippet for hue_safety_valve_server.ini field, add the following:
[hadoop]
[[ yarn_clusters ]]
[[[default]]]
resourcemanager_host=<hostname of active ResourceManager>
resourcemanager_api_url=http://<hostname of active resource manager>:<web port of
active resource manager>
proxy_api_url=http://<hostname of active resource manager>:<web port of active
resource manager>
HIVE_AUX_JARS_PATH=""
AUX_CLASSPATH=/usr/share/java/mysql-connector-java.jar:/usr/share/java/oracle-connector-java.jar:$(find
/usr/share/cmf/lib/postgresql-jdbc.jar 2> /dev/null | tail -n 1)
[desktop]
[[ldap]]
search_bind_authentication=false
Erroneous warning displayed on the HBase configuration page on CDH 4.1 in Cloudera Manager 5.0.0
An erroneous "Failed parameter validation" warning is displayed on the HBase configuration page on CDH 4.1
in Cloudera Manager 5.0.0
Severity: Low
Workaround: Use CDH4.2 or higher, or ignore the warning.
Host recommissioning and decommissioning should occur independently
In large clusters, when problems appear with a host or role, administrators may choose to decommission the
host or role to fix it and then recommission the host or role to put it back in production. Decommissioning,
especially host decommissioning, is slow, hence the importance of parallelization, so that host recommissioning
can be initiated before decommissioning is done.
Severity: High
Workarounds and caveats:
• On Red Hat and similar systems, make sure rpcbind-0.2.0-10.el6 or later is installed.
• On SLES, Debian, and Ubuntu systems, do one of the following:
– Install CDH using Cloudera Manager; or
– As of CDH 5.1, start the NFS gateway as root; or
– Start the NFS gateway without using packages; or
– You can use the gateway by running rpcbind in insecure mode, using the -i option, but keep in mind
that this allows anyone from a remote host to bind to the portmap.
4. Add the following to the Oozie SchemaService Workflow Extension Schemas property:
shell-action-0.2.xsd
hive-action-0.3.xsd
sqoop-action-0.3.xsd
[groups]
HTTP = HTTP
[roles]
HTTP = collection = admin->action=query
3. Copy this file to the location for the "Sentry Global Policy File" for Solr. The associated config name for this
location is sentry.solr.provider.resource, and you can see the current value by navigating to the Sentry
sub-category in the Service Wide configuration editing workflow in the Cloudera Manager UI. The default
value for this entry is /user/solr/sentry/sentry-provider.ini. This refers to a path in HDFS.
4. Check if you have entries in HDFS for the parent(s) directory:
5. You may need to create the appropriate parent directories if they are not present. For example:
6. After ensuring the parent directory is present, copy the file created in step 2 to this location, as follows:
7. Ensure that this file is owned/readable by the solr user (this is what the Solr Server runs as):
8. Restart the Solr service. If both Kerberos and Sentry are being enabled for Solr, the MGMT services also need
to be restarted. The Solr Server liveness health checks should clear up once SMON has had a chance to
contact the servers and retrieve metrics.
Out-of-memory errors may occur when using the Reports Manager
Out-of-memory errors may occur when using the Cloudera Manager Reports Manager.
Workaround: Set the value of the "Java Heap Size of Reports Manager" property to at least the size of the HDFS
filesystem image (fsimage) and restart the Reports Manager.
#################################################################
# Settings to configure Impala
#################################################################
[impala]
....
# Turn on/off impersonation mechanism when talking to Impala
## impersonation_enabled=False
Cloudera Manager Server may fail to start when upgrading using a PostgreSQL database.
If you're upgrading to Cloudera Manager 5.0.0 beta 1 and you're using a PostgreSQL database, the Cloudera
Manager Server may fail to start with a message similar to the following:
Workaround: Use psql to connect directly to the server's database and issue the following SQL command:
alter table REVISIONS alter column MESSAGE type varchar(1048576);
After that, your Cloudera Manager server should start up normally.
Note: Although there is a CDH 5.4.2 release, there is no synchronous Cloudera Navigator 2.3.2 release.
Important: Spark lineage is not currently enabled, supported, or recommended for production
use. If you're interested in this feature, try it out in a test environment until we address the
issues and limitations needed for production-readiness.
• Security
– Role-Based Access Control - support assigning groups to roles that constrain access to Navigator features
– Authentication - LDAP and Active Directory authentication of Navigator users
– SSL - enable SSL for encrypted communication
• API
– Version changed to v3
– Supports auditing and policies
After HDFS upgrade, some changes to HDFS entities may not appear in Navigator
After an HDFS upgrade, Navigator might not detect changes to HDFS entities, such as move, rename, and delete
operations, that were recorded only in the HDFS edit logs before the upgrade. This may cause an inconsistent
view of HDFS entities between Navigator and HDFS.
Workaround: None.
Lineage performance suffers when more than 10000 relations are extracted
If more than 10000 relations must be traversed for a lineage diagram performance suffers. This can occur in
cases where there are thousands of files in a directory or hundreds of columns in a table.
Spurious errors about missing database connectors are reported in the Metadata Server log file
Workaround: Ignore the errors.
Metadata component in Cloudera Navigator 1.2 (included with Cloudera Manager 5.0) cannot be upgraded to
2.0
Workaround:
Cloudera does not provide an upgrade path from the Navigator Metadata component which was a beta release
in Cloudera Navigator 1.2 to the Cloudera Navigator 2 release. If you are upgrading from Cloudera Navigator 1.2
(included with Cloudera Manager 5.0), you must perform a clean install of Cloudera Navigator 2. Therefore, if you
have Cloudera Navigator roles from a 1.2 release:
1. Delete the Navigator Metadata Server role.
2. Remove the contents of the Navigator Metadata Server storage directory.
3. Add the Navigator Metadata Server role according to the process described in Adding the Navigator Metadata
Server Role.
4. Clear the cache of any browser that had used the 1.2 release of the Navigator Metadata component. Otherwise,
you may observe errors in the Navigator Metadata UI.
See Upgrading Cloudera Navigator.
Facet count should show (0) instead of (-) when there are no matching entities
Facet counts show (-) when the Metadata Server doesn't return a value. It should show (0).
Workaround: None.
Sentry auditing does not work if the Python version is lower than 2.5
Navigator Audit Server reports invalid null characters in HBase audit events when using the PostgreSQL database
Navigator Audit Server reports invalid null characters in HBase audit events when using the PostgreSQL database.
HBase allows null characters in qualifiers, so now Navigator escapes them.
The audit reports UI now returns results when there are a large number of audit records
The audit reports UI was not returning results when there were a large number of audit records matching a
particular time period, especially when the period included multiple days. The UI is now also much more responsive.
Navigator Audit Server no longer throws OOM for very long Impala queries
If Hue is added after Navigator, search results do not have links to Hue
In the Metadata UI, search results contain links to an appropriate application in Hue. However, if you add a Hue
service after Navigator roles, there will be no links to Hue.
Workaround:
1. Set the cluster's display name and name properties to be the same:
a. Get cluster's name and display name using following API: http://hostname:7180/api/v6/clusters.
b. In the Cloudera Manager Admin Console, at the right of the cluster name, click the down arrow and select
Rename Cluster. Set the cluster display name to match its name.
2. Restart the Navigator Metadata server.
Navigator Audit Server reports invalid null characters in HBase audit events when using the PostgreSQL database
Navigator Audit Server reports invalid null characters in HBase audit events when using the PostgreSQL database.
HBase allows null characters in qualifiers, so now Navigator escapes them.
The audit reports UI now returns results when there are a large number of audit records
The audit reports UI was not returning results when there were a large number of audit records matching a
particular time period, especially when the period included multiple days. The UI is now also much more responsive.
Navigator Audit Server no longer throws OOM for very long Impala queries
LDAP lookups in Active Directory to resolve group membership are now working
Dropping a Hive table and creating a view with same name or vice versa no longer raises an error
HDFS extraction now works after upgrading CDH from 5.1 to 5.2
Setting a property in the Hue advanced configuration snippet no longer throws a "too many Boolean clauses"
error in Navigator Metadata
Memory leak in Navigator Audit Server due to error during batch operations
Dropping a Hive table and creating a view with same name or vice versa no longer raises an error
Setting a property in the Hue advanced configuration snippet no longer throws a "too many Boolean clauses"
error in Navigator Metadata
The "allowed" query selector is missing from the audit REST API
Queries such as http://hostname:7180/api/v7/audits?maxResults=10&query=allowed==false are
now supported.
Workaround: None.
When you specify an end date for the created property, no results are returned.
Workaround: Clear the end date control or specify an end date of TO+*%5D%22%7D.
Navigating back to the parent entity in a Pig lineage diagram sometimes displays the error: Cannot read property
'x' of undefined.
Workaround: None.
Pig job that has relations with self is unreadable in lineage view.
The Metadata UI currently does not handle situation where there is a data-flow relation between elements that
are also related via parent-child relation.
Workaround: None.
If auditing is enabled, during an upgrade from Cloudera Manager 4.6.3 to 4.7, the Impala service won't start.
Impala auditing requires the Audit Log Directory property to be set, but the upgrade process fails to set the
property.
Workaround: Do one of the following:
• Stop the Cloudera Navigator Audit server.
• Ensure that you have a Cloudera Navigator license and manually set the property to
/var/log/impalad/audit.
Empty box appears over the list of results after adding a tag to a file
When tags are added to an entity, in some cases a white box remains after pressing Enter.
Workaround: Refresh the page to remove the artifact.
Certain complex multi-level lineages, such as directory/file and database/table, may not be fully represented
visually.
Workaround: None.
JDK Compatibility
This topic contains compatibility information across versions of JDK and CDH/Cloudera Manager.
Cloudera Manager Version Oracle JDK 1.6 Oracle JDK 1.7 Oracle JDK 1.8
Cloudera Manager 5.4.x 1.6.0_31 1.7.0_75 1.8.0_40
Cloudera Manager 5.3.x 1.6.0_31 1.7.0_67 1.8.0_11
Cloudera Manager 5.2.x 1.6.0_31 1.7.0_67 Not Supported
Cloudera Manager 5.1.x 1.6.0_31 1.7.0_55 Not Supported
Cloudera Manager 5.0.x 1.6.0_31 1.7.0_45 Not Supported
Cloudera Manager 4.8.x 1.6.0_31 1.7.0_45 Not Supported
1.7.0_55 has been certified
against Cloudera Manager
4.8.3
Cloudera Manager Version Oracle JDK 1.6 Oracle JDK 1.7 Oracle JDK 1.8
Cloudera Manager 4.7.x 1.6.0_31 1.7.0_45 Not Supported
Starting CDH 4.4, all
cluster nodes and services
must be running the same
JDK version (that is, all
deployed on JDK 6 or all
deployed on a supported
JDK 7 version).
Important:
JDK 1.6 is not supported on any CDH 5 release, but before CDH 5.4.0, CDH libraries have been compatible
with JDK 1.6. As of CDH 5.4.0, CDH libraries are no longer compatible with JDK 1.6 and applications
using CDH libraries must use JDK 1.7.
CDH Version Oracle JDK 1.6 Oracle JDK 1.7 Oracle JDK 1.8
CDH 5.4.x Not Supported 1.7.0_75 1.8.0_40
CDH 5.3.x Not Supported 1.7.0_67 1.8.0_11
CDH 5.2.x Not Supported 1.7.0_67 Not Supported
CDH 5.1.x Not Supported 1.7.0_55 Not Supported
CDH 5.0.x Not Supported 1.7.0_55 Not Supported
CDH 4.7.x 1.6.0_31 1.7.0_55 Not Supported
CDH 4.6.x 1.6.0_31 1.7.0_55 Not Supported
CDH 4.5.x 1.6.0_31 1.7.0_55 Not Supported
CDH 4.4.x 1.6.0_31 1.7.0_15 Not Supported
CDH 4.3.x 1.6.0_31 1.7.0_15 Not Supported
CDH 4.2.x 1.6.0_31 1.7.0_15 Not Supported
CDH Version Oracle JDK 1.6 Oracle JDK 1.7 Oracle JDK 1.8
Cloudera Manager 4.5.1
supports CDH 4.2 running
with JDK 7, with
restrictions.
Note:
The Cloudera Manager minor version must always be equal to or greater than the CDH minor version
because older versions of Cloudera Manager may not support features in newer versions of CDH. For
example, if you want to upgrade to CDH 5.1.2 you must first upgrade to Cloudera Manager 5.1 or
higher.
Apache Accumulo
This matrix contains compatibility information across versions of Apache Accumulo, and CDH and Cloudera
Manager. For detailed information on each release, see Apache Accumulo documentation.
Cloudera Impala
This matrix contains compatibility information across versions of Cloudera Impala and CDH/Cloudera Manager.
For detailed information on each release, see Cloudera Impala documentation.
Note: The Impala 2.2.x maintenance releases now use the CDH 5.4.x numbering system rather than
increasing the Impala version numbers. Impala 2.2 and higher are not available under CDH 4.
Cloudera Impala 2.2.0 Cloudera Manager 5.0.0 - CDH 5.4.0 CDH 5.4.0
5.x.x
Cloudera Impala 2.1.3 Cloudera Manager 5.0.0 - CDH 5.3.3 CDH 5.3.3
5.x.x
Cloudera Impala 2.1.2 Cloudera Manager 5.0.0 - CDH 5.3.2 CDH 5.3.2
5.x.x
Cloudera Impala 2.1.0 Cloudera Manager 5.0.0 - CDH 5.3.0 CDH 5.3.0
5.x.x
Cloudera Impala 2.1.x for Cloudera Manager 4.8.0 - CDH 4.1.0 and later 4.x.x No
CDH 4 4.x.x, Cloudera Manager
5.0.0 - 5.x.x
Cloudera Impala 2.0.4 Cloudera Manager 5.0.0 - CDH 5.2.5 CDH 5.2.5
5.x.x
Cloudera Impala 2.0.3 Cloudera Manager 5.0.0 - CDH 5.2.4 CDH 5.2.4
5.x.x
Cloudera Impala 2.0.2 Cloudera Manager 5.0.0 - CDH 5.2.3 CDH 5.2.3
5.x.x
Cloudera Impala 2.0.1 Cloudera Manager 5.0.0 - CDH 5.2.1 CDH 5.2.1
5.x.x
Cloudera Impala 2.0.0 Cloudera Manager 5.0.0 - CDH 5.2.0 CDH 5.2.0
5.x.x
Cloudera Impala 2.0.x for Cloudera Manager 4.8.0 - CDH 4.1.0 and later 4.x.x No
CDH 4 4.x.x, Cloudera Manager
5.0.0 - 5.x.x
Cloudera Impala 1.4.4 Cloudera Manager 5.0.0 - CDH 5.1.5 CDH 5.1.5
5.x.x
Cloudera Impala 1.4.3 Cloudera Manager 5.0.0 - CDH 5.1.4 CDH 5.1.4
5.x.x
Cloudera Impala 1.4.2 Cloudera Manager 5.0.0 - CDH 5.1.3 CDH 5.1.3
5.x.x
Cloudera Impala 1.4.1 Cloudera Manager 5.0.0 - CDH 5.1.2 CDH 5.1.2
5.x.x
Cloudera Impala 1.4.0 Cloudera Manager 5.0.0 - CDH 5.1.0 CDH 5.1.0
5.x.x; Recommended:
Cloudera Manager 5.1.0
Cloudera Impala 1.4.x for Cloudera Manager 4.8.0 - CDH 4.1.0 - 4.x.x No
CDH 4 4.x.x, Cloudera Manager
5.0.0 - 5.x.x
Cloudera Impala 1.3.3 Cloudera Manager 5.0.0 - CDH 5.0.5 CDH 5.0.5
5.x.x
Cloudera Impala 1.3.2 Cloudera Manager 5.0.0 - CDH 5.0.4 CDH 5.0.4
5.x.x
Apache Kafka
Apache Kafka is a distributed commit log service that functions much like a publish/subscribe messaging system,
but with better throughput, built-in partitioning, replication, and fault tolerance. Kafka is currently distributed
in a parcel that is independent of the CDH parcel and integrates with Cloudera Manager using a Custom Service
Descriptor (CSD).
Note: Kafka is only supported on parcel-deployed clusters. Do not use it on a cluster deployed using
packages or a tarball.
Cloudera Navigator
This matrix contains compatibility information across versions of Cloudera Navigator, Cloudera Manager, and
CDH. For detailed information on each release, see Cloudera Navigator documentation.
Cloudera Navigator Auditing, 5.3.0 • Audit Component Impala 1.2.1 with Not
2.2.x Metadata, and CDH 4.4.0 Supported
– HDFS, HBase
Security
- 4.0.0
– Hue - 4.2.0
– Hive - 4.2.0,
4.4.0 for
operations
denied due to
lack of
privileges.
– Sentry - 5.1.0
Cloudera Navigator Auditing, 5.2.0 • Audit Component Impala 1.2.1 with Not
2.1.x Metadata, and CDH 4.4.0 Supported
– HDFS, HBase
Security
- 4.0.0
– Hue - 4.2.0
– Hive - 4.2.0,
4.4.0 for
operations
denied due to
lack of
privileges.
– Sentry - 5.1.0
• Metadata
Component
– HDFS, Hive,
Oozie,
MapReduce,
Sqoop 1 -
4.4.0
– Pig - 4.6.0
– YARN - 5.0.0
Cloudera Navigator Auditing, 5.1.2 • Audit Component Impala 1.2.1 with Not
2.0.1 Metadata, and CDH 4.4.0 Supported
– HDFS, HBase
Security
- 4.0.0
– Hue - 4.2.0
– Hive - 4.2.0,
4.4.0 for
operations
denied due to
lack of
privileges.
– Sentry - 5.1.0
• Metadata
Component
Cloudera Navigator Auditing, 5.1.0 • Audit Component Impala 1.2.1 with Not
2.0.0 Metadata, and CDH 4.4.0 Supported
– HDFS, HBase
Security
- 4.0.0
– Hue - 4.2.0
– Hive - 4.2.0,
4.4.0 for
operations
denied due to
lack of
privileges.
– Sentry - 5.1.0
• Metadata
Component
– HDFS, Hive,
Oozie,
MapReduce,
Sqoop 1 -
4.4.0
– Pig - 4.6.0
– YARN - 5.0.0
Cloudera Navigator Auditing 5.0.0 • HDFS, HBase - Impala 1.2.1 with Not
1.2.x 4.0.0 CDH 4.4.0 Supported
• Hue - 4.2.0
• Hive - 4.2.0, 4.4.0
for operations
denied due to
lack of privileges.
Cloudera Navigator Auditing 4.8.0 and 4.7.0 • HDFS, HBase - Impala 1.1.1 with Not
1.1.x 4.0.0 CDH 4.4.0 Supported
• Hive, Hue - 4.2.0
Cloudera Search
This topic contains compatibility information across versions of Cloudera Search and CDH/Cloudera Manager.
For detailed documentation, see Cloudera Search with CDH 4 and Cloudera Search with CDH 5. Compatibility
information for Cloudera Search depends on the version of CDH you are using.
Note: It's possible for a single cluster to use both, the Sentry service (for Hive and Impala) and Sentry
policy files (for Solr).
Apache Spark
Spark is a fast, general engine for large-scale data processing. For installation and configuration instructions,
see Spark Installation. To see new features introduced with each release, refer the CDH 5 Release Notes on page
5.