PowerScale OneFS HDFS Reference Guide OneFS 8.1.2.0 - 9.3.0.0
PowerScale OneFS HDFS Reference Guide OneFS 8.1.2.0 - 9.3.0.0
October 2021
Notes, cautions, and warnings
NOTE: A NOTE indicates important information that helps you make better use of your product.
CAUTION: A CAUTION indicates either potential damage to hardware or loss of data and tells you how to avoid
the problem.
WARNING: A WARNING indicates a potential for property damage, personal injury, or death.
© 2016 - 2021 Dell Inc. or its subsidiaries. All rights reserved. Dell, EMC, and other trademarks are trademarks of Dell Inc. or its subsidiaries.
Other trademarks may be trademarks of their respective owners.
1
Introduction to this guide
This guide provides information for PowerScale OneFS and Hadoop Distributed File System (HDFS) administrators when
implementing a PowerScale OneFS and Hadoop system integration. This guide describes how you can use the PowerScale
OneFS Web administration interface (Web UI) and command-line interface (CLI) to configure and manage your PowerScale and
Hadoop clusters.
Topics:
• Where to get help
• Copyright
Copyright
Copyright © 2021 Dell Inc. or its subsidiaries. All rights reserved.
Published October 2021
Dell believes the information in this publication is accurate as of its publication date. The information is subject to change
without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS-IS.“ DELL MAKES NO REPRESENTATIONS OR WARRANTIES
OF ANY KIND
WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. USE, COPYING, AND DISTRIBUTION OF ANY DELL
SOFTWARE DESCRIBED
IN THIS PUBLICATION REQUIRES AN APPLICABLE SOFTWARE LICENSE.
Topics:
• How Hadoop is implemented on OneFS
• Hadoop distributions supported by OneFS
• HDFS files and directories
• Hadoop user and group accounts
• HDFS and SmartConnect
Each node boosts performance and expands the cluster's capacity. For Hadoop analytics, the PowerScale scale-out distributed
architecture minimizes bottlenecks, rapidly serves Big Data, and optimizes performance.
How a PowerScale OneFS Hadoop implementation differs from a traditional Hadoop deployment
Topics:
• Activate the HDFS license
• Configuring the HDFS service
• Configuring HDFS authentication methods
• Creating a local Hadoop user
• Enabling the WebHDFS REST API
• Configuring secure impersonation
• Configuring virtual HDFS racks
• Configuring HDFS wire encryption
• Configuring HDFS transparent data encryption
2. If your modules are not licensed, obtain a license key from your PowerScale sales representative. To activate the license,
type the following command, where license file path is the location of your license file:
Setting Description
Block size The HDFS block size setting on the PowerScale cluster determines how the HDFS service returns data
on read requests from Hadoop compute client.
You can modify the HDFS block size on the cluster to increase the block size from 4 KB up to 1 G. The
default block size is 128 MB. Increasing the block size enables the PowerScale cluster nodes to read
and write HDFS data in larger blocks and optimize performance for most use cases.
The Hadoop cluster maintains a different block size that determines how a Hadoop compute client
writes a block of file data to the PowerScale cluster. The optimal block size depends on your data, how
you process your data, and other factors. You can configure the block size on the Hadoop cluster in the
hdfs-site.xml configuration file in the dfs.block.size property.
You must specify the block size in bytes. Suffixes K, M, and G are allowed.
The following command sets the checksum type to crc32 in the zone3 access zone:
Authentication Description
method
Simple only Requires only a username to establish client connections.
Kerberos only Requires Kerberos credentials to establish client connections.
NOTE: You must configure Kerberos as an authentication provider on the PowerScale cluster, and
you must modify the core-site.xml file on clients running Hadoop 2.2 and later.
The following command specifies that Hadoop compute clients connecting to zone3 must be identified through the Kerberos
authentication method:
To ensure that users can authenticate through Kerberos, you must modify the core-site.xml file on clients running Hadoop
2.2 and later.
Note that if you are changing the core-site.xml and hdfs-site.xml files directly with an editor per the instructions
below, this will work, but those changes will likely be overwritten. This is because these two configuration files are frequently
overwritten by the Ambari or Cloudera Navigator user interfaces. Therefore, if you are managing the cluster with Ambari or
Cloudera Navigator, we recommend that you use their respective user interfaces to make any configuration changes.
1. Go to the $HADOOP_CONF directory on your Hadoop client.
2. Open the core-site.xml file in a text editor.
<property>
<name>hadoop.security.token.service.use_ip</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.kerberos.principal.pattern</name>
<value>hdfs/*@storage.company.com</value>
</property>
"/\[]:;|=,+*?<>
The following command designates hadoop-user23 in zone1 as a new proxy user and adds the group hadoop-users to the list of
members that the proxy user can impersonate:
The following command designates hadoop-user23 in zone1 as a new proxy user and adds UID 2155 to the list of members that
the proxy user can impersonate:
3. To view the configuration details for a specific proxy user, run the isi hdfs proxyusers viewcommand.
The following command displays the configuration details for the hadoop-user23 proxy user in zone1:
The following command creates a rack named /hdfs-rack2 in the zone5 access zone, specifies 120.135.26.10-120.135.26.20
as the IP address range of Hadoop compute clients associated with the rack, and specifies subnet0:pool0 as the IP address pool
of OneFS nodes assigned to the rack:
The following command adds 120.135.26.30-120.135.26.40 to the list of existing Hadoop compute client IP addresses assigned
to /hdfs-rack2 in the zone3 access zone:
In addition to adding a range to the list of existing ranges, you can modify the client IP address ranges by replacing the current
ranges, deleting a specific range or deleting all ranges.
The following command replaces the existing IP pools with subnet1:pool1 and subnet2:pool2 assigned to /hdfs-rack2 in the
zone3 access zone:
In addition to replacing the list of existing pools with new pools, you can modify the IP pools by adding pools to the list of
current pools, deleting a specific pool or deleting all pools.
The following command displays setting details for all virtual HDFS racks configured in the zone1 access zone:
3. To view the setting details for a specific virtual HDFS rack, run the isi hdfs racks view command:
Each rack name begins with a forward slash—for example /hdfs-rack2.
HDFS wire encryption that is supported by OneFS is different than the Apache HDFS Transparent Data Encryption technology.
NOTE: When HDFS wire encryption is enabled, there is a significant impact on the HDFS protocol throughput and I/O
performance.
Option Description
To enable HDFS wire encryption Select one of the Advanced Encryption Standard (AES) ciphers, AES/CTR/NoPadding
with 128 bit key, AES/CTR/NoPadding with 192 bit key, or AES/CTR/NoPadding
with 256 bit key.
To disable HDFS wire encryption Select Do not encrypt data.
3. Click Save Settings.
Option Description
To enable HDFS wire encryption Set the encryption_argument to one of the Advanced Encryption Standard (AES) ciphers,
aes_128_ctr,aes_192_ctr , or aes_256_ctr.
For example:
If you do not want to add the -provider option to the necessary hadoop key <operation> command, find your
environment below and set the hadoop.security.key.provider.path property.
Other KMS Set the property in HDFS > Configs > Set property in Ambari > Services > OneFS
Advanced > Custom core-site. If the property > Configs > Advanced > Custom core-
is not configured automatically, add it to the custom site
core-site.xml file.
Steps:
HDP version < 2.6.x -- not using Ranger KMS
a. Navigate to HDFS > Configs > Advanced > Custom core-site.
b. Click Add Property.
c. Enter the property as: hadoop.security.key.provider.path=kms://<kms-url>/kms
For example,
hadoop.security.key.provider.path=kms://[email protected]:1688/kms
d. Click Add.
e. Save settings.
HDP version 3.0.1 or later -- any KMS
a. Navigate to Ambari > Services > OneFS > Configs > Advanced > Custom core-site.
b. Click Add Property.
c. Enter the property as: hadoop.security.key.provider.path=kms://<kms-url>/kms
For example,
hadoop.security.key.provider.path=kms://[email protected]:9292/kms
d. Click Add.
e. Save settings.
Authorization Exception Errors
The OneFS Key Management Server configuration is configured per zone. If you receive an Authorization Exception error
similar to the following:
For example:
When you run the isi hdfs crypto settings command, note in the output that Port 9292 is specific to the Ranger
KMS server. If you are using a different KMS server, a different port may be required. Do not use Port 6080, as this port is
for the Ranger UI only.
The HDFS root path in the example above is /ifs/hdfs. Change this to your HDFS root path followed by the empty
directory corresponding to your encryption zone. For example, /ifs/hdfs/A as in the example above.
Important: Do not create the encryption zone from a DFS client in the Hadoop cluster. The encryption zone must be
created using the OneFS CLI as shown above, otherwise you will see an error similar to the following on the console and in
the OneFS hdfs.log file:
With the encryption zone defined on the OneFS cluster, you will be able to list the encryption zone immediately from any
DFS client in the Hadoop cluster.
4. The next step is to test the reading/writing of a file to the created encryption zone by an authorized user.
The ambari-qa user is the default smoke-test user that comes with HDP. For this test, the KMS server is updated to allow
the ambari-qa user to obtain the keys and metadata as well as to generate and decrypt encryption keys. With the policy
updated on the KMS server for the ambari-qa user, you can proceed to test writing and reading a test.txt file from the
created encryption zone on OneFS (for example, to the /A encryption zone as in the previous example) from a DFS client in
the Hadoop cluster as the ambari-qa user.
cat test.txt
hdfs dfs -put test.txt /A
hdfs dfs -ls /A
hdfs dfs -cat /A/test.txt
5. Verify that the test file is actually encrypted on the OneFS cluster by logging into OneFS as the root administrator and
displaying the contents of the test file in the test directory/ifs/hdfs/A in our example .
cd /ifs/hdfs/A
ls
cat test.txt
Result: You should see that the contents of the test file are encrypted and the original text is not displayed even by the
privileged root user on OneFS. The test file created by the ambari-qa user has read permissions for both the Hadoop
group and everyone, since the "hive" user is defined in the KMS with decrypt privileges. The "hive" user can decrypt the
file created by the ambari-qa user, but cannot place any files into the encryption zone, in this case /A, since the write
permission is missing for the Hadoop group that the "hive" user is a member of.
7. If you need to delete the encryption zone, deleting the encryption zone directory on OneFS is sufficient to delete the
encryption zone.
Topics:
• HDFS commands
HDFS commands
The following list of OneFS commands will help you to manage your OneFS and Hadoop system integration.
Syntax
Options
--kms-url <string>
Specifies the URL of the Key Management Server.
Syntax
Options
Syntax
Options
<path> <keyname>
Specifies a directory and key name for the encryption zone.
The encryption zone must be somewhere within the HDFS root directory for that zone.
Syntax
Options
Syntax
Options
--generation-interval <string>
The interval between successive FSImages.
--help <string>
Display help for this command.
Syntax
Options
--help <string>
Display help for this command.
--zone <string>
The access zone to which the HDFS settings apply.
Syntax
Options
--help <string>
Display help for this command.
--zone <string>
The access zone to which the HDFS settings apply.
Syntax
Options
--force | -f
Do not prompt for confirmation.
--verbose | -v
Display more detailed information.
--zone <string>
The access zone to which the HDFS settings apply.
Syntax
Options
--help <string>
Display help for this command.
--zone <string>
The access zone to which the HDFS settings apply.
Syntax
Options
--enabled {yes | no}
Enables or disables the HDFS FSImage service. Allow access to FSImage and start FSImage generation.
The HDFS FSImage service is disabled by default. This service should only be enabled on a Hadoop-
enabled Access Zone that will use Cloudera Navigator.
--help <string>
Display help for this command.
--verbose | -v
Syntax
Options
--help <string>
Display help for this command.
--zone <string>
The access zone to which the HDFS settings apply.
Syntax
Options
--help <string>
Display help for this command.
--zone <string>
The access zone to which the HDFS settings apply.
Syntax
Options
--enabled {yes | no}
Allows access to FSImage and starts FSImage generation. The HDFS FSImage service is disabled by
default. This service should only be enabled on a Hadoop-enabled access zone that will use Cloudera
Navigator.
--help <string>
Display help for this command.
--maximum-delay <string>
The maximum duration until an edit event is reported in INotify.
--retention <string>
The minimum duration edits will be retained.
--verbose | -v
Display more detailed information.
--zone <string>
The access zone to which the HDFS settings apply.
Syntax
Options
--force | -f
Do not prompt for confirmation.
--verbose | -v
Display more detailed information.
--zone <string>
The access zone to which the HDFS settings apply.
Syntax
Syntax
Options
--set {always | error | warning | info | verbose | debug | trace | default}
Sets the default logging level for the HDFS service on the cluster. The default value is default.
--verbose | -v
Displays more detailed information.
Syntax
Options
There are no options for this command.
Syntax
Options
<proxyuser-name>
Specifies the user name of a user currently configured on the cluster to be designated as a proxy user.
--add-gid <group-identifier>...
Adds the group by specified by UNIX GID to the list of proxy user members. The proxy user can
impersonate any user in the group. The users in the group must authenticate to the same access zone as
the proxy user. You can specify multiple UNIX GIDs in a comma-separated list.
--add-group <group-name>...
Adds the group specified by name to the list of proxy user members. The proxy user can impersonate
any user in the group. The users in the group must authenticate to the same access zone as the proxy
user. You can specify multiple group names in a comma-separated list.
--add-sid <security-identifier>...
Adds the user, group of users, machine or account specified by Windows SID to the list of proxy user
members. The object must authenticate to the same access zone as the proxy user. You can specify
multiple Windows SIDs in a comma-separated list.
--add-uid <user-identifier>...
Adds the user specified by UNIX UID to the list of members the proxy user can impersonate. The user
must authenticate to the same access zone as the proxy user. You can specify multiple UNIX UIDs in a
comma-separated list.
--add-user <user-name>...
Adds the user specified by name to the list of members the proxy user can impersonate. The user must
authenticate to the same access zone as the proxy user. You can specify multiple user names in a
comma-separated list.
--add-wellknown <well-known-name>...
Adds the well-known user specified by name to the list of members the proxy user can impersonate. The
well-known user must authenticate to the same access zone as the proxy user. You can specify multiple
well-known user names in a comma-separated list.
--verbose | -v
Displays more detailed information.
--zone <zone-name>
Specifies the access zone the user authenticates through.
Examples
The following command designates hadoop-user23 in zone1 as a new proxy user:
The following command designates hadoop-user23 in zone1 as a new proxy user and adds the group of users named hadoop-
users to the list of members that the proxy user can impersonate:
The following command designates hadoop-user23 in zone1 as a new proxy user and adds UID 2155 to the list of members that
the proxy user can impersonate:
Syntax
[--add-group <group-name>...]
[--add-gid <group-identifier>...]
[--add-user <user-name>...]
[--add-uid <user-identifier>...]
[--add-sid <security-identifier>...]
[--add-wellknown <well-known-name>...]
[--remove-group <group-name>...]
[--remove-gid <group-identifier>...]
[--remove-user <user-name>...]
[--remove-uid <user-identifier>...]
[--remove-sid <security-identifier>...]
[--remove-wellknown <well-known-name>...]
[--verbose | -v]
[--zone <zone-name>]
Options
<proxyuser-name>
Specifies the user name of the proxy user to be modified.
--add-group <group-name>...
Adds the group specified by name to the list of proxy user members. The proxy user can impersonate
any user in the group. The users in the group must authenticate to the same access zone as the proxy
user. You can specify multiple group names in a comma-separated list.
--add-gid <group-identifier>...
Adds the group specified by UNIX GID to the list of proxy user members. The proxy user can
impersonate any user in the group. The users in the group must authenticate to the same access zone as
the proxy user. You can specify multiple UNIX GIDs in a comma-separated list.
--add-user <user-name>...
Adds the user specified by name to the list of members the proxy user can impersonate. The user must
authenticate to the same access zone as the proxy user. You can specify multiple user names in a
comma-separated list.
--add-uid <user-identifier>...
Adds the user specified by UNIX UID to the list of members the proxy user can impersonate. The user
must authenticate to the same access zone as the proxy user. You can specify multiple UNIX UIDs in a
comma-separated list.
--add-sid <security-identifier>...
Adds the user, group of users, machine or account specified by Windows SID to the list of proxy user
members. The object must authenticate to the same access zone as the proxy user. You can specify
multiple Windows SIDs in a comma-separated list.
--add-wellknown <well-known-name>...
Adds the well-known user specified by name to the list of members the proxy user can impersonate. The
well-known user must authenticate to the same access zone as the proxy user. You can specify multiple
well-known user names in a comma-separated list.
--remove-group <group-name>...
Removes the group specified by name from the list of proxy user members so that the proxy user can no
longer impersonate any user in the group. You can specify multiple group names in a comma-separated
list.
--remove-gid <group-identifier>...
Examples
The following command adds the well-known local user to, and removes the user whose UID is 2155 from, the list of members
for proxy user hadoop-user23 in zone1:
Syntax
Options
<proxyuser-name>
Specifies the user name of the proxy user to be deleted.
--force | -f
Deletes the specified proxy user without requesting confirmation.
--verbose | -v
Displays more detailed information.
--zone <zone-name>
Specifies the access zone that the proxy user authenticates through.
Syntax
Options
<proxyuser-name>
Specifies the name of the proxy user.
--format {table | json | csv | list}
Displays output in table (default), JavaScript Object Notation (JSON), comma-separated value (CSV),
or list format.
--no-footer
Displays table output without footers.
--no-header
Displays table and CSV output without headers.
--verbose | -v
Displays more detailed information.
--zone <zone-name>
Specifies the access zone the proxy user authenticates through.
Examples
The following command displays a detailed list of the users and groups that are members of proxy user hadoop-user23 in zone1:
Type: user
Name: krb_user_005
ID: UID:1004
--------------------------------------------------------------------------------
Type: group
Name: krb_users
ID: SID:S-1-22-2-1003
--------------------------------------------------------------------------------
Type: wellknown
Name: LOCAL
ID: SID:S-1-2-0
Syntax
Options
--format {table | json | csv | list}
Displays output in table (default), JavaScript Object Notation (JSON), comma-separated value (CSV),
or list format.
--no-footer
Displays table output without footers.
--no-header
Displays table and CSV output without headers.
--verbose | -v
Displays more detailed information.
--zone <zone-name>
Specifies the name of the access zone.
Examples
The following command displays a list of all proxy users that are configured in zone1:
Name
-------------
hadoop-user23
hadoop-user25
hadoop-user28
-------------
Total: 3
Syntax
Examples
The following command displays the configuration details for the hadoop-user23 proxy user in zone1:
Name: hadoop-user23
Members: krb_users
LOCAL
krb_user_004
Syntax
Options
<rack-name>
Specifies the name of the virtual HDFS rack. The rack name must begin with a forward slash—for
example, /example-name.
--client-ip-ranges <low-ip-address>-<high-ip-address>...
Specifies IP address ranges of external Hadoop compute clients assigned to the virtual rack.
--ip-pools <subnet>:<pool>...
Assigns a pool of OneFS cluster IP addresses to the virtual rack.
--verbose | -v
Displays more detailed information.
--zone <string>
Specifies the access zone that will contain the virtual rack.
Syntax
Options
<rack-name>
Specifies the virtual HDFS rack to be modified. Each rack name begins with a forward slash—for
example /example-name.
--add-client-ip-ranges <low-ip-address>-<high-ip-address>...
Adds a specified IP address range of external Hadoop compute clients to the virtual rack.
--add-ip-pools <subnet>:<pool>...
Adds a specified pool of OneFS cluster IP addresses to the virtual rack.
--clear-client-ip-ranges
Removes all IP address ranges of external Hadoop compute clients from the virtual rack.
--clear-ip-pools
Removes all pools of OneFS cluster IP addresses from the virtual rack.
--client-ip-ranges <low-ip-address>-<high-ip-address>...
Specifies IP address ranges of external Hadoop compute clients assigned to the virtual rack. The value
assigned through this option overwrites any existing IP address ranges. You can add a new range
through the --add-client-ip-ranges option.
--ip-pools <subnet>:<pool>...
Assigns pools of OneFS node IP addresses to the virtual rack. The value assigned through this option
overwrites any existing IP address pools. You can add a new pool through the --add-ip-pools
option.
--name <rack-name>
Assigns a new name to the specified virtual rack. The rack name must begin with a forward slash—for
example /example-name.
--remove-client-ip-ranges <low-ip-address>-<high-ip-address>...
Removes a specified IP address range of external Hadoop compute clients from the virtual rack. You can
only remove an entire range; you cannot delete a subset of a range.
--remove-ip-pools <subnet>:<pool>...
Removes a specified pool of OneFS cluster IP addresses from the virtual rack.
--verbose | -v
Displays more detailed information.
--zone <string>
Specifies the access zone that contains the virtual rack you want to modify.
Syntax
Options
<rack-name>
Deletes the specified virtual HDFS rack. Each rack name begins with a forward slash—for example, /
example-name.
--force | -f
Suppresses command-line prompts and messages.
--verbose | -v
Displays more detailed information.
--zone <string>
Specifies the access zone that contains the virtual rack you want to delete.
Syntax
Options
--format {table | json | csv | list}
Display HDFS racks in table, JSON, CSV, or list format.
--no-footer | -z
Do not display table summary footer information.
--no-header | -a
Do not display headers in CSV or table output format.
--verbose | -v
Displays more detailed information.
--zone <string>
Specifies the access zone. The system displays all virtual racks in the specified zone.
Syntax
Options
<rack-name>
Specifies the name of the virtual HDFS rack to view. Each rack name begins with a forward slash—for
example, /example-name.
--zone <string>
Specifies the access zone that contains the virtual rack you want to view.
Syntax
Options
--enabled <boolean>
Enable the HDFS Ranger plug-in.
--policy-manager-url <string>
The scheme, host name, and port of the Apache Ranger server. For example:
http://ranger.com:6080 or:
https://ranger.com/6182
--repository-name <string>
The HDFS repository name hosted on the Apache Ranger server.
--verbose | -v
Display more detailed information.
--zone <string>
The access zone containing the HDFS repository.
Syntax
Options
--zone <string>
The access zone containing the HDFS repository.
Syntax
Options
--ambari-metrics-collector <string>
The host name for the metrics collector. The value must be a resolvable hostname, FQDN, IPv4 or IPv6
address.
--ambari-namenode <string>
A point of contact in the access zone that Hadoop services managed through the Ambari interface
should connect through. The value must be a resolvable IPv4 address or a SmartConnect zone name.
--ambari-server <string>
The Ambari server that receives communication from an Ambari agent. The value must be a resolvable
hostname, FQDN, IPv4 or IPv6 address.
--authentication-mode {simple_only | kerberos_only}
The authentication method used for HDFS connections through the specified access zone. The default
value for authentication-mode is simple_only for OneFS 8.2.1 and later versions.
--data-transfer-cipher {none | aes_128_ctr | aes_192_ctr | aes_256_ctr}
The Advanced Encryption Standard (AES) cipher to use for wire encryption.
Syntax
Options
--zone <string>
Specifies the access zone. The system will display the HDFS settings for the specified zone.
Topics:
• HDFS components
• Using Hadoop with PowerScale
HDFS components
HDFS components include the Ambari Management Pack for OneFS and third-party components such as Ambari and Cloudera
Manager.
Ambari
The Ambari components you use depend on your version of Ambari and the Hortonworks Data Platform (HDP).
The Ambari agent client and server framework applies through Ambari 2.6. The Ambari Management Pack for OneFS OneFS
applies to Ambari 2.7.1.0 with HDP 3.0.1.0 and later on OneFS 8.1.2 and later.
Ambari agent
The Apache Ambari client and server framework, as part of the Hortonworks Data Platform (HDP), is an optional third-party tool
that enables you to configure, manage, and monitor a Hadoop cluster through a browser-based interface. This section applies
only to the OneFS Ambari agent through Ambari 2.6.
The OneFS Ambari agent is configured per access zone. You can configure the Ambari agent in any access zone that contains
HDFS data. To start the Ambari agent in an access zone, you must specify the IPv4 address of the external Ambari server and
the address of a NameNode. The NameNode acts as the point of contact for the access zone.
The Apache Ambari server receives communications from the Ambari agent. Once the Ambari agent is assigned to the access
zone, it registers with the Ambari server. The agent then provides heartbeat status to the server. The Ambari server must be a
resolvable hostname, FQDN, or IPv4 address and must be assigned to an access zone.
The NameNode is the designated point of contact in an access zone that Hadoop services manage through the Ambari
interface. For example, if you manage services such as YARN or Oozie through the Ambari agent, the services connect to the
access zone through the specified NameNode. The Ambari agent communicates the location of the designated NameNode to
the Ambari server and to the Ambari agent. If you change the designated NameNode address, the Ambari agent updates the
Ambari server. The NameNode must be a valid SmartConnect zone name or an IP address from the IP address pool that is
associated with the access zone.
NOTE: The specified NameNode value maps to the NameNode, secondary NameNode, and DataNode components on the
OneFS Ambari agent.
The OneFS Ambari agent is based on the Apache Ambari framework and is compatible with multiple Ambari server versions. For
a complete list of supported versions, see the Hadoop Distributions and Products Supported by OneFS page.
42 Additional resources
Configure Ambari agent settings (Web UI)
1. Click Protocols > Hadoop (HDFS) > Settings.
2. From the Current Access Zone list, select the access zone in which you want to enable Ambari server settings.
3. From the Ambari Server Settings area, in the Ambari Server field, type the name of the external Ambari server that
communicates with the Ambari agent.
The value must be a resolvable hostname, FQDN, IPv4, or IPv6 address.
4. In the Ambari NameNode field, designate the SmartConnect FQDN or IP address of the access zone where the HDFS data
resides on the cluster.
The IP address must belong to an IP address pool that shares access zone. IPv6 addresses are not supported.
5. In the ODP Version field, specify the version of the Open Data Platform (ODP) stack repository, including build number if
one exists, installed by the Ambari server.
The ODP version is required to support ODP upgrades on other systems that are part of the Hadoop cluster.
6. Click Save Changes.
To specify the name of the external Ambari host where the Ambari Metrics Collector component is installed using the Web UI:
1. Click Protocols > Hadoop (HDFS) > Settings
2. In the Ambari Metrics Collector field, specify the name of the external Ambari host where the Ambari Metrics Collector
component is installed.
Additional resources 43
The value must be a resolvable hostname, FQDN, IPv4, or IPv6 address.
3. Save Changes
To view the Ambari metrics, follow the steps that are outlined in the PowerScale OneFS with Hadoop and Hortonworks
Installation Guide.
Cloudera
Cloudera Navigator
OneFS provides support for Cloudera's Navigator application with the release of OneFS 8.1.2 and later versions.
OneFS supports the following data management tasks in Cloudera Navigator:
● Browse and search data: Find the owner, creation and modification dates, understand data origins, and history.
● Lineage and provenance: Track data from its source and monitor downstream dependencies.
● Discovery and exploration: Add, review, and update metadata on the objects contained in the Hadoop data store.
● Custom metadata and tagging: Add custom tags and information to data and objects in HDFS.
The Cloudera Navigator Data Management component is a comprehensive data governance and stewardship tool available to
supplement Cloudera's distribution including Apache Hadoop (CDH). Navigator recognizes HDFS, Yarn, Impala, and Hive as
sources of data that it can manage. It extracts information from these services to provide additional insight into how data was
created and managed—and when and by whom it was changed—using metadata and job history along with HDFS data feeds.
The primary use of Navigator is data governance to monitor and track data in a HDFS workflow. One of the unique challenges
with very large data sets is being able to track and monitor how data moves through the data analytics workflow. A key
Navigator feature is the ability to link between input and output data through analytics jobs like Mapred, or perform data
transformations on table-based data in Hive or Impala databases. Navigator then analyzes the metadata and job history and links
it together to generate lineage.
Traditional HDFS metadata management
In a traditional Direct Attached Storage (DAS) Hadoop with NameNode (NN) deployment of HDFS, the NameNode's main role
is to store all the metadata of the underlying data blocks: the HDFS namespace, directory structures, file permissions, and block
IDs to files. While this data is held in memory for operational use, it is critical that this data is persisted to disk for recovery and
fault tolerance.
In traditional HDFS, this metadata is stored in two ways:
● FSImage (a binary image file accessed through an HTTP end point)
● INotify stream (an ordered JSON edit log retrieved through HDFS RPCs)
The FSImage image file is a complete point-in-time representation of the HDFS file system metadata. The FSImage file is used
on NameNode startup to load the metadata into memory. Because it is inefficient at handling incremental updates, all of the
modifications to the HDFS file system are recorded in a transaction log (INotify stream) rather than frequently rewriting the
FSImage file. This provides the NameNode with a number of capabilities, and modifications can be tracked without having to
constantly regenerate the FSImage file. In the event of a NameNode restart, the combination of the latest FSImage and INotify
log can be used to provide an accurate view of the file system at any point in time.
Eventually the HDFS cluster will need to construct a new FSImage that encompasses all INotify log file entries consolidated
with the old FSImage directly into a new updated FSImage file to provide an updated point-in-time representation of the file
system. This is known as checkpointing and is a resource-intensive operation. During checkpointing, the NameNode has to
restrict user access to the system, so instead of restricting access to the active NameNode, HDFS offloads this operation to
the Secondary NameNode (SN)—or to a standby NameNode—when operating in high availability (HA) mode. The secondary
NameNode handles the merge of existing FSImage and INotify transaction logs and generates a new complete FSImage for the
NameNode. At this time, the latest FSImage can be used in conjunction with the new INotify log files to provide the current
file system. It is important that the checkpoints occur, otherwise on a NameNode restart, it has to construct the entire HDFS
metadata from the available FSImage and all INotify logs. This can take a significant amount of time, and the HDFS file system
will be unavailable while this occurs.
Cloudera Navigator metadata management
The Navigator metadata service accesses data in a number of ways, such as Yarn application logs, Hive and Impala applications,
and HDFS metadata through polling of the FSImage file and INotify transaction logs. It collects all of this information and stores
it within Apache Solr databases on the Hadoop cluster. Navigator then runs additional extractions and analytics to create the
data that you can view in Navigator. The ability to collect the underlying HDFS metadata from FSImage and INotify is critical to
44 Additional resources
Navigator's ability to view the file system and is why, up until the release of OneFS 8.1.1, OneFS Hadoop clusters were unable to
provide HDFS file system data to Navigator.
Navigator's primary function is to read an initial FSImage and then use the INotify logs to gain access to all file system updates
that have occurred. It is possible under specific situations that Navigator is required to refresh its data from a full FSImage
rather than leveraging the INotify log, but this does not occur normally.
It is important to recognize that Navigator data is not real-time; it periodically updates the data through polling and extraction
to create the data reviews. This behavior is consistent with both DAS and OneFS deployments and is how Cloudera Navigator is
designed to operate.
OneFS support for Cloudera Navigator
The OneFS approach to handling file system allocation, block location, and metadata management is fundamentally different
than how a traditional Apache-based HDFS file system manages its data and metadata. When OneFS is integrated into a Hadoop
cluster, it provides the storage file system to the Hadoop cluster that is based on OneFS and not on an HDFS-based file system.
Its layout and protection scheme is fundamentally different than HDFS, and so is its management of metadata and blocks.
Since OneFS is not a NameNode-based HDFS file system—and no NameNode is present in the Hadoop cluster—the OneFS file
system presents NameNode and DataNode-like functionality to the remote Hadoop cluster through the HDFS service. OneFS
doesn't rely on FSImage and INotify transaction log-based metadata management within OneFS with HDFS data. In order
to support the native OneFS capabilities, enterprise features for Hadoop, and provide multiprotocol access, OneFS uses the
underlying file system presented to the HDFS protocol for Hadoop access. Therefore, prior to OneFS 8.1.1, OneFS could not
provide an FSImage and INotify log for consumption.
With the release of OneFS 8.1.1 and later versions, OneFS integrates with Cloudera Navigator by enabling an FSImage and
INotify log file on OneFS in an HDFS access zone. By enabling an HDFS Hadoop access zone root for FSImage and INotiffy
integration, you are, in effect, telling OneFS to create an FSImage file and start tracking HDFS file system events in an INotify
log file, thereby making that data available for consumption by Navigator. Once enabled, OneFS effectively begins to mimic the
behavior of a traditional NameNode deployment, and an FSImage file is generated by OneFS. All HDFS file system operations are
logged into an INotify stream.
Periodically OneFS will regenerate a new FSImage, but this operation is not true checkpointing or merging of the INotify log
as performed on an HDFS NameNode, because the actual file system and operations are still handled by the core OneFS file
system. The FSImage and INotify logs are generated by OneFS to provide the required data to Cloudera Navigator in the
required format.
The FSImage regeneration job runs daily to recreate a current FSImage which—combined with the current INotify logs—will
represent the current state of data and metadata in the HDFS root from an HDFS perspective.
OneFS is a multi-protocol file system, which provides unified access to its data through many protocols, including HDFS, NFS,
SMB, and others. Since only HDFS file system operations are captured by the INotify log, Navigator will only initially see this
metadata; any metadata created in the HDFS data directories by NFS or SMB will not get included in the INotify stream.
However, on regeneration of an FSImage, these files will be included in the current FSImage, and Navigator will see them the
next time it uses a later refreshed FSImage. Since Navigator's primary method of obtaining updated metadata is based on
INotify logs, it may take some time before non-HDFS-originating data is included. This is expected behavior and should be taken
into account if multiprotocol workflows are in use.
Using Navigator with OneFS
In order to enable Navigator integration, both FSImage and INotify need to be enabled on the HDFS access zone within OneFS.
Once enabled, they should not be disabled unless the use of Navigator is to be permanently discontinued.
You should not enable FSImage and INotify on any zones that do not use Navigator, as these add unnecessary overhead. Within
OneFS, the FSImage and INotify features are access zone-aware and should only be enabled on any Hadoop-enabled access
zone that will use Navigator. There is no reason to enable it on a zone that is not being monitored by Navigator, since it will add
overhead to that cluster due to a feature that is not being consumed.
No additional configuration changes are required within Cloudera Manager or Navigator to enable integration. When integration
is initially enabled, it will take some time for the initial HDFS data to become visible within Navigator and additional time is
needed to generate linkage. As new data is added, it will show up in Navigator and will be linked based on the polling and
extraction period within Navigator.
Additionally, note the following:
● You can enable FSImage and INotify either through the command line interface or through the web administration interface.
● Once FSImage and INotify are enabled, you must deploy CDH 5.12 with Cloudera Navigator. Cloudera deployments prior to
CDH 5.12 will not allow Navigator installation.
● Wait approximately an hour until Navigator has gathered information from applications.
● Clusters will need to be sized to accommodate the performance impact of INotify.
● Events are logged to the /var/log/hdfs.log file and messages.
Additional resources 45
● You should avoid disabling INotify—or toggling INotify and FSImage off and on—as these are destructive actions in Cloudera
Navigator and can cause metadata data loss.
● Do not set the FSImage generation interval (the interface between successive FSImages) beyond the INotify retention
period (the minimum duration edit logs will be retained). The INotify minimum retention period must be longer than the
FSImage generation interval.
● With INotify enabled there is an expected performance impact for all edit actions over HDFS.
● FSimage generation takes approximately one hour for every three million files.
● To view the data in Navigator, use Yarn, Hive, or another application.
● OneFS 8.1.1 and later releases do not support Cloudera Navigator data audit capabilities.
See HDFS commands for a list of the HDFS commands available from the OneFS command line interface.
For more information about Cloudera Navigator, see:
● Cloudera Navigator documentation hub
● Cloudera Navigator data management
NOTE: A poorly formed policy can have an unintended impact, for example, blocking access.
The repository name is a setting within Apache Ranger. The minimum supported version of Apache Ranger is 0.6.0 because the
Ranger DENY policy is supported only in 0.6.0 and later versions. In version 0.6.0, Apache Ranger changed the name of this
feature to service instance. The service instance is the name of the HDFS service instance within the Apache Ranger Admin UI
used as the repository name.
If you have a Kerberos-enabled cluster, follow the instructions in the Hortonworks Security Guide to enable the Ranger HDFS
plugin on the cluster.
46 Additional resources
1. Click Protocols > Hadoop (HDFS) > Ranger Plugin Settings.
2. In the Ranger Plugin settings area, select Enable Ranger Plugin
3. In the Policy manager URL field, type the URL that points to the location of the Policy Manager.
4. In the Repository name field, type the name of the HDFS repository.
5. Click Save Changes.
Compatibility information
● Hadoop Distributions and Products Supported by OneFS
Cloudera
● Powerscale OneFS with Hadoop and Cloudera Installation Guide (PDF)
Additional resources 47
Cloudera with Kerberos
● Powerscale OneFS with Hadoop and Cloudera Kerberos Installation Guide (PDF)
48 Additional resources