Advanced Admin AOS v6 - 6 - Compressed
Advanced Admin AOS v6 - 6 - Compressed
Acropolis Advanced
Administration Guide
March 24, 2023
Contents
Overview.......................................................................................................................... 4
Introduction to AOS................................................................................................................................................. 4
Nutanix AOS Architecture......................................................................................................................... 8
Key Components........................................................................................................................................................8
Built-in AHV Virtualization.........................................................................................................................8
Platform Services...........................................................................................................................................9
Enterprise Storage Capabilities............................................................................................................... 9
Networking Services.................................................................................................................................... 9
AOS Lifecycle Management.................................................................................................................................. 9
Documentation References..................................................................................................................................10
Cluster Management.................................................................................................23
Performing Initial Configuration........................................................................................................................ 23
Controller VM Access............................................................................................................................................ 25
Admin User Access to Controller VM.................................................................................................26
Nutanix User Access to Controller VM.............................................................................................. 28
Controller VM Password Complexity Requirements..................................................................... 29
Cluster Operations..................................................................................................................................................30
Stopping a Cluster..................................................................................................................................... 30
Node Removal............................................................................................................................................... 31
Starting a Nutanix Cluster.......................................................................................................................32
Destroying a Cluster.................................................................................................................................. 33
Fingerprinting Existing vDisks........................................................................................................................... 35
IPv6 Enablement in a Cluster............................................................................................................................ 35
IPv6 Architecture........................................................................................................................................ 36
IPv6 Configuration......................................................................................................................................36
Logs................................................................................................................................. 47
Send Logs to Remote Syslog Server..............................................................................................................47
Configuring the Remote Syslog Server Settings........................................................................... 48
Common Log Files.................................................................................................................................................. 51
Nutanix Logs Root.......................................................................................................................................51
Self-Monitoring (sysstats) Logs..............................................................................................................51
/home/nutanix/data/logs/cassandra..................................................................................................52
Consolidated Audit Logs......................................................................................................................... 52
Controller VM Log Files........................................................................................................................... 53
Correlating the FATAL log to the INFO file................................................................................................. 56
Stargate Logs............................................................................................................................................................57
Cassandra Logs........................................................................................................................................................ 58
Prism Gateway Log................................................................................................................................................ 59
ii
Zookeeper Logs.......................................................................................................................................................60
Genesis.out.................................................................................................................................................................60
Diagnosing a Genesis Failure..................................................................................................................61
ESXi Log Files...........................................................................................................................................................62
Nutanix Calm Log Files........................................................................................................................................ 63
Copyright.......................................................................................................................87
iii
OVERVIEW
AOS is the operating system of the Nutanix Controller VM–the VM that runs in the hypervisor
to provide Nutanix-specific functionality. It provides core functionality used by workloads
and services running on the platform. It contains several data services and features for data
protection, space efficiency, scalability, automated data tiering, and security.
AOS is a back-end service that allows for workload and resource management, provisioning,
and operations. Its goal is to abstract the facilitating resource (for example, hypervisor, on-prem
infrastructure, and cloud based infrastructure) and give workload the ability to seamlessly move
between hypervisors, cloud providers, and platforms.
The AOS Advanced Administration Guide provides an introduction to AOS and its key feature.
It covers advanced topics and tasks that you can do within the system, and references that
describe the tasks.
This information is for experienced Windows or Linux system administrators who are familiar
with virtualization.
Introduction to AOS
AOS is the base operating system, the so-called data plane that packages (encapsulates)
the run time of storage, compute, security, and network. It is an artificial intelligence core
hardened operating system. AOS runs on top of the Nutanix native hypervisor (AHV), or other
hypervisors such as the VMware vSphere ESXi and Microsoft Hyper-V. AOS creates Nutanix
Acropolis clusters on the hypervisor to control and virtualize storage of the node as storage
pool, container, and Volumes using direct path I/O PCI pass through mechanism.
AOS is installed as a Controller Virtual Machine (CVM) atop a hypervisor to manage everything
in a Nutanix cluster.
AOS provides data services and consists of three foundational components–the Distributed
Storage Fabric (DSF), the App Mobility Fabric (AMF), and AHV.
AOS | Overview | 4
• Distributed Storage Fabric (DSF)
DSF pools the flash and hard disk drive storage across the cluster and presents it as a
datastore to the hypervisor. DSF exposes various storage systems (SMB, NFS, and SCSI) with
no single point of failure. The datastore is like a centralized storage to the hypervisor but the
I/O is managed locally and provides high performance.
DSF consists of the following:
• A Storage Pool, which is a logical pool of physical devices that include SSD and HDD
devices for the cluster. Storage pool can span multiple Nutanix nodes and expand with the
cluster. In most configurations, only one storage pool is necessary.
• A Container, which is the logical segmentation of storage pools and consists of a group of
VMs or vDisk. Container is important because feature like redundancy factor is configured
at container level and applied at VM or file level.
• A vDisk, which is a file larger than 512 KB on the DSF including VMDK and VM hard disks.
There are no artificial limits on the size of vDisk and the theoretical maximum size of vDisk
is 9 exabytes (*1 exabyte - 1 billion gigabyte). VDisks are logically composed of VBlocks.
• A vBlock, which is a 1 MB chunk of virtual address space on a vDisk. Each vBlock is
mapped to an extent in the vDisk.
• An Extent, which is a 1 MB piece of large contiguous data that consists of a number of
contiguous blocks.
• An Extent Group, which is a 1 MB or 4 MB piece of physically contiguous stored data. This
data is stored as a file on the storage device owned by the CVM. Extents are dynamically
distributed among extent groups to provide data striping across nodes and disks to
improve performance.
Note: The 1 MB and 4 MB extent groups are for dedup and non-dedup data
correspondingly.
DSF divides user data at a fundamental level into Extents. It can compress, ratio-code,
snapshot, or dedupe this user data. Compression process can reduce the size of the Extent
from 1 MB to a few kilobytes.
Data Extents can move around. Data that is accessed frequently (Hot Data) is moved to the
SSD tier in the DSF and data that is not accessed as frequently (Cold Data) is moved to the
HDD. This DSF capability is called Intelligent Tiering.
Extents are also stored on the nodes on which the guest VM is running.
DSF has enterprise grade features like Performance Accelerations, Capacity Optimization,
Data Protection, and Disaster Recovery.
• Performance Accelerations. DSF uses the following key capabilities for performance
acceleration:
• Intelligent Tiering.
DSF continuously and automatically monitors data access patterns and then optimizes
data placement. DSF moves data intelligently between the SSD and HDD tiers to
provide for optimal performance without requiring an administrator.
• Data Locality.
This capability refers to the storage of VM data on the node where the VM is running.
It ensures that the read I/O does not have to go through the network. Data locality
optimizes the performance and reduces the network congestion. When the VM is
AOS | Overview | 5
moved from one node to another using either vMotion, lab migration, or due to an HA
event, the migrated VM data is also moved to ensure data locality.
• Automatic Disk Balancing.
This capability ensures data is distributed uniformly across the cluster. Any node in
the Nutanix cluster can use storage resources across the cluster. This ensures manual
rebalancing is unnecessary. Automatic disk balancing reacts to changing workloads
and once utilization reaches self threshold, disk balancing keeps it uniform among the
nodes.
• Capacity Optimization. DSF provides deduplication, compression, and erasure coding for
Storage optimization.
• Deduplication.
There are two types of deduplication – Performance tier and Post process map reduce.
The performance tier deduplication removes the duplicate data inline with the content
cache to reduce the footprint of the applications working set. The post-process
deduplication reduces repetitive data in the capacity tier to increase the effective
storage capacity of a cluster.
• Compression.
This capability consists of an inline and post process compression. These are
intelligently determined based on sequential or random access patterns to ensure
optimal performance.
• Erasure Coding.
Provides resilience and increases usable capacity by up to 75%. Erasure coding encodes
a strip of data block on different nodes and calculates parity. If a node or disk fails,
parity is used to find and calculate the missing data blocks.
• Data Protection. It is integrated at the VM level and ensures continuous availability of data.
Depending on the recovery time objective (RTO) and recovery point objectives (RPO),
data protection features include Time Stream and Cloud Connect for minor incidents, and
Async, NearSync, and Sync replication for Major incidents.
Data protection can create a limited local metadata based on local Snapshots with VM for
application level consistency. Since this snapshot is based on metadata, data protection
requires minimum disk overhead and ensures high performance recovery.
• Disaster Recovery. The Nutanix data recovery and replication capabilities are built on
snapshot technology. VM snapshots can be asynchronously replicated or backed-up to
another data center based on user defined schedule. Replication topologies are flexible
and bi-directional. Disaster recovery can be one-to-one, one-to-many, and many-to-many
deployments. During replication, data is compressed and replicated at the sub block level
for maximum efficiency and lower WAN bandwidth consumption.
Nutanix offers Metro availability for critical workloads that require zero Recovery Point
Objective (RPO) and near zero Recovery Time Objective (RTO) to ensure continuous data
availability across separate sites in a metro.
Administrators can set up metro availability bi-directionally between two sites connected
over a metro area network. This requires a round trip latency of less than 5 milliseconds.
Data is written synchronously to both sides and is always available to the applications
in the event of a site failure or when a site undergoes maintenance. VMs can be non-
destructively migrated between sites for planned maintenance events and other needs.
AOS | Overview | 6
• App Mobility Fabric (AMF)
This is a collection of technologies that allows applications and data to move freely between
runtime environments. AMF is an open environment capable of delivering powerful virtual
machine (VM) placement, VM migration, VM conversion, cross hypervisor High Availability,
and integrated disaster recovery.
AMF supports most virtualized applications and provides a more seamless path to containers
and hybrid cloud computing.
AOS | Overview | 7
For more information, see Prism Web Console Guide.
The Nutanix solution does not require SAN constructs, such as LUNs, RAID groups, or expensive
storage switches. All storage management is VM-centric, and I/O is optimized at the VM virtual
disk level. The software solution runs on nodes from a variety of manufacturers that are either
all-flash for optimal performance, or a hybrid combination of SSD and HDD that provides a
combination of performance and additional capacity.
The Distributed Storage Fabric (DSF) automatically tiers data across the cluster to different
classes of storage devices using intelligent data placement algorithms. The algorithm ensures
the most frequently used data is available in memory or in flash on the node local to the VM.
Key Components
Nutanix Acropolis has five key components that make it a complete solution for delivering any
infrastructure service:
AOS | Overview | 8
• Broad Ecosystem Support (Certified Citrix Ready, Microsoft Validated via SVVP).
Platform Services
Nutanix AOS delivers a comprehensive set of software-defined platform services so that IT
organizations can consolidate all their workloads on the Nutanix platform and manage them
centrally.
This include the following:
• Performance acceleration capabilities such as caching, data tiers, and data locality.
• Storage optimization technologies, such as Deduplication, Compression and Erasure Coding.
• Data-at-Rest Encryption, supporting both Self-encrypting drives (SED) with KMS and
Software-only Encryption.
• Data protection technologies to support snapshots to local, remote and cloud based sites.
• Disaster Recovery features including synchronous and asynchronous replication.
Networking Services
Nutanix AOS provides a comprehensive set of services to visualize the network, automate
common network operations and, in the near future, secure the network through native services
and partner integration. These services include, but are not limited to:
AOS | Overview | 9
You can access the LCM framework using the Prism interface. All communication between the
cluster and LCM modules go through the LCM framework. To view the LCM dashboard, select
LCM from the pull-down list on the left of the main menu. For more information, see Life Cycle
Manager Guide.
To use LCM at a location that has internet access, see Life Cycle Manager Guide.
To use LCM at a location without internet access, see Life Cycle Manager Dark Site Guide.
Documentation References
Refer to the guides in this table for additional information required for AOS administration.
Guide Purpose
Getting Started Guide This guide provides you with all the required
summary to get the Nutanix Acropolis system
up and running.
Nutanix Rack Mounting Guide This guide provides information about
mounting a block based on various models.
Field Installation Guide This guide provides information about using
the Foundation for deploying a node and
creating a cluster automatically.
It allows you to configure a pre-imaged node,
or image a node with AOS and a hypervisor of
your choice.
It also allows you to form a cluster out of
nodes whose hypervisor and AOS versions are
the same, with or without reimaging.
vSphere Administration Guide for Acropolis This guide describes how to configure and
(using vSphere HTML Client) manage the Nutanix cluster in vSphere.
Hyper-V Administration for Acropolis This guide describes how to configure and
manage the Nutanix cluster in Hyper-V.
AOS | Overview | 10
Guide Purpose
Data Protection and Recovery with Prism This guide describes the concepts and
Element procedures for configuring data protection
using protection domains. You can configure
data protection using protection domains in
Prism Element (in other words, by signing in to
an individual cluster through its web console).
Getting Started with Nutanix Community This guide provides information about the
Edition Nutanix Community Edition. This is a free
version of Nutanix AOS, which powers the
Nutanix enterprise cloud platform.
The Community Edition of AOS is designed
for people interested in test driving its main
features on their own test hardware and
infrastructure. As stated in the end user license
agreement, Community Edition is intended
for internal business operations and non-
production use only.
Nutanix Cluster Check Guide This guide provides information about Nutanix
Cluster Check (NCC), which is a cluster-
resident software that diagnoses cluster health
and identifies configurations qualified and
recommended by Nutanix. NCC continuously
and proactively runs hundreds of checks
and takes the needed action towards issue
resolution. Depending on the issue discovered,
NCC raises an alert or automatically creates
Nutanix Support cases. NCC can be run
provided that the individual nodes are up,
regardless of cluster state.
AOS | Overview | 11
REMOTE CONSOLE IP ADDRESS
CONFIGURATION
The Intelligent Platform Management Interface (IPMI) is a standardized interface used to
manage a host and monitor its operation. To enable remote access to the console of each host,
you must configure the IPMI settings within BIOS.
The Nutanix cluster provides a Java application to remotely view the console of each node, or
host server. You can use this console to configure additional IP addresses in the cluster.
The procedure for configuring the remote console IP address is slightly different for each
hardware platform.
Refer to the Third Party Platform section to know more about configuring the non-OEM
platforms.
Procedure
2. Restart the node and press Delete to enter the BIOS setup utility.
There is a limited amount of time to enter BIOS before the host completes the restart
process.
4. Press the down arrow key until BMC Network Configuration is highlighted and then press
Enter.
5. Press down the arrow key until Update IPMI LAN Configuration is highlighted and press
Enter to select Yes.
9. Review the BIOS settings and press F4 to save the configuration changes and exit the BIOS
setup utility.
The node restarts.
Note: If you are reconfiguring IPMI address on a node because you have replaced the
motherboard, restart Genesis on the Controller VM only for that node.
Procedure
1. Log on to the hypervisor host with SSH (AHV or vSphere) or remote desktop connection
(Hyper-V).
• For AHV
root@ahv# ipmitool -U ADMIN -P ADMIN lan set 1 ipsrc static
root@ahv# ipmitool -U ADMIN -P ADMIN lan set 1 ipaddr mgmt_interface_ip_addr
root@ahv# ipmitool -U ADMIN -P ADMIN lan set 1 netmask mgmt_interface_subnet_addr
root@ahv# ipmitool -U ADMIN -P ADMIN lan set 1 defgw ipaddr mgmt_interface_gateway
• For vSphere
root@esx# /ipmitool -U ADMIN -P ADMIN lan set 1 ipsrc static
root@esx# /ipmitool -U ADMIN -P ADMIN lan set 1 ipaddr mgmt_interface_ip_addr
root@esx# /ipmitool -U ADMIN -P ADMIN lan set 1 netmask mgmt_interface_subnet_addr
root@esx# /ipmitool -U ADMIN -P ADMIN lan set 1 defgw ipaddr mgmt_interface_gateway
• For Hyper-V
> ipmiutil lan -e -I mgmt_interface_ip_addr -G mgmt_interface_gateway
-S mgmt_interface_subnet_addr -U ADMIN -P ADMIN
Replace:
• For AHV
root@ahv# ipmitool -v -U ADMIN -P ADMIN lan print 1
• For vSphere
root@esx# /ipmitool -v -U ADMIN -P ADMIN lan print 1
• For Hyper-V
> ipmiutil lan -r -U ADMIN -P ADMIN
Note: If you are reconfiguring IPMI address on a node because you have replaced the
motherboard, restart Genesis on the Controller VM only for that node.
Note: This is applicable only if you have already created a cluster and want to change the IP
addresses of the CVM, hypervisor host, IPMI address schema into a new infrastructure.
For more information and cautions about the impact of changing the cluster IP addresses, see
Modifying Cluster Details in the Prism Web Console Guide.
• Before you decide to change the CVM, hypervisor host, and IPMI IP addresses, consider the
possibility of incorporating the existing IP address schema into the new infrastructure by
reconfiguring your routers and switches instead of Nutanix nodes and CVMs. If that is not
possible and you must change the IP addresses of CVMs and hypervisor hosts, proceed with
the procedure described in this document.
• Guest VM downtime is necessary for this change, because the Nutanix cluster must be in a
stopped state. Therefore, plan the guest VM downtime accordingly.
• Verify if your cluster is using the network segmentation feature.
nutanix@cvm$ network_segment_status
Note the following if you are using the network segmentation feature.
• The network segmentation feature enables the backplane network for CVMs in your
cluster (eth2 interface). The backplane network is always a non-routable subnet and/or
VLAN that is distinct from the one which is used by the external interfaces (eth0) of your
CVMs and the management network on your hypervisor. Typically, you do not need to
change the IP addresses of the backplane interface (eth2) if you are updating the CVM or
host IP addresses.
• If you have enabled network segmentation on your cluster, check to make sure that the
VLAN and subnet in-use by the backplane network is still going to be valid once you move
to the new IP scheme. If not, change the subnet or VLAN.
For information about the AOS version and instructions to disable the network
segmentation feature before you change the CVM and host IP addresses, see Disabling
Network Segmentation in the Security Guide. After you have updated the CVM and host
IP addresses by following the steps outlined later in this document, you can proceed to re-
enable network segmentation. Follow the instructions in the Security Guide, to designate
the new VLAN or subnet for the backplane network.
• Use the following CLI commands if network segmentation is enabled on the cluster and
you wish to change both the external CVM eth0 (Management) and internal CVM eth2
(backplane) IP addresses:
Caution: All the features that use the cluster virtual IP address are impacted if you change
that address. For more information, see Virtual IP Address Impact in the Prism Web Console
Guide .
Use the following steps to change the virtual IP address of the cluster.
1. Clear the existing virtual IP address of the cluster.
nutanix@cvm$ ncli cluster clear-external-ip-address
• Replace insert_new_external_ip_address with the new virtual IP address for the cluster.
• You can change the iSCSI data services IP address of the cluster while you are changing the
virtual IP address of the cluster. This IP address is used by Nutanix Volumes and other data
services applications.
Caution: For certain features, changing the external data services IP address can result in
unavailable storage or other issues. The features in question include Volumes, Calm, Leap,
Karbon, Objects, and Files. For more information, see KB 8216 and iSCSI Data Services IP
Address Impact in the Prism Web Console Guide.
• Ensure that the cluster NTP and DNS servers are reachable from the new Controller VM IP
addresses. If you are using different NTP and DNS servers, remove the existing NTP and DNS
servers from the cluster configuration and add the new ones. If you do not know the new
Web Console
• In the gear icon pull-down list, click Name Servers.
nCLI
• ncli> cluster remove-from-name-servers servers="name_servers"
Replace:
• Log on to a Controller VM in the cluster and check that all hosts are part of the metadata
store.
nutanix@cvm$ ncli host ls | grep "Metadata store status"
For every host in the cluster, Metadata store enabled on the node is displayed.
Warning: If Node marked to be removed from metadata store is displayed, do not proceed
with the IP address reconfiguration, and contact Nutanix Support to resolve the issue.
Warning: If you are using distributed switches in your ESXi clusters, migrate the distributed
switches to standard switches before you perform any IP address reconfiguration procedures that
involve changing the management VMkernel port of the ESXi host to a different distributed port
group that has a different VLAN.
• If you are changing the IP address of the CVM to a subnet that is non-routable to the
current subnet. For this scenario, contact Nutanix Support for assistance.
• If you are using the network segmentation feature on your cluster and you want to
change the IP addresses of the backplane (eth2) interface. For instructions about
how to change the IP addresses of the backplane (eth2) interface, see Reconfiguring
the Backplane Network in the Security Guide .
Note: Check the connectivity between CVMs and hosts, that is all the hosts must be
reachable from all the CVMs and vice versa before you perform step 4. If any CVM or host is
not reachable, contact Nutanix Support for assistance.
Warning: If you are changing the Controller VM IP addresses to another subnet, network, IP
address range, or VLAN, you must also change the hypervisor management IP addresses to the
same subnet, network, IP address range, or VLAN.
If you have configured a data services IP address for guest VMs that use ISCSI volumes
and you are changing the IP addresses of the CVMs to a different subnet, you must
change the data services IP address to that subnet. After you change the data services
IP address, update the guest VM ISCSI client configuration with the new data services IP
address.
For instructions about how to change the IP address of an AHV host, see Changing the IP
Address of on AHV Host in the AHV Administration Guide.
For instructions about how to change the IP address of an ESXi host, see Changing a Host IP
Address in the vSphere Administration Guide for Acropolis.
For instructions about how to change the IP address of a Hyper-V host, see Changing a Host IP
Address in the Hyper-V Administration for Acropolis guide.
1. Log on to the hypervisor using SSH (AHV or vSphere), remote desktop connection (Hyper-
V), or the IPMI remote console.
If you are unable to reach the IPMI IP addresses, reconfigure by using the BIOS or hypervisor
command line.
For using BIOS, see Configuring the Remote Console IP Address (BIOS).
For using the hypervisor command line, see Configuring the Remote Console IP Address
(Command Line).
Warning: This step affects the operation of a Nutanix cluster. Schedule a down time before
performing this step.
If you are using VLAN tags on your CVMs and on the management network for your
hypervisors and you want to change the VLAN tags, then stop the cluster and make the
changes mentioned in the following guides.
For information about assigning VLANs to hosts and the Controller VM, see the indicated
documentation:
• For AHV—See Assigning an AHV Host to a VLAN and Assigning the Controller VM to a
VLAN in the AHV Administration Guide.
• For ESXi—See Configuring Host Networking (Management Network) in the vSphere
Administration Guide for Acropolis for instructions about tagging a VLAN on an ESXi host
by using DCUI.
Note: If you are relocating the cluster to a new site, the external_ip_reconfig script works
only if all the CVMs are up and accessible with their old IP addresses. Otherwise, contact
Nutanix Support to manually change the IP addresses.
After you have stopped the cluster, shut down the CVMs and hosts and move the cluster.
Proceed with step 4 only after you start the cluster at the desired site and you have
confirmed that all CVMs and hosts can SSH to one another. As a best practice, ensure that
the out-of-band management Remote Console (IPMI, iDRAC, and ILO) is accessible on each
node before you proceed further.
Verify that upstream networking is configured to support the changes to the IP address
schema .
For example, check the network load balancing or LACP configuration to verify that it
supports the seamless transition from one IP address schema to another.
5. Follow the prompts to type the new netmask, gateway, and external IP addresses.
On successful completion of the script, the system displays a similar output:
External IP reconfig finished successfully. Restart all the CVMs and start the cluster.
Note:
Note: If you have changed the CVMs to a new subnet, you must now update the IP addresses
of hypervisor hosts to the new subnet. Change the hypervisor management IP address or IPMI
IP address before you restart the Controller VMs.
7. After you turn on every CVM, log on to each CVM and verify if the IP address has been
successfully changed.
Note: It can take up to 10 minutes for the CVMs to show the new IP addresses after they are
turned on.
Note: If you see any of the old IP addresses in the following commands or the commands fail
to run, stop and call Nutanix Support assistance.
c. From any one CVM in the cluster, verify that the following outputs show the new IP
address scheme and that the Zookeeper IDs are mapped correctly.
Note: Never edit the following files manually. Contact Nutanix Support for assistance.
If the cluster starts properly, output similar to the following is displayed for each node in the
cluster:
CVM:host IP-Address Up
Zeus UP [9935, 9980, 9981, 9994, 10015, 10037]
Scavenger UP [25880, 26061, 26062]
Xmount UP [21170, 21208]
SysStatCollector UP [22272, 22330, 22331]
IkatProxy UP [23213, 23262]
IkatControlPlane UP [23487, 23565]
SSLTerminator UP [23490, 23620]
SecureFileSync UP [23496, 23645, 23646]
Medusa UP [23912, 23944, 23945, 23946, 24176]
DynamicRingChanger UP [24314, 24404, 24405, 24558]
Pithos UP [24317, 24555, 24556, 24593]
InsightsDB UP [24322, 24472, 24473, 24583]
Athena UP [24329, 24504, 24505]
Mercury UP [24338, 24515, 24516, 24614]
Mantle UP [24344, 24572, 24573, 24634]
VipMonitor UP [18387, 18464, 18465, 18466, 18474]
Stargate UP [24993, 25032]
InsightsDataTransfer UP [25258, 25348, 25349, 25388, 25391, 25393,
25396]
Ergon UP [25263, 25414, 25415]
Cerebro UP [25272, 25462, 25464, 25581]
Chronos UP [25281, 25488, 25489, 25547]
Curator UP [25294, 25528, 25529, 25585]
Prism UP [25718, 25801, 25802, 25899, 25901, 25906,
25941, 25942]
CIM UP [25721, 25829, 25830, 25856]
AlertManager UP [25727, 25862, 25863, 25990]
Arithmos UP [25737, 25896, 25897, 26040]
Catalog UP [25749, 25989, 25991]
Acropolis UP [26011, 26118, 26119]
Uhura UP [26037, 26165, 26166]
Snmp UP [26057, 26214, 26215]
NutanixGuestTools UP [26105, 26282, 26283, 26299]
MinervaCVM UP [27343, 27465, 27466, 27730]
ClusterConfig UP [27358, 27509, 27510]
Aequitas UP [27368, 27567, 27568, 27600]
APLOSEngine UP [27399, 27580, 27581]
APLOS UP [27853, 27946, 27947]
Lazan UP [27865, 27997, 27999]
Delphi UP [27880, 28058, 28060]
Flow UP [27896, 28121, 28124]
Anduril UP [27913, 28143, 28145]
XTrim UP [27956, 28171, 28172]
ClusterHealth UP [7102, 7103, 27995, 28209,28495, 28496,
28503, 28510,
28573, 28574, 28577, 28594, 28595, 28597, 28598, 28602, 28603, 28604, 28607, 28645, 28646,
28648, 28792,
28793, 28837, 28838, 28840, 28841, 28858, 28859, 29123, 29124, 29127, 29133, 29135, 29142,
29146, 29150,
29161, 29162, 29163, 29179, 29187, 29219, 29268, 29273]
Note: You can manually configure CVM IP address only when a cluster is not yet created.
Perform the following steps to manually configure a static IP address to the Controller VM:
Procedure
2. Update the NETMASK, IPADDR, BOOTPROTO, and GATEWAY parameters in the script.
NETMASK=xxx.xxx.xxx.xxx
IPADDR=xxx.xxx.xxx.xxx
BOOTPROTO=none
GATEWAY=xxx.xxx.xxx.xxx
• IPADDR—Enter the appropriate static IP address (assigned by your IT department) for the
Controller VM.
• BOOTPROTO—Enter none.
If you employ DHCP, change the value from dhcp to none. Only a static address is allowed;
DHCP is not supported.
• GATEWAY—Enter the IP address for your gateway.
Note: Carefully check the file to ensure that there are no syntax errors, whitespace at the end
of lines, or blank lines in the file.
Note: Prism Element is a built-in service on every Nutanix cluster. Each Nutanix cluster
deployment has a unique Prism Element instance for local management. Multiple clusters are
managed via Prism Central.
Managing a Nutanix cluster involves configuring and monitoring the entities within the cluster,
including virtual machines, storage containers, and hardware components. You can manage a
Nutanix cluster through a web-based management console or a command line interface (nCLI).
This topic introduces you to the Prism Web Console, which is a graphical user interface (GUI)
that allows you to monitor cluster operations and perform a variety of configuration tasks. For
more information, see Web Console Overview in the Prism Web Console Guide.
Use the following steps to perform the initial configuration using Prism Web Console:
Procedure
3. Enable the Remote Support Tunnel if the site security policy allows Nutanix customer
support to access the cluster.
For more information, see Controlling Remote Connections in the Prism Web Console
Guide.
Caution: Failing to enable remote support prevents Nutanix Support from directly
addressing cluster issues. Nutanix recommends that all customers send the Pulse data at
minimum because it allows proactive support of customer issues.
Note: Nutanix Pulse does not gather, nor communicate with any guest VM specific data, user
data, metadata or any personally-identifiable information such as administrator credentials.
No system-level data from any customer is ever shared with third parties.
5. Add a list of Alert Email recipients, or if the security policy does not allow it, disable alert
emails.
For more information, see Configuring Alert Emails in the Prism Web Console Guide.
8. Run the Life Cycle Manager (LCM) inventory to ensure the LCM framework has the updated
software and firmware version of the entities in the cluster.
For more information, see the Life Cycle Manager Guide.
9. Enable the automatic downloads of upgrade software packages for cluster components if
the site security policy permits.
For more information, see Software and Firmware Upgrades in the Prism Web Console
Guide.
Note: To ensure that automatic download of updates can function, allow access to the
following URLs through your firewall:
• *.compute-*.amazonaws.com:80
• release-api.nutanix.com:80
12. If you are using Microsoft Hyper-V hypervisor on HPE DX platform models, ensure that the
software and drivers on Hyper-V are compatible with the firmware version installed on the
nodes.
For more information, see Deploying Drivers and Software on Hyper-V for HPE DX in the
Field Installation Guide. This procedure to deploy software and drivers is to be carried out
after cluster creation and before moving the nodes to production.
13. Verify that the cluster has passed the latest Nutanix Cluster Check (NCC) tests.
Nutanix Cluster Check (NCC) is a framework of scripts that can help diagnose cluster
health. You can run NCC as long as individual nodes are up, regardless of cluster state. The
scripts run standard commands against the cluster or the nodes, depending on the type of
information being retrieved.
• Check the installed NCC version and update it if a recent version is available. For more
information, see Software and Firmware Upgrades in the Prism Web Console Guide.
• Install the new version of NCC (if you have detected a new version of NCC and have not
installed it yet).
If the check reports a status other than PASS, resolve the reported issues before
proceeding
If you are unable to resolve the issues, contact Nutanix Support for assistance.
• Configure email frequency to allow the cluster to check, run, and email reports at regular
intervals as configured.
For more information, see Configuring Alert Policies in the Prism Web Console Guide.
Controller VM Access
Although each host in a Nutanix cluster runs a hypervisor independent of other hosts in the
cluster, some operations affect the entire cluster.
Most administrative functions of a Nutanix cluster can be performed through the web console
(Prism), however, there are some management tasks that require access to the Controller
VM (CVM) over SSH. Nutanix recommends restricting CVM SSH access with password or key
authentication.
This topic provides information about how to access the Controller VM as an admin user and
nutanix user.
Warning: When you connect to a Controller VM with SSH, ensure that the SSH client does not
import or change any locale settings. The Nutanix software is not localized, and running the
commands with any locale other than en_US.UTF-8 can cause severe cluster issues.
To check the locale used in an SSH session, run /usr/bin/locale. If any environment
variables are set to anything other than en_US.UTF-8, reconnect with an SSH
configuration that does not import or change any locale settings.
Note:
• As an admin user, you cannot access nCLI by using the default credentials. If you are
logging in as the admin user for the first time, you must log on through the Prism web
console or SSH to the Controller VM. Also, you cannot change the default password
of the admin user through nCLI. To change the default password of the admin user, you
must log on through the Prism web console or SSH to the Controller VM.
• When you make an attempt to log in to the Prism web console for the first time after
you upgrade to AOS 5.1 from an earlier AOS version, you can use your existing admin
user password to log in and then change the existing password (you are prompted)
to adhere to the password complexity requirements. However, if you are logging in to
the Controller VM with SSH for the first time after the upgrade as the admin user, you
must use the default admin user password (Nutanix/4u) and then change the default
password (you are prompted) to adhere to the Controller VM Password Complexity
Requirements.
• You cannot delete the admin user account.
When you change the admin user password, you must update any applications and scripts
using the admin user credentials for authentication. Nutanix recommends that you create a user
assigned with the admin role instead of using the admin user for authentication. The Prism Web
Console Guide describes authentication and roles.
Following are the default credentials to access a Controller VM.
Procedure
1. Log on to the Controller VM with SSH by using the management IP address of the Controller
VM and the following credentials.
• Password: Nutanix/4u
You are now prompted to change the default password.
2. Respond to the prompts, providing the current and new admin user password.
Changing password for admin.
Old Password:
New password:
Retype new password:
Password changed.
Note:
• As a nutanix user, you cannot access nCLI by using the default credentials. If you
are logging in as the nutanix user for the first time, you must log on through the
Prism web console or SSH to the Controller VM. Also, you cannot change the default
password of the nutanix user through nCLI. To change the default password of
the nutanix user, you must log on through the Prism web console or SSH to the
Controller VM.
• When you make an attempt to log in to the Prism web console for the first time after
you upgrade the AOS from an earlier AOS version, you can use your existing nutanix
user password to log in and then change the existing password (you are prompted)
to adhere to the password complexity requirements. However, if you are logging
in to the Controller VM with SSH for the first time after the upgrade as the nutanix
user, you must use the default nutanix user password (nutanix/4u) and then change
the default password (you are prompted) to adhere to the Controller VM Password
Complexity Requirements on page 29.
• You cannot delete the nutanix user account.
• You can configure the minimum and maximum password expiration days based on
your security requirement.
When you change the nutanix user password, you must update any applications and scripts
using the nutanix user credentials for authentication. Nutanix recommends that you create a
user assigned with the nutanix role instead of using the nutanix user for authentication. The
Prism Web Console Guide describes authentication and roles.
Following are the default credentials to access a Controller VM.
Procedure
1. Log on to the Controller VM with SSH by using the management IP address of the Controller
VM and the following credentials.
• Password: nutanix/4u
You are now prompted to change the default password.
2. Respond to the prompts, providing the current and new nutanix user password.
Changing password for nutanix.
Old Password:
New password:
Retype new password:
Password changed.
Note: Ensure that the following conditions are met for the special characters usage in the
CVM password:
• The special characters are appropriately used while setting up the CVM password.
In some cases, for example when you use ! followed by a number in the CVM
password, it leads to a special meaning at the system end, and the system may
replace it with a command from the bash history. In this case, you may generate a
password string different from the actual password that you intend to set.
• The special character used in the CVM password are ASCII printable characters
only. For information about ACSII printable characters, refer ASCII printable
characters (character code 32-127) article on ASCII code website.
Cluster Operations
This section describes how to manage cluster operations such as starting, stopping, destroying,
and expanding a cluster.
Stopping a Cluster
Note:
• If you are running Files, stop Files before stopping your AOS cluster. This task stops
all services provided by guest virtual machines and the Nutanix cluster.
• If you are planning to stop your cluster that has metro availability configured, do not
stop the cluster before performing some remedial actions. For more information, see
Conditions for Implementing Data Protection (Metro Availability) in the Prism Web
Console Guide.
(Hyper-V only) Stop the Hyper-V failover cluster by logging on to a Hyper-V host and running
the Stop-Cluster PowerShell command.
Note: This procedure stops all services provided by guest virtual machines, the Nutanix cluster,
and the hypervisor host.
Procedure
Node Removal
You may need to remove a node for various reasons such as replacement of a failed node or
to deprecate an old node for cluster expansion. You can remove a node using the Prism web
console or nCLI.
To remove a node (host) from the cluster, see Removing a Node in the Prism Web Console
Guide.
Procedure
If the cluster is running properly, the output displays the status of all the applications on all
the CVMs. All of them must display the status as UP. An output similar to the following is
displayed for each node in the cluster:
CVM:host IP-Address Up
Zeus UP [9935, 9980, 9981, 9994, 10015, 10037]
Scavenger UP [25880, 26061, 26062]
Xmount UP [21170, 21208]
SysStatCollector UP [22272, 22330, 22331]
IkatProxy UP [23213, 23262]
IkatControlPlane UP [23487, 23565]
SSLTerminator UP [23490, 23620]
SecureFileSync UP [23496, 23645, 23646]
Medusa UP [23912, 23944, 23945, 23946, 24176]
DynamicRingChanger UP [24314, 24404, 24405, 24558]
Pithos UP [24317, 24555, 24556, 24593]
InsightsDB UP [24322, 24472, 24473, 24583]
Athena UP [24329, 24504, 24505]
Mercury UP [24338, 24515, 24516, 24614]
Mantle UP [24344, 24572, 24573, 24634]
VipMonitor UP [18387, 18464, 18465, 18466, 18474]
Stargate UP [24993, 25032]
InsightsDataTransfer UP [25258, 25348, 25349, 25388, 25391, 25393,
25396]
Ergon UP [25263, 25414, 25415]
Cerebro UP [25272, 25462, 25464, 25581]
Chronos UP [25281, 25488, 25489, 25547]
Curator UP [25294, 25528, 25529, 25585]
Prism UP [25718, 25801, 25802, 25899, 25901, 25906,
25941, 25942]
CIM UP [25721, 25829, 25830, 25856]
AlertManager UP [25727, 25862, 25863, 25990]
Arithmos UP [25737, 25896, 25897, 26040]
Catalog UP [25749, 25989, 25991]
Acropolis UP [26011, 26118, 26119]
Uhura UP [26037, 26165, 26166]
Snmp UP [26057, 26214, 26215]
NutanixGuestTools UP [26105, 26282, 26283, 26299]
MinervaCVM UP [27343, 27465, 27466, 27730]
ClusterConfig UP [27358, 27509, 27510]
Aequitas UP [27368, 27567, 27568, 27600]
APLOSEngine UP [27399, 27580, 27581]
APLOS UP [27853, 27946, 27947]
Lazan UP [27865, 27997, 27999]
Delphi UP [27880, 28058, 28060]
Flow UP [27896, 28121, 28124]
What to do next
After you have verified that the cluster is running, you can start guest VMs.
(Hyper-V only) If the Hyper-V failover cluster was stopped, start it by logging on to a Hyper-V
host and running the Start-Cluster PowerShell command.
Warning: By default, Nutanix clusters have redundancy factor 2, which means they can tolerate
the failure of a single node or drive. Nutanix clusters with a configured option of redundancy
factor 3 allow the Nutanix cluster to withstand the failure of two nodes or drives in different
blocks.
Destroying a Cluster
Note:
• If you have destroyed the cluster and did not reclaim the existing licenses, contact
Nutanix Support to reclaim the licenses.
• Reclaiming licenses is required to remove the cluster from the insights portal.
Warning: Destroying a cluster resets all nodes in the cluster to the factory configuration. All
cluster configuration and guest VM data are unrecoverable after destroying the cluster. This
action is not reversible and you must use this procedure discerningly.
Note:
• If the cluster is registered with Prism Central (the multiple cluster manager VM),
unregister the cluster before destroying it. For more information, see Register
(Unregister) Cluster with Prism Central in the Prism Web Console Guide.
• You need admin user access to destroy a cluster.
2. Power off all the VMs that are running on the hosts in the cluster.
nutanix@cvm$ acli vm.off *
Alternatively, you can log into the Web Console and power off all the VMs.
Caution: Performing this operation deletes all cluster and guest VM data in the cluster.
Procedure
Run the vDisk manipulator utility from any Controller VM in the cluster.
• Replace ctr_name with the name of the storage container where the vDisk to fingerprint
resides.
• Replace vdisk_path with the path of the vDisk to fingerprint relative to the storage
container path (for example, Win7-desktop11/Win7-desktop11-flat.vmdk). You cannot
specify multiple vDisks in this parameter.
» To fingerprint all vDisks in the cluster:
nutanix@cvm$ ncli vdisk list | grep "Name.*NFS" | awk -F: \
'{print $4 ":" $5 ":" $6 ":" $7}' >> fingerprint.txt
nutanix@cvm$ for i in `cat fingerprint.txt`; do vdisk_manipulator --vdisk_name=$i \
--operation="add_fingerprints" --stats_only=false; done
Note: You can run vdisk_manipulator in a loop to fingerprint multiple vDisks, but run only one
instance of vdisk_manipulator on each Controller VM at a time. Executing multiple instances
on a Controller VM concurrently would generate significant load on the cluster.
Note: The current IPv6 implementation is allowed only on a Prism Element (PE) using AHV.
Ensure the PE is not registered to a Prism Central.
• Cluster VIP—Supports dual stack and accessible by either IPv6 or IPv4 simultaneously
without the need to reconfigure.
• FSVM CIFS and NFS client connections—Supports dual stack and accessible by either IPv6 or
IPv4 simultaneously without the need to reconfigure.
• DR between clusters for Files data—The remote site connection supports either IPv4 or IPv6.
Both are not supported at the same time for a single remote site.
• CVM and AHV access over SSH—CVM and AHV are accessible by both IPv4 address and
IPv6 address simultaneously over SSH without the need to reconfigure.
IPv6 Architecture
The image below illustrates the IPv6 implementation on a cluster.
IPv6 Configuration
Refer to the following sections to configure the IPv6 on a cluster:
Procedure
Note: By default, IPv6 is enabled on AHV. The manage_ipv6 CLI command enables IPv6 on the
AHV.
3. Pass the ips.json file to configure the nodes with IPv6 addresses.
nutanix@cvm$ manage_ipv6 -i ips.json configure
This command takes two IPv6 addresses per node and assign them to the eth0 Controller
VM and br0 AHV interface.
An output similar to the following is displayed:
[INFO] Reading IPv6 config from JSON file: ips.json
[INFO] IPv6 config to configure: {
“svmips”: {
“x.x.x.84”: “2001:db8::/32”,
“x.x.x.87”: “2001:db8::/32”,
“x.x.x.92”: “2001:db8::/32”,
},
“hostips”: {
“x.x.x.154”: “2001:db8::/32”,
“x.x.x.83”: “2001:db8::/32”,
“x.x.x.90”: “2001:db8::/32”,
},
“prefixlen”: 32,
Refer to the following CLI commands for all the other possible IPv6 actions:
• Remove the IPs assigned to the nodes in the cluster using the ips.json script.
nutanix@cvm$ manage_ipv6 unconfigure
Note: Remove the IPv6 VIP from the cluster if Prism Element is configured with VIP v6.
• Completely disable IPv6 addresses in the AOS. The user cannot configure IPv6 manually
either. The script restarts all the services as required.
nutanix@cvm$ manage_ipv6 disable
• View all the nodes' current IPv6 configuration in YAML format and any other irregularities,
if present, in the cluster.
nutanix@cvm$ manage_ipv6 show
5. To un-configure an IPv6 address, manually remove the cluster Virtual IP address in Prism and
run this command.
manage_ipv6 unconfigure
In the Virtual IPv6 field, enter an IPv6 address that will be used as a virtual IPv6 for the cluster.
A controller VM runs on each node and has its own IPv6 address, but this field sets a logical
IPv6 address that always connects to one of the active Controller VM in the cluster (assuming
at least one is active), which removes the need to enter the address of a specific Controller VM.
The virtual IPv6 address is normally set during the initial cluster configuration (see the Field
Installation Guide), but you can update the address at any time through this field.
For information on adding or modifying cluster parameters, see Modifying Cluster Details in the
Prism Web Console Guide.
Name Server also supports IPv6. For more information about name servers, see Configuring
Name Servers in the Prism Web Console Guide.
• Before attempting to add a node to the cluster, review the Prerequisites and Requirements
in the Prism Web Console Guide. The process for adding a node varies depending on several
factors, and this section covers specific considerations based on your AOS, hypervisor,
encryption, and hardware configuration.
• Check the Health Dashboard. If any health checks are failing, resolve them to ensure that the
cluster is healthy before adding any nodes. For more information, see Health Dashboard in
the Prism Web Console Guide.
• Allow any current add node operations to complete.
Procedure
1. Either click the gear icon in the main menu and then select Expand Cluster in the Settings
page or go to the Hardware Dashboard and click the Expand Cluster button.
The network searches for Nutanix nodes and then the Expand Cluster dialog box appears
(on the Select Host screen) with a graphical list of the discovered blocks and nodes.
Discovered blocks are blocks with one or more unassigned factory-prepared nodes
(hypervisor and Controller VM installed) residing on the same subnet as the cluster.
Discovery requires that IPv6 multicast packets are allowed through the physical switch.
A lack of IPv6 multicast support prevents node discovery and successful cluster expansion.
2. Select the check box for each block that you want to add to the cluster. All nodes within a
checked block are selected automatically; uncheck any nodes you do not want added to the
cluster.
When you select a block, more fields appear below the block diagram. A separate line for
each node (host) in the block appears under each field name.
Note: This is an optional configuration and is required only if IPv6 is configured on the
cluster.
Note: This is an optional configuration and is required only if IPv6 is configured on the
cluster.
Note: This is an optional configuration and is required only if IPv6 is configured on the
cluster.
Caution: This feature is for future use. Do not use tech preview features in production
environments.
h. When all the node values are correct, click Next button (lower right).
The network addresses are validated before continuing. If an issue is discovered, the
problem addresses are highlighted in red. If there are no issues, the process moves to the
Assign Rack screen with a message at the top when the hypervisor, AOS, or other relevant
feature is incompatible with the cluster version.
For the remaining procedure to expand clusters, see Expanding a Cluster in the Prism Web
Console Guide.
• For using network mapping, create the network connections and VLANs on both source
and destination cluster. For more information about configuring network connections, see
Network Configuration for VM Interfaces in the Prism Web Console Guide.
• Set up the network mapping on both the source and the destination clusters. For more
information, see Network Mapping in the Data Protection and Recovery with Prism Element
Guide.
Note: Do not create multiple remote sites pointing to the single destination cluster. Otherwise, an
alert will be generated.
Procedure
1. In the Data Protection dashboard (see Data Protection Dashboard in the Data Protection and
Recovery with Prism Element Guide), click the Remote Site button and then select Physical
Cluster from the pull-down list.
The Remote Site dialog box appears.
Note: It is recommended that you use a Virtual IP address for the proxy. Configuring a
remote CVM for the proxy will make the CVM the single point of failure and break the entire
communication if the CVM goes down.
Note: Network Address Translation (NAT) performed by any device in between the two
Nutanix clusters is not currently supported.
Caution: Do not enable a proxy on remote sites that will be used with a metro availability
Protection Domain.
Note: During IPv6 configuration, the IPv6 addresses are stored in Zeus and the same is
reflected in the IPS JSON output. When the disaster recovery fails, check for the Stored
IPv6 configuration in Zeus in the output of the JSON script.
e. Cluster Virtual IP
Note: Ensure that the virtual IP address and the nodes in the remote cluster are in the
same subnet.
• (Only on clusters with network segmentation for disaster recovery) Do the following:
• In Segmented Subnet (Gateway IP/Prefix Length), enter the address of the subnet
that is configured for isolating disaster recovery traffic.
• In Port, enter the port number.
• In Segmented Virtual IP, enter the virtual IP (IPv4 or IPv6) address of the disaster
recovery service.
Note: Ensure that the virtual IP address and the nodes in the remote cluster are in the
same subnet.
See Configuring a Remote Site (Physical Cluster) in the Data Protection and Recovery with
Prism Element Guide for the remaining procedure of configuring a remote site
Logs are forwarded from a Controller VM and they display the IP address of the Controller VM.
The logs are forwarded from a Controller VM using either TCP or UDP. You can also forward
logs to a remote syslog server by using Reliable Event Logging Protocol (RELP). To use RELP
logging, ensure that you have installed rsyslog-relp on the remote syslog server.
Note: You can use RELP logging only if the transport protocol is TCP.
After a remote syslog server is configured, it is enabled by default. (The Controller VM begins
sending log messages once the syslog server is configured). rsyslog-config supports and can
report messages from the following Nutanix modules.
Note:
For some modules, there is no change in the list of logs forwarded irrespective of the
monitor logs setting.
Module name Logs forwarded with monitor logs Logs forwarded with monitor logs
disabled enabled
• Forwards all the AHV host logs that are stored in /var/log/messages to a
remote syslog server.
AOS | Logs | 47
Module name Logs forwarded with monitor logs Logs forwarded with monitor logs
disabled enabled
Forwards all the logs related to the ACROPOLIS service to a remote syslog
server.
ERROR ERROR
Ensure you enable module logs at the ERROR level, unless you require more information. If you
enable more levels, the rsyslogd daemon sends more messages. For example, if you set the
SYSLOG_MODULE level to INFO, your remote syslog server might receive a large quantity of
operating system messages.
• You can only configure one rsyslog server; you cannot specify multiple servers.
• CPU usage might reach 10 percent when the rsyslogd daemon is initially enabled and
starts processing existing logs. This is an expected condition on first use of an rsyslog
implementation.
AOS | Logs | 48
Note: As the logs are forwarded from a Controller VM, the logs display the IP address of the
Controller VM.
Procedure
1. As the remote syslog server is enabled by default, disable it while you configure settings.
ncli> rsyslog-config set-status enable=false
2. Create a syslog server (which adds it to the cluster) and confirm it has been created.
ncli> rsyslog-config add-server name=remote_server_name \
relp-enabled={true | false} \
ip-address=remote_ip_address port=port_num \
network-protocol={tcp | udp}
ncli> rsyslog-config ls-servers
Name : remote_server_name
IP Address : remote_ip_address
Port : port_num
Protocol : TCP or UDP
Relp Enabled : true or false
Replace:
• remote_server_name with a descriptive name for the remote server receiving the specified
messages.
• remote_ip_address with the remote server's IP address.
• {true | false}, choose true to enable RELP and choose false to disable RELP.
3. Choose a module to forward log information from and specify the level of information to
collect.
ncli> rsyslog-config add-module server-name=remote_server_name
AOS | Logs | 49
module-name=module level=loglevel include-monitor-logs={ false | true }
• ACROPOLIS
• AUDIT
• CASSANDRA
• CEREBRO
• CURATOR
• GENESIS
• PRISM
• STARGATE
• SYSLOG_MODULE
• ZOOKEEPER
• Replace loglevel with one of the following:
• DEBUG
• INFO
• NOTICE
• WARNING
• ERROR
• CRITICAL
• ALERT
• EMERGENCY
Enable module logs at the ERROR level unless you require more information.
• (Optional) Set include-monitor-logs to specify whether the monitor logs are sent. It is
enabled (true) by default. If disabled (false), only certain logs are sent.
Note: If enabled, the include-monitor-logs option sends all monitor logs, regardless of the
level set by the level= parameter.
Note: The rsyslog configuration is send to Prism Central, Prism Element, and AHV only if the
module selected for export is applicable to them.
AOS | Logs | 50
Common Log Files
Nutanix stores log files that contain cluster service events and errors in a common log root
directory on the local filesystem of each CVM in a cluster. The logs contains details of all the
relevant services required for cluster operation and monitoring.
The files in the common log area are further classified into different directories, depending on
the type of information they contain.
Note:
• The timestamps for all Nutanix service logs are moved to UTC (in ISO
8601:2020-01-01 T00:00:00Z) from Prism version 5.18.
• All operating system logs are not moved to UTC, hence Nutanix recommends that
you set the server local time to UTC.
.FATAL Logs
If a component fails, it creates a log file named according to the following convention:
component-name.cvm-name.log.FATAL.date-timestamp
• component-name identifies to the component that created the file, such as Curator or Stargate.
• date-timestamp identifies the date and time when the first failure within that file occurred.
Each failure creates a new .FATAL log file.
Log entries use the following format:
[IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
The first character indicates whether the log entry is an Info, Warning, Error, or Fatal. The next
four characters indicate the day on which the entry was made. For example, if an entry starts
with F0820, it means that at some time on August 20th, the component had a failure.
Tip: The cluster also creates .INFO and .WARNING log files for each component. Sometimes, the
information you need is stored in one of these files.
AOS | Logs | 51
The node self-monitors itself by running several Linux tools every few minutes, including ping,
iostat, sar, and df.
This directory contains the output for each of these commands, along with the corresponding
timestamp.
/home/nutanix/data/logs/cassandra
The /home/nutanix/data/logs/cassandra directory stores the Cassandra metadata database
logs. The Nutanix process that starts the Cassandra database (cassandra_monitor) logs to
the /home/nutanix/data/logs directory. However, the most useful information relating to the
Cassandra is found in the system.log* files located in the /home/nutanix/data/logs/cassandra
directory.
This directory contains the output for each of these commands, along with the corresponding
timestamp.
Note: Audit logs with default values are generated when updates to VMs are initiated, either by
Prism Central Self Service users or by using Nutanix v3 API calls for the first time.
Scenario 2: Powering on a VM from Prism Web Console with Self Service User (testuser)
{"affectedEntityList":[{"entityType":"vm","name":"Test","uuid":"341a36a4-bc1e-4f45-
b829-41de340aa6f4"},{"entityType":"host","name":"BayouBilly-3","uuid":"ed47be97-
ea80-4c79-bc32-749d5b9a5dac"}],"alertUid":"VmPowerOnAudit","classificationList":
["UserAction"],"clientIp":"x.x.x.193","creationTimestampUsecs":"1625651262934906","defaultMsg":"Powered
on VM
Test","opEndTimestampUsecs":"1625651262930860","opStartTimestampUsecs":"1625651258486149","operationType":"P
cb05-3882-ac1f6b161aaa","params":
{"vm_name":"Test"},"recordType":"Audit","sessionId":"1adcde83-2dfe-437f-
af5b-3f532d7529c4","severity":"Audit","userName":"testuser","uuid":"b6634fdb-169c-453a-8e99-0da2623b6c50"}
AOS | Logs | 52
created","opEndTimestampUsecs":"1625651133825892","opStartTimestampUsecs":"1625651133825892","operationType"
b75e-688c5ce6cfe9","params":
{"category_key":"TEST"},"recordType":"Audit","severity":"Audit","userName":"admin","userUuid":"00000000-0000
ad7b-fe63aa68fda7"}
AOS | Logs | 53
Log Contents Frequency
AOS | Logs | 54
Table 6: Location: /home/nutanix/data/logs/cassandra
Log Contents
iostat.INFO I/O activity for each physical every 5 sec sudo iostat
disk.
Log Contents
num.processed Alerts that have been
processed.
AOS | Logs | 55
Table 9: Location: /var/log
Log Contents
Procedure
1. Search for the timestamp of the FATAL event in the corresponding INFO files.
c. Open the INFO file with vi and go to the bottom of the file (Shift+G).
nutanix@cvm$ grep "^F0907 01:22:23" stargate*INFO* |
cut -f1 -d:stargate.NTNX-12AM3K490006-2-CVM.nutanix.log.INFO.20130904-220129.7363
d. Analyze the log entries immediately before the FATAL event, especially any errors or
warnings.
2. If a process fails repeatedly, it might be faster to do a long listing of the INFO files and
select the one immediately preceding the current one. The current one would be the one
referenced by the symbolic link.
For example, in the output below, the last failure would be recorded in the file
stargate.NTNX-12AM3K490006-2-CVM.nutanix.log.INFO.20130904-220129.7363.
ls -ltr stargate*INFO*
-rw-------. 1 nutanix nutanix 104857622 Sep 3 11:22 stargate.NTNX-12AM3K490006-2-
CVM.nutanix.log.INFO.20130902-004519.7363
-rw-------. 1 nutanix nutanix 104857624 Sep 4 22:01 stargate.NTNX-12AM3K490006-2-
CVM.nutanix.log.INFO.20130903-112250.7363
AOS | Logs | 56
-rw-------. 1 nutanix nutanix 56791366 Sep 5 15:12 stargate.NTNX-12AM3K490006-2-
CVM.nutanix.log.INFO.20130904-220129.7363
lrwxrwxrwx. 1 nutanix nutanix 71 Sep 7 01:22 stargate.INFO ->
stargate.NTNX-12AM3K490006-2-CVM.nutanix.log.INFO.20130907-012223.11357
-rw-------. 1 nutanix nutanix 68761 Sep 7 01:33 stargate.NTNX-12AM3K490006-2-
CVM.nutanix.log.INFO.20130907-012223.11357
Tip: You can use the procedure above for the other types of files as well (WARNING and
ERROR) in order to narrow the window of information. The INFO file provides all messages,
WARNING provides only warning, error, and fatal-level messages, ERROR provides only error
and fatal-level messages, and so on.
Stargate Logs
This section discusses common entries found in Stargate logs and what they mean. The
Stargate logs are located at /home/nutanix/data/logs/stargate.[INFO|WARNING|ERROR|FATAL].
This message is generic and can happen for a variety of reasons. While Stargate is initializing,
a watch dog process monitors it to ensure a successful startup process. If it has trouble
connecting to other components (such as Zeus or Pithos) the watch dog process stops
Stargate.
If Stargate is running, this indicates that the alarm handler thread is stuck for longer than 30
seconds. The stoppage could be due to a variety of reasons, such as problems connecting to
Zeus or accessing the Cassandra database.
To analyze why the watch dog fired, first locate the relevant INFO file, and review the entries
leading up to the failure.
This message indicates that Stargate is unable to communicate with Medusa. This may be due
to a network issue.
Analyze the ping logs and the Cassandra logs.
Log Entry: CAS failure seen while updating metadata for egroup egroupid or Backend
returns error 'CAS Error' for extent group id: egroupid
W1001 16:22:34.496806 6938 vdisk_micro_egroup_fixer_op.cc:352]
CAS failure seen while updating metadata for egroup 1917333
This is a benign message and usually does not indicate a problem. This warning message means
that another Cassandra node has already updated the database for the same key.
Log Entry: Fail-fast after detecting hung stargate ops: Operation with id opid hung for
60secs
F0712 14:19:13.088392 30295 stargate.cc:912] Fail-fast after detecting hung stargate ops:
Operation with
id 3859757 hung for 60secs
This message indicates that Stargate restarted because an I/O operation took more than 60
seconds to complete.
To analyze why the I/O operation took more than 60 seconds, locate the relevant INFO file and
review the entries leading up to the failure.
AOS | Logs | 57
Log Entry: Timed out waiting for Zookeeper session establishment
F0907 01:22:23.124495 10559 zeus.cc:1779] Timed out waiting for Zookeeper session establishment
This message indicates that Stargate had 5 failed attempts to connect to Medusa/Cassandra.
Review the Cassandra log (cassandra/system.log) to see why Cassandra was unavailable.
Log Entry:multiget_slice() failed with error: error_code while reading n rows from
cassandra_keyspace
E1002 18:51:43.223825 24634 basic_medusa_op.cc:1461] multiget_slice() failed with error: 4
while reading 1 rows
from 'medusa_nfsmap'. Retrying...
Log Entry: Forwarding of request to NFS master ip:2009 failed with error kTimeout.
W1002 18:50:59.248074 26086 base_op.cc:752] Forwarding of request to NFS master
172.17.141.32:2009 failed with
error kTimeout
This message indicates that Stargate cannot connect to the NFS leader on the node specified.
Review the Stargate logs on the node specified in the error.
Cassandra Logs
After analyzing Stargate logs, if you suspect an issue with Cassandra/Medusa, analyze the
Cassandra logs. This topic discusses common entries found in system.log and what they mean.
Log Entry: batch_mutate 0 writes succeeded and 1 column writes failed for
keyspace:medusa_extentgroupidmap
INFO [RequestResponseStage:3] 2013-09-10 11:51:15,780 CassandraServer.java (line 1290)
batch_mutate 0
writes succeeded and 1 column writes failed for keyspace:medusa_extentgroupidmap
cf:extentgroupidmap
AOS | Logs | 58
row:lr280000:1917645 Failure Details: Failure reason:AcceptSucceededForAReplicaReturnedValue :
1
This is a common log entry and can be ignored. It is equivalent to the CAS errors in the
stargate.ERROR log. It simply means that another Cassandra node updated the keyspace first.
This message indicates that the node could not communicate with the Cassandra instance at
the specified IP address.
Either the Cassandra process is down (or failing) on that node or there are network connectivity
issues. Check the node for connectivity issues and Cassandra process restarts.
Log Entry: Caught Timeout exception while waiting for paxos read response from leader:
x.x.x.x
ERROR [EXPIRING-MAP-TIMER-1] 2013-08-08 07:33:25,407 PaxosReadDoneHandler.java (line 64) Caught
Timeout
exception while waiting for paxos read reponse from leader: 172.16.73.85. Request Id: 116.
Proto Rpc Id : 2119656292896210944. Row no:1. Request start time: Thu Aug 08 07:33:18 PDT
2013.
Message sent to leader at: Thu Aug 08 07:33:18 PDT 2013 # commands:1 requestsSent: 1
This message indicates that the node encountered a timeout while waiting for the Paxos leader.
Either the Cassandra process is down (or failing) on that node or there are network connectivity
issues. Check the node for connectivity issues or for the Cassandra process restarts.
AOS | Logs | 59
The ss -s command shows you the number of open ports.
nutanix@cvm$ ss -s
Total: 277 (kernel 360)
TCP: 218 (estab 89, closed 82, orphaned 0, synrecv 0, timewait 78/0), ports 207
If there are issues with connecting to the Nutanix UI, escalate the case and provide the
output of the ss -s command as well as the contents of prism_gateway.log.
Zookeeper Logs
The Zookeeper logs are located at /home/nutanix/data/logs/zookeeper.out and contains the
status of the Zookeeper service.
More often than not, there is no need to look at this log. However, if one of the other logs
specifies that it is unable to contact Zookeeper and it is affecting cluster operations, you may
want to look at this log to find the error Zookeeper is reporting.
Genesis.out
When checking the status of the cluster services, if any of the services are down, or the
Controller VM is reporting Down with no process listing, review the log at /home/nutanix/data/
logs/genesis.out to determine why the service did not start, or why Genesis is not properly
running.
Check the contents of genesis.out if a Controller VM reports multiple services as DOWN, or if
the entire Controller VM status is DOWN.
Like other component logs, genesis.out is a symbolic link to the latest genesis.out instance
and has the format genesis.out.date-timestamp.
An example of steady state output:
nutanix@cvm$ tail -F ~/data/logs/genesis.out
2017-03-23 19:24:00 INFO node_manager.py:2070 Certificate cache in sync
2017-03-23 19:24:00 INFO node_manager.py:4732 Checking if we need to sync the local SVM and
Hypervisor DNS configuration
with Zookeeper
2017-03-23 19:26:00 INFO node_manager.py:1960 Certificate signing request data is not available
in Zeus configuration
AOS | Logs | 60
2017-03-23 19:26:00 INFO node_manager.py:1880 No Svm certificate maps found in the Zeus
configuration
2017-03-23 19:26:00 INFO node_manager.py:4732 Checking if we need to sync the local SVM and
Hypervisor DNS configuration
with Zookeeper
2017-03-23 19:28:00 INFO node_manager.py:1960 Certificate signing request data is not available
in Zeus configuration
2017-03-23 19:28:00 INFO node_manager.py:1880 No Svm certificate maps found in the Zeus
configuration
Under normal conditions, the genesis.out file logs the following messages periodically:
Unpublishing service Nutanix Controller
Publishing service Nutanix Controller
Zookeeper is running as [leader|follower]
Prior to these occasional messages, you should see Starting [n]th service. This is an indicator
that all services were successfully started.
Tip: You can ignore any INFO messages logged by Genesis by running the command:
grep -v -w INFO /home/nutanix/data/logs/genesis.out
Possible Errors
2017-03-23 19:28:00 WARNING command.py:264 Timeout executing scp -q -o CheckHostIp=no -o
ConnectTimeout=15 -o
StrictHostKeyChecking=no -o TCPKeepAlive=yes -o UserKnownHostsFile=/dev/null -o
PreferredAuthentications=keyboard-interactive,password -o BindAddress=x.x.x.254
'root@[x.x.x.1]:/etc/resolv.conf' /tmp/resolv.conf.esx: 30 secs elapsed
2017-03-23 19:28:00 ERROR node_dns_ntp_config.py:287 Unable to download ESX DNS configuration
file, ret -1,
stdout , stderr
2017-03-23 19:28:00 WARNING node_manager.py:2038 Could not load the local ESX configuration
2017-03-23 19:28:00 ERROR node_dns_ntp_config.py:492 Unable to download the ESX NTP
configuration file,
ret -1, stdout , stderr
Any of the above messages means that Genesis was unable to log on to the ESXi host using the
configured password.
AOS | Logs | 61
Procedure
1. Examine the contents of the genesis.out file and locate the stack trace (indicated by the
CRITICAL message type).
In the example above, the certificates in AuthorizedCerts.txt were not updated, which
means that you failed to connect to the NutanixHostAgent service on the host.
Log Contents
Log Contents
datastore/vm_name/vmware.log Virtual machine activity and health
AOS | Logs | 62
Nutanix Calm Log Files
The following table provides Nutanix Calm logs related information.
Log Description
/home/docker/nucalm/logs Logs of microservices from Nutanix Calm
container.
/home/docker/epsilon/logs Logs of microservices from Epsilon Container.
/home/nutanix/data/logs/genesis.out Logs containing information about enabling
container service and starting Nutanix Calm
and epsilon containers.
/home/nutanix/data/logs/epsilon.out Logs containing information about starting
epsilon service.
AOS | Logs | 63
TRAFFIC MARKING FOR QUALITY OF
SERVICE
To prioritize outgoing (or egress) traffic as required, you can configure quality of service on the
traffic for a cluster.
There are two distinct types of outgoing or egress traffic:
Note: Set any QoS value in the range 0–0x3f. The default QoS values for the traffic are as follows:
Configuring QoS
Configure Quality of Service (QoS) for management and data services traffic using nCLI.
Where:
Where:
• message=QoS is already enabled. indicates why the command failed. This sample error
message indicates that the net enable-qos command failed because QoS enable command
was run again when QoS is already enabled.
Procedure
• Activate QoS on a cluster by running the following commands on all the CVMs in the cluster:
nutanix@cvm$ echo --qos_enabled=true >> ~/config/genesis.gflags
nutanix@cvm$ allssh genesis restart
If you run the command as net enable-qos without the options, AOS enables QoS with the
default values (mgmt=0x10 and data-svc=0xa).
Note: After you run the net enable-qos command, if you run it again, the command fails and
AOS displays the following output:
QoSUpdateStatusDTO(status=false, message=QoS is already enabled.)
Note: If you need to change the QoS values after you enable it , run the net edit-qos
command with the option (data-svc or mgmt or both as necessary).
Note: When you get the QoS configuration enabled on the cluster, nCLI provides the QoS
values in hexadecimal format (0xXX where XX is hexadecimal value in the range 00–3f).
Note:
Where:
• dataSvc=0xa indicates that QoS value for data services traffic (data-svc option) is
set to 0xa (represented in hexadecimal value as 0xa. If you disabled QoS, then this
parameter is displayed as dataSvc=null.
• To set the QoS values for the traffic types on a cluster after you enabled QoS on the cluster,
run the following command:
ncli> net edit-qos [data-svc="data-svc value"][mgmt="mgmt value"]
You can provide QoS values between 0x0-0x3f for one or both the options. The value is
hexadecimal representation of a value between decimal 0-63 both inclusive.
• To disable QoS on a cluster, run the following command:
ncli> net disable-qos
Requirements
Requirements for Blockstore and SPDK include the following:
• Ensure that the cluster is created on a compatible version of AOS. For information about the
AOS versions compatible with Blockstore, see the AOS Family Release Notes.
• Ensure that the all-flash (AF) cluster, created using a minimum AOS version of 6.1 or later,
has at least one non-boot flash device for Blockstore (without SPDK) to be enabled on that
device.
Note: Blockstore is enabled only on non-boot flash devices, in other words, the flash device
that is not used as part of RAID partition to boot the CVM.
If flash devices need to be configured as part of CVM boot (RAID) partition, then
ensure that the all-flash cluster has more than two flash devices on at least one node.
For example, in an all-flash cluster in which each node has only two flash devices, the
blockstore is not enabled in that cluster. When you add one more flash device on any
node in that cluster as a non-boot device, Blockstore is enabled on the newly added
non-boot flash device.
Nutanix supports additional storage device configurations in each AOS release. Contact
Nutanix Support for information about the storage device configurations for the AOS release
you deploy.
Procedure
2. Check the /dev/spdk/ table to see if any NVMe devices are displayed therein using one of the
following methods:
Or
» On all nodes collectively:
nutanix@cvm$ allssh ls /dev/spdk/*
If you see NVMe devices in the /dev/spdk/ table, then Blockstore with SPDK is enabled.
3. Alternatively, check the iostat command output to see if any NVMe devices are displayed
therein.
nutanix@cvm$ iostat
If you see SPDK NVMe devices in device list, then Blockstore with SPDK is enabled on the
NVMe devices.
What to do next
If all the aforesaid checks are met and SPDK is still not enabled, contact Nutanix Support.
I/O Tiers
Before describing the Intel Optane performance tier, this section describes storage tiers and the
structure of read-write operations.
Distributed Storage Fabric (DSF) I/O Path
Within the CVM, the Stargate process is responsible for handling storage I/O for user VMs
(UVMs) and persistence (RF, etc.). The Autonomous Extent Store (AES) is used to handle
sequential and sustained random workloads (subject to certain conditions). Oplog that
extends to both tiers in a hybrid drive configuration handles purely random read-write
data and drains into the Extent Store.
Information Lifecycle Management (ILM) Tiers
Nutanix Information Lifecycle Management (ILM) uses storage tiers to place data
based on how different types of storage devices like SATA SSD or NVMe SSDs perform.
ILM monitors how data is being accessed and places the most accessed data in high
performance, low latency drives to achieve the best possible access speeds. ILM
distinguishes read intensive data from write intensive data.
Nutanix ILM defines storage tiers based on categorization of storage drives. Nutanix
categorizes drives into three types: DAS-SATA, SSD-SATA, and SSD-PCIe. DAS-SATA
includes HDDs. SSD-SATA includes SATA SSDs. The SSD-PCIe category includes NVMe
drives like Intel P4510. ILM uses a preference list for data placement using Extent groups.
This list, in the descending order of preference, is: SSD-PCIe, SSD-SATA and DAS-SATA.
ILM deploys a multi tier architecture in hybrid drive configurations (such as SATA SSD +
HDD) to enable faster processing of high frequency data. In a hybrid drive configuration,
tier 0 is always the fast tier with low latency drives such as SSDs. Tier 1 is the slower tier
with higher latency devices such as HDDs.
In general, in all-flash drive configurations including all-NVMe configurations, ILM uses
single tier architecture.
In a hybrid drive configuration, the Nutanix ILM based I/O path consists of the following
components and pathways where Tier 1 is HDD and Tier 0 is low latency SATA SSD. Bursty Rand
refers to purely random writes and Sustained Rand refers to sustained random writes.
Typically, in an all-flash drive configuration, the I/O path using single tier architecture is
described by the following figure.
Tip: A typical all-NVMe configuration with Intel Optane would consist of 2 x Optane SSDs + 6 x
NVMe SSDs with size being 750GB and 1.5TB for Optane and NVMe SSDs respectively.
From a storage tier perspective, ILM categorizes Optane as SSD-MEM-NVMe tier and gives
it the highest weight to enable migration of frequently accessed data. With this additional
Optane tier, the multi tier preference list for random data migration using Extent groups, in the
descending order of preference, is: SSD-MEM-NVMe, SSD-PCIe, SSD-SATA and DAS-SATA. Thus,
Optane tier is the most preferred tier used primarily for migrating random data that is read with
high frequency.
Note: AOS prefers non-Optane NVMe SSDs as first choice for initial placement to avoid filling up
the scarce Optane resources.
Reads in all-NVMe configurations with Intel Optane (Intel Optane + NVMe SSD) are managed
with two tiers - the highest performance media being Intel Optane is tier 0 (SSD-MEM-NVMe)
and the lower performance media being the other (non-Optane) NVMe SSDs are treated as tier
1 (SSD-PCIe). AOS migrates frequently accessed (frequently read) random data to the Optane
tier to leverage the low latency high read speed. Less frequently read random data is retained
in Tier 1 that is the normal NVMe SSD tier. Random data when it has higher read frequency
versus write frequency is defined as read-hot. AOS (via ILM) migrates Extent groups for read-
hot random data (read intensive workloads) to the Optane tier based on the pre-configured
migration weights assigned to the tiers.
• AOS migrates the random data read with high frequency (read-hot random data) to the
Optane performance tier (SSD-MEM-NVMe) since Optane drives provide very low read
latency. AOS uses the Read count versus write count and migration weights for a random
data segment to decide whether it should be migrated to the Optane tier.
Note: AOS (via ILM) uses pre-configured read and write weights for tier-based migration or
migration decisions. When a data segment is no longer read-hot or random, it is migrated to
non-Optane NVMe tier to conserve Optane resources.
• AOS (Cassandra service) uses only the non-Optane NVMe drives for metadata.
• AOS prefers non-Optane NVMe drives as first choice for initial placement to avoid filling up
the scarce Optane resources.
• AOS uses both Optane and non-Optane NVMe tiers for Oplog and Extent store.
• The AOS installation is a fresh installation of an AOS version that support Optane Tier for
NVMe (minimum AOS version of 6.1).
• Ensure that the clusters have only Intel Optane and other NVMe SSD drives.
• Ensure that all the containers in the cluster have AES enabled.
The following limitations apply to the Optane performance tier.
• The Optane performance tier is not supported on clusters upgraded to supporting AOS
versions.
• Performance tier is auto-disabled in a cluster if any node in the cluster has an HDD or a SATA
SSD. If any node in the cluster has a drive configuration that includes a SATA SSD or an HDD,
AOS disables the performance tier for the cluster as a whole including on all the other nodes
that have only NVMe and Intel Optane drives.
When the Optane performance tier is disabled, an alert is raised. To clear the alert and auto-
enable the performance tier, remove the SATA SSD or HDD that is present on any node in the
cluster.
Introducing (or adding) a SATA SSD or HDD in any node in the cluster also disables the
performance tier and raises the alert.
• The Optane performance tier is auto-disabled if any node in the cluster that does not have
any Optane drives. In other words, all the nodes in the cluster must have only NVMe and
Optane drives.
Note:
• To interpret the acronyms and symbols used in the following table, see the
information available in Table 14: Acronyms and Symbols - Description on page 75
table.
• For the platforms described in this section, if two SSD devices (SAS/SATA or NVMe)
are used for Cassandra metadata, the SSD devices must be ext4 format (non-
blockstore) except for NX-1065 platform that uses one SSD.
• CVM vCPU == Number of physical host CPUs minus 2, limited to a maximum of 22 vCPUs.
• Controller VM memory == Physical host memory minus 16GB
Note: Minimum Foundation version of 5.3 supports these limits with NUMA pinnings or
alignments. Earlier Foundation versions with a minimum version of 5.0 support these limits but
not NUMA pinnings or alignments.
High Performance** 12 to 16
(Numa nodes >= 2, NVMe Drives >= 2, Cores >=
8)
RDMA (2 or 4 Socket) and Hyperthreading is
enabled.
or
12 or more physical core present in each NUMA
node.
Generic 8 to 12 20
Number of physical cores in each NUMA node:
• 8 or more
• 6 or more and Hyperthreading is enabled
• HDD = 32 TB or more
• SSD = 48 TB or more (Any combination: SAS/SATA only, NVMe only, or a mix of SAS/
SATA and NVMe)
**High performance specifications include:
$ Cluster Feature $ & CVM Field Requirements - vCPU and vRAM (in GiB): Minimum,
Recommended &
M R
Note: Ensure that the maximum capacity of the node is inline with qualified HCI and NUS limits.
• The All-Flash HCI node must have same capacity SSDs (for example, 1.92 TB) whether you
have all same SSDs (SATA/SAS/NVMe) or mixed (SATA/SAS + NVMe)
• The Hybrid HCI node must have the same capacity SSDs (for example, 3.86 TB SATA SSD)
and same capacity HDD (for example, 8 TB HDD)
Important: These rules are strictly applicable only when you add similar nodes to the existing
cluster and not when you deploy new clusters.
• If total NVMe+SSD node capacity is less than 64 TB, the drive sizes can have a skew of up to
100%.For example:
• Supported combination – Configurations with equal size of NVMe SSD & SATA/SAS SSD
drives.
• Unsupported combination – Configurations with skew <= 20%; for example, 1.6 TB NVMe +
1.92 TB SATA/SAS SSD.
• Supported combination(Only for existing cluster deployments) – Configurations with
skew <= 100% ; for example, a combination of 1.92 TB and 3.84 TB regardless of the drive
interface (SATA/SAS/NVMe).
• Unsupported combination – Configurations with skew > 100%; for example, a combination
of 1.92 TB & 7.68 TB regardless of the drive interface (SATA/SAS/NVMe).
• If the total NVMe+SSD node capacity is more than 64 TB, the maximum supported skew
in drive sizes within the same tier cannot exceed 20% as per the rules defined above. For
example:
• Supported combination – Configurations with equal size NVMe SSD & SATA/SAS SSD
drives.
• Supported combination – Configurations with skew <= 20 %. ex: 1.6 TB NVMe + 1.92 TB
SATA/SAS SSD.
• Unsupported combination – Configuration skew > 20 %. A combination of 3.84 TB and
7.68 TB regardless of the drive interface (SATA/SAS/NVMe).
• Supported combination – A mix of 16 TB and 18 TB HDD (Less than 20% skew in drive sizes
within the same tier.
• Unsupported combination – A mix of 14 TB and 18 TB HDD (More than 20% skew in drive
sizes within the same tier)
Greater than 20% skew in drive sizes may lead to performance inconsistency as the larger
drive sizes are targeted for initial writes, and in the back-end, the system constantly tries to
Note: Starting with AOS 6.0, during drive replacements in flash tier, in case of non-availability of
drive sizes that conform to the above rules, bigger drives can be configured. However, the bigger
drives are downsized by AOS to match the capacity of the rest of the drives in the flash tier on
the node to avoid performance issues.
• The capacity that is allocated to the largest HCI node is available with the remaining HCI
nodes of the cluster. This management technique enables you to handle any failure that
occurs on the largest HCI node. For example, in a 4-Node cluster, if the largest HCI node
capacity is 75 TB, ensure that the total capacity of the remaining three HCI nodes is equal to
or greater than 75 TB.
• When you add new nodes to the cluster, you must follow the Redundancy Factor (RF) of the
cluster. If the cluster is an RF-2 cluster, you must utilize the limit and deploy 2 HCI nodes of
the same type (either Hybrid or All-Flash). In case you add a smaller number of nodes than
the RF of the cluster, then you can perform any capacity increase action only based on the
available capacity of the remaining nodes in the cluster.
For example, if you have set a cluster with three nodes of 10 TB capacity each and the fourth
node of 80 TB capacity, the total capacity of the cluster becomes 110 TB. In this case, the
usable raw capacity is only 60 TB (with 30 TB logical capacity for RF2). The total capacity of
80 TB of the fourth node cannot be used due to unavailability of required space on the rest
of the nodes for replica placement.
Note: If you proceed with one physical drive addition, at a time to one cluster node or at the
same time to multiple cluster nodes, the following issues might occur in the system:
• Number of Stargate restarts increases, and the local Stargate becomes unavailable
for a short term.
• Stargate across multiple cluster nodes can restart at the same time.
The new physical drives that are added to the deployed HCI node of a cluster must be equally
distributed between the nodes in the cluster in a round-robin fashion and RF of the cluster must
be maintained.
• Hybrid HCI Node - Involves both SSDs and HDDs. In the case of Hybrid HCI nodes, the SSDs
can be either SAS/SATA or NVMe. A combination of different SSDs (SAS/SATA + NVMe )
and HDD is not allowed in the same HCI node.
• All-Flash HCI Node - Involves only SSDs. In case of All-Flash HCI nodes, the SSDs can be
either SAS/SATA, Optane, or NVMe or a combination of any of these SSDs.
Note: The Optane SSD cannot be used as a standalone SSD in an All-Flash HCI Node. It can
be used only in combination with SAS/SATA or NVMe.
Note:
• To interpret the acronyms and symbols used in the following table, see the
information available in Table 14: Acronyms and Symbols - Description on page 75
table.
• Ensure that you adhere to the capacity guidelines described in HCI Node Capacity
Guidelines on page 79.
• Ensure that you follow the nutanix-recommended new drive addition and
replacement instructions described in HCI Node - Recommended Drive Addition /
Replacement Instructions on page 81.
M R M R
• Minimum: 10 % of Overall
Capacity
• Recommended: 10% of
overall capacity + active
Working Set Size (WSS)*.
M R M R
*The active Working Set Size (WSS) is the amount of data that the application reads/
writes frequently. While factoring the flash capacity for the cluster, in addition to the CVM
requirements, you should also provision additional flash capacity to account for the working
set size of the workloads. Nutanix sizing tool considers both while sizing for the cluster.
Important:
• The HDD:SSD ratio should be 2:1 to provide sufficient bandwidth on the slower
tier to absorb ILM down migrations from the faster tier. In case it is difficult to
maintain a 2:1 ratio for HDD:SSD, an All-Flash node is recommended.
• In Hybrid HCI node, Partial population of Hybrid (HDD+SSD) nodes is allowed
only for the below platforms:
• All DX platforms.
• NX 8155
• The following server platforms are some exceptions where configuration
mechanism of 2:1 ratio for HDD:SSD is not followed:
• 4 drive slot Dell XC, Lenovo HX, Fujitsu XF platforms (2 SSD + 2 HDD)
• 10 drive slot Dell XC, Lenovo HX, Fujitsu XF platforms (4 SSD + 6 HDD)
• NX-1065 (1 SSD + 2 HDD), NX-1175S (2 SSD + 2 HDD)
• HPE DX360 4 LFF Gen10 and Gen10 Plus, DX320 4 LFF Gen11
Note:
• To interpret the acronyms and symbols used in the following table, see the
information available in Table 14: Acronyms and Symbols - Description on page 75
table.
• Ensure that you adhere to the capacity guidelines described in HCI Node Capacity
Guidelines on page 79.
SSD Type Number of Number of SSDs; SSD Capacity (in TB): Minimum,
Drives Minimum, Recommended Recommended
M R M R
SSD Type Number of Number of SSDs; SSD Capacity (in TB): Minimum,
Drives Minimum, Recommended Recommended
M R M R
Important:
• NX-8170
• HPE DX-360
• Intel DCS LCY2224NX3
• The server platform examples for Optane + SAS / SATA are:
• Data Protection and Recovery with Prism Element Guide - See Resource Requirements
Supporting Snapshot Frequency (Asynchronous, NearSync and Metro) information.
• Nutanix Disaster Recovery Guide - See On-Prem Hardware Resource Requirements
information.
AOS | Copyright | 87