Nutanix
Nutanix
it is a cluster wide setting that sets the number of nodes that participate in metadata operations. Some version of the
Prism UI calls it out as desired Redundancy factor. This corresponds to fault tolerance (FT1 or FT2) inside the AOS cluster,
and can be thought of as the number of failures to tolerate, IE 1 or 2.
Unlike the fault tolerance and redundancy factors which has nothing to do with the extent’s stores.
1.1.1. FT1: -
In FT1, each node maintains its own metadata and copies it to the next two nodes in the Cassandra ring.
In the image below, Node-B holds the metadata itself as well as Node-C and Node-D.
In the event of a node failure, at least two nodes with node B’s metadata remain in the production and the cluster will
remains stable.
{You have 4 nodes with RF2, when one of the nodes fails AOS immediately starts replicating the data to return to a
stable RF2 state. Since you have 4 nodes and probably enough resources to keep the workloads active, the system
returns to a stable state with only 3 nodes.
This obviously would not have happened if your cluster had been at 3 nodes or if the resources left on the 3 active nodes
were not enough to keep the workloads fully operational.}
1.1.2. FT2: -
In FT2, each node maintains its own metadata and copies it to the next four nodes in the Cassandra ring.
In the image below, Node-1B holds the metadata itself as well as Node-1C, Node-1D, Node-2B and Node-2A.
In the event of two failure, at least three nodes with node 1B’s metadata remain in the production and the cluster will
remains stable.
1.2.1.The Redundancy Factor 2, cluster keeps 3 copies of metadata (Cassandra) and Zookeeper configuration,
and 2 copies of data.
1.2.2.The Redundancy Factor 3, cluster keeps 5 copies of Metadata and Zookeeper (configuration data) and 3
copies of data.
Next you get how we protect guest VM and application data in the effect of sizing.
It is a storage container(datastore) setting that specifies the number of replica copies of extent(guest)data that are kept
within the acropolis distributed storage fabric.
Container replication factor 2 which can protect the extent of at least one failure.
Guest Data blocks are placed on the local node where the guest is running and 1 replica copy of each block is distributed
throughout the nodes in the cluster.
Since there is 1 replica copy of data, for Sizing purpose you can estimate physical
storage as two time the amount of VM logical storage.
Since there is 1 replica copy of data, assume physical requirement for storage is 2x virtual machine logical usage.
Containers set with Replication Factor RF2 can run on a cluster with the cluster Fault Tolerance set as FT1 or FT2.
Loss of 1node (even the node that the guest is running on) does not result in a loss of extent/guest data. HA feature will
restart VM on another node and data will be made available by
the application mobility fabric.
1.3.2. Container Replication Factor 3: -
Since there is 2 replica copy of data, for Sizing purpose you can estimate physical storage as three time the amount of
VM logical storage.
Since there is 2 replica copy of data, assume physical requirement for storage is 3x virtual machine logical usage.
Containers set with Replication Factor RF3 can run on a cluster with the cluster Fault Tolerance set as FT2.
Loss of 2 node (even the node that the guest is running on) does not result in a loss of extent/guest data. HA feature will
restart VM on another node and data will be made available by the application mobility fabric.
During discovery, talk to your customer about Fault Tolerance and Replication Factor as this factor into sizing.
FT1 clusters require at least 4 nodes to be self-healing while also allowing full maintenance capabilities during an
N-1 failure.
FT2 clusters require at least 7 nodes to be self-healing while also allowing full maintenance capabilities during an
N-2 failure.
Containers set replication Factor 2 should be sized with 2x the logical VM storage.
Containers set replication Factor 3 should be sized with 3x the logical VM storage.
Together, a group of Nutanix nodes forms a distributed system (Nutanix cluster) responsible for providing the Prism and
AOS capabilities. All services and components are distributed across all CVMs in a cluster to provide for high-availability
and linear performance at scale.
The following figure shows an example of how these Nutanix nodes form a Nutanix cluster:
2.1 Nutanix Cluster - Distributed System
These techniques are also applied to metadata and data alike. By ensuring metadata and data is distributed across all
nodes and all disk devices. we can ensure the highest possible performance during normal data ingest and re-protection.
This enables our MapReduce Framework (Curator) to leverage the full power of the cluster to perform activities
concurrently. Sample activities include that of data re-protection, compression, erasure coding, deduplication, etc.
The Nutanix cluster is designed to accommodate and remediate failure. The system will transparently handle and
remediate the failure, continuing to operate as expected. The user will be alerted, but rather than being a critical time-
sensitive item, any remediation (e.g., replace a failed node) can be done on the admin’s schedule.
If you need to add additional resources to your Nutanix cluster, you can scale out linearly simply by adding new nodes.
With traditional 3-tier architecture, simply adding additional servers will not scale out your storage performance.
However, with a hyperconverged platform like Nutanix, when you scale out with new node(s) you’re scaling out:
The following figure shows how the % of work handled by each node drastically decreases as the cluster scales:
No artificial limits are imposed on the vDisk size on the AOS/stargate side. As of 4.6, the vDisk size is stored as a 64-
bit signed integer that stores the size in bytes. This means the theoretical maximum vDisk size can be 2^63-1 or 9E18
(9 Exabytes). Any limits below this value would be due to limitations on the client side, such as the maximum VMDK
size on ESXI.
The following figure shows how these maps between AOS and the hypervisor:
Write IO is deemed as sequential when there is more than 1.5MB of outstanding write IO to a vDisk (as of
4.6). IOs meeting this will bypass the OpLog and go directly to the Extent Store since they are already large
chunks of aligned data and won't benefit from coalescing.
All other IOs, including those which can be large (e.g., >64K) will still be handled by the OpLog.
Data is brought into the cache at a 4K granularity and all caching is done real-time (e.g., no delay or batch process data to
pull data into the cache).
Each CVM has its own local cache that it manages for the vDisk(s) it is hosting (e.g., VM(s) running on the same node).
When a vDisk is cloned (e.g., new clones, snapshots, etc.) each new vDisk has its own block map and the original vDisk is
marked as immutable. This allows us to ensure that each CVM can have its own cached copy of the base vDisk with cache
coherency.
In the event of an overwrite, that will be re-directed to a new extent in the VM's own block map. This ensures that there
will not be any cache corruption.
This architecture worked well except in cases of traditional applications and workloads that had VMs with single large
vdisk. These VMs were not able to leverage the capabilities of AOS to its fullest. As of AOS 6.1, enhancement was done
to vdisk controller such that requests to a single vdisk are now distributed to multiple vdisk controllers by creating
shards of the controller each having its own thread. I/O distribution to multiple controllers is done by a primary
controller so for external interaction this still looks like a single vdisk. This results in effectively sharding the single vdisk
making it multi-threaded. This enhancement alongwith other technologies talked above like Blockstore, AES allows AOS
to deliver consistent high performance at scale even for traditional applications that use a single vdisk.
AOS Data
Protection
1.16. Availability Domains
A key construct of the Nutanix distributed storage Fabric that gives it the ability to intelligently distribute or strip
metadata, configuration data, and extent/guest data across the disk in a node, nodes in block, across nodes in different
blocks, or across nodes in different racks
Availability Domains (aka node/block/rack awareness) is a key struct for distributed systems to abide by for determining
component and data placement. Nutanix refers to a “block” as the chassis which contains either one, two, or four server
“nodes” and a “rack” as a physical unit containing one or more “block”. NOTE: A minimum of 3 blocks must be utilized
for block awareness to be activated, otherwise node awareness will be used.
Nutanix currently supports the following levels or awareness:
Disk (always)
Node (always)
Block (as of AOS 4.5)
Rack (as of AOS 5.9)
Node Awareness Standard: -
Rack Configuration
Awareness can be broken into a few key focus areas:
Data (The VM data)
Metadata (Cassandra)
Configuration Data (Zookeeper)
1.16.1. Data
With AOS, data replicas will be written to other [blocks/racks] in the cluster to ensure that in the case of a [block/rack]
failure or planned downtime, the data remains available. This is true for both RF2 and RF3 scenarios, as well as in the
case of a [block/rack] failure. An easy comparison would be “node awareness”, where a replica would need to be
replicated to another node which will provide protection in the case of a node failure. Block and rack awareness further
enhances this by providing data availability assurances in the case of [block/rack] outages.
The following figure shows how the replica placement would work in a 3-block deployment:
Block/Rack Aware Replica Placement
In the case of a [block/rack] failure, [block/rack] awareness will be maintained (if possible) and the data will be
replicated to other [blocks/racks] within the cluster:
In summary it is preferred to leverage synchronous replication / metro clustering for the following reasons:
The same end result can be achieved with sync rep / metro clustering, avoiding any risks and keeping
isolated fault domains
If network connectivity goes down between the two locations in a non-supported "stretched" deployment,
one side will go down as quorum must be maintained (e.g., majority side will stay up). In the metro cluster
scenario, both sides can continue to operate independently.
Availability domain placement of data is best effort in skewed scenarios
Additional Latency / reduced network bandwidth between both sites can impact performance in the
"stretched" deployment
The max node block has 4 nodes which means the other 3 blocks should have 4x4==16
nodes. In this case it WOULD be block aware as the remaining blocks have 18 nodes
which is above our minimum.
1.16.3.2. Zookeeper
Placement
Block/Rack
Failure
When the [block/rack] comes back online, the
Zookeeper role would be transferred back to maintain [block/rack] awareness.
NOTE: Prior to 4.5, this migration was not automatic and must be done manually.
1.17. Data Path Resiliency
Reliability and resiliency are key, if not the most important concepts within AOS or any primary storage platform.
Contrary to traditional architectures which are built around the idea that hardware will be reliable, Nutanix takes a
different approach: it expects hardware will eventually fail. By doing so, the system is designed to handle these failures
in an elegant and non-disruptive manner.
NOTE: That doesn’t mean the hardware quality isn’t there, just a concept shift. The Nutanix hardware and QA teams
undergo an exhaustive qualification and vetting process.
As mentioned in the prior sections metadata and data are protected using a RF which is based upon the cluster FT level.
As of 5.0 supported FT levels are FT1 and FT2 which correspond to metadata RF3 and data RF2, or metadata RF5 and
data RF3 respectively.
To learn more about how metadata is sharded refer to the prior ‘Scalable Metadata’ section. To learn more about how
data is protected refer to the prior ‘Data protection’ section.
In a normal state, cluster data layout will look similar to the following:
Data Path Resiliency - Normal State
As you can see the VM/vDisk data has 2 or 3 copies on disk which are distributed among the nodes and associated
storage devices.
Importance of Data Distribution
By ensuring metadata and data is distributed across all nodes and all disk devices
we can ensure the highest possible performance during normal data ingest and
re-protection.
As data is ingested into the system its primary and replica copies will be
distributed across the local and all other remote nodes. By doing so we can
eliminate any potential hot spots (e.g., a node or disk performing slowly) and
ensure a consistent write performance.
In the event of a disk or node failure where data must be re-protected, the full
power of the cluster can be used for the rebuild. In this event the scan of
metadata (to find out the data on the failed device(s) and where the replicas
exist) will be distributed evenly across all CVMs. Once the data replicas have been
found all healthy CVMs, disk devices (SSD+HDD), and host network uplinks can be
used concurrently to rebuild the data.
For example, in a 4-node cluster where a disk fails each CVM will handle 25% of
the metadata scan and data rebuild. In a 10-node cluster, each CVM will handle
10% of the metadata scan and data rebuild. In a 50-node cluster, each CVM will
handle 2% of the metadata scan and data rebuild.
Key point: With Nutanix and by ensuring uniform distribution of data we can
ensure consistent write performance and far superior re-protection times. This
also applies to any cluster wide activity (e.g., erasure coding, compression,
deduplication, etc.)
Comparing this to other solutions where HA pairs are used or a single disk holds a
full copy of the data, they will face frontend performance issues if the mirrored
node/disk is under duress (facing heavy IO or resource constraints).
Also, in the event of a failure where data must be re-protected, they will be limited
by a single controller, a single node's disk resources and a single node's network
uplinks. When terabytes of data must be re-replicated this will be severely
constrained by the local node's disk and network bandwidth, increasing the time
the system is in a potential data loss state if another failure occurs.
1.18. Potential levels of failure
Being a distributed system, AOS is built to handle component, service, and CVM failures, which can be characterized on
a few levels:
Disk Failure
CVM “Failure”
Node Failure
When does a rebuild begin?
When there is an unplanned failure (in some cases we will proactively take things offline if they aren't working correctly)
we begin the rebuild process immediately.
Unlike some other vendors which wait 60 minutes to start rebuilding and only maintain a single copy during that period
(very risky and can lead to data loss if there's any sort of failure), we are not willing to take that risk at the sacrifice of
potentially higher storage utilization.
We can do this because of a) the granularity of our metadata b) choose peers for write RF dynamically (while there is a
failure, all new data (e.g., new writes / overwrites) maintain their configured redundancy) and c) we can handle things
coming back online during a rebuild and re-admit the data once it has been validated. In this scenario data may be
"over-replicated" in which a Curator scan will kick off and remove the over-replicated copies.
1.18.1. Disk Failure
A disk failure can be characterized as just that, a disk which has either been removed, encounters a failure, or one that is
not responding or has I/O errors. When Stargate sees I/O errors or the device fails to respond within a certain threshold
it will mark the disk offline. Once that has occurred Hades will run S.M.A.R.T. and check the status of the device. If the
tests pass the disk will be marked online, if they fail it will remain offline. If Stargate marks a disk offline multiple times
(currently 3 times in an hour), Hades will stop marking the disk online even if S.M.A.R.T. tests pass.
VM impact:
HA event:No
Failed I/Os:No
Latency:No impact
In the event of a disk failure, a Curator scan (MapReduce Framework) will occur immediately. It will scan the metadata
(Cassandra) to find the data previously hosted on the failed disk and the nodes / disks hosting the replicas.
Once it has found that data that needs to be “re-replicated”, it will distribute the replication tasks to the nodes
throughout the cluster.
During this process a Drive Self-Test (DST) is started for the bad disk and SMART logs are monitored for errors.
The following figure shows an example disk failure and re-protection:
Data Path
Resiliency - Disk Failure
An important thing to highlight here is given how Nutanix distributes data and replicas across all nodes / CVMs / disks;
all nodes / CVMs / disks will participate in the re-replication.
This substantially reduces the time required for re-protection, as the power of the full cluster can be utilized; the larger
the cluster, the faster the re-protection.
1.18.2. Node Failure
VM Impact:
HA event:Yes
Failed I/Os:No
Latency:No impact
In the event of a node failure, a VM HA event will occur restarting the VMs on other nodes throughout the virtualization
cluster. Once restarted, the VMs will continue to perform I/Os as usual which will be handled by their local CVMs.
Similar to the case of a disk failure above, a Curator scan will find the data previously hosted on the node and its
respective replicas. Once the replicas are found all nodes will participate in the reprotection.
Data Path
Resiliency - Node Failure
In the event where the node remains down for a prolonged period of time (30 minutes as of 4.6), the down CVM will be
removed from the metadata ring. It will be joined back into the ring after it has been up and stable for a duration of
time.
Pro tip
Data resiliency state will be shown in Prism on the dashboard page.
You can also check data resiliency state via the cli:
###### Node Status
These should always be up to date, however to refresh the data you can kick off a Curator partial scan.
Resilient capacity is storage capacity in a cluster that can be consumed at the lowest availability/failure domain while
maintaining the cluster’s ability to self-heal and recover to desired replication factor (RF) after FT failure(s) at the
configured availability/failure domain. So, in simple terms, Resilient Capacity = Total Cluster Capacity - Capacity needed
to rebuild from FT failure(s).
Starting with AOS 5.18, the resilient capacity is displayed in Prism Element storage summary widget with a gray line.
Thresholds can be set to warn end users when cluster usage is reaching resilient capacity. By default, that is set to 75%.
Prism can
also show
detailed
storage
utilizations on a per node basis which helps
administrators understand resiliency on a per node
basis. This is useful in clusters which have a skewed
storage distribution.
When cluster usage is greater than resilient capacity for that cluster, the cluster might not be able to tolerate and
recover from failures anymore. Cluster can possibly still recover and tolerate failure at a lower failure domain as resilient
capacity is for configured failure domain. For example, a cluster with a node failure domain may still be able to self-heal
and recover from a disk failure but cannot self-heal and recover from a node failure.
It is highly recommended to not exceed the resilient capacity of a cluster in any circumstances to ensure proper
functioning of cluster and maintain its ability to self-heal and recover from failures.
1.20. Capacity Optimization
The Nutanix platform incorporates a wide range of storage optimization technologies that work in concert to make
efficient use of available capacity for any workload. These technologies are intelligent and adaptive to workload
characteristics, eliminating the need for manual configuration and fine-tuning.
The following optimizations are leveraged:
Erasure Coding (EC-X)
Compression
Deduplication
More detail on how each of these features can be found in the following sections.
The table describes which optimizations are applicable to workloads at a high-level:
Data Transform Best suited Application(s) Comments
Provides higher availability with reduced overheads
than traditional RF. No impact to normal write or read
Most, Ideal for Nutanix
Erasure Coding (EC-X) I/O performance. Does have some read overhead in
Files/Objects
the case of a disk / node / block failure where data
must be decoded.
No impact to random I/O, helps increase storage tier
Inline utilization. Benefits large or sequential I/O
All
Compression performance by reducing data to replicate and read
from disk.
Given inline compression will compress only large or
Offline
None sequential writes inline and do random or small I/Os
Compression
post-process, that should be used instead.
Full Clones in VDI,
Persistent Desktops, Greater overall efficiency for data which wasn't cloned
Dedupe
P2V/V2V, Hyper-V or created using efficient AOS clones
(ODX)
Erasure Coding pairs perfectly with inline compression which will add to the storage savings.
1.20.2. Compression
For a visual explanation, you can watch the following video: LINK
The Nutanix Capacity Optimization Engine (COE) is responsible for performing data transformations to increase data
efficiency on disk. Currently compression is one of the key features of the COE to perform data optimization. AOS
provides both inline and offline flavors of compression to best suit the customer’s needs and type of data. As of 5.1,
offline compression is enabled by default.
Inline compression will compress sequential streams of data or large I/O sizes (>64K) when written to the Extent Store
(SSD + HDD). This includes data draining from OpLog as well as sequential data skipping it.
OpLog Compression
As of 5.0, the OpLog will now compress all incoming writes >4K that show good compression (Gflag:
vdisk_distributed_oplog_enable_compression). This will allow for a more efficient utilization of the OpLog
capacity and help drive sustained performance.
When drained from OpLog to the Extent Store the data will be decompressed, aligned and then re-
compressed at a 32K aligned unit size (as of 5.1).
This feature is on by default and no user configuration is necessary.
Offline compression will initially write the data as normal (in an un-compressed state) and then leverage the Curator
framework to compress the data cluster wide. When inline compression is enabled but the I/Os are random in nature,
the data will be written un-compressed in the OpLog, coalesced, and then compressed in memory before being written
to the Extent Store.
Nutanix leverages LZ4 and LZ4HC for data compression with AOS 5.0 and beyond. Prior to AOS 5.0 the Google Snappy
compression library is leveraged which provides good compression ratios with minimal computational overhead and
extremely fast compression / decompression rates.
Normal data will be compressed using LZ4 which provides a very good blend between compression and performance.
For cold data, LZ4HC will be leveraged to provide an improved compression ratio.
Cold data is characterized into two main categories:
Regular data: No R/W access for 3 days (Gflag: curator_medium_compress_mutable_data_delay_secs)
Immutable data (snapshots): No R/W access for 1 day (Gflag:
curator_medium_compress_immutable_data_delay_secs)
The following figure shows an example of how inline compression interacts with the AOS write I/O path:
Inline Compression I/O Path
Pro tip
Almost always use inline compression (compression delay = 0) as it will only compress larger / sequential writes and not
impact random write performance.
This will also increase the usable size of the SSD tier increasing effective performance and allowing more data to sit in
the SSD tier. Also, for larger or sequential data that is written and compressed inline, the replication for RF will be
shipping the compressed data, further increasing performance since it is sending less data across the wire.
Inline compression also pairs perfectly with erasure coding.
For offline compression, all new write I/O is written in an un-compressed state and follows the normal AOS I/O path.
After the compression delay (configurable) is met, the data is eligible to become compressed. Compression can occur
anywhere in the Extent Store. Offline compression uses the Curator MapReduce framework and all nodes will perform
compression tasks. Compression tasks will be throttled by Chronos.
The following figure shows an example of how offline compression interacts with the AOS write I/O path:
Pro tip
Use deduplication on your base images (you can manually fingerprint them using vdisk_manipulator) to take
advantage of the unified cache.
Fingerprinting is done during data ingest of data with an I/O size of 64K or greater (initial I/O or when draining from
OpLog). AOS then looks at hashes/fingerprints of each 16KB chunk within a 1MB extent and if it finds duplicates for
more than 40% of chunks, it dedupes the entire extent. That resulted in many dedupe extents with reference count of 1
(no other duplicates) within the 1MB extent from the remaining 60% of extent, that ended up using metadata.
With AOS 6.6, the algorithm was further enhanced such that within a 1MB extent, only chunks that have duplicates will
be marked for deduplication instead of entire extent reducing the metadata required. With AOS 6.6, changes were also
made with the way dedupe metadata was stored. Before AOS 6.6, dedupe metadata was stored in a top level vdisk block
map which resulted in dedupe metadata to be copied when snapshots were taken. This resulted in a metadata bloat.
With AOS 6.6, that metadata is now stored in extent group id map which is a level lower than vdisk block map. Now
when snapshots are taken of the vdisk, it does not result in copying of dedupe metadata and prevents metadata bloat.
Once the fingerprintng is done, a background process will remove the duplicate data using the AOS MapReduce
framework (Curator). For data that is being read, the data will be pulled into the AOS Unified Cache which is a multi-
tier/pool cache. Any subsequent requests for data having the same fingerprint will be pulled directly from the cache. To
learn more about the Unified Cache and pool structure, please refer to the Unified Cache sub-section in the I/O path
overview.
The following figure shows an example of how the Elastic Dedupe Engine interacts with the AOS I/O path:
Data Locality
Thresholds for Data Migration
Cache locality occurs in real time and will be determined based upon vDisk ownership. When a vDisk / VM moves from
one node to another the "ownership" of those vDisk(s) will transfer to the now local CVM. Once the ownership has
transferred the data can be cached locally in the Unified Cache. In the interim the cache will be wherever the
ownership is held (the now remote host). The previously hosting Stargate will relinquish the vDisk token when it sees
remote I/Os for 300+ seconds at which it will then be taken by the local Stargate. Cache coherence is enforced as
ownership is required to cache the vDisk data.
Egroup locality is a sampled operation and an extent group will be migrated when the following occurs: "3 touches for
random or 10 touches for sequential within a 10-minute window where multiple reads every 10 second sampling
count as a single touch".
Shadow Clones
1.28. Storage Layers and Monitoring
The Nutanix platform monitors storage at multiple layers throughout the stack, ranging from the VM/Guest OS all the
way down to the physical disk devices. Knowing the various tiers and how these relate is important whenever
monitoring the solution and allows you to get full visibility of how the ops relate. The following figure shows the various
layers of where operations are monitored and the relative granularity which are explained below:
Storage Layers
1.28.1. Virtual Machine Layer
Key Role: Metrics reported by the hypervisor for the VM
Description: Virtual Machine or guest level metrics are pulled directly from the hypervisor and represent the
performance the VM is seeing and is indicative of the I/O performance the application is seeing.
When to use: When troubleshooting or looking for VM level detail
1.28.2. Hypervisor Layer
Key Role: Metrics reported by the Hypervisor(s)
Description: Hypervisor level metrics are pulled directly from the hypervisor and represent the most accurate
metrics the hypervisor(s) are seeing. This data can be viewed for one of more hypervisor node(s) or the
aggregate cluster. This layer will provide the most accurate data in terms of what performance the platform is
seeing and should be leveraged in most cases. In certain scenarios the hypervisor may combine or split
operations coming from VMs which can show the difference in metrics reported by the VM and hypervisor.
These numbers will also include cache hits served by the Nutanix CVMs.
When to use: Most common cases as this will provide the most detailed and valuable metrics.
1.28.3. Controller Layer
Key Role: Metrics reported by the Nutanix Controller(s)
Description: Controller level metrics are pulled directly from the Nutanix Controller VMs (e.g., , Stargate 2009
page) and represent what the Nutanix front-end is seeing from NFS/SMB/iSCSI or any back-end operations (e.g., ,
ILM, disk balancing, etc.). This data can be viewed for one of more Controller VM(s) or the aggregate cluster. The
metrics seen by the Controller Layer should normally match those seen by the hypervisor layer, however will
include any backend operations (e.g., , ILM, disk balancing). These numbers will also include cache hits served by
memory. In certain cases, metrics like (IOPS), might not match as the NFS / SMB / iSCSI client might split a large
IO into multiple smaller IOPS. However, metrics like bandwidth should match.
When to use: Similar to the hypervisor layer, can be used to show how much backend operation is taking place.
1.28.4. Disk Layer
Key Role: Metrics reported by the Disk Device(s)
Description: Disk level metrics are pulled directly from the physical disk devices (via the CVM) and represent
what the back-end is seeing. This includes data hitting the OpLog or Extent Store where an I/O is performed on
the disk. This data can be viewed for one of more disk(s), the disk(s) for a particular node, or the aggregate disks
in the cluster. In common cases, it is expected that the disk ops should match the number of incoming writes as
well as reads not served from the memory portion of the cache. Any reads being served by the memory portion
of the cache will not be counted here as the op is not hitting the disk device.
When to use: When looking to see how many ops are served from cache or hitting the disks.
Metric and Stat Retention
Metrics and time series data is stored locally for 90 days in Prism Element. For Prism Central and Insights, data can be
stored indefinitely (assuming capacity is available).