ECS 3.6.1 Monitoring Guide Rev1.1
ECS 3.6.1 Monitoring Guide Rev1.1
Monitoring Guide
3.6.1
May 2021
Rev. 1.1
Notes, cautions, and warnings
NOTE: A NOTE indicates important information that helps you make better use of your product.
CAUTION: A CAUTION indicates either potential damage to hardware or loss of data and tells you how to avoid
the problem.
WARNING: A WARNING indicates a potential for property damage, personal injury, or death.
© 2021 Dell Inc. or its subsidiaries. All rights reserved. Dell, EMC, and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other
trademarks may be trademarks of their respective owners.
Contents
Figures..........................................................................................................................................5
Tables........................................................................................................................................... 6
Contents 3
Cloud topology............................................................................................................................................................. 26
Cloud replication traffic............................................................................................................................................. 26
4 Contents
Figures
Figures 5
Tables
6 Tables
1
Monitor Basics
Monitor basics provides critical information about viewing the ECS portal dashboard and using the monitoring pages.
Topics:
• View the ECS Portal Dashboard
• Using monitoring pages
View requests
The Requests panel displays the total requests, successful requests, and failed requests.
Failed requests are organized by system error and user error. User failures are typically HTTP 400 errors. System failures are
typically HTTP 500 errors. Click Requests to see more request metrics.
Request statistics do not include replication traffic.
NOTE: For partial upgrade scenarios (for example, during 3.4 to 3.6), nodes in 3.4 pulls data from dashboard API, whereas
nodes upgraded to 3.6 pulls data from flux API. This may result in inconsistent display of data.
Monitor Basics 7
View capacity utilization
The Capacity Utilization panel displays the total, used, available, reserved, and percent full capacity.
NOTE: When the storage pool reaches 90% of its total capacity, it does not accept write requests and it becomes a
read-only system. A storage pool must have a minimum of four nodes and must have three or more nodes with more than
10% free capacity in order to allow writes. This reserved space is required to ensure that ECS does not run out of space
while persisting system metadata. If this criteria is not met, the write will fail. The ability of a storage pool to accept writes
does not affect the ability of other pools to accept writes. For example, if you have a load balancer that detects a failed
write, the load balancer can redirect the write to another VDC.
Capacity amounts are shown in gibibytes (GiB) and tibibytes (TiB). One GiB is approximately equal to 1.074 gigabytes (GB). One
TiB is approximately equal to 1.1 terabytes (TB).
The Used capacity indicates the amount of capacity that is in use. Click Capacity Utilization to see more capacity metrics.
The capacity metrics are available in the left menu.
View performance
The Performance panel displays how network read and write operations are currently performing, and the average read/write
performance statistics over the last 24 hours for the VDC.
Click Performance to see more comprehensive performance metrics.
NOTE:
● There will be a label of SSD Cache Enabled if the feature is on the node. And if Read Cache is disabled or the nodes do
not have SSD disks there will be no SSD Cache Enabled label.
● For partial upgrade scenarios (for example, during 3.4 to 3.6), nodes in 3.4 pulls data from dashboard API, whereas
nodes upgraded to 3.6 pulls data from flux API. This may result in inconsistent display of data.
NOTE:
8 Monitor Basics
● If the data from failed disks have already recovered and failed disks are ready for replacement, they will not show in the
Node & Data Disks panel. Click Manage Disks under System Health to go to Maintenance, which indicates if there
are disks that are ready for physical replacement. Alternatively, access Maintenance using left panel menu, Manage >
Maintenance.
● The maximum number of connections per node is 1000.
View alerts
The Alerts panel displays a count of critical alerts and errors.
Click Alerts to see the full list of current alerts. Any Critical or Error alerts are linked to the Alerts tab on the Events page
where only the alerts with a severity of Critical or Error are filtered and displayed.
NOTE: Alerts can also be filtered with Severity Info and Warning.
View audits
Audits can be filtered only with date time range and namespace.
Table navigation
Highlighted text in a table row indicates a link to a detail display. Selecting the link drills down to the next level of detail. On
drill-down displays, a path string shows your current location in the sequence of drill-down displays. This path string is called a
breadcrumb trail or breadcrumbs for short. Selecting any highlighted breadcrumb jumps up to the associated display.
On some monitoring displays, you can force a table to refresh with the latest data by clicking the Refresh icon.
Monitor Basics 9
Figure 3. Open Filter panel with date and time range selections
When the table has the Current filter applied, the latest values are displayed. When the table has a date-time range filter
applied, it displays the average value over that period.
History
When you select a History button, all available charts for that row are displayed below the table. You can hover over a chart
from left to right to see a vertical line that helps you find a specific date-time point on the chart. A pop-up display shows the
value and timestamp for that point.
The date-time scale is determined by the filter setting that has been configured. When the Current filter is selected, the charts
show data from the last 24 hours. History data is kept for 60 days.
In the history charts, when the Current filter is selected, if there is no available historical data, No Data displays.
10 Monitor Basics
Export icon
Export icon enables you to export data from all the monitoring tables and graphs to pdf, doc, excel. and .csv formats for later
consumption. To select the format, and export the data, use the export icon in the upper right of the menu bar on each table
and graph.
The exported data can be used to get a longer term view on capacity usage and consumption trends.
Monitor Basics 11
2
Monitor Metering
Monitor metering provides critical information about viewing and using the monitoring pages in the ECS portal dashboard.
Topics:
• Monitor metering data
• Monitor capacity utilization
• Monitor system health
• Monitor transactions
• Monitor recovery status
• Monitor disk bandwidth
• Introduction to geo-replication monitoring
• Cloud hosted VDC monitoring
Steps
1. In the ECS Portal, select Monitor > Metering.
2. From the Date Time Range menu, select the period for which you want to see the metering data. Select Current to view
the current metering data. Select Custom to specify a custom date-time range.
Metering is not a real-time reporting activity but is performed as a background process and some delay in reporting can
occur. The longest delay is about 15 minutes. However, where the system is under heavy load, or is unstable, longer delays
can be seen. If you are encountering longer delays, contact ECS Customer Support.
If you select Custom, use the From and To calendars to choose the time period for which data will be displayed.
Metering data is kept for 30 days.
NOTE: The Current filter displays the latest available values. A date-time range filter displays average values over the
specified range.
3. Select the namespace for which you want to display metering data. To narrow the list of namespaces, type the first few
letters of the target namespace and click the magnifying glass icon.
If you are a Namespace Administrator, you will only be able to select your namespace.
4. Click the + icon next to each namespace you want to see object data for.
5. To see the data for a particular bucket, click the + icon next to each bucket for which you want to see data.
To narrow the list of buckets, type the first few letters of the target bucket and click the magnifying glass icon.
If you do not specify a bucket, the object metering data will be the totals for all buckets in the namespace.
6. Click Apply to display the metering data for the selected namespace and bucket for the specified time period.
NOTE: While all buckets in a geo-federation can be selected in metering, if a selected bucket is not associated in a
replication group to which the VDC that you are logged into belongs, metering information cannot be retrieved for that
bucket. In this case, after a wait, the bucket is listed as No data. To get the metering information for the bucket, log in
to the VDC that owns the bucket or any VDC that is part of the replication group to which the bucket belongs.
12 Monitor Metering
Depending on the Date Time Range selected, the attributes that are displayed in the Metering Page may change.
If Current option is selected, only Namespace, Buckets, Bucket Tags, Total MPU Parts, Total MPU Size, Total Size,
Object Count, and Last Updated attributes are displayed in the table. If Custom or any other time range is chosen,
the Namespace, Buckets, Bucket Tags, Total MPU Parts, Total MPU Size, Total Size, Object Count, Objects Created,
Objects Deleted, Write Traffic and Read Traffic attributes are displayed in the table and the Last Updated attribute is
not displayed.
Metering data
Object metering data for a specified namespace, or a specified bucket within a namespace, can be obtained for a defined time
period at the ECS Portal Monitor > Metering page.
The metering information that is provided is shown in the following table:
NOTE: When you perform an update operation on an object, the metering services show Object Overwrite as
Objects Created and Objects Deleted. The Objects that are deleted are shown because of the expected
OVERWRITE behavior of an object. However, no object is deleted.
NOTE:
● Metering is not a real-time reporting activity but is performed as a background process and some delay in reporting can
occur. The longest delay is about 15 minutes. However, where the system is under heavy load, or is unstable, longer
delays can be seen. If you are encountering longer delays, contact ECS Customer Support.
Monitor Metering 13
● To reflect the statistics for S3 and fan-out objects, a delay of 2 hours 15 min can occur.
NOTE: When there are many concurrent requests, ECS metering can ignore some requests so that they do not impact
system performance. Hence, the Write Traffic value can show less that the actual Write bandwidth.
Read-only system
When the storage pool reaches 90% of its total capacity, it does not accept write requests and it becomes a read-only system.
A storage pool must have a minimum of four nodes and must have three or more nodes with more than 10% free capacity in
order to allow writes. This reserved space is required to ensure that ECS does not run out of space while persisting system
metadata. If this criteria is not met, the write fails. The ability of a storage pool to accept writes does not affect the ability of
other pools to accept writes. For example, if you have a load balancer that detects a failed write, the load balancer can redirect
the write to another VDC.
Capacity forecast
You can use the Capacity tab to monitor when the capacity is expected to reach 50% and 80%. Capacity forecast is based on
the current usage pattern that is shown on 1 day, 7 days, and 30-days usage trend. Capacity Forecast data is shown either for
the entire VDC, for an individual storage pool or for nodes.
NOTE: The capacity ETA shown as N/A could be due to the following reasons:
1. There is not enough historical data for forecast. At least two data points (1 hour apart) are required. It could happen
when the ECS system is deployed. Click the History button at VDC, storage pool, or node levels to verify.
2. If capacity passed intended target, the ETA is set to 0.
3. The used capacity shows a down trend for the specified time (for example, 7 days). Click the History button or get the
history through dashboard API to verify.
To see the capacity forecast data from the ECS Portal, select Monitor > Capacity Utilization > Capacity. Capacity tab is
the default.
To see the data about total capacity, used capacity, and available capacity, click History.
Capacity Forecast is calculated based on the total capacity and used capacity.
Monitor capacity
You can use the Capacity tab to view capacity utilization data for:
14 Monitor Metering
● VDC (VDC capacity utilization)
● Storage Pools (Storage pool capacity utilization)
● Nodes (Node capacity utilization)
● Disks (Disk capacity utilization)
● Used Capacity (Monitor used capacity)
You can view summary storage usage data about total, used, available, and reserved storage capacity for storage pools and
nodes.
Reserved capacity is the approximately 10 percent of the total capacity that is reserved for failure handling and for performing
erasure encoding or XOR operations. Reserved capacity is not available for writing new data.
The tab opens with the Storage Pools capacity table displayed. To view capacity data for individual nodes, click the appropriate
link in the Nodes (Online) column to display the Nodes table. Click the appropriate link in the Disks (Online) column to view
capacity data for individual disks.
You can display average values over a selected date-time range or over a custom time range using the Filter drop-down menu.
The Current filter displays the latest available values and is the default filter value.
When the table has the Date Time Range filter set to Current (the default setting), the table displays the latest values and
the history graphs display values over the last 24-hour period. When the table has a Date Time Range filter applied (other than
Current), it displays the average value over that period.
Actions History provides a graphic display of the data. If the Current filter (default) is
selected, the History button displays total, used, and available capacity for the last
24 hours. History data is kept for 60 days.
Monitor Metering 15
Storage pool capacity utilization
Table 3. Capacity utilization: storage pool
Attribute Description
Storage Pool Name of the storage pool.
Nodes (Online) Number of nodes in the storage pool followed by the number of those nodes online.
Click this number to open: Node capacity utilization.
Online Nodes with Sufficient Disk Space Number of online nodes that have sufficient disk space to accept new data. If too
NOTE: Does not appear if a filter many disks are too full to accept new data, the performance of the system may be
other than Current is applied. impacted.
Disks (Online) Number of disks in the storage pool followed by the number of those disks that are
online.
Total Total capacity of the storage pool that is online. This is the total of the capacity
that is already used and the capacity still free for allocation.
Used Used online capacity in the storage pool.
Available (Reserved) Online capacity available for use, including the approximately 10% of the total
NOTE: If the Current filter capacity that is reserved for failure handling and for performing erasure encoding or
is applied, Available (Reserved) XOR operations.
displays. If a filter other than
Current is applied, only Available
displays.
Actions History provides a graphic display of the data. If the Current filter (default) is
selected, the History button displays total, used, and available capacity for the last
24 hours. History data is kept for 60 days.
Online Status Indicates whether the node is online or offline. A check mark indicates that the
node status is Good.
16 Monitor Metering
Table 4. Capacity utilization: node (continued)
Attribute Description
Actions History provides a graphic display of the data. If the Current filter (default) is
selected, the History button displays total, used, and available capacity for the last
24 hours. History data is kept for 60 days.
Storage usage is shown as color-coded bars, one color for the current VDC, and a different color for its storage pools. Tool tips
for each colored bar correspond to the status information in the numeric status line.
Monitor Metering 17
● Garbage Detected: View summary garbage collection data.
● Capacity Reclaimed: View data about storage capacity reclaimed by the garbage collection process.
Garbage Detected
Click the Virtual Data Center drop-down menu to view garbage detection data for the entire VDC or individual storage pools.
Capacity Reclaimed
Click the Filter button to set a filter for the reclamation data by VDC or storage pool over a date/time range.
18 Monitor Metering
Table 9. Erasure encoding metrics (continued)
Column Description
Coding Rate The rate at which any current data waiting for erasure encoding is being processed.
Est. Time to Complete The estimated completion time extrapolated from the current erasure encoding
rate.
Actions ● History provides a graphic display of the total coding data, total coded data,
percent of data coded, and coding rate per second. History data is kept for 60
days.
● If the Current filter is selected, History displays default history for the last 24
hours.
Monitor Metering 19
Monitor system health
You can monitor system health from the ECS Portal Monitor > System Health page.
The System Health page has the following tabs:
● Hardware Health: View data about the status of nodes and disks.
● Process Health: View data about the status of the NIC, CPU, and memory.
● Node Rebalancing: View data about the status of node rebalancing operations.
Steps
1. Select Monitor > System Health and select the Hardware Health tab.
By default the Offline Nodes subtab displays. This table may be empty if all nodes are online. Similarly, the Offline Data
Disks subtab may be empty if all disks are online.
2. Select the Offline Nodes and Offline Data Disks subtabs to view a summary.
3. Select the All Nodes and Data Disks subtab to drill down to nodes and disks.
4. Click the node name to drill down to its disk health page.
NOTE: The Slot Info value always matches the physical slot ID in ECS U-Series, C-Series, and D-Series Appliances.
This makes Slot Info useful for quickly locating a disk during disk replacement service. Some Certified Hardware
installations with ECS Software may not report useful or reliable data for Slot Info.
NOTE: Monitor the health of online and offline storage pool nodes and data disks. All data disks that belong to the
selected node are listed here. SSD Read Caches are not included.
20 Monitor Metering
Monitor process health
You can use the Process Health tab to obtain metrics that can help assess the health of the VDC, node, or node process.
NOTE: When clicked Process Health, the Process Health - Overview dashboard opens in a new Grafana window.
Process Health dashboards can also be accessed from Advanced Monitoring > expand Data Access Performance -
Overview
● > Process Health - by Nodes
● > Process Health - Overview
● > Process Health - Process List by Node.
Monitor Metering 21
Table 12. ECS processes (continued)
Process Description
Head Service (headsvc) Manages object head protocols: S3, OpenStack Swift, EMC Atmos, CAS, and
HDFS.
Metering (metering) Manages the following tables: Metering Aggregate (MA) and Metering Raw
(MR).
Object Control Service (objcontrolsvc) Provides REST APIs for configuring the ECS cluster, managing ECS
resources, and monitoring the system.
Provision Service (provisionsvc) Manages the provisioning of storage resources and user access. It handles
user management, authorization, and authentication for all provisioning
requests, resource management, and multi-tenancy support.
Resource Service (resourcesvc) Manages the following tables: Resource Table (RT) which handles replication
groups, buckets, users, namespace information and so on.
Record Manager (rm) Manages PR (Partition Record) table (journal region).
Storage Service Manager (ssm) Manages the following tables: Storage Space (SS) which contain disk block
usage and disk to chunk mapping. Interacts with one or more Storage
Servers and manages the active/free chunks on the corresponding servers.
Directs I/O operations to the disks.
Statistics Service (statsvc) Tracks various information on storage processes. These statistics can be
used to monitor the system.
VNest (vnest) Provides distributed synchronization and group services. A subset of data
nodes will be group members responsible for serving the key/value requests.
VNest services running on other nodes will listen for configuration updates
and be ready to be added to the group.
See Advanced Monitoring, Process Health - by Nodes, Process Health - Overview and Process Health - Process List by Nodefor
details.
Prerequisites
Access the Node Rebalancing tab from the ECS Portal at Monitor > System Health > Node Rebalancing.
NOTE: When clicked Node Rebalancing, the Node Rebalancing dashboard opens in a new Grafana window.
The Node Rebalancing dashboard can also be accessed from Advanced Monitoring > expand Data Access Performance -
Overview > Node Rebalancing.
See Advanced Monitoring and Node Rebalancing for details.
Monitor transactions
You can monitor requests and network performance for VDCs and nodes from the Monitor > Transactions page.
Access the Transactions tab from the ECS Portal at Monitor > Transactions.
22 Monitor Metering
NOTE: When clicked Transactions, the Data Access Performance - Overview dashboard opens in a new Grafana
window.
The Transactions data can also be accessed from Advanced Monitoring > Data Access Performance - Overview.
See Advanced Monitoring and Data Access Performance - Overview for details.
NOTE: When clicked Recovery Status, the Recovery Status dashboard opens in a new Grafana window.
The Recovery Status dashboard can also be accessed from Advanced Monitoring > expand Data Access Performance -
Overview > Recovery Status.
See Advanced Monitoring for details.
NOTE: When clicked Disk Bandwidth, the Disk Bandwidth - Overview dashboard opens in a new Grafana window.
Disk Bandwidth dashboards can also be accessed from Advanced Monitoring > expand Data Access Performance -
Overview
● > Disk Bandwidth - by Nodes
● > Disk Bandwidth - Overview.
See Advanced Monitoring, Disk Bandwidth - by Nodes and Disk Bandwidth - Overview for details.
Monitor Metering 23
Table 13. Rate and Chunks columns
Column Description
Replication Group Lists the replication groups of which this VDC is a member. Click a replication
group to see a table of remote VDCs in the replication group and their
statistics. Click the Replication Groups link above the table to return to the
default view.
Write Traffic The current rate of writes to all remote VDCs or individual remote VDC in the
replication group.
Read Traffic The current rate of reads to all remote VDCs or individual remote VDC in the
replication group.
User Data Pending Replication The total logical size of user data waiting for replication for the replication
group or remote VDC.
Metadata Pending Replication The total logical size of metadata waiting for replication for the replication
group or remote VDC.
Data Pending XOR The total logical size of all data waiting to be processed by the XOR
compression algorithm in the local VDC for the replication group or remote
VDC.
24 Monitor Metering
Table 15. Failover columns (continued)
Field Description
Data Pending XOR Decoding Shows the count and total logical size of chunks waiting to be retrieved by
the XOR compression scheme.
Failover State ● BLIND_REPLAY_DONE
● REPLICATION_CHECK_DONE: The process that makes sure that all
replication chunks are in an acceptable state and replication has
completed successfully.
● CONSISTENCY_CHECK_DONE: The process that makes sure that all
system metadata is fully consistent with other replicated data and has
completed successfully.
● ZONE_SYNC_DONE: The synchronization of the failed VDC has
completed successfully.
● ZONE_BOOTSTRAP_DONE: The bootstrap process on the failed VDC
has completed successfully.
● ZONE_FAILOVER_DONE: The failover process has completed
successfully.
Failover Progress A percentage indicator for the overall status of the failover process.
Monitor Metering 25
The Cloud menu is not shown if the ECS system uses only on-premise sites.
Cloud topology
You can use the Cloud topology summary information to see how the ECS system is making use of hosted VDCs.
The Cloud > Topology page shows the hosted VDCs that are part of an ECS federated system, and shows the relationship
between the hosted VDC and any on-premise VDCs.
26 Monitor Metering
Replication Groups
The Replication Groups tab shows each replication group and provides traffic data for a VDC for each replication group that
it contributes to. A VDC might have a storage pool that is in more than one replication group, and this display allows you to see
the traffic associated with each replication group.
Monitor Metering 27
3
Monitoring Events: Audits and Alerts
Monitor events provides critical information about the available event monitoring messages (audit and alert) in the ECS Portal.
Topics:
• About event monitoring
• Monitor audit data
• Audit messages
• Monitor alerts
• Alert policy
• Acknowledge all alerts
• Alert messages
Steps
1. Select the Audit tab.
2. Optionally, select Filter.
3. Specify a Date Time Range and adjust the From and To fields and time fields. When creating a custom date-time range,
select Current Time to use the current date and time as the end of your range.
4. Select a Namespace.
5. Click Apply.
NOTE: The newest audit messages appear at the top of the table.
Steps
1. Select Alerts.
2. Optionally, click Filter.
3. Select your filters. The alerts filter adds filtering by Severity and Type, and an option to Show Acknowledged Alerts,
which retains the display of an alert even after it is acknowledged by the user. When creating a custom date-time range,
select Current Time to use the current date and time as the end of your range.
Alert types must be entered exactly as described in the following table:
System alert ● System alert policies are precreated and exist in ECS during deployment.
policies ● All the metrics have an associated system alert policy.
● System alert policies cannot be updated or deleted.
● System alert policies can be enabled/disabled.
● Alert is sent to the UI and all channels (SNMP, SYSLOG, and Secure Remote Services).
User-defined ● You can create User-defined alert policies for the required metrics.
alert policies ● Alert is sent to the UI and customer channels (SNMP and SYSLOG).
Steps
1. Select New Alert Policy.
2. Give a unique policy name.
3. Use the metric type drop-down menu to select a metric type.
Metric Type is a grouping of statistics. It consists of:
● Btree Statistics
● CAS GC Statistics
● Geo Replication Statistics
● Metering Statistics
● Garbage Collection Statistics
● EKM
4. Use the metric name drop-down menu to select a metric name.
5. Select level.
a. To inspect metrics at the node level, select Node.
b. To inspect metrics at the VDC level, select VDC.
6. Select polling interval.
Polling Interval determines how frequently data should be checked. Each polling interval gives one data point which is
compared against the specified condition and when the condition is met, alert is triggered.
7. Select instances.
Instances describe how many data points to check and how many should match the specified conditions to trigger an alert.
For metrics where historical data is not available only the latest data is used.
8. Select conditions.
You can set the threshold values and alert type with Conditions.
The alerts can be either a Warning Alert, Error Alert, or Critical Alert.
9. To add more conditions with multiple thresholds and with different alert levels, select Add Condition.
10. Click Save.
Steps
1. To acknowledge all alerts, click the Acknowledge All Alerts button.
a. To acknowledge a subset of all alerts, use the table filter to filter by a combination of date and time, severity, type, or
namespace, and then click Acknowledge All Alerts.
The bulk alert acknowledgment process runs in the background and may take a few minutes to complete. Only one bulk alert
acknowledgment can be processed at a time.
2. On the confirmation pop-up screen, to initiate acknowledgment, click OK or to exit without acknowledgment click Cancel.
Clicking the Acknowledge All Alerts initiates a background task to acknowledge all the matching alerts. The response
either shows successfully initiated or fails.
To keep a record of the acknowledge all alerts request, a new informational alert of type Bulk Alert Ack will be generated
after the acknowledgment completes. Clear the filter and manually refresh the table.
Alert messages
List of the alert messages that ECS uses.
Alert message Severity labels have the following meanings:
● Critical: Messages about conditions that require immediate attention
● Error: Messages about error conditions that report either a physical failure or a software failure
● Warning: Messages about less than optimal conditions
● Info: Routine status messages
Capacity Error 997 Portal, API, Licensed Capacity The capacity of the Contact ECS
license Secure Entitlement system is greater than Remote Support
threshold Remote Exceeded Event was licensed.
Services, Trap,
Syslog
Chunk not Error 1004 Portal, API, chunkId {chunkId} Contact ECS
found Secure not found Remote Support
Remote
Services,
SNMP Trap,
Syslog
CPU Usage Warning 4001 Portal, API, CPU usage is $ If CPU usage percent Contact ECS
Percent SNMP Trap, {inspectorValue}% crosses the threshold Remote Support
Error 4002 Syslog crosses threshold $ specified then the alert
{thresholdValue}% is triggered.
Critical 4003
Data Error 1500 Portal, ESRS, Data Migration has Data migration has no Contact ECS
Migration SNMP Trap, no movement for $ progress for several Remote Support
Blocked Syslog, SMTP {configured} hours hours.
for a device and
level (default 6
hours).
NOTE: Ignore the severity as Warning, for the Data Migration Finished alert. The severity is supposed to be Info.
Data Warning 1501 Portal, ESRS, Data Migration is Data migration is Contact ECS
Migration SNMP Trap, complete for a complete. Remote Support
Finished Syslog, SMTP device and level.
Disabled CAS Info 1316 Portal, API, CAS Processing is ● CAS GC is Content Contact ECS
GC Secure paused. Addressable Storage Remote Support
Warning 1317 Remote Garbage Collection. representative
Services, ● CAS GC is disabled. to determine
Error 1318
Last Byte Warning 4013 Portal, API, Last Byte Latency If TTLB for write latency Contact ECS
Latency For SNMP Trap, for Write is $ crosses the threshold Remote Support
Write Error 4014 Syslog {inspectorValue}ms specified then the alert
is triggered.
Read latency is
1050 millisecond,
crosses threshold
1000 millisecond.
Write latency is
1500 millisecond,
crosses threshold
1000 millisecond.
Slow CAS Info 1308 Portal, API, CAS Processing CAS GC reference Contact ECS
GC Secure reference collection collection tasks are Remote Support
Reference Warning 1309 Remote speed is slow. lagging.
Collection Services,
Error 1310 SNMP, Trap,
Critical 1311 Syslog
Slow Journal Info 1304 Portal, API, Journal parsing Journal parsing speed is Contact ECS
Parsing Secure speed is slow. slow. Remote Support
Warning 1307 Remote
Services,
Error SNMP, Trap,
Critical Syslog
Space Usage Warning 4005 Portal, API, Disk space usage is If Disk usage percent Contact ECS
Percent SNMP, Trap, ${inspectorValue}% crosses the threshold Remote Support
Error 4006 Syslog crosses threshold $ specified then the alert
{thresholdValue}% is triggered.
Critical 4007
SSD Read Error 1392 Portal, API, SSD read cache SSD read cache fall back Contact ECS
Cache Secure capacity auto clean to memory cache after Remote Support
Capacity Remote up failed >= clean up failed when
Failure Services, ${inspectorValue} capacity full.
SNMP, Trap, times on node $
Syslog {node} after SSD
capacity exceeded
threshold for
${resourceName}
process. Result:
SSD Read cache
on node ${node} is
disabled.
Low Life Info 2064 Portal, API, Node SN={node NVMe SSD endurance
Remaining SNMP Trap, SN}: Disk level is less than
Syslog, Secure SN=${disk sn} threshold level.
Remote in rack={rack},
Services node={fqdn},
slot={slot number}
has life remaining
below threshold
{threshold level} .
Disk Details:
Type={SSD/
NVMe},
Model={vendor
model},
Size={disk size}
GB, Firmware=$
{firmware version}
NVME_BAD Error 1389 Portal, API, No memory to Memory pool initiation
_MEMORY_ SNMP Trap, allocate to buffer failed or memory alloc
ERROR Syslog, Secure for nvmeengine, for read.
Remote node={publicip},
Services failedCount={count
}
NVME_DEVI Error 1390 Portal, API, NVMe partition Nvme device initiation
CE_INIT_FAI SNMP Trap, {uuid} initialization failed.
LED_ERROR Syslog, Secure on target node
Docker Warning 2017 Portal, API, Container Container paused Contact ECS
container SNMP Trap, {containerName} Remote Support
paused Syslog has paused on node
{fqdn}.
Docker Info 2016 Portal, API, Container Container moved to
container SNMP Trap, {containerName} is running state.
running Syslog up on node {fqdn}.
Docker Error 2015 Portal, API, Container Container stopped Contact ECS
container SNMP Trap, {containerName} Remote Support
stopped Syslog has stopped on
node {fqdn}.
Events Error 2038 Portal, API, Events cannot be Verify configuration of
cannot be Secure delivered through the channel for which
delivered. Remote {SMTP|ESRS} and the alert is.
Services, lost.
SNMP Trap,
Syslog
Firewall Bad 2051 Portal, API, Firewall health is Rules or ip sets do not Contact ECS
health is BAD Secure BAD! {reason} exist, system firewall is Remote Support
or SUSPECT Suspect 2052 Remote off, ip tables or ip set
Services, Firewall health is utils do not exist.
SNMP Trap, SUSPECT! {reason}
Syslog Rules or ip sets do not
exist, trying to recover.
Fabric agent Error 2014 Portal, API, FabricAgent has Fabric agent health is
suspect SNMP Trap, suspected on node suspect.
Syslog {fqdn}.
Net interface Critical 2023 Portal, API, Net interface Fabric's net interface is Contact ECS
health down SNMP Trap, {$netInterfaceNam down. Remote Support
Syslog, Secure e}[ on node
Remote $FQDN] is
Services down[ with IP
address $IP]".
Net interface Info 2024 Portal, API, Net interface Fabric's net interface is
health up SNMP Trap, {$netInterfaceNam up.
Syslog, Secure e}[ on node
Remote $FQDN] is up[ with
Services IP address $IP]".
Net interface Critical 2026 Portal, API, Net interface Net interface is down for
permanent Secure {$netInterfaceNam at least 10 minutes.
down Remote e}[ on node
Services $FQDN] is
permanently
down[ with IP
address $IP].
Advanced Monitoring
Advanced Monitoring dashboards provide critical information about the ECS processes on the VDC you are logged in to.
The advanced monitoring dashboards are based on time series database, and are provided by Grafana, which is well known
open-source time series analytics platform.
Refer Grafana for basic details of navigation in Grafana dashboards.
Disk Bandwidth - Overview You can use the Disk Bandwidth - Overview dashboard to monitor the
disk usage metrics by read or write operations at the VDC level.
NOTE: For Disk Bandwidth - Overview dashboard, consistency
checker metric shows data only for read but not write as it is irrelevant.
Node Rebalancing You can use the Node Rebalancing dashboard to monitor the status of
data rebalancing operations when nodes are added to, or removed from, a
cluster. Node rebalancing is enabled by default at installation. Contact your
technical support representative to disable or reenable this feature.
48 Advanced Monitoring
Table 25. Advanced monitoring dashboards (continued)
Dashboard Description
Process Health - by Nodes You can use the Process Health - by Nodes dashboard to monitor
for each node of the VDC use of network interface, CPU, and available
memory. The dashboard displays the latest values, and the history graphs
display values in the selected range.
Process Health - Overview You can use the Process Health - Overview dashboard to monitor the
VDC use of network interface, CPU, and available memory. The dashboard
displays the latest average values, and the history graphs display values in
the selected time range.
Process Health - Process List by Node You can use the Process Health - Process List by Node dashboard to
monitor processes use of CPU, memory, average thread number and last
restart time in the selected time range. The dashboard displays the latest
values in the selected time range.
Recovery Status You can use the Recovery Status dashboard to monitor the data
recovered by the system.
SSD Read Cache You can use the SSD Read Cache dashboard to monitor total SSD disk
capacity and disk space that is used by SSD read cache.
Tech Refresh: Data Migration You can use the Tech Refresh: Data Migration dashboard to monitor the
data migration off and on a node or cluster.
Top Buckets You can use the Top Buckets dashboard to monitor the number of buckets
with top utilization that is based on total object size and count.
Advanced Monitoring 49
Table 26. Advanced monitoring dashboard fields (continued)
Dashboard Field Description
● Data Access Performance - System Failures The number of data requests that failed due to hardware or service
Overview errors. System failures are failed requests that are associated with
● Data Access Performance - by hardware or service errors (typically an HTTP error code of 5xx).
Namespaces
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - User Failures The number of data requests from all object heads are classified as
Overview user failures. User failures are known error types originating from the
● Data Access Performance - by object heads (typically an HTTP error code of 4xx).
Namespaces
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - Failure % Rate The percentage of failures for the VDC, namespace, nodes, or
Overview protocols.
● Data Access Performance - by
Namespaces
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - TPS (success/ Rate of successful requests and failures per second.
Overview failure)
● Data Access Performance - by
Namespaces
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - Bandwidth Data access bandwidth of successful requests per second.
Overview (read/write)
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - Failed Rate of failed requests per second, split by error type (user/system).
Overview Requests/s by
● Data Access Performance - by error type
Namespaces (user/system)
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - Latency Latency of read/write requests.
Overview
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● SSD Read Cache
50 Advanced Monitoring
Table 26. Advanced monitoring dashboard fields (continued)
Dashboard Field Description
● Data Access Performance - Successful Displays the rate of successful requests per second, by method, node,
Overview request drill and protocol.
● Data Access Performance - by down
Nodes
● Data Access Performance - Successful Rate of successful requests per second, by method.
Overview Requests/s by
● Data Access Performance - by Method
Nodes
● Data Access Performance - by Successful Rate of successful requests per second, by node.
Namespaces Requests/s by
● Data Access Performance - by Node
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - Successful Rate of successful requests per second, by protocol.
Overview Requests/s by
● Data Access Performance - by Protocol
Nodes
● Data Access Performance - Failures drill Displays the rate of failed requests per second, by method, node, and
Overview down protocol.
● Data Access Performance - by
Nodes
● Data Access Performance - Failed Rate of failed requests per second, by method.
Overview Requests/s by
● Data Access Performance - by Method
Nodes
● Data Access Performance - by Failed Rate of failed requests per second, by node.
Namespaces Requests/s by
● Data Access Performance - by Node
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - Failed Rate of failed requests per second, by protocol.
Overview Requests/s by
● Data Access Performance - by Protocol
Nodes
● Data Access Performance - Failed Rate of failed requests per second, by error code.
Overview Requests/s by
● Data Access Performance - by error code
Nodes
● Data Access Performance - by Compare TPS of Select multiple nodes and compare rates of successful requests per
Nodes successful second.
● Data Access Performance - by requests
Namespaces
● Data Access Performance - by
Protocols
Data Access Performance - by Compare TPS of Select multiple nodes and compare rates of failed requests per second,
Namespaces failed requests by error type (user/system).
● Data Access Performance - by Compare read Select multiple nodes and compare data access bandwidth (read) of
Nodes bandwidth successful requests per second.
● Data Access Performance - by
Protocols
Advanced Monitoring 51
Table 26. Advanced monitoring dashboard fields (continued)
Dashboard Field Description
● Data Access Performance - by Compare write Select multiple nodes and compare data access bandwidth (write) of
Nodes bandwidth successful requests per second.
● Data Access Performance - by
Protocols
● Data Access Performance - by Compare read Select multiple nodes and compare latency of read requests.
Nodes latency
● Data Access Performance - by
Protocols
● Data Access Performance - by Compare write Select multiple nodes and compare latency of write requests.
Nodes latency
● Data Access Performance - by
Protocols
● Data Access Performance - by Compare rate of Select multiple nodes and compare rates of failed requests per second,
Nodes failed requests/s split by error type (user/system).
● Data Access Performance - by
Protocols
Data Access Performance - by Request drill Rate of requests per second, split by node.
Namespaces down by nodes
● Disk Bandwidth - by Nodes Read or Write Indicates whether the row describes read data or write data.
● Disk Bandwidth - Overview
● Disk Bandwidth - by Nodes Nodes The number of nodes in the VDC. You can click the nodes number
● Disk Bandwidth - Overview to see the disk bandwidth metrics for each node. There is no Nodes
column when you have drilled down into the Nodes display for a VDC.
● Disk Bandwidth - by Nodes Total Total disk bandwidth that is used for either read or write operations.
● Disk Bandwidth - Overview
● Disk Bandwidth - by Nodes Hardware Rate at which disk bandwidth is used to recover data after a hardware
● Disk Bandwidth - Overview Recovery failure.
● Disk Bandwidth - by Nodes Erasure Rate at which disk bandwidth is used in system erasure coding
● Disk Bandwidth - Overview Encoding operations.
● Disk Bandwidth - by Nodes XOR Rate at which disk bandwidth is used in the XOR data protection
● Disk Bandwidth - Overview operations of the system. XOR operations occur for systems with
three or more sites (VDCs).
● Disk Bandwidth - by Nodes Consistency Rate at which disk bandwidth is used to check for inconsistencies
● Disk Bandwidth - Overview Checker between protected data and its replicas.
● Disk Bandwidth - by Nodes Geo Rate at which disk bandwidth is used to support geo replication
● Disk Bandwidth - Overview operations.
● Disk Bandwidth - by Nodes User Traffic Rate at which disk bandwidth is used by object users.
● Disk Bandwidth - Overview
Node Rebalancing Data Rebalanced Amount of data that has been rebalanced.
Node Rebalancing Pending Amount of data that is in the rebalance queue but has not been
Rebalancing rebalanced yet.
Node Rebalancing Rate of The incremental amount of data that was rebalanced during a specific
Rebalance (per time period. The default time period is one day.
day)
Process Health - Process List by Process The last time the process restarted on the node in the selected time
Node Restarts range. The maximum time range could be 5 days because it is limited
by the retention policy.
52 Advanced Monitoring
Table 26. Advanced monitoring dashboard fields (continued)
Dashboard Field Description
Process Health - Overview Avg. NIC Average bandwidth of the network interface controller hardware that
Bandwidth is used by the selected VDC or node.
Process Health - Process List by NIC Bandwidth Bandwidth of the network interface controller hardware that is used
Node by the selected VDC or node.
Process Health - Overview Avg. CPU Usage Average percentage of the CPU hardware that is used by the selected
VDC or node.
Process Health - Overview Avg. Memory Average usage of the aggregate memory available to the VDC or node.
Usage
● Process Health - by Nodes Relative NIC Percentage of the available bandwidth of the network interface
● Process Health - Overview (%) controller hardware that is used by the selected VDC or node.
● Process Health - by Nodes Relative Memory Percentage of the memory used relative to the memory available to
● Process Health - Overview (%) the selected VDC or node.
● Process Health - Process List
by Node
● Process Health - by Nodes CPU Usage Percentage of the node's CPU used by the process. The list of
● Process Health - Process List processes that are tracked is not the complete list of processes
by Node running on the node. The sum of the CPU used by the processes is
not equal to the CPU usage shown for the node.
Process Health - by Nodes Memory Usage The memory used by the process.
● Process Health - by Nodes Relative Memory Percentage of the memory used relative to the memory available to
● Process Health - Overview (%) the process.
● Process Health - Process List
by Node
Process Health - Process List by Avg. # Thread Average number of threads used by the process.
Node
Process Health - Process List by Last Restart The last time the process restarted on the node.
Node
Process Health - by Nodes Host -
Process Health - Process List by Process -
Node
Recovery Status Amount of Data With the Current filter selected, this is the logical size of the data yet
to be Recovered to be recovered.
● When a historical period is selected as the filter, the meaning of
Total Amount Data to be Recovered is the average amount of
data pending recovery during the selected time.
● For example, if the first hourly snapshot of the data showed 400
GB of data to be recovered in a historical time period and every
other snapshot showed 0 GB waiting to be recovered, the value of
this field would be 400 GB divided by the total number of hourly
snapshots in the period.
SSD Read Cache Disk Usage Used SSD space by Read Cache
SSD Read Cache Disk Capacity Total SSD disk capacity
Tech Refresh: Data Migration Remaining This panel shows graph of remaining volume on source nodes.
Volume to
Migrate
Tech Refresh: Data Migration Migration Speed This panel shows graph of remaining volume on source nodes.
Advanced Monitoring 53
Table 26. Advanced monitoring dashboard fields (continued)
Dashboard Field Description
Tech Refresh: Data Migration Data Migration Detailed status of migration on source nodes. Migration speed and
Status predictions are calculated based on last 1 hour of currently selected
time interval.
Top buckets Top Buckets by Top used buckets by size.
Size
Top buckets Top Buckets by Top used buckets by object count.
Object Count
Top buckets Time of The time at which the displayed metrics of Top Buckets dashboard
Calculation were calculated.
View mode
Steps
1. To view a dashboard in the view mode, click the title of a dashboard, for example (TPS (success/failure) > View.
The dashboard opens in the view mode or in the full-screen mode.
2. Click Back to dashboard icon to return back to the dashboards view.
Export CSV
Steps
1. To export the dashboard data to .csv format click the title of a dashboard, for example (TPS (success/failure) > More >
Export CSV.
The Export CSV window pops-up.
You can customize the csv output by modifying the Mode, Date Time Format, and check/uncheck the Excel CSV Dialect
attributes.
2. Click Export > Save to export the dashboard data to .csv format to your local storage.
54 Advanced Monitoring
Data Access Performance - by Namespaces
In the Data Access Performance - by Namespaces dashboard, you can monitor for namespaces:
● TPS (success/failure)
● Failed Requests/s by error type (user/system)
● Successful Requests/s by Node
● Failed Requests/s by Node
● Compare TPS of successful requests
● Compare TPS of failed requests
To view the Data Access Performance - by Namespaces dashboard in the ECS Portal, select Advanced Monitoring >
Related dashboards > Data Access Performance - by Namespaces.
All the namespace data are visible in the default view. To select a namespace, click the legend parameter for the namespace
below the graph.
Requests drill down by nodes shows the successful and failed requests by node.
Compare: select multiple namespaces compares TPS of successful and failed requests.
In the Data Access Performance - by Nodes dashboard, you can monitor for nodes in a VDC:
● TPS (success/failure)
● Bandwidth (read/write)
● Failed Requests/s by error type (user/system)
● Latency
● Successful Requests/s by Method
● Successful Requests/s by Node
● Successful Requests/s by Protocol
● Failed Requests/s by Method
● Failed Requests/s by Node
● Failed Requests/s by Protocol
● Failed Requests/s by error code
● Compare TPS of successful requests
● Compare TPS of failed requests
● Compare read bandwidth
● Compare write bandwidth
● Compare read latency
● Compare write latency
To view the Data Access Performance - by Nodes dashboard in the ECS Portal, select Advanced Monitoring > Related
dashboards > Data Access Performance - by Nodes.
Data for all the nodes are visible in the default view. To select data for a node, click the legend parameter for the node below
the graph.
Successful requests drill down shows the successful requests by method, node, and protocol.
Failures drill down shows the failed requests by method, node, protocol, and error code.
Compare: select multiple namespaces compares TPS of successful and failed requests, compare read/write bandwidth,
compare read/write latency.
In the Data Access Performance - by Protocols dashboard, based on the protocol, you can monitor:
● TPS (success/failure)
● Bandwidth (read/write)
● Failed Requests/s by error type (user/system)
Advanced Monitoring 55
● Latency
● Successful Requests/s by Node
● Failed Requests/s by Node
● Compare TPS of successful requests
● Compare TPS of failed requests
● Compare read bandwidth
● Compare write bandwidth
● Compare read latency
● Compare write latency
To view the Data Access Performance - by Nodes dashboard in the ECS Portal, select Advanced Monitoring > Related
dashboards > Data Access Performance - by Protocols.
Data for all the protocols are visible in the default view. To select data for a protocol, click the legend parameter for the protocol
below the graph.
Requests drill down by nodes shows the successful and failed requests by node.
Compare: select multiple namespaces compares TPS of successful and failed requests, compare read/write bandwidth,
compare read/write latency.
You can use the Disk Bandwidth - by Nodes dashboard to monitor the disk usage metrics by read or write operations at the
node level. The dashboard displays the latest values.
To view the Disk Bandwidth - by Nodes dashboard, click Advanced Monitoring > expand Data Access Performance -
Overview > Disk Bandwidth - by Nodes
You can use the Disk Bandwidth - Overview dashboard to monitor the disk usage metrics by read or write operations at the
VDC level.
To view the Disk Bandwidth - Overview dashboard, click Advanced Monitoring > expand Data Access Performance -
Overview > Disk Bandwidth - Overview
Node Rebalancing
You can use the Node Rebalancing dashboard to monitor the status of data rebalancing operations when nodes are added to,
or removed from, a cluster. Node rebalancing is enabled by default at installation. Contact your customer support representative
to disable or re-enable this feature.
To view the Node Rebalancing dashboard, click Advanced Monitoring > expand Data Access Performance - Overview >
Node Rebalancing
A series of interactive graphs shows that the amount of data rebalanced, pending rebalancing, and the rate of rebalancing data
in bytes over time.
Node rebalancing works only for new nodes that are added to the cluster.
You can use the Process Health - by Nodes dashboard to monitor for each node of the VDC use of network interface, CPU,
and available memory. The dashboard displays the latest values and the history graphs display values in the selected range.
To view the Process Health - by Nodes dashboard, click Advanced Monitoring > expand Data Access Performance -
Overview > Process Health - by Nodes
56 Advanced Monitoring
Process Health - Overview
You can use the Process Health - Overview dashboard to monitor the VDC use of network interface, CPU, and available
memory. The dashboard displays the latest average values and the history graphs display values in the selected time range.
To view the Process Health - Overview dashboard, click Advanced Monitoring > expand Data Access Performance -
Overview > Process Health - Overview
You can use the Process Health - Process List by Node dashboard to monitor processes use of CPU, memory, average
thread number and last restart time in the selected time range. The dashboard displays the latest values in the selected time
range.
To view the Process Health - Process List by Node dashboard, click Advanced Monitoring > expand Data Access
Performance - Overview > Process Health - Process List by Node
Recovery Status
Top Buckets
ECS is upgraded with a mechanism in metering to calculate the number of buckets with top utilization that is based on total
object size and count.
Statistics of buckets with top utilization for the system is displayed in monitoring dashboards. The number of buckets that are
displayed on the monitoring dashboard is a configurable value.
To view the Top buckets dashboard, click Advanced Monitoring > expand Data Access Performance - Overview > Top
buckets.
Advanced Monitoring 57
Automatic Metering Reconstruction
Automatic metering reconstruction is a mechanism to reconstruct the metering statistics completely.
Metering is responsible for storing the statistics for utilization by namespace and bucket that is based on object size and
count. When an object is created in a bucket, then the statistics are reported to the metering service where the statistics
are aggregated and stored. Statistics are aggregated and mapped to a time which is the nearest multiple of five minutes. For
example, objects that are created at 10:04:59 pm are mapped to time at 10:00:00 pm. The metering statistics are stored in time
series format to provide historical view of the statistics and to serve billing sample queries. The statistics are displayed in a time
window.
As a result of logic errors in implementation of metering, blob service side operations wrong statistics are reported to
metering. Incorrect metering information gets compounded and remains inaccurate from that point forward. Automatic metering
reconstruction is a mechanism to overcome the problem of erroneous statistics.
This feature is disabled in ESC 3.5.0.0. You have to manually enable it.
The automatic reconstruction is invoked in the following scenarios:
● During upgrade
● When the system recovers from a PSO
Flux API
Flux API enables you to retrieve time series database data by sending REST queries using curl. You can get raw data from
fluxd service in a way similar to using the Dashboard API. You have to get a token, and provide the token in the requests.
Prerequisites
Requires one of the following roles:
● SYSTEM_ADMIN
● SYSTEM_MONITOR
Request payload examples
json:
{
"query": "from(bucket:\"monitoring_main\") |> range(start: -30m) |> filter(fn: (r) =>
r._measurement == \"statDataHead_performance_internal_transactions\")"
}
query=from(bucket: "monitoring_main")
|> range(start: -30m)
|> filter(fn: (r) => r._measurement == "statDataHead_performance_internal_transactions")
Steps
1. Generate a token.
Token:
58 Advanced Monitoring
admin@ecs:/> echo $tok
X-SDS-AUTH-TOKEN:****
Advanced Monitoring 59
"dashboard"
],
[
"0",
"2020-03-10T09:54:31.207799855Z",
"2020-03-10T10:24:31.207799855Z",
"2020-03-10T10:06:43Z",
"1",
"failed_request_counter",
"statDataHead_performance_internal_transactions",
"ecs.lss.emc.com",
"28cd473e-ca45-4623-b30d-0481c548a650",
"statDataHead",
"dashboard"
],
CSV example
Database monitoring_main
Performance metrics in this database are raw, each is split by data node, that is all have host and node_id tags.
Information:
Measurement in this section have following structure:
60 Advanced Monitoring
Service is the name of ECS service that produces the measurement, i.e. blob, cm, georcv,
statDataHead.
For example,
blob_IO_Statistics_data_read
cm_IO_Statistics_data_write
Measurement: blob_IO_Statistics_data_read
...
Tags: host, node_id, process, tag
Fields: read_CCTotal (float, bytes)
read_ECTotal (float, bytes)
read_GEOTotal (float, bytes)
read_RECOVERTotal (float, bytes)
read_USERTotal (float, bytes)
read_XORTotal (float, bytes)
Measurement: blob_IO_Statistics_data_write
...
Tags: host, node_id, process, tag
Fields: write_CCTotal (integer)
write_ECTotal (integer)
write_GEOTotal (integer)
write_RECOVERTotal (integer)
write_USERTotal (integer)
write_XORTotal (integer)
Measurement: blob_SSDReadCache_Stats
Tags: host, id, last, node_id, process
Fields: +Inf (integer)
0.0 (integer)
1000.0 (integer)
25000.0 (integer)
5000.0 (integer)
rocksdb_disk_capacity_failure_counter (integer)
rocksdb_disk_usage_counter_bytes (integer)
rocksdb_disk_usage_percentage_counter (integer)
ssd_capacity_counter_bytes (integer)
CM statistics
These statistics represent processes in ECS service CM, such BTree GC, Chunk management, Erasure coding.
Measurement: cm_BTREE_GC_Statistics
Tags: host, node_id, process, tag
Fields: accumulated_candidate_garbage_btree_gc_level_0 (integer)
accumulated_candidate_garbage_btree_gc_level_1 (integer)
accumulated_detected_data_btree_level_0 (integer)
accumulated_detected_data_btree_level_1 (integer)
accumulated_reclaimed_data_btree_level_0 (integer)
accumulated_reclaimed_data_btree_level_1 (integer)
candidate_chunks_btree_gc_level_0 (integer)
candidate_chunks_btree_gc_level_1 (integer)
candidate_garbage_btree_gc_level_0 (integer)
candidate_garbage_btree_gc_level_1 (integer)
copy_candidate_chunks_btree_gc_level_0 (integer)
copy_candidate_chunks_btree_gc_level_1 (integer)
copy_completed_chunks_btree_gc_level_0 (integer)
copy_completed_chunks_btree_gc_level_1 (integer)
copy_waiting_chunks_btree_gc_level_0 (integer)
copy_waiting_chunks_btree_gc_level_1 (integer)
deleted_chunks_btree_level_0 (integer)
deleted_chunks_btree_level_1 (integer)
deleted_data_btree_level_0 (integer)
Advanced Monitoring 61
deleted_data_btree_level_1 (integer)
full_reclaimable_chunks_btree_gc_level_0 (integer)
full_reclaimable_chunks_btree_gc_level_1 (integer)
reclaimed_data_btree_level_0 (integer)
reclaimed_data_btree_level_1 (integer)
usage_between_0%_and_5%_chunks_btree_gc_level_0 (integer)
usage_between_0%_and_5%_chunks_btree_gc_level_1 (integer)
usage_between_10%_and_15%_chunks_btree_gc_level_0 (integer)
usage_between_10%_and_15%_chunks_btree_gc_level_1 (integer)
usage_between_5%_and_10%_chunks_btree_gc_level_0 (integer)
usage_between_5%_and_10%_chunks_btree_gc_level_1 (integer)
verification_waiting_chunks_btree_gc_level_0 (integer)
verification_waiting_chunks_btree_gc_level_1 (integer)
Measurement: cm_Chunk_Statistics
Tags: host, node_id, process, tag
Fields: chunks_copy (integer)
chunks_copy_active (integer)
chunks_copy_s0 (integer)
chunks_level_0_btree (integer)
chunks_level_0_btree_active (integer)
chunks_level_0_btree_active_index_page (integer)
chunks_level_0_btree_active_leaf_page (integer)
chunks_level_0_btree_index_page (integer)
chunks_level_0_btree_leaf_page (integer)
chunks_level_0_btree_s0 (integer)
chunks_level_0_btree_s0_index_page (integer)
chunks_level_0_btree_s0_leaf_page (integer)
chunks_level_0_journal (integer)
chunks_level_0_journal_active (integer)
chunks_level_0_journal_s0 (integer)
chunks_level_1_btree (integer)
chunks_level_1_btree_active (integer)
chunks_level_1_btree_active_index_page (integer)
chunks_level_1_btree_active_leaf_page (integer)
chunks_level_1_btree_index_page (integer)
chunks_level_1_btree_leaf_page (integer)
chunks_level_1_btree_s0 (integer)
chunks_level_1_btree_s0_index_page (integer)
chunks_level_1_btree_s0_leaf_page (integer)
chunks_level_1_journal (integer)
chunks_level_1_journal_active (integer)
chunks_level_1_journal_s0 (integer)
chunks_repo (integer)
chunks_repo_active (integer)
chunks_repo_s0 (integer)
chunks_typeII_ec_pending (integer)
chunks_typeI_ec_pending (integer)
chunks_undertransform_ec_pending (integer)
chunks_xor (integer)
data_copy (integer)
data_level_0_btree (integer)
data_level_0_btree_index_page (integer)
data_level_0_btree_leaf_page (integer)
data_level_0_journal (integer)
data_level_1_btree (integer)
data_level_1_btree_index_page (integer)
data_level_1_btree_leaf_page (integer)
data_level_1_journal (integer)
data_repo (integer)
data_repo_copy (integer)
data_xor (integer)
data_xor_shipped (integer)
Measurement: cm_EC_Statistics
Tags: host, node_id, process, tag
Fields: chunks_ec_encoded (integer)
chunks_ec_encoded_alive (integer)
data_ec_encoded (integer)
data_ec_encoded_alive (integer)
Measurement: cm_Geo_Replication_Statistics_Geo_Chunk_Cache
Tags: host, node_id, process, tag
62 Advanced Monitoring
Fields: Capacity_of_Cache (integer)
Number_of_Chunks (integer)
Measurement: cm_REPO_GC_Statistics
Tags: host, node_id, process, tag
Fields: accumulated_deleted_garbage_repo (integer)
accumulated_reclaimed_garbage_repo (integer)
deleted_chunks_repo (integer)
deleted_data_repo (integer)
ec_freed_slots (integer)
full_reclaimable_aligned_chunk (integer)
merge_copy_overhead_in_deleted_data_repo (integer)
merge_copy_overhead_in_reclaimed_data_repo (integer)
reclaimed_chunk_repo (integer)
reclaimed_data_repo (integer)
slots_waiting_shipping (integer)
slots_waiting_verification (integer)
total_ec_free_slots (integer)
Measurement: cm_Rebalance_Statistics
Tags: host, node_id, process, tag
Fields: bytes_rebalanced (integer)
bytes_rebalancing_failed (integer)
chunks_canceled (integer)
chunks_for_rebalancing (integer)
chunks_rebalanced (integer)
chunks_total (integer)
jobs_canceled (integer)
segments_for_rebalancing (integer)
segments_rebalanced (integer)
segments_rebalancing_failed (integer)
segments_total (integer)
Measurement: cm_Rebalance_Statistics_CoS
Tags: CoS, host, node_id, process, tag
Fields: bytes_rebalanced (integer)
bytes_rebalancing_failed (integer)
chunks_canceled (integer)
chunks_for_rebalancing (integer)
chunks_rebalanced (integer)
chunks_total (integer)
jobs_canceled (integer)
segments_for_rebalancing (integer)
segments_rebalanced (integer)
segments_rebalancing_failed (integer)
segments_total (integer)
Measurement: cm_Recover_Statistics
Tags: host, node_id, process, tag
Fields: chunks_to_recover (integer)
data_recovered (integer)
data_to_recover (integer)
Measurement: cm_Recover_Statistics_CoS
Tags: CoS, host, node_id, process, tag
Fields: chunks_to_recover (integer)
data_recovered (integer)
data_to_recover (integer)
SR statistics
These statistics represent processes in ECS service SR, responsible for space reclamation.
Measurement: sr_REPO_GC_Statistics
Tags: host, node_id, process, tag
Fields: accumulated_merge_copy_overhead_in_full_garbage (integer)
accumulated_total_repo_garbage (integer)
full_reclaimable_repo_chunk (integer)
garbage_in_partial_sr_tasks (integer)
garbage_in_repo_usage (integer)
merge_copy_overhead_in_full_garbage (integer)
Advanced Monitoring 63
merge_way_gc_processed_chunks (integer)
merge_way_gc_src_chunks (integer)
merge_way_gc_targeted_chunks (integer)
merge_way_gc_tasks (integer)
total_repo_garbage (integer)
usage_between_0%_and_33.3%_repo_chunk (integer)
usage_between_33.3%_and_50%_repo_chunk (integer)
usage_between_50%_and_66.7%_repo_chunk (integer)
SSM statistics
These statistics represent processes in ECS storage manager service SSM.
Measurement: ssm_sstable_SSTable_SS
Tags: SS, SSTable, last, process, tag
Fields: allocatedSpace (integer)
availableFreeSpace (integer)
downDurationTotal (integer)
freeSpace (integer)
largeBlockAllocated (integer)
largeBlockAllocatedSize (integer)
largeBlockFreed (integer)
largeBlockFreedSize (integer)
pendingDurationTotal (integer)
pingerDurationTotal (integer)
smallBlockAllocated (integer)
smallBlockFreed (integer)
smallBlockFreedSize (integer)
smallBlockSize (integer)
state (string)
timeInStateTotal (integer)
totalSpace (integer)
upDurationTotal (integer)
Measurement: ssm_sstable_SSTable_SS_datamigration
Tags: SS, SSTable, last, process
Fields: status (integer)
totalCapacityToMigrate (integer)
Database monitoring_last
Measurement: blob_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: blob_Total_memory_and_disk_cache_size
Tags: Total_memory_and_disk_cache_size, host, last, node_id, process
Fields: Disk_cache_size (integer)
Memory_cache_size (integer)
Measurement: cm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: eventsvc_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: mm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
64 Advanced Monitoring
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: resource_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: rm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: sr_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: sr_Total_memory_and_disk_cache_size
Tags: Total_memory_and_disk_cache_size, host, last, node_id, process
Fields: Disk_cache_size (integer)
Memory_cache_size (integer)
Measurement: ssm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)
Measurement: dtquery_cmf
Tags: last, process
Fields: com.emc.ecs.chunk.gc.btree.enabled (integer)
com.emc.ecs.chunk.gc.btree.scanner.verification.enabled (integer)
com.emc.ecs.chunk.gc.repo.enabled (integer)
com.emc.ecs.chunk.gc.repo.verification.enabled (integer)
com.emc.ecs.chunk.rebalance.is_enabled (integer)
com.emc.ecs.objectgc.cas.enabled (integer)
com.emc.ecs.sensor.btree_sr_pending_mininum (integer)
com.emc.ecs.sensor.repo_sr_pending_mininum (integer)
Measurement: mm_topn_bucket_by_obj_count_place
Tags: last, place, process, tag
Fields: bucketName (string)
namespace (string)
value (integer)
Measurement: mm_topn_bucket_by_obj_size_place
Tags: last, place, process, tag
Fields: bucketName (string)
namespace (string)
value (integer)
Measurement: vnestStat_membership_ismember
Tags: host, ismember, last, node_id, process
Fields: is_leader (string)
Measurement: vnestStat_performance_latency_type
Tags: host, id, last, node_id, process, type
Fields: +Inf (integer)
0.0 (integer)
Advanced Monitoring 65
1.0 (integer)
7999999.99999999 (integer)
825912.9477680004 (integer)
85266.52466135359 (integer)
8802.840841123942 (integer)
9.686250859269972 (integer)
908.7975284781536 (integer)
93.82345570870827 (integer)
Measurement: vnestStat_performance_transactions_from_type
Tags: from, host, last, node_id, process, type
Fields: failed_request_counter (integer)
succeed_request_counter (integer)
Database monitoring_op
Information:
Measurements listed in this section are from default Telegraf plugins. Here, measurement
name equals plugin name. Refer to plugin documentation for more information.
For example, documentation for Telegraf plugin "cpu" can be found here.
Measurement: cpu
Tags: cpu, host, node_id, tag
Fields: usage_guest (float)
usage_guest_nice (float)
usage_idle (float)
usage_iowait (float)
usage_irq (float)
usage_nice (float)
usage_softirq (float)
usage_steal (float)
usage_system (float)
usage_user (float)
Measurement: disk
Tags: device, fstype, host, mode, node_id, path, tag
Fields: free (integer)
inodes_free (integer)
inodes_total (integer)
inodes_used (integer)
total (integer)
used (integer)
used_percent (float)
Measurement: diskio
Tags: ID_PART_ENTRY_UUID, SCSI_IDENT_SERIAL, SCSI_MODEL, SCSI_REVISION, SCSI_VENDOR,
host, name, node_id, tag
Fields: io_time (integer)
iops_in_progress (integer)
read_bytes (integer)
read_time (integer)
reads (integer)
weighted_io_time (integer)
write_bytes (integer)
write_time (integer)
writes (integer)
Measurement: linux_sysctl_fs
Tags: host, node_id, tag
Fields: aio-max-nr (integer)
aio-nr (integer)
dentry-age-limit (integer)
dentry-nr (integer)
dentry-unused-nr (integer)
dentry-want-pages (integer)
66 Advanced Monitoring
file-max (integer)
file-nr (integer)
inode-free-nr (integer)
inode-nr (integer)
inode-preshrink-nr (integer)
Measurement: mem
Tags: host, node_id, tag
Fields: active (integer)
available (integer)
available_percent (float)
buffered (integer)
cached (integer)
commit_limit (integer)
committed_as (integer)
dirty (integer)
free (integer)
high_free (integer)
high_total (integer)
huge_page_size (integer)
huge_pages_free (integer)
huge_pages_total (integer)
inactive (integer)
low_free (integer)
low_total (integer)
mapped (integer)
page_tables (integer)
shared (integer)
slab (integer)
swap_cached (integer)
swap_free (integer)
swap_total (integer)
total (integer)
used (integer)
used_percent (float)
vmalloc_chunk (integer)
vmalloc_total (integer)
vmalloc_used (integer)
wired (integer)
write_back (integer)
write_back_tmp (integer)
Measurement: net
Tags: host, interface, node_id, tag
Fields: bytes_recv (integer)
bytes_sent (integer)
bytes_sum (integer)
drop_in (integer)
drop_out (integer)
err_in (integer)
err_out (integer)
packets_recv (integer)
packets_sent (integer)
packets_sum (integer)
speed (integer)
utilization (integer)
Measurement: nstat
Tags: host, name, node_id, tag
Fields: IpExtInOctets (integer)
IpExtOutOctets (integer)
TcpInErrs (integer)
UdpInErrors (integer)
Measurement: processes
Tags: host, node_id, tag
Fields: blocked (integer)
dead (integer)
idle (integer)
paging (integer)
running (integer)
sleeping (integer)
Advanced Monitoring 67
stopped (integer)
total (integer)
total_threads (integer)
unknown (integer)
zombies (integer)
Measurement: procstat
Tags: host, node_id, process_name, tag, user
Fields: cpu_time (integer)
cpu_time_guest (float)
cpu_time_guest_nice (float)
cpu_time_idle (float)
cpu_time_iowait (float)
cpu_time_irq (float)
cpu_time_nice (float)
cpu_time_soft_irq (float)
cpu_time_steal (float)
cpu_time_stolen (float)
cpu_time_system (float)
cpu_time_user (float)
cpu_usage (float)
create_time (integer)
involuntary_context_switches (integer)
memory_data (integer)
memory_locked (integer)
memory_rss (integer)
memory_stack (integer)
memory_swap (integer)
memory_vms (integer)
nice_priority (integer)
num_fds (integer)
num_threads (integer)
pid (integer)
read_bytes (integer)
read_count (integer)
realtime_priority (integer)
rlimit_cpu_time_hard (integer)
rlimit_cpu_time_soft (integer)
rlimit_file_locks_hard (integer)
rlimit_file_locks_soft (integer)
rlimit_memory_data_hard (integer)
rlimit_memory_data_soft (integer)
rlimit_memory_locked_hard (integer)
rlimit_memory_locked_soft (integer)
rlimit_memory_rss_hard (integer)
rlimit_memory_rss_soft (integer)
rlimit_memory_stack_hard (integer)
rlimit_memory_stack_soft (integer)
rlimit_memory_vms_hard (integer)
rlimit_memory_vms_soft (integer)
rlimit_nice_priority_hard (integer)
rlimit_nice_priority_soft (integer)
rlimit_num_fds_hard (integer)
rlimit_num_fds_soft (integer)
rlimit_realtime_priority_hard (integer)
rlimit_realtime_priority_soft (integer)
rlimit_signals_pending_hard (integer)
rlimit_signals_pending_soft (integer)
signals_pending (integer)
voluntary_context_switches (integer)
write_bytes (integer)
write_count (integer)
Measurement: swap
Tags: host, node_id, tag
Fields: free (integer)
in (integer)
out (integer)
total (integer)
used (integer)
used_percent (float)
Measurement: system
68 Advanced Monitoring
Tags: host, node_id, tag
Fields: load1 (float)
load15 (float)
load5 (float)
n_cpus (integer)
n_users (integer)
uptime (integer)
uptime_format (string)
DT statistics
Measurement: dtquery_dt_dist_dt_node_id_type
Tags: dt_node_id, process, tag, type
Fields: count_i (integer)
Measurement: dtquery_dt_dist_host_dt_node_id
Tags: dt_node_id, process, tag
Fields: count_i (integer)
Measurement: dtquery_dt_dist_type_type
Tags: process, tag, type
Fields: count_i (integer)
Measurement: dtquery_dt_status
Tags: process, tag
Fields: total (integer)
unknown (integer)
unready (integer)
Measurement: dtquery_dt_status_detailed_type
Tags: process, tag, type
Fields: total (integer)
unknown (integer)
unready (integer)
Measurement: ecs_fabric_agent_dirstat_size_bytes
Tags: host, node_id, path, tag, url
Fields: gauge (float)
SR journal statistics
Measurement: sr_JournalParser_GC_RG_DT
Tags: DT, RG, last, process
Fields: majorMinorOfJournalRegion (string)
pendingChunks (integer)
timestampOfChunkRegion (string)
timestampOfJournalParserLastRun (string)
Measurement: sr_ObjectGC_CAS_RG
Tags: RG, last, process
Fields: STATUS (string)
Measurement: vnestStat_btree
Tags: cumulative_stats, host, level, node_id, tag
Fields: level_count (float)
Advanced Monitoring 69
page_count (float)
size_bytes (float)
Database monitoring_vdc
Metrics in this database are calculated values over whole VDC without reference to particular data node.
Information:
Metrics below are aggregated over data nodes for raw measurements used in Grafana ECS UI.
Measurement: cq_disk_bandwidth
Tags: type_op ('read', 'write')
Fields: consistency_checker (float)
erasure_encoding (float)
geo (float)
hardware_recovery (float)
total (float)
user_traffic (float)
xor (float)
Measurement: cq_node_rebalancing_summary
Tags: none
Fields: data_rebalanced (integer)
pending_rebalance (integer)
Measurement: cq_process_health
Tags: none
Fields: cpu_used (float)
mem_used (float)
mem_used_percent (float)
nic_bytes (float)
nic_utilization (float)
Measurement: cq_recover_status_summary
Tags: none
Fields: data_recovered (integer)
data_to_recover (integer)
Database monitoring_main
Performance metrics in this database are raw, each is split by data node, that is all have node and node_id tags.
Most of integer fields are increasing counters that is values that increase over time. Increasing counters restart from zero after
datahead service restart.
Measurement: statDataHead_performance_internal_error
Tags: host, node_id, process, tag
Fields: system_errors (integer)
user_errors (integer)
70 Advanced Monitoring
Measurement: statDataHead_performance_internal_error_code
Tags: code, host, node_id, process, tag
Fields: error_counter (integer)
Measurement: statDataHead_performance_internal_error_head
Tags: head, host, node_id, process, tag
Fields: system_errors (integer)
user_errors (integer)
Measurement: statDataHead_performance_internal_error_head_namespace
Tags: head, host, namespace, node_id, process, tag
Fields: system_errors (integer)
user_errors (integer)
Measurement: statDataHead_performance_internal_latency
Tags: host, id, node_id, process, tag
Fields: +Inf (integer)
0.0 (integer)
1.0 (integer)
111.6295328521717 (integer)
12461.15260479408 (integer)
23.183877401213103 (integer)
2588.0054039994393 (integer)
4.814963904455889 (integer)
537.4921713544796 (integer)
59999.999999999985 (integer)
Measurement: statDataHead_performance_internal_latency_head
Tags: head, host, id, node_id, process, tag
Fields: +Inf (integer)
0.0 (integer)
1.0 (integer)
111.6295328521717 (integer)
12461.15260479408 (integer)
23.183877401213103 (integer)
2588.0054039994393 (integer)
4.814963904455889 (integer)
537.4921713544796 (integer)
59999.999999999985 (integer)
Measurement: statDataHead_performance_internal_throughput
Tags: host, node_id, process, tag
Fields: total_read_requests_size (integer)
total_write_requests_size (integer)
Measurement: statDataHead_performance_internal_throughput_head
Tags: head, host, node_id, process, tag
Fields: total_read_requests_size (integer)
total_write_requests_size (integer)
Measurement: statDataHead_performance_internal_transactions
Tags: host, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)
Measurement: statDataHead_performance_internal_transactions_head
Tags: head, host, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)
Measurement: statDataHead_performance_internal_transactions_head_namespace
Tags: head, host, namespace, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)
Measurement: statDataHead_performance_internal_transactions_method
Tags: host, method, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)
Advanced Monitoring 71
Database monitoring_vdc
Performance metrics in this database are calculated values over whole VDC without reference to particular data node.
Most of values are:
● Rates (number of requests per seconds)- for all measurements not ending by "_delta"
● Delta values, increase of a counter from previous time stamp- for all measurements ending by "_delta"
● Down sampled values (aggregated one point per day)- for all measurements ending by "_downsampled"
Measurement: cq_performance_error
Tags: none
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_downsampled
Tags: none
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_code
Tags: code
Fields: error_counter (float)
Measurement: cq_performance_error_code_downsampled
Tags: code
Fields: error_counter (float)
Measurement: cq_performance_error_delta
Tags: none
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_error_delta_downsampled
Tags: none
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_error_head
Tags: head
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_head_downsampled
Tags: head
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_head_delta
Tags: head
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_error_head_delta_downsampled
Tags: head
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_error_ns
Tags: namespace
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_ns_downsampled
Tags: namespace
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_ns_delta
Tags: namespace
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_error_ns_delta_downsampled
Tags: namespace
Fields: system_errors_i (integer)
user_errors_i (integer)
72 Advanced Monitoring
Measurement: cq_performance_latency
Tags: id
Fields: p50 (float)
p99 (float)
Measurement: cq_performance_latency_downsampled
Tags: id
Fields: p50 (float)
p99 (float)
Measurement: cq_performance_latency_head
Tags: head, id
Fields: p50 (float)
p99 (float)
Measurement: cq_performance_latency_head_downsampled
Tags: head, id
Fields: p50 (float)
p99 (float)
Measurement: cq_performance_throughput
Tags: none
Fields: total_read_requests_size (float)
total_write_requests_size (float)
Measurement: cq_performance_throughput_downsampled
Tags: none
Fields: total_read_requests_size (float)
total_write_requests_size (float)
Measurement: cq_performance_throughput_head
Tags: head
Fields: total_read_requests_size (float)
total_write_requests_size (float)
Measurement: cq_performance_throughput_head_downsampled
Tags: head
Fields: total_read_requests_size (float)
total_write_requests_size (float)
Measurement: cq_performance_transaction
Tags: none
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_downsampled
Tags: none
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_delta
Tags: none
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_delta_downsampled
Tags: none
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_head
Tags: head
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_head_downsampled
Tags: head
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_head_delta
Tags: head
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_head_delta_downsampled
Tags: head
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_method
Tags: method
Advanced Monitoring 73
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_method_downsampled
Tags: method
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_ns
Tags: namespace
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_ns_downsampled
Tags: namespace
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_ns_delta
Tags: namespace
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_ns_delta_downsampled
Tags: namespace
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Processes statistics
Dashboard API
GET /dashboard/nodes/{id}/processes
GET /dashboard/processes/{id}
Flux API
Database:
● monitoring_op
Measurement:
● procstat(detailed info on available fields and tags https://github.com/influxdata/telegraf/tree/master/plugins/inputs/
procstat)
Fields:
● memory_rss- resident memory of a process (bytes)
● cpu_usage- cpu usage percentage for a process (percent used of a single cpu)
● num_threads- number of threads used by process (int)
Tags:
● process_name- valid process names:
○ nvmeengine
○ nvmetargetviewer
○ dtsm
○ rack-service-manager
○ rpcbind
○ blobsvc
○ cm
74 Advanced Monitoring
○ coordinatorsvc
○ dataheadsvc
○ dtquery
○ ecsportalsvc
○ eventsvc
○ georeceiver
○ metering
○ objcontrolsvc
○ resourcesvc
○ transformsvc
○ vnest
○ fluxd
○ influxd
○ throttler
○ grafana-server
○ dockerd
○ fabric-agent
○ fabric-lifecycle
○ fabric-registry
○ fabric-zookeeper
● host- hostname (fqdn)
● node_id- host id
NOTE:
r.node_id == "330e4b8f-4491-4ec7-b816-7b10ac9c6abf"
r.process_name == "cm"
Example query:
from(bucket: "monitoring_op")
|> filter(fn: (r) => r._measurement == "procstat" and r._field == "memory_rss" and
r.process_name == "vnest" and r.host == "host_name")
|> range(start: -24h)
|> keep(columns: ["_time", "_value", "process_name"])
Example output:
#datatype,string,long,dateTime:RFC3339,long,string
#group,false,false,false,false,true
#default,_result,,,,
,result,table,_time,_value,process_name
,,0,2019-08-15T13:05:00Z,2505809920,vnest
,,0,2019-08-15T13:10:00Z,2505887744,vnest
,,0,2019-08-15T13:15:00Z,2506014720,vnest
,,0,2019-08-15T13:20:01Z,2506010624,vnest
Nodes statistics
Dashboard API
GET /dashboard/nodes/{id}
Database:
Advanced Monitoring 75
● monitoring_op
Measurement:
● cpu (detailed info on available fields and tags https://github.com/influxdata/telegraf/tree/master/plugins/inputs/cpu)
Fields:
● usage_idle- idle cpu usage (percents)
Tags:
● host- hostname (fqdn)
● node_id- host id
Example query:
from(bucket: "monitoring_op")
|> filter(fn: (r) => r._measurement == "cpu" and r.cpu == "cpu-total" and r._field ==
"usage_idle" and r.host == "host_name")
|> range(start: -24h)
|> keep(columns: ["_time", "_value", "host"])
Example output:
#datatype,string,long,dateTime:RFC3339,double,string
#group,false,false,false,false,true
#default,_result,,,,
,result,table,_time,_value,host
,,0,2019-08-15T13:20:00Z,19.549454477395525,host_name
,,0,2019-08-15T13:25:00Z,17.920104933062728,host_name
,,0,2019-08-15T13:30:00Z,18.050788903551002,host_name
,,0,2019-08-15T13:35:00Z,19.801364027505095,host_name
Measurement:
● mem (detailed info on available fields and tags https://github.com/influxdata/telegraf/tree/master/plugins/inputs/mem)
Fields:
● free- free memory on host (bytes)
Tags:
● host- hostname (fqdn)
● node_id- host id
Example query:
from(bucket: "monitoring_op")
|> filter(fn: (r) => r._measurement == "mem" and r._field == "free" and r.host ==
"host_name")
|> range(start: -24h)
|> keep(columns: ["_time", "_value", "host"])
Example output:
#datatype,string,long,dateTime:RFC3339,long,string
#group,false,false,false,false,true
#default,_result,,,,
,result,table,_time,_value,host
,,0,2019-08-15T14:10:00Z,3181088768,host_name
,,0,2019-08-15T14:15:00Z,2988388352,host_name
,,0,2019-08-15T14:20:00Z,3002994688,host_name
,,0,2019-08-15T14:25:00Z,3115741184,host_name
76 Advanced Monitoring
Performance statistics
Dashboard API
GET /dashboard/nodes/{id}
GET /dashboard/zones/localzone
GET /dashboard/zones/localzone/nodes
Dashboard APIs
Lists the APIs that are changed or deprecated.
Where replacement can be found See Monitoring list of metrics: Non-Performance > Database monitoring_op >
Node system level statistics.
Advanced Monitoring 77
Table 27. Alternative places to find removed data (continued)
Measurements cpu, mem, net
Where replacement can be found See Monitoring list of metrics: Non-Performance > Database monitoring_vdc.
Measurement cq_node_rebalancing_summary
Where replacement can be found See Monitoring list of metrics: Non-Performance > Database monitoring_last
> Export of configuration framework values.
Measurement dtquery_cmf
Where replacement can be found For VDC metrics, see Monitoring list of metrics: Performance > Database
monitoring_vdc.
For Node metrics, see Monitoring list of metrics: Performance > Database
monitoring_main.
Where replacement can be found For VDC metrics, see Monitoring list of metrics: Non-Performance > Database
monitoring_vdc.
For Node metrics, see Monitoring list of metrics: Non-Performance > Data for ECS
Service I/O Statistics.
78 Advanced Monitoring
5
Examining Service Logs
Describes the location and content of ECS service logs.
Topics:
• ECS service logs
The emcservice user cannot access service logs. When the node is locked using the platform lockdown feature, a user
cannot access service logs. Only an administrator who has permission to access the node can access the logs.
● authsvc.log: Records information from the authentication service
● blobsvc*.log: Records aspects of the binary large object service (BLOB) service
● cassvc*.log: Records aspects of the CAS service
● coordinatorsvc.log: Records information from the coordinator service
● ecsportalsvc.log: Records information from the ECS Portal service
● eventsvc*.log: Records aspects of the event service. This information is available in the ECS Portal at Monitor >
Events
● hdfssvc*.log: Records aspects of the HDFS service
● objcontrolsvc.log: Records information from the object service
● objheadsvc*.log: Records aspect of the various object heads supported by the object service.
● provisionsvc*.log: Records aspects of the ECS provisioning service
● resourcesvc*.log: Records information that is related to global resources like namespaces, buckets, object users
● dataheadsvc-access.log: Records the aspects of the object heads supported by the object service, the file service
supported by HDFS, and the CAS service.
80 Document feedback