0% found this document useful (0 votes)
55 views80 pages

ECS 3.6.1 Monitoring Guide Rev1.1

Uploaded by

ali2k2sec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views80 pages

ECS 3.6.1 Monitoring Guide Rev1.1

Uploaded by

ali2k2sec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

ECS

Monitoring Guide
3.6.1

May 2021
Rev. 1.1
Notes, cautions, and warnings

NOTE: A NOTE indicates important information that helps you make better use of your product.

CAUTION: A CAUTION indicates either potential damage to hardware or loss of data and tells you how to avoid
the problem.

WARNING: A WARNING indicates a potential for property damage, personal injury, or death.

© 2021 Dell Inc. or its subsidiaries. All rights reserved. Dell, EMC, and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other
trademarks may be trademarks of their respective owners.
Contents

Figures..........................................................................................................................................5

Tables........................................................................................................................................... 6

Chapter 1: Monitor Basics..............................................................................................................7


View the ECS Portal Dashboard...................................................................................................................................... 7
Upper-right menu bar................................................................................................................................................... 7
View requests................................................................................................................................................................. 7
View capacity utilization...............................................................................................................................................8
View performance..........................................................................................................................................................8
View storage efficiency................................................................................................................................................8
View geo monitoring..................................................................................................................................................... 8
View node and disk health........................................................................................................................................... 8
View alerts....................................................................................................................................................................... 9
View audits...................................................................................................................................................................... 9
Using monitoring pages......................................................................................................................................................9
Table navigation............................................................................................................................................................. 9
Filter by date and time..................................................................................................................................................9
History.............................................................................................................................................................................10
Export icon..................................................................................................................................................................... 11

Chapter 2: Monitoring ECS.......................................................................................................... 12


Monitor metering data...................................................................................................................................................... 12
Metering data................................................................................................................................................................13
Monitor capacity utilization............................................................................................................................................. 14
Read-only system.........................................................................................................................................................14
Capacity forecast......................................................................................................................................................... 14
Monitor capacity...........................................................................................................................................................14
Monitor used capacity.................................................................................................................................................17
Monitor garbage collection data............................................................................................................................... 17
Monitor erasure encoding.......................................................................................................................................... 18
Monitor CAS processing.............................................................................................................................................19
Monitor system health..................................................................................................................................................... 20
Monitor hardware health........................................................................................................................................... 20
Monitor process health............................................................................................................................................... 21
Monitor node rebalancing status............................................................................................................................. 22
Monitor transactions........................................................................................................................................................ 22
Monitor recovery status.................................................................................................................................................. 23
Monitor disk bandwidth................................................................................................................................................... 23
Introduction to geo-replication monitoring................................................................................................................. 23
Monitor geo replication: Rate and Chunks............................................................................................................ 23
Monitor geo replication: Recovery Point Objective (RPO)............................................................................... 24
Monitor geo replication: Failover Processing........................................................................................................24
Monitor geo replication: Bootstrap Processing....................................................................................................25
Cloud hosted VDC monitoring........................................................................................................................................25

Contents 3
Cloud topology............................................................................................................................................................. 26
Cloud replication traffic............................................................................................................................................. 26

Chapter 3: Monitoring Events: Audits and Alerts......................................................................... 28


About event monitoring................................................................................................................................................... 28
Monitor audit data.............................................................................................................................................................28
Audit messages.................................................................................................................................................................. 29
Monitor alerts.....................................................................................................................................................................33
Alert policy.......................................................................................................................................................................... 34
New alert policy........................................................................................................................................................... 34
Acknowledge all alerts......................................................................................................................................................35
Alert messages...................................................................................................................................................................35

Chapter 4: Advanced Monitoring................................................................................................. 48


Advanced Monitoring....................................................................................................................................................... 48
View Advanced Monitoring Dashboards................................................................................................................ 48
Share Advanced Monitoring Dashboards.............................................................................................................. 58
Flux API................................................................................................................................................................................58
Monitoring list of metrics.......................................................................................................................................... 60
Monitoring list of metrics: Non-Performance...................................................................................................... 60
Monitoring list of metrics: Performance................................................................................................................ 70
Flux API replacements for deprecated dashboard API.......................................................................................74
Dashboard APIs.................................................................................................................................................................. 77

Chapter 5: Examining Service Logs..............................................................................................79


ECS service logs................................................................................................................................................................ 79

Part I: Document feedback.......................................................................................................... 80

4 Contents
Figures

1 Upper-right menu bar................................................................................................................................................7


2 Refresh icon................................................................................................................................................................ 9
3 Open Filter panel with date and time range selections................................................................................... 10
4 History chart with active cursor........................................................................................................................... 10
5 Export icons................................................................................................................................................................11

Figures 5
Tables

1 Bucket and namespace metering......................................................................................................................... 13


2 Capacity utilization: VDC........................................................................................................................................ 15
3 Capacity utilization: storage pool......................................................................................................................... 16
4 Capacity utilization: node....................................................................................................................................... 16
5 Capacity utilization: disk..........................................................................................................................................17
6 Used capacity............................................................................................................................................................ 17
7 Garbage collection: garbage detected................................................................................................................ 18
8 Garbage collection: capacity reclaimed...............................................................................................................18
9 Erasure encoding metrics....................................................................................................................................... 18
10 CAS processing metrics..........................................................................................................................................19
11 VDC, node, and process health metrics..............................................................................................................21
12 ECS processes.......................................................................................................................................................... 21
13 Rate and Chunks columns..................................................................................................................................... 24
14 RPO columns............................................................................................................................................................ 24
15 Failover columns...................................................................................................................................................... 24
16 Bootstrap Processing columns.............................................................................................................................25
17 Replication traffic by VDC..................................................................................................................................... 26
18 Replication traffic by replication group.............................................................................................................. 27
19 ECS audit messages................................................................................................................................................29
20 Alert types................................................................................................................................................................. 33
21 ESRS dial home types.............................................................................................................................................33
22 ECS Object alert messages................................................................................................................................... 35
23 ECS fabric alert messages.....................................................................................................................................42
24 Secure Remote Services alert messages........................................................................................................... 47
25 Advanced monitoring dashboards........................................................................................................................48
26 Advanced monitoring dashboard fields.............................................................................................................. 49
27 Alternative places to find removed data............................................................................................................ 77
28 APIs removed in ECS 3.5.0....................................................................................................................................78

6 Tables
1
Monitor Basics
Monitor basics provides critical information about viewing the ECS portal dashboard and using the monitoring pages.
Topics:
• View the ECS Portal Dashboard
• Using monitoring pages

View the ECS Portal Dashboard


The ECS Portal Dashboard provides critical information about the ECS processes on the VDC you are currently logged in to.
The Dashboard is the first page you see after you log in. The title of each panel (box) links to the portal monitoring page that
shows more detail for the monitoring area.

Upper-right menu bar


The upper-right menu bar appears on each ECS Portal page.

Figure 1. Upper-right menu bar

Menu items include the following icons and menus:


1. The Alert icon displays a number that indicates how many unacknowledged alerts are pending for the current VDC. The
number displays 99+ if there are more than 99 alerts. You can click the Alert icon to see the Alert menu, which shows the
five most recent alerts for the current VDC.
2. The Help icon brings up the online documentation for the current portal page.
3. The Guide icon brings up the Getting Started Task Checklist.
4. The VDC menu displays the name of the current VDC. If your AD or LDAP credentials allow you to access more than one
VDC, you can switch the portal view to the other VDCs without entering your credentials.
5. The User menu displays the current user and allows you to log out. The User menu displays the last login time for the user.

View requests
The Requests panel displays the total requests, successful requests, and failed requests.
Failed requests are organized by system error and user error. User failures are typically HTTP 400 errors. System failures are
typically HTTP 500 errors. Click Requests to see more request metrics.
Request statistics do not include replication traffic.
NOTE: For partial upgrade scenarios (for example, during 3.4 to 3.6), nodes in 3.4 pulls data from dashboard API, whereas
nodes upgraded to 3.6 pulls data from flux API. This may result in inconsistent display of data.

Monitor Basics 7
View capacity utilization
The Capacity Utilization panel displays the total, used, available, reserved, and percent full capacity.
NOTE: When the storage pool reaches 90% of its total capacity, it does not accept write requests and it becomes a
read-only system. A storage pool must have a minimum of four nodes and must have three or more nodes with more than
10% free capacity in order to allow writes. This reserved space is required to ensure that ECS does not run out of space
while persisting system metadata. If this criteria is not met, the write will fail. The ability of a storage pool to accept writes
does not affect the ability of other pools to accept writes. For example, if you have a load balancer that detects a failed
write, the load balancer can redirect the write to another VDC.
Capacity amounts are shown in gibibytes (GiB) and tibibytes (TiB). One GiB is approximately equal to 1.074 gigabytes (GB). One
TiB is approximately equal to 1.1 terabytes (TB).
The Used capacity indicates the amount of capacity that is in use. Click Capacity Utilization to see more capacity metrics.
The capacity metrics are available in the left menu.

View performance
The Performance panel displays how network read and write operations are currently performing, and the average read/write
performance statistics over the last 24 hours for the VDC.
Click Performance to see more comprehensive performance metrics.
NOTE:
● There will be a label of SSD Cache Enabled if the feature is on the node. And if Read Cache is disabled or the nodes do
not have SSD disks there will be no SSD Cache Enabled label.
● For partial upgrade scenarios (for example, during 3.4 to 3.6), nodes in 3.4 pulls data from dashboard API, whereas
nodes upgraded to 3.6 pulls data from flux API. This may result in inconsistent display of data.

View storage efficiency


The Storage Efficiency panel displays the efficiency of the erasure coding (EC) process.
The chart shows the progress of the current EC process, and the other values show the total amount of data that is subject to
EC, the amount of EC data waiting for the EC process, and the current rate of the EC process. Click Storage Efficiency to see
more storage efficiency metrics.

View geo monitoring


The Geo Monitoring panel displays how much data from the local VDC is waiting for geo-replication, and the rate of the
replication.
Recovery Point Objective (RPO) refers to the point in time in the past to which you can recover. The value is the oldest data at
risk of being lost if a local VDC fails before replication is complete. Failover Progress shows the progress of any active failover
that is occurring in the federation involving the local VDC. Bootstrap Progress shows the progress of any active process to
add a new VDC to the federation. Click Geo Monitoring to see more geo-replication metrics.

View node and disk health


The Node & Data Disks panel displays the health status of disks and nodes.
A green check mark beside the node or disk number indicates the number of nodes or disks in good health. A red x indicates bad
health. Click Node & Data Disks to see more hardware health metrics. If the number of bad disks or nodes is a number other
than zero, clicking the count takes you to the corresponding Hardware Health tab (Offline Data Disks or Offline Nodes) on
the System Health page.

NOTE:

8 Monitor Basics
● If the data from failed disks have already recovered and failed disks are ready for replacement, they will not show in the
Node & Data Disks panel. Click Manage Disks under System Health to go to Maintenance, which indicates if there
are disks that are ready for physical replacement. Alternatively, access Maintenance using left panel menu, Manage >
Maintenance.
● The maximum number of connections per node is 1000.

View alerts
The Alerts panel displays a count of critical alerts and errors.
Click Alerts to see the full list of current alerts. Any Critical or Error alerts are linked to the Alerts tab on the Events page
where only the alerts with a severity of Critical or Error are filtered and displayed.

NOTE: Alerts can also be filtered with Severity Info and Warning.

View audits
Audits can be filtered only with date time range and namespace.

Using monitoring pages


Introduces the basic techniques for using monitoring pages in the ECS Portal.
The ECS Portal monitoring pages share a set of common interactions as described in the following sections:

Table navigation
Highlighted text in a table row indicates a link to a detail display. Selecting the link drills down to the next level of detail. On
drill-down displays, a path string shows your current location in the sequence of drill-down displays. This path string is called a
breadcrumb trail or breadcrumbs for short. Selecting any highlighted breadcrumb jumps up to the associated display.
On some monitoring displays, you can force a table to refresh with the latest data by clicking the Refresh icon.

Figure 2. Refresh icon

Filter by date and time


The standard monitoring filter enables to narrow results by date and time. It is available on several monitoring pages. Some
pages have more filter types, described on those pages.
You can select a Date Time Range predefined value (in hours, weeks, or months) or select Custom to specify a From and To
date and time. For the To value, you can select the current time. After selecting a Date Time Range, and click Apply. The Filter
panel closes and the page content updates. When closed, the Filter panel shows a summary of the applied filter settings and
provides a Clear Filter command and a Refresh symbol.
If you want the Filter panel to stay open, click the Pin icon before you click Apply.

Monitor Basics 9
Figure 3. Open Filter panel with date and time range selections

When the table has the Current filter applied, the latest values are displayed. When the table has a date-time range filter
applied, it displays the average value over that period.

History
When you select a History button, all available charts for that row are displayed below the table. You can hover over a chart
from left to right to see a vertical line that helps you find a specific date-time point on the chart. A pop-up display shows the
value and timestamp for that point.
The date-time scale is determined by the filter setting that has been configured. When the Current filter is selected, the charts
show data from the last 24 hours. History data is kept for 60 days.

Figure 4. History chart with active cursor

In the history charts, when the Current filter is selected, if there is no available historical data, No Data displays.

10 Monitor Basics
Export icon
Export icon enables you to export data from all the monitoring tables and graphs to pdf, doc, excel. and .csv formats for later
consumption. To select the format, and export the data, use the export icon in the upper right of the menu bar on each table
and graph.
The exported data can be used to get a longer term view on capacity usage and consumption trends.

Figure 5. Export icons

Monitor Basics 11
2
Monitor Metering
Monitor metering provides critical information about viewing and using the monitoring pages in the ECS portal dashboard.
Topics:
• Monitor metering data
• Monitor capacity utilization
• Monitor system health
• Monitor transactions
• Monitor recovery status
• Monitor disk bandwidth
• Introduction to geo-replication monitoring
• Cloud hosted VDC monitoring

Monitor metering data


You can display metering data for namespaces, or buckets within namespaces, for a specified time period.

About this task


The available metering data is detailed in Metering data.
Using the ECS Management REST API you can retrieve data programmatically with custom clients.

Steps
1. In the ECS Portal, select Monitor > Metering.
2. From the Date Time Range menu, select the period for which you want to see the metering data. Select Current to view
the current metering data. Select Custom to specify a custom date-time range.
Metering is not a real-time reporting activity but is performed as a background process and some delay in reporting can
occur. The longest delay is about 15 minutes. However, where the system is under heavy load, or is unstable, longer delays
can be seen. If you are encountering longer delays, contact ECS Customer Support.
If you select Custom, use the From and To calendars to choose the time period for which data will be displayed.
Metering data is kept for 30 days.
NOTE: The Current filter displays the latest available values. A date-time range filter displays average values over the
specified range.

3. Select the namespace for which you want to display metering data. To narrow the list of namespaces, type the first few
letters of the target namespace and click the magnifying glass icon.
If you are a Namespace Administrator, you will only be able to select your namespace.
4. Click the + icon next to each namespace you want to see object data for.
5. To see the data for a particular bucket, click the + icon next to each bucket for which you want to see data.
To narrow the list of buckets, type the first few letters of the target bucket and click the magnifying glass icon.
If you do not specify a bucket, the object metering data will be the totals for all buckets in the namespace.
6. Click Apply to display the metering data for the selected namespace and bucket for the specified time period.
NOTE: While all buckets in a geo-federation can be selected in metering, if a selected bucket is not associated in a
replication group to which the VDC that you are logged into belongs, metering information cannot be retrieved for that
bucket. In this case, after a wait, the bucket is listed as No data. To get the metering information for the bucket, log in
to the VDC that owns the bucket or any VDC that is part of the replication group to which the bucket belongs.

12 Monitor Metering
Depending on the Date Time Range selected, the attributes that are displayed in the Metering Page may change.
If Current option is selected, only Namespace, Buckets, Bucket Tags, Total MPU Parts, Total MPU Size, Total Size,
Object Count, and Last Updated attributes are displayed in the table. If Custom or any other time range is chosen,
the Namespace, Buckets, Bucket Tags, Total MPU Parts, Total MPU Size, Total Size, Object Count, Objects Created,
Objects Deleted, Write Traffic and Read Traffic attributes are displayed in the table and the Last Updated attribute is
not displayed.

Metering data
Object metering data for a specified namespace, or a specified bucket within a namespace, can be obtained for a defined time
period at the ECS Portal Monitor > Metering page.
The metering information that is provided is shown in the following table:

Table 1. Bucket and namespace metering


Attribute Description
Namespace Namespace selected.
Buckets Bucket selected for which the metering data applies. If blank, the data is for all
buckets in the namespace.
Bucket Tags Lists any name=value bucket tags associated with the bucket.
Total MPU Parts The number of MPU parts that have been created and not used as part of a
complete MPU operation.
Total MPU Size The total disk size occupied by MPU parts that have been created and not used as
part of a complete MPU operation.
Total Size Total size of the objects that are stored in the selected namespace or bucket at the
end time that is specified in the filter. If the size is less than 1 GB, then the portal
displays 0GB.
Object Count Number of objects that are associated with the selected namespace or bucket at
the end time that is specified in the filter.
Last Updated If the Current filter is selected, Last Updated displays the time until which
metering data can be considered consistent. This can help you determine any
delay in reported metering stats. The metering stats may include some data on
the operations that are performed after the last updated time.
Objects Created Number of objects that are created in the selected namespace or bucket in the
time period.
Objects Deleted Number of objects that are deleted from the selected namespace or bucket in the
time period.
Write Traffic Total of incoming object data (writes) for the selected namespace or bucket during
the specified period. Values are displayed in a size unit that is based on the size of
the data.
Read Traffic Total of outgoing object data (reads) for the selected namespace or bucket during
the specified period. Values are displayed in a size unit that is based on the size of
the data.

NOTE: When you perform an update operation on an object, the metering services show Object Overwrite as
Objects Created and Objects Deleted. The Objects that are deleted are shown because of the expected
OVERWRITE behavior of an object. However, no object is deleted.

NOTE:
● Metering is not a real-time reporting activity but is performed as a background process and some delay in reporting can
occur. The longest delay is about 15 minutes. However, where the system is under heavy load, or is unstable, longer
delays can be seen. If you are encountering longer delays, contact ECS Customer Support.

Monitor Metering 13
● To reflect the statistics for S3 and fan-out objects, a delay of 2 hours 15 min can occur.

NOTE: When there are many concurrent requests, ECS metering can ignore some requests so that they do not impact
system performance. Hence, the Write Traffic value can show less that the actual Write bandwidth.

Monitor capacity utilization


You can monitor capacity utilization from the ECS Portal Monitor > Capacity Utilization page. You can monitor the capacity
utilization of storage pools, nodes and the entire VDC.
The Capacity Utilization page has the following tabs:
● Capacity: View summary data about the total, used, available, and reserved storage capacity of storage pools and nodes
● Used Capacity: View data about the used capacity for the VDC and storage pools
● Garbage Collection: View data about garbage detected, recovered capacity, capacity that is pending reclamation, and
capacity that cannot be reclaimed
● Erasure Encoding: View erasure-encoded data in a local storage pool, data that is pending erasure encoding, and the current
erasure encoding rate and estimated completion time
● CAS Processing: View garbage data collection for CAS (Content Addressable Storage) buckets.
Tables showing capacity usage data display in each of the tabs. You can look down into the nodes and to individual disks by
selecting the appropriate link in each table. Each row has an associated History display that enables you to see how the data
has changed over time. To graphically display how capacity has changed over time, select History for the storage pool, node, or
disk that you are interested in. History data is kept for 30 days.
See Using monitoring pages for information about going to the tables.

Read-only system
When the storage pool reaches 90% of its total capacity, it does not accept write requests and it becomes a read-only system.
A storage pool must have a minimum of four nodes and must have three or more nodes with more than 10% free capacity in
order to allow writes. This reserved space is required to ensure that ECS does not run out of space while persisting system
metadata. If this criteria is not met, the write fails. The ability of a storage pool to accept writes does not affect the ability of
other pools to accept writes. For example, if you have a load balancer that detects a failed write, the load balancer can redirect
the write to another VDC.

Capacity forecast
You can use the Capacity tab to monitor when the capacity is expected to reach 50% and 80%. Capacity forecast is based on
the current usage pattern that is shown on 1 day, 7 days, and 30-days usage trend. Capacity Forecast data is shown either for
the entire VDC, for an individual storage pool or for nodes.
NOTE: The capacity ETA shown as N/A could be due to the following reasons:
1. There is not enough historical data for forecast. At least two data points (1 hour apart) are required. It could happen
when the ECS system is deployed. Click the History button at VDC, storage pool, or node levels to verify.
2. If capacity passed intended target, the ETA is set to 0.
3. The used capacity shows a down trend for the specified time (for example, 7 days). Click the History button or get the
history through dashboard API to verify.
To see the capacity forecast data from the ECS Portal, select Monitor > Capacity Utilization > Capacity. Capacity tab is
the default.
To see the data about total capacity, used capacity, and available capacity, click History.
Capacity Forecast is calculated based on the total capacity and used capacity.

Monitor capacity
You can use the Capacity tab to view capacity utilization data for:

14 Monitor Metering
● VDC (VDC capacity utilization)
● Storage Pools (Storage pool capacity utilization)
● Nodes (Node capacity utilization)
● Disks (Disk capacity utilization)
● Used Capacity (Monitor used capacity)
You can view summary storage usage data about total, used, available, and reserved storage capacity for storage pools and
nodes.
Reserved capacity is the approximately 10 percent of the total capacity that is reserved for failure handling and for performing
erasure encoding or XOR operations. Reserved capacity is not available for writing new data.
The tab opens with the Storage Pools capacity table displayed. To view capacity data for individual nodes, click the appropriate
link in the Nodes (Online) column to display the Nodes table. Click the appropriate link in the Disks (Online) column to view
capacity data for individual disks.
You can display average values over a selected date-time range or over a custom time range using the Filter drop-down menu.
The Current filter displays the latest available values and is the default filter value.
When the table has the Date Time Range filter set to Current (the default setting), the table displays the latest values and
the history graphs display values over the last 24-hour period. When the table has a Date Time Range filter applied (other than
Current), it displays the average value over that period.

VDC capacity utilization


Table 2. Capacity utilization: VDC
Attribute Description
VDC Name of the VDC.
Per 1 Day Trend 50% Forecasts VDC capacity when it is expected to reach 50% based on 1-day usage
trend.
Per 7 Day Trend 50% Forecasts VDC capacity when it is expected to reach 50% based on 7-days usage
trend.
Per 30 Day Trend 50% Forecasts VDC capacity when it is expected to reach 50% based on 30-days usage
trend.
Per 1 Day Trend 80% Forecasts VDC capacity when it is expected to reach 80% based on 1-day usage
trend.
Per 7 Day Trend 80% Forecasts VDC capacity when it is expected to reach 80% based on 7-days usage
trend.
Per 30 Day Trend 80% Forecasts VDC capacity when it is expected to reach 80% based on 30-days usage
trend.
Total Total capacity of the VDC that is online. This is the total of the capacity that is
already used and the capacity still free for allocation.
Used Used online capacity in the VDC.
Available (Reserved) Online capacity available for use, including the approximately 10% of the total
NOTE: If the Current filter capacity that is reserved for failure handling and for performing erasure encoding or
is applied, Available (Reserved) XOR operations.
displays. If a filter other than
Current is applied, only Available
displays.

Actions History provides a graphic display of the data. If the Current filter (default) is
selected, the History button displays total, used, and available capacity for the last
24 hours. History data is kept for 60 days.

Monitor Metering 15
Storage pool capacity utilization
Table 3. Capacity utilization: storage pool
Attribute Description
Storage Pool Name of the storage pool.
Nodes (Online) Number of nodes in the storage pool followed by the number of those nodes online.
Click this number to open: Node capacity utilization.
Online Nodes with Sufficient Disk Space Number of online nodes that have sufficient disk space to accept new data. If too
NOTE: Does not appear if a filter many disks are too full to accept new data, the performance of the system may be
other than Current is applied. impacted.

Disks (Online) Number of disks in the storage pool followed by the number of those disks that are
online.
Total Total capacity of the storage pool that is online. This is the total of the capacity
that is already used and the capacity still free for allocation.
Used Used online capacity in the storage pool.
Available (Reserved) Online capacity available for use, including the approximately 10% of the total
NOTE: If the Current filter capacity that is reserved for failure handling and for performing erasure encoding or
is applied, Available (Reserved) XOR operations.
displays. If a filter other than
Current is applied, only Available
displays.

Actions History provides a graphic display of the data. If the Current filter (default) is
selected, the History button displays total, used, and available capacity for the last
24 hours. History data is kept for 60 days.

Node capacity utilization


Table 4. Capacity utilization: node
Attribute Description
Nodes Fully qualified domain name (FQDN) of the node.
Disks (Online) Number of disks that are associated with the node followed by the number of those
disks that are online. Click disk number to open: Disk capacity utilization
Total Total online capacity provided by the online disks within the node. This is the total
of the capacity that is already used and the capacity still free for allocation.
Used Online capacity used within the node.
Available (Reserved) Remaining online capacity available in the node including reserved capacity.
NOTE: If the Current filter
is applied, Available (Reserved)
displays. If a filter other than
Current is applied, only Available
displays.

Offline Total capacity of the node that is offline.


NOTE: Displays only if the Current
filter is applied.

Online Status Indicates whether the node is online or offline. A check mark indicates that the
node status is Good.

16 Monitor Metering
Table 4. Capacity utilization: node (continued)
Attribute Description
Actions History provides a graphic display of the data. If the Current filter (default) is
selected, the History button displays total, used, and available capacity for the last
24 hours. History data is kept for 60 days.

Disk capacity utilization


Table 5. Capacity utilization: disk
Attribute Description
Disks Disk identifier.
Total Total capacity provided by the disk.
Used Capacity used on the disk.
Available Remaining capacity available on the disk.
Online Status Indicates whether the disk is online or offline. The check mark indicates that the
disk status is Good.
Actions History provides a graphic display of the data. If the Current filter (default) is
selected, the History button displays total, used, and available capacity for the last
24 hours. History data is kept for 60 days.

Monitor used capacity


You can use the Used Capacity tab to view the used storage capacity for the current VDC and for each storage pool in the
VDC.

Table 6. Used capacity


Storage use Description
User Data The capacity that is used for the repository chunks representing data uploaded by ECS
users.
System Metadata The capacity that is used by the ECS processes that track and describe the data in the
system.
Protection Overhead The combined overhead of triple mirroring and erasure coding for all user data, system
metadata, and geo data protection chunks protected locally.
Geo Cache The capacity used to cache chunks that are accessed locally but not stored locally.
Geo Copy The capacity that is used for Geo-replication chunks stored on the current VDC.
Garbage The capacity used by data that is no longer in use.

Storage usage is shown as color-coded bars, one color for the current VDC, and a different color for its storage pools. Tool tips
for each colored bar correspond to the status information in the numeric status line.

Monitor garbage collection data


You can use the Garbage Collection tab to monitor garbage collection data for the entire VDC or for individual storage pools.
Use the Virtual Data Center drop-down menu to select the storage type: Virtual Data Center or Storage Pool. Virtual Data
Center is the default.
Garbage collection is enabled by default at installation. Contact your customer support representative to disable or reenable this
feature.
The Garbage Collection page has the following subtabs:

Monitor Metering 17
● Garbage Detected: View summary garbage collection data.
● Capacity Reclaimed: View data about storage capacity reclaimed by the garbage collection process.

Garbage Detected
Click the Virtual Data Center drop-down menu to view garbage detection data for the entire VDC or individual storage pools.

Table 7. Garbage collection: garbage detected


Attribute Description
Storage Type The VDC or storage pool for which to view garbage collection data.
Total Garbage Detected The amount of reclaimable storage capacity detected on the system.
Capacity Reclaimed The amount of storage capacity reclaimed by the garbage collection process.
Capacity Pending Reclamation The amount of storage capacity that is identified as reclaimable but not reclaimed
yet.
UnReclaimable Garbage The amount of storage capacity that cannot be reclaimed currently.

Capacity Reclaimed
Click the Filter button to set a filter for the reclamation data by VDC or storage pool over a date/time range.

Table 8. Garbage collection: capacity reclaimed


Attribute Description
Storage Type The VDC or storage pool for which to view capacity reclaimed data.
Capacity Reclaimed The amount of storage capacity recovered following garbage collection.
User Data Reclaimed The amount of user data recovered.
System Metadata Reclaimed The amount of system metadata recovered.
Actions History provides a graphic display of the data. If the Current filter (default) is
selected, the History button displays the total reclaimed capacity for the last 24
hours. History data is kept for 60 days.

Monitor erasure encoding


You can use the Erasure Encoding tab to monitor the total user data and erasure encoded data in a local storage pool. It also
shows the current encoding rate and the estimated completion time.
You can display average values over a selected date-time range or over a custom time range using the Filter drop-down menu.
The Current filter displays the latest available values and is the default filter value.

Table 9. Erasure encoding metrics


Column Description
Storage Pool The storage pools from the current VDC.
Total Coding Data The total logical size of all data chunks in the storage pool which are subject to
erasure encoding.
Total Coded Data The total logical size of all erasure-encoded chunks in the storage pool.
Coded (%) The percent of data in the storage pool that is erasure encoded. Percent values
display with three decimal places in the history chart for accurate plotting. Percent
values display with two decimal points in the table, consistent with the format of
the other values in the table.

18 Monitor Metering
Table 9. Erasure encoding metrics (continued)
Column Description
Coding Rate The rate at which any current data waiting for erasure encoding is being processed.
Est. Time to Complete The estimated completion time extrapolated from the current erasure encoding
rate.
Actions ● History provides a graphic display of the total coding data, total coded data,
percent of data coded, and coding rate per second. History data is kept for 60
days.
● If the Current filter is selected, History displays default history for the last 24
hours.

Monitor CAS processing


You can use the CAS Processing tab to monitor unused CAS (Content Addressable Storage) objects in CAS buckets within a
selected namespace over a specified time range. The unused CAS objects that are monitored by ECS include unreferenced blobs
and expired reflections.
In Centera terminology, there are three types of CAS objects: blob, clip, and reflection.
● Blob: CAS data objects are called blobs (binary large objects). Blobs store data. Blobs can be referenced by data objects
of a different type called clips. A blob is referenced by its Content Address (CA) that is stored in the Content Description
File (CDF) that references the blob. The logical combination of a CDF and a Blob is called a Clip. The hash of a CDF is the
Clip-ID. There can be multiple Clips for the same Blob with different CDFs (different metadata but with same user data,
single instance storage). When blobs are not referenced by live clips, these unreferenced blobs become garbage data.
● C-Clip: Combination of a CDF and its related blobs
● Reflection: CDF of a deleted C-Clip. A reflection is created after the deletion of a C-Clip and provides an audit trail for
each deleted C-Clip. Reflections may have expiration times. (If there is no configured expiration time for a reflection, the
reflection is never deleted.)
Click the Filter drop-down menu to select a namespace containing CAS buckets and to set a date/time range to view the
number and size of unreferenced blobs and expired reflections in CAS buckets.
Important: For ECS systems with existing CAS data that upgrade to 3.2.1, there is a CAS garbage data bootstrap process
that is automatically triggered post upgrade. The bootstrap process builds necessary references over the existing CAS data
and can require a significant amount of time depending on the amount of existing CAS data. During the bootstrap process,
the unreferenced blob and reflection values will not change on the CAS Processing page. For example, you see zero for the
unreferenced blob data that are detected and unreferenced blobs detected values. The values will not change until after the
bootstrap process is complete. If you see that the values do not change over an extended period, call customer support.
When you search for a namespace (using the Search... option at the bottom of the list of namespaces in the Namespace
drop-down field), the search functionality is based on prefixes only. For example, a search for fin returns finance-
namespace-dev, while a search for dev would return nothing.

Table 10. CAS processing metrics


Attribute Description
Bucket The name of the bucket containing CAS data.
Unreferenced Blob Data Detected The amount of unreferenced blob data in the bucket (in bytes).
Unreferenced Blobs Detected The number of unreferenced blobs in the bucket.
Reflection Data Detected The amount of reflection data in the bucket (in bytes).
Reflections Expired The number of expired reflections in the bucket.
Actions History provides a graphic display of the unreferenced blob and reflection data. If
the Current filter (default) is selected, the History button displays the data for
the last 24 hours. History data is kept for 60 days.

Monitor Metering 19
Monitor system health
You can monitor system health from the ECS Portal Monitor > System Health page.
The System Health page has the following tabs:
● Hardware Health: View data about the status of nodes and disks.
● Process Health: View data about the status of the NIC, CPU, and memory.
● Node Rebalancing: View data about the status of node rebalancing operations.

Monitor hardware health


You can use the Hardware Health tab to obtain the health of disks and nodes.

About this task


The Hardware Health tab is accessed from the ECS Portal at Monitor > System Health > Hardware Health. The following
states describe hardware health:
● Good: The node is in normal operating condition.
● Suspect: Either the node is transitioning from good to bad because of decreasing hardware metrics, or there is a problem
with a lower-level hardware component, or the hardware is not detectable by the system because of connectivity problems.
● Bad: The node needs replacement.
Disks states have the following meanings:
● Good: The system is reading from and writing to the disk.
● Suspect: The system no longer writes to the disk but reads from it. Swarms of suspect disks are likely caused by
connectivity problems at a node. These disks transition back to Good when the connectivity issues clear up.
● Bad: The system neither reads from nor writes to the disk. Replace the disk. Once a disk has been identified as bad by the
ECS system, it cannot be reused anywhere in the ECS system. Because of ECS data protection, when a disk fails, copies
of the data that was once on the disk are re-created on other disks in the system. A bad disk only represents a loss of
capacity to the system--not a loss of data. When the disk is replaced, the new disk does not have data that is restored to it.
It becomes raw capacity for the system.
● Missing: The disk is a known disk that is unreachable. The disk may be transitioning between states, disconnected, or pulled.
● Removed: The disk is one that the system has completed recovery on and removed from the storage engine's list of valid
disks. History of all the removed disks will be displayed on ECS UI.
● Not Accessible: If a node is not accessible, then all its disks have this status. It indicates that the actual status of the disk is
not available to ECS.
NOTE: The Current filter displays the latest available values. A date-time range filter displays average values over the
specified range. Value data is kept for 60 days.

Steps
1. Select Monitor > System Health and select the Hardware Health tab.
By default the Offline Nodes subtab displays. This table may be empty if all nodes are online. Similarly, the Offline Data
Disks subtab may be empty if all disks are online.
2. Select the Offline Nodes and Offline Data Disks subtabs to view a summary.
3. Select the All Nodes and Data Disks subtab to drill down to nodes and disks.
4. Click the node name to drill down to its disk health page.
NOTE: The Slot Info value always matches the physical slot ID in ECS U-Series, C-Series, and D-Series Appliances.
This makes Slot Info useful for quickly locating a disk during disk replacement service. Some Certified Hardware
installations with ECS Software may not report useful or reliable data for Slot Info.

NOTE: Monitor the health of online and offline storage pool nodes and data disks. All data disks that belong to the
selected node are listed here. SSD Read Caches are not included.

20 Monitor Metering
Monitor process health
You can use the Process Health tab to obtain metrics that can help assess the health of the VDC, node, or node process.

About this task


The Process Health tab is accessed from the ECS Portal at Monitor > System Health > Process Health.

NOTE: When clicked Process Health, the Process Health - Overview dashboard opens in a new Grafana window.

Process Health dashboards can also be accessed from Advanced Monitoring > expand Data Access Performance -
Overview
● > Process Health - by Nodes
● > Process Health - Overview
● > Process Health - Process List by Node.

Table 11. VDC, node, and process health metrics


Metric label Level Description
Avg. NIC Bandwidth VDC and Node Average bandwidth of the network interface controller
hardware that is used by the selected VDC or node.
Avg. CPU Usage (%) VDC and Node Average percentage of the CPU hardware that is used by
the selected VDC or node.
Avg. Memory Usage VDC and Node Average usage of the aggregate memory available to the
VDC or node.
Relative NIC (%) VDC and Node Percentage of the available bandwidth of the network
interface controller hardware that is used by the selected
VDC or node.
Relative Memory (%) VDC and Node Percentage of the memory used relative to the memory
available to the selected VDC or node.
CPU Usage Process Percentage of the node's CPU used by the process. The
list of processes that are tracked is not the complete list
of processes running on the node. The sum of the CPU
used by the processes is not equal to the CPU usage
shown for the node.
Memory Usage Process The memory used by the process.
Relative Memory (%) Process Percentage of the memory used relative to the memory
available to the process.
Avg. # Thread Process Average number of threads used by the process.
Last Restart Process The last time the process restarted on the node.

Table 12. ECS processes


Process Description
Blob Service (blobsvc) Manages the following tables: Object (OB), Listing (LS), and Repo Chunk
Reference (RR).
Chunk Manager (cm) Manages the following tables: Chunk (CT), Btree Reference (BR). Provides
the logic to handle various events based on the chunk's current state and
decide which state to transition to next.
Directory Table Query (dtquery) Provides REST APIs to get Directory Table (DT) details.
GeoReceiver (georeceiver) Receives requests for chunks in the current VDC that are not owned by the
current VDC (secondary chunks). It then requests Chunk Manager to start
an operation to track the copy chunk creation and select three replicas. The
GeoReceiver process then writes the datastream to the three instances. On
successful completion, it directs Chunk Manager to commit the copy chunk.

Monitor Metering 21
Table 12. ECS processes (continued)
Process Description

NOTE: GeoReceiver connections running on different public networks


must be made configurable. GeoReceiver thread pool being unlimited,
leads to high amount of TCP connections and service restarts causing
DU.

Head Service (headsvc) Manages object head protocols: S3, OpenStack Swift, EMC Atmos, CAS, and
HDFS.
Metering (metering) Manages the following tables: Metering Aggregate (MA) and Metering Raw
(MR).
Object Control Service (objcontrolsvc) Provides REST APIs for configuring the ECS cluster, managing ECS
resources, and monitoring the system.
Provision Service (provisionsvc) Manages the provisioning of storage resources and user access. It handles
user management, authorization, and authentication for all provisioning
requests, resource management, and multi-tenancy support.
Resource Service (resourcesvc) Manages the following tables: Resource Table (RT) which handles replication
groups, buckets, users, namespace information and so on.
Record Manager (rm) Manages PR (Partition Record) table (journal region).
Storage Service Manager (ssm) Manages the following tables: Storage Space (SS) which contain disk block
usage and disk to chunk mapping. Interacts with one or more Storage
Servers and manages the active/free chunks on the corresponding servers.
Directs I/O operations to the disks.
Statistics Service (statsvc) Tracks various information on storage processes. These statistics can be
used to monitor the system.
VNest (vnest) Provides distributed synchronization and group services. A subset of data
nodes will be group members responsible for serving the key/value requests.
VNest services running on other nodes will listen for configuration updates
and be ready to be added to the group.

See Advanced Monitoring, Process Health - by Nodes, Process Health - Overview and Process Health - Process List by Nodefor
details.

Monitor node rebalancing status


Use the Node Rebalancing tab to monitor the status of data rebalancing operations when nodes are added to, or removed
from, a cluster. Node rebalancing is enabled by default at installation. Contact your customer support representative to disable
or re-enable this feature.

Prerequisites
Access the Node Rebalancing tab from the ECS Portal at Monitor > System Health > Node Rebalancing.

NOTE: When clicked Node Rebalancing, the Node Rebalancing dashboard opens in a new Grafana window.

The Node Rebalancing dashboard can also be accessed from Advanced Monitoring > expand Data Access Performance -
Overview > Node Rebalancing.
See Advanced Monitoring and Node Rebalancing for details.

Monitor transactions
You can monitor requests and network performance for VDCs and nodes from the Monitor > Transactions page.
Access the Transactions tab from the ECS Portal at Monitor > Transactions.

22 Monitor Metering
NOTE: When clicked Transactions, the Data Access Performance - Overview dashboard opens in a new Grafana
window.
The Transactions data can also be accessed from Advanced Monitoring > Data Access Performance - Overview.
See Advanced Monitoring and Data Access Performance - Overview for details.

Monitor recovery status


You can use the Recovery Status page to monitor the data recovered by the system.

About this task


Recovery is the process of rebuilding data after any local condition that results in bad data (chunks). The Recovery Status
page is accessed from the ECS Portal at Monitor > Recovery Status.

NOTE: When clicked Recovery Status, the Recovery Status dashboard opens in a new Grafana window.

The Recovery Status dashboard can also be accessed from Advanced Monitoring > expand Data Access Performance -
Overview > Recovery Status.
See Advanced Monitoring for details.

Monitor disk bandwidth


You can use the Disk Bandwidth page to monitor the disk usage metrics at the VDC or individual node level.

About this task


The Disk Bandwidth page is accessed from the ECS Portal at Monitor > Disk Bandwidth.

NOTE: When clicked Disk Bandwidth, the Disk Bandwidth - Overview dashboard opens in a new Grafana window.

Disk Bandwidth dashboards can also be accessed from Advanced Monitoring > expand Data Access Performance -
Overview
● > Disk Bandwidth - by Nodes
● > Disk Bandwidth - Overview.
See Advanced Monitoring, Disk Bandwidth - by Nodes and Disk Bandwidth - Overview for details.

Introduction to geo-replication monitoring


You can use the Geo Replication page to monitor the replication of data across the VDCs that make up a replication group.
The Geo Replication page is accessed from the ECS Portal at Monitor > Geo Replication and provides four tabs:
● Rate and Chunks
● Recovery Point Objective (RPO)
● Failover Processing
● Bootstrap Processing

Monitor geo replication: Rate and Chunks


You can use the Rate and Chunks tab to obtain metrics about the network traffic for geo-replication and the chunks waiting
for replication by a replication group or remote VDC.
The Rate and Chunks tab is accessed from the ECS Portal at Monitor > Geo Replication > Rate and Chunks.

Monitor Metering 23
Table 13. Rate and Chunks columns
Column Description
Replication Group Lists the replication groups of which this VDC is a member. Click a replication
group to see a table of remote VDCs in the replication group and their
statistics. Click the Replication Groups link above the table to return to the
default view.
Write Traffic The current rate of writes to all remote VDCs or individual remote VDC in the
replication group.
Read Traffic The current rate of reads to all remote VDCs or individual remote VDC in the
replication group.
User Data Pending Replication The total logical size of user data waiting for replication for the replication
group or remote VDC.
Metadata Pending Replication The total logical size of metadata waiting for replication for the replication
group or remote VDC.
Data Pending XOR The total logical size of all data waiting to be processed by the XOR
compression algorithm in the local VDC for the replication group or remote
VDC.

Monitor geo replication: Recovery Point Objective (RPO)


You can use the RPO tab to view the recovery point objective for a replication group and its remote VDCs. The RPO refers to
the point in time in the past to which you can recover. The value presented is the oldest data at risk of being lost if a local VDC
fails before replication is complete.
The RPO tab is accessed from the ECS Portal at Monitor > Geo Replication > RPO.

Table 14. RPO columns


Column Description
Remote Replication Group\Remote VDC At the VDC level, lists all remote replication groups of which the local VDC is
a member. At the replication group level, this column lists the remote VDCs
in the replication group.
Overall RPO The recent time period for which data might be lost in the event of a local
VDC failure.

Monitor geo replication: Failover Processing


You can use the Failover Processing tab to view the metrics on the process to rereplicate data following permanent failure of
a remote VDC.
The Failover Processing tab is accessed from the ECS Portal at Monitor > Geo Replication > Failover Processing.

Table 15. Failover columns


Field Description
Replication Group Lists the replication groups that the local VDC is a member of.
Failed VDC Identifies a failed VDC that is part of the replication group.
User Data Pending Re-replication When a VDC fails, user data chunks replicated to the failed VDC have to be
re-replicated to a different VDC. This field reports the logical size of all user
data (repository) chunks waiting re-replication to a different VDC.
Metadata Pending Re-replication When a VDC fails, metadata chunks replicated to the failed VDC have to
be re-replicated to a different VDC. This field reports the logical size of all
metadata chunks waiting re-replication to a different VDC.

24 Monitor Metering
Table 15. Failover columns (continued)
Field Description
Data Pending XOR Decoding Shows the count and total logical size of chunks waiting to be retrieved by
the XOR compression scheme.
Failover State ● BLIND_REPLAY_DONE
● REPLICATION_CHECK_DONE: The process that makes sure that all
replication chunks are in an acceptable state and replication has
completed successfully.
● CONSISTENCY_CHECK_DONE: The process that makes sure that all
system metadata is fully consistent with other replicated data and has
completed successfully.
● ZONE_SYNC_DONE: The synchronization of the failed VDC has
completed successfully.
● ZONE_BOOTSTRAP_DONE: The bootstrap process on the failed VDC
has completed successfully.
● ZONE_FAILOVER_DONE: The failover process has completed
successfully.
Failover Progress A percentage indicator for the overall status of the failover process.

Monitor geo replication: Bootstrap Processing


You can use the Bootstrap Processing tab to monitor the copying of user data and metadata to a VDC that has been added to
a replication group.
The Bootstrap Processing tab is accessed from the ECS Portal at Monitor > Geo Replication > Bootstrap Processing.

Table 16. Bootstrap Processing columns


Column Description
Replication Group This column provides the list of replication groups of which the local VDC is
a member and that are adding new VDCs. Each row provides metrics for the
specified replication group.
Added VDC The VDC being added to the specified replication group.
User Data Pending Replication The logical size of all user data (repository) chunks waiting for replication to
the new VDC.
Metadata Pending Replication The logical size of all system metadata waiting for replication to the new
VDC.
Bootstrap State The bootstrap state. Can be:
● BTreeScan
● ReplicateBTree
● ReplicateBTreeMarker
● ReplicateJournal
● Done
Bootstrap Progress (%) The completion percent of the entire bootstrap process.

Cloud hosted VDC monitoring


ECS provides support for identifying when a site is hosted or on-premise and the ECS Management REST API provides support
for retrieving information about the utilization and performance of hosted sites.
Where an ECS system includes a hosted site, the ECS Portal displays a top-level Cloud menu that enables administrators to
see how the hosted sites are used as part of replication groups and to view the traffic to and from the hosted site in terms
of bandwidth utilization and latency. The portal displays also show the traffic to and from on-premise sites to allow comparison
with hosted sites traffic.

Monitor Metering 25
The Cloud menu is not shown if the ECS system uses only on-premise sites.

Cloud topology
You can use the Cloud topology summary information to see how the ECS system is making use of hosted VDCs.
The Cloud > Topology page shows the hosted VDCs that are part of an ECS federated system, and shows the relationship
between the hosted VDC and any on-premise VDCs.

Cloud Hosted VDCs


The Cloud Hosted VDCs table shows the hosted VDCs that are present in the ECS system. Currently ECS supports a single
hosted site.

Related On-Premise VDCs


The Related On-Premise VDCs table shows the on-premise VDCs that are part of the ECS federation.

Related Replication Groups


The Related Replication Groups table shows the replication groups that contain a storage pool contributed by a selected hosted
VDC. The Hosted VDC is selected in the Cloud Hosted VDC table.
A primary use case for using a hosted VDC is the Passive configuration in which the hosted VDC provides a site for replication
data but cannot be used as an active site by users. However, where the active operation of the hosted VDC is allowed, the
hosted VDC can be included in replication groups where the type is Passive.
The table shows the replication group type and the VDC storage pools that are contributing to the replication group, at least one
of which will be a hosted VDC.

Cloud replication traffic


You can use the cloud replication traffic information is to see the performance of hosted VDCs and compare with on-premise
VDCs.
The Cloud > Replication page shows replication traffic by VDC and by replication group.
NOTE: The Current filter displays the latest available values. A date-time range filter displays average values over the
specified range.

Virtual Data Centers


The Virtual Data Centers tab shows each VDC, both hosted or on-premise, and provides aggregated traffic figures for all
replication groups associated with a VDC.

Table 17. Replication traffic by VDC


Attribute Description
Read Latency The average latency in milliseconds for reads from all replication groups associated
with the selected VDC.
Write Latency The average latency in milliseconds for writes to all replication groups associated
with the selected VDC.
Read Bandwidth The bandwidth utilized by reads from all replication groups associated with the
selected VDC.
Write Bandwidth The bandwidth utilized by writes from all replication groups associated with the
selected VDC.

26 Monitor Metering
Replication Groups
The Replication Groups tab shows each replication group and provides traffic data for a VDC for each replication group that
it contributes to. A VDC might have a storage pool that is in more than one replication group, and this display allows you to see
the traffic associated with each replication group.

Table 18. Replication traffic by replication group


Attribute Description
Read Latency The average latency in milliseconds for reads from the selected VDC that relate to
the specified replication group.
Write Latency The average latency in milliseconds for writes to the selected VDC that relate to
the specified replication group.
Read Bandwidth The bandwidth utilized by reads from the from the selected VDC that relate to the
specified replication group.
Write Bandwidth The bandwidth utilized by writes to the selected VDC that relate to the specified
replication group.

Monitor Metering 27
3
Monitoring Events: Audits and Alerts
Monitor events provides critical information about the available event monitoring messages (audit and alert) in the ECS Portal.
Topics:
• About event monitoring
• Monitor audit data
• Audit messages
• Monitor alerts
• Alert policy
• Acknowledge all alerts
• Alert messages

About event monitoring


You can view the available event monitoring messages (audit and alert) from the ECS Portal.
The Monitor > Events page has two tabs:
● Audit: All activity by users working with the portal, the ECS REST APIs, and the ECS CLI. Other audit types include upgrade
activities.
● Alerts: Alerts raised by the ECS system.
Event data through the ECS Portal is limited to 30 days. If you need to keep event data for longer periods, consider using the
ViPR SRM product.

Monitor audit data


Use the Monitor > Events > Audit tab to view and manage audit data.

About this task


See the List of audit messages.

Steps
1. Select the Audit tab.
2. Optionally, select Filter.
3. Specify a Date Time Range and adjust the From and To fields and time fields. When creating a custom date-time range,
select Current Time to use the current date and time as the end of your range.
4. Select a Namespace.
5. Click Apply.
NOTE: The newest audit messages appear at the top of the table.

28 Monitoring Events: Audits and Alerts


Audit messages
List of the audit messages generated by ECS.

Table 19. ECS audit messages


Service Audit item Audit message
Alert sent_alert Alert \"${alertMessage}\" with symptom code $
{symptomCode} triggered
Auth Provider new_authentication_provider_added New authentication provider ${resourceId} added
Auth Provider authentication_provider_deleted Authentication provider ${resourceId} deleted
Auth Provider authentication_provider_updated Existing Authentication provider ${resourceId} updated
Bucket bucket_created Bucket ${resourceId} has been created
Bucket bucket_deleted Bucket ${resourceId} has been deleted
Bucket bucket_updated Bucket ${resourceId} has been updated
Bucket bucket_ACL_set Bucket ${resourceId} ACLs have changed
Bucket bucket_owner_changed Owner of ${resourceId} bucket has changed
Bucket bucket_versioning_set Versioning has been enabled on ${resourceId} bucket
Bucket bucket_versioning_unset Versioning has been suspended on ${resourceId} bucket
Bucket bucket_versioning_source_set Bucket ${resourceId} versioning source set
Bucket bucket_metadata_set Metadata on ${resourceId} bucket has been changed
Bucket bucket_head_metadata_set Bucket ${resourceId} head metadata set
Bucket bucket_expiration_policy_set Bucket ${resourceId} expiration policy has updated
Bucket bucket_expiration_policy_deleted Bucket ${resourceId} expiration policy has been deleted
Bucket bucket_cors_config_set Bucket ${resourceId} CORS rules have been changed
Bucket bucket_cors_config_deleted Bucket ${resourceId} CORS rules have been deleted
Bucket notification_size_exceeded_on_bucket Notification size has been exceeded on ${resourceId} bucket
Bucket block_size_exceeded_on_bucket Block size has been exceeded on ${resourceId} bucket
Bucket bucket_set_quota Bucket ${resourceId} quota has been updated with
notification size as ${notificationSize} and block size as $
{blockSize}
Bucket bucket_policy_created Bucket ${resourceId} policy has been created
Bucket bucket_policy_updated Bucket ${resourceId} policy has been updated
Bucket bucket_policy_deleted Bucket ${resourceId} policy has been deleted
Cluster cluster_set Cluster id ${resourceId} has been set
Fabric None InstallerServiceOperation[kind=
INSTALLER_SERVICE_
OPERATION,
host=${hostName},
timestamp=${timestamp},
operationType=${operation},
args=${arguments of operation},
status=SUCCEEDED,
fqdn=${fqdn of host},
version=${installer version}]

Monitoring Events: Audits and Alerts 29


Table 19. ECS audit messages (continued)
Service Audit item Audit message
Fabric None NodeMaintenanceMode[kind=
NodeMaintenanceMode,
timestamp=${timestamp},
agentId=${agendId},
fqdn=${fqdn},
status=${MaintenanceStatus}]

License user_added_license License ${resourceId} has been added


License managed_capacity_exceeded Managed capacity has exceeded licensed ${resourceId}
capacity
License license_expired License ${resourceId} has expired
Local user domain_group_mapping_created Domain group ${resourceId} to ${roles} role(s) mapping is
added
Local user domain_group_mapping_created_no_roles Domain group ${resourceId} without role mappings is added
Local user domain_group_mapping_updated Domain group ${resourceId} roles mapping is changed to $
{roles} role(s)
Local user domain_group_mapping_updated_no_roles All roles of domain group ${resourceId} mapping have been
removed
Local user domain_user_mapping_created Domain user ${resourceId} to ${roles} role(s) mapping is
added
Local user domain_user_mapping_created_no_roles Domain user ${resourceId} without role mappings is added
Local user domain_user_mapping_deleted Domain user ${resourceId} mapping is removed
Local user domain_user_mapping_updated Domain user ${resourceId} role mapping is changed to $
{roles} role(s)
Local user domain_user_mapping_updated_no_roles All roles of domain user ${resourceId} mapping have been
removed
Local user local_user_created Management user ${resourceId} with ${roles} role(s)has
been created
Local user local_user_created_no_roles Management user ${resourceId} without roles has been
created
Local user local_user_deleted Management user ${resourceId} has been deleted
Local user local_user_password_changed Credential of management user ${resourceId} has changed
Local user local_user_updated Roles of management user ${resourceId} have been changed
to ${roles}
Local user local_user_roles_updated_no_roles All roles of management user ${resourceId} have been
removed
Locked vdc_lock_successful VDC lock was successful
Locked vdc_lock_failed VDC lock failed
Locked node_lock_successful Lock successful for node ${resourceId}
Locked node_lock_failed Lock failed for node ${resourceId}
Locked node_unlock_successful Unlock successful for node ${resourceId}
Locked node_unlock_failed Unlock failed for node ${resourceId}
Login login_successful User ${resourceId} logged in successfully
Login login_failed User ${resourceId} failed to login

30 Monitoring Events: Audits and Alerts


Table 19. ECS audit messages (continued)
Service Audit item Audit message
Login user_token_logout User logged out token ${resourceId}
Login user_logout All user tokens have logged out
Namespace block_size_exceeded_on_namespace Block size has been exceeded on ${resourceId} namespace
Namespace namespace_admin_group_mappings_updated Namespace ${resourceId} admin group mappings updated to
following groups: ${groups}
Namespace namespace_admin_group_mappings_updated Namespace ${resourceId} admin groups mappings updated to
_no_groups an empty list
Namespace namespace_admin_user_mappings_updated Namespace ${resourceId} admin mappings updated to
following users: ${admins}
Namespace namespace_admin_user_mappings_updated_ Namespace ${resourceId} admin mappings updated to an
no_admins empty list
Namespace namespace_created Namespace ${resourceId} has been created
Namespace namespace_deleted Namespace ${resourceId} has been deleted
Namespace namespace_updated Namespace ${resourceId} has been updated
Namespace notification_size_exceeded_on_namespace Notification size has been exceeded on ${resourceId}
namespace
NFS ugmapping_created ${type} mapping ${ugMappingName} --> ${resourceId} has
been created
NFS ugmapping_deleted ${type} mapping ${ugMappingName} --> ${resourceId} has
been deleted
NFS export_created Export with export path ${exportPath} has been created
NFS export_deleted Export with export path ${exportPath} has been deleted
NFS export_updated Export with export path ${exportPath} has been updated
Replication replication_group_created Replication Group ${resourceId} has been created
Group
Replication replication_group_updated Replication Group ${resourceId} has been updated
Group
Security command_exec_insufficient_permission Attempt to execute a command ${command} from ${host}
without right permissions
SNMP snmp_v2_target_created SNMP target ${snmpTarget} with Community '$
{community}' is added
SNMP snmp_v3_target_created SNMP target ${snmpTarget} with Username '$
{username}', Authentication(${authProtocol}) and Privacy($
{privProtocol})
SNMP snmp_target_deleted SNMP target ${snmpTarget} is deleted
SNMP snmp_engineid_updated SNMP agent EngineID is set to ${engineId}
SNMP snmp_v2_target_updated SNMP target ${oldSnmpTarget} is updated as $
{newSnmpTarget} with Community string ${community}
SNMP snmp_v3_target_updated SNMP target ${oldSnmpTarget} is updated
as ${newSnmpTarget} with Username $
{username}, Authentication(${authProtocol}) and Privacy($
{privProtocol})
Storage Pool storage_pool_created Storage Pool ${resourceId} has been created
Storage Pool storage_pool_deleted Storage Pool ${resourceId} has been deleted

Monitoring Events: Audits and Alerts 31


Table 19. ECS audit messages (continued)
Service Audit item Audit message
Storage Pool storage_pool_updated Storage Pool ${resourceId} has been updated
Syslog syslog_server_added Syslog server ${protocol}://${host}:${port} with severity $
{severity} is added into the configuration
Syslog syslog_server_updated Syslog server ${old_protocol}://${old_host}:${old_port} is
updated to ${protocol}://${host}:${port} with severity $
{severity} in the configuration
Syslog syslog_server_deleted Syslog server ${protocol}://${host}:${port} is removed from
the configuration
Transformation transformation_created_message Transformation created
Transformation transformation_updated_message Transformation updated
Transformation transformation_pre_check_started_message Transformation precheck started
Transformation transformation_enumeration_started_messag Transformation enumeration started
e
Transformation transformation_indexing_started_message Transformation indexing started
Transformation transformation_migration_started_message Transformation migration started
Transformation transformation_recovery_migration_started_ Transformation recovery migration started
message
Transformation transformation_reconciliation_started_messa Transformation reconciliation started
ge
Transformation transformation_sources_updated_message Transformation sources updated
Transformation transformation_deleted_message Transformation deleted
Transformation transformation_retried_message Transformation %s retried
Transformation transformation_canceled_message Transformation %s canceled
Transformation transformation_profile_mappings_updated_m Transformation profile mappings updated
essage
User change_password_failed User ${resourceId} failed to change password, reason: $
{reason}
User user_created Object user ${resourceId} has been created
User user_deleted Object user ${resourceId} has been deleted
User user_set_password New password has been set for object user ${resourceId}
User user_delete_password Password has been deleted for object user ${resourceId}
User user_set_metadata New metadata has been set for object user ${resourceId}
User user_locked Object user ${resourceId} has been locked
User user_unlocked Object user ${resourceId} has been unlocked
User user_set_user_tag User Tag has been set for object user ${resourceId}
User user_delete_user_tag User Tag has been deleted for object user ${resourceId}

32 Monitoring Events: Audits and Alerts


Monitor alerts
You can use the Monitor > Events > Alerts tab to view and manage system alerts.

About this task


See the list of alert messages.
Alert message Severity labels have the following meanings:
● Critical: Messages about conditions that require immediate attention
● Error: Messages about error conditions that report either a physical failure or a software failure
● Warning: Messages about less than optimal conditions
● Info: Routine status messages

Steps
1. Select Alerts.
2. Optionally, click Filter.
3. Select your filters. The alerts filter adds filtering by Severity and Type, and an option to Show Acknowledged Alerts,
which retains the display of an alert even after it is acknowledged by the user. When creating a custom date-time range,
select Current Time to use the current date and time as the end of your range.
Alert types must be entered exactly as described in the following table:

Table 20. Alert types


Alert Type (type exactly as Description
shown)
Fabric Raised when system issues detected.
Geo Raised for geo-replication alerts.
License Raised for license, capacity, or capacity entitlement exceeded alerts.
Notify Raised for miscellaneous alerts.
Quota Raised when soft or hard quota limits are exceeded (SoftQuotaLimitExceeded or
HardQuotaLimitExceeded) for a bucket or for a namespace.
RPO Raised when the recovery point objective (RPO) is greater than the RPO threshold.
Capacity Alerting Raised when the remaining capacity of the storage pool reaches a set threshold.
Capacity License Threshold Raised if the system capacity is greater than the licensed capacity.
CHUNK_NOT_FOUND Raised when chunk data is not found.
DTSTATUS_RECENT_FAILURE Raised when the status of a data table is bad.

Table 21. ESRS dial home types


Alert Type (type exactly as Description
shown)
TestDialHome Raised to test that ESRS connections can be established and that the call home
functionality works.
4. Select a Namespace.
5. Click Apply.
6. Next to each event, click the Acknowledge Alert button to acknowledge and dismiss the message. Messages that
have previously been acknowledged will display when the Show Acknowledged Alerts filter option is selected, but the
Acknowledge Alert button will not be displayed for these rows.
7. You can click the Description of an alert, when it is formatted as a link, to be taken to a relevant page in the portal.

Monitoring Events: Audits and Alerts 33


Alert policy
Alert policies are created to alert about metrics, and are triggered when the specified conditions are met. Alert policies are
created per VDC.
You can use the Settings > Alerts Policy page to view alert policies.
There are two types of alert policy:

System alert ● System alert policies are precreated and exist in ECS during deployment.
policies ● All the metrics have an associated system alert policy.
● System alert policies cannot be updated or deleted.
● System alert policies can be enabled/disabled.
● Alert is sent to the UI and all channels (SNMP, SYSLOG, and Secure Remote Services).

User-defined ● You can create User-defined alert policies for the required metrics.
alert policies ● Alert is sent to the UI and customer channels (SNMP and SYSLOG).

New alert policy


You can use the Settings > Alerts Policy > New Alert Policy tab to create user-defined alert policies.

Steps
1. Select New Alert Policy.
2. Give a unique policy name.
3. Use the metric type drop-down menu to select a metric type.
Metric Type is a grouping of statistics. It consists of:
● Btree Statistics
● CAS GC Statistics
● Geo Replication Statistics
● Metering Statistics
● Garbage Collection Statistics
● EKM
4. Use the metric name drop-down menu to select a metric name.
5. Select level.
a. To inspect metrics at the node level, select Node.
b. To inspect metrics at the VDC level, select VDC.
6. Select polling interval.
Polling Interval determines how frequently data should be checked. Each polling interval gives one data point which is
compared against the specified condition and when the condition is met, alert is triggered.
7. Select instances.
Instances describe how many data points to check and how many should match the specified conditions to trigger an alert.
For metrics where historical data is not available only the latest data is used.

8. Select conditions.
You can set the threshold values and alert type with Conditions.
The alerts can be either a Warning Alert, Error Alert, or Critical Alert.

9. To add more conditions with multiple thresholds and with different alert levels, select Add Condition.
10. Click Save.

34 Monitoring Events: Audits and Alerts


Acknowledge all alerts
Alerts can be acknowledged individually or by bulk using the Acknowledge All Alerts button. You can choose to acknowledge all
the alerts or acknowledge a subset of the alerts using filters.

About this task


You can use the Monitor > Events > Alerts tab to acknowledge alerts.

Steps
1. To acknowledge all alerts, click the Acknowledge All Alerts button.
a. To acknowledge a subset of all alerts, use the table filter to filter by a combination of date and time, severity, type, or
namespace, and then click Acknowledge All Alerts.
The bulk alert acknowledgment process runs in the background and may take a few minutes to complete. Only one bulk alert
acknowledgment can be processed at a time.
2. On the confirmation pop-up screen, to initiate acknowledgment, click OK or to exit without acknowledgment click Cancel.
Clicking the Acknowledge All Alerts initiates a background task to acknowledge all the matching alerts. The response
either shows successfully initiated or fails.

To keep a record of the acknowledge all alerts request, a new informational alert of type Bulk Alert Ack will be generated
after the acknowledgment completes. Clear the filter and manually refresh the table.

Alert messages
List of the alert messages that ECS uses.
Alert message Severity labels have the following meanings:
● Critical: Messages about conditions that require immediate attention
● Error: Messages about error conditions that report either a physical failure or a software failure
● Warning: Messages about less than optimal conditions
● Info: Routine status messages

Table 22. ECS Object alert messages


Alert Severity Symptom Sent to... Message Description Action
code
Object Warning 1020 Portal, API, Object %s in bucket The alert will be Contact ECS
version limit SNMP Trap, %s is approaching generated when the user Remote Support
warning Syslog the per object writes to an object with
version limit of a version count .
%d. Current version ● >80% of the version
count is above %d. limit on user channel
● >90% of the version
limit on Both
channels
Btree chunk Warning 1321 Portal, API, System metadata Event trigger source Contact ECS
level GC Secure garbage ● Example: Reclaimed Remote Support
Remote reclamation Btree Garbage is
Services, throughput is too less than 10% of
SNMP Trap, slow to catch the remaining BTree
Syslog up with garbage garbage as BTree GC
detection. is slow at Chunk
reclamation.
● This condition has
persisted for last
7 days, leading to
creation of this alert.

Monitoring Events: Audits and Alerts 35


Table 22. ECS Object alert messages (continued)
Alert Severity Symptom Sent to... Message Description Action
code
● Derived it from
formula:
Full_Garbage > 1TB,
and
Garbage_Detected_
Rate -
Garbage_Chunk_Rec
laim_Rate > 100GB
Btree disk Warning 1325 Portal, API, Capacity free-up Event trigger source Contact ECS
level GC Secure throughput is too ● Example: Reclaimed Remote Support.
Remote slow to catch Btree Garbage is less
Services, up with system than 10% of the Full
SNMP Trap, metadata garbage garbage, as BTree
Syslog reclamation. GC is slow at disk
level reclamation.
● This condition has
persisted for last
7 days, leading to
creation of this alert.
● Derived from
formula: if
Garbage_Pending_D
elete > 1TB, and
Garbage_Chunk_Rec
laim_Rate -
Garbage_Capacity_R
eclaim_Rate > 100GB
Btree partial Warning 1329 Portal, API, Partial GC for Event trigger source Contact ECS
GC Secure system metadata is ● Example: Rate of Remote Support.
Remote too slow. Btree Partial GC
Services, conversion to full
SNMP Trap, Garbage is less
Syslog than 10% of the
Partial GC eligible for
Conversion.
● Btree partial GC
works too slow
to convert partial
garbage into full
garbage.
● This condition has
persisted for last
7 days, leading to
creation of this alert.
● Derived from
formula : If
Partial_Eligible_Garb
age > 1TB, and
Partial_To_Full_Conv
ert_Rate < 100GB
Bucket hard Error 1006 Portal, API, HardQuotaLimitExc Contact ECS
quota SNMP Trap, eeded: bucket Remote Support
Syslog {bucket_name}
Bucket soft Warning 1008 Portal, API, SoftQuotaLimitExce Contact ECS
quota SNMP Trap, eded: bucket Remote Support
Syslog {bucket_name}

36 Monitoring Events: Audits and Alerts


Table 22. ECS Object alert messages (continued)
Alert Severity Symptom Sent to... Message Description Action
code
Capacity Warning 1111 Portal, API, Storage pool The severity of the Contact ECS
alerting SNMP Trap, {Storage pool} has alert depends on how Remote Support
Error 1112 Syslog {id}% remaining close the remaining
capacity meeting storage pool capacity
Critical 1113 threshold of {id}%. is to reaching the
configured threshold.
Capacity alerting is not
set by default: set
capacity alerts to receive
them. You can set them
by editing an existing
storage pool or when
you create a storage
pool.
Capacity Warning 1100 Portal, API, Used Capacity of The configured threshold Contact ECS
exceeded SNMP Trap, the VDC exceeded is set at 80% of the Remote Support
threshold Syslog configured Used Capacity of the representative
threshold, current VDC by default. to determine
usage is {usage}%. CAUTION: If the the appropriate
used capacity solution.
reaches 90%, you
cannot write or
modify object data.

Capacity Error 997 Portal, API, Licensed Capacity The capacity of the Contact ECS
license Secure Entitlement system is greater than Remote Support
threshold Remote Exceeded Event was licensed.
Services, Trap,
Syslog
Chunk not Error 1004 Portal, API, chunkId {chunkId} Contact ECS
found Secure not found Remote Support
Remote
Services,
SNMP Trap,
Syslog
CPU Usage Warning 4001 Portal, API, CPU usage is $ If CPU usage percent Contact ECS
Percent SNMP Trap, {inspectorValue}% crosses the threshold Remote Support
Error 4002 Syslog crosses threshold $ specified then the alert
{thresholdValue}% is triggered.
Critical 4003

Data Error 1500 Portal, ESRS, Data Migration has Data migration has no Contact ECS
Migration SNMP Trap, no movement for $ progress for several Remote Support
Blocked Syslog, SMTP {configured} hours hours.
for a device and
level (default 6
hours).

NOTE: Ignore the severity as Warning, for the Data Migration Finished alert. The severity is supposed to be Info.

Data Warning 1501 Portal, ESRS, Data Migration is Data migration is Contact ECS
Migration SNMP Trap, complete for a complete. Remote Support
Finished Syslog, SMTP device and level.
Disabled CAS Info 1316 Portal, API, CAS Processing is ● CAS GC is Content Contact ECS
GC Secure paused. Addressable Storage Remote Support
Warning 1317 Remote Garbage Collection. representative
Services, ● CAS GC is disabled. to determine
Error 1318

Monitoring Events: Audits and Alerts 37


Table 22. ECS Object alert messages (continued)
Alert Severity Symptom Sent to... Message Description Action
code

Critical 1319 SNMP Trap, the appropriate


Syslog solution.
Disk Info 2031 Portal, API, Entering Disk was unmounted
unmounted SNMP Trap, maintenance mode
Syslog on node {fqdn}
DT init failure Error 3001 Portal, API, ● There are at ● DT is a directory Contact ECS
Secure least {number} table. Remote Support
Remote DT(s) being ● The default value is
Services, unready/ set at 8 DTs for this
SNMP Trap, unhealthy in last alert to trigger.
Syslog {number}
rounds of DT
status check for
{dt_category}
● DT {directoryId}
failed to initialize
due to
{error_type}
EKM Server Warning 1361 Portal, API, ● The server Contact ECS
Certificate Secure certificate for Remote Support
Expiry Error 1362 Remote EKM server
Services, expires in 30
SNMP Trap, days. Renew the
Syslog certificate.
● The server
certificate for
EKM server
expires in 7
days. Renew the
certificate.
EKM Server Warning 1369 Portal, API, The EKM server Contact ECS
Connection Secure is not responding. Remote Support
Status Error 1370 Remote Ensure that the
Services, server is connected.
SNMP Trap,
Syslog
First Byte Warning 4009 Portal, API, First Byte Latency If TTFB for read latency Contact ECS
Latency For SNMP Trap, for Read is $ crosses the threshold Remote Support
Read Error 4010 Syslog {inspectorValue}ms specified then the alert
crosses threshold $ is triggered.
4011 {thresholdValue}ms

GC Status Warning 1345 Portal, API, Space reclamation Contact ECS


Secure for user data/ Remote Support.
Remote system metadata is
Services, disabled.
SNMP Trap,
Syslog Make sure it
is disabled for
temporary purpose,
and re-enable it
when ready.

Last Byte Warning 4013 Portal, API, Last Byte Latency If TTLB for write latency Contact ECS
Latency For SNMP Trap, for Write is $ crosses the threshold Remote Support
Write Error 4014 Syslog {inspectorValue}ms specified then the alert
is triggered.

38 Monitoring Events: Audits and Alerts


Table 22. ECS Object alert messages (continued)
Alert Severity Symptom Sent to... Message Description Action
code

Critical 4015 crosses threshold $


{thresholdValue}ms
License Info 998 Portal, API, Expiration event
expiration Secure
Remote
Services,
SNMP Trap,
Syslog
License Info 100 Portal, API, Registration Event
registration Secure
Remote
Services,
SNMP Trap,
Syslog
Memory Warning 1349 Portal, API, For cm process Contact ECS
outside Btree Secure memory of X bytes Remote Support
writes cache Remote is allocated outside
Services, Btree write cache
SNMP Trap, on node <Node IP>.
Syslog
Metering Warning 1205 Portal, API, Read latency is Contact ECS
read latency Secure 300 millisecond, Remote Support.
Error 1206 Remote crosses threshold
Services, 250 millisecond.
Critical 1207 SNMP Trap,
Syslog Read latency is
505 millisecond,
crosses threshold
500 millisecond.

Read latency is
1050 millisecond,
crosses threshold
1000 millisecond.

Metering Warning 1205 Portal, API, Write latency is Contact ECS


write latency Secure 300 millisecond, Remote Support.
Error 1206 Remote crosses threshold
Services, 250 millisecond.
Critical 1207 SNMP Trap,
Syslog Write latency is
555 millisecond,
crosses threshold
500 millisecond.

Write latency is
1500 millisecond,
crosses threshold
1000 millisecond.

Monitoring Critical 4016 Portal, API, Data recorded in Contact ECS


Health Secure TSDB is lagging Remote Support
4017 Remote by {thresholdValue}
Services, mins on node x.x.x.x
4018 SNMP Trap,
Syslog

Monitoring Events: Audits and Alerts 39


Table 22. ECS Object alert messages (continued)
Alert Severity Symptom Sent to... Message Description Action
code
Namespace Error 1005 Portal, API, HardQuotaLimitExc Contact ECS
hard quota SNMP Trap, eeded: Namespace Remote Support
Syslog {namespace}
Namespace Warning 1009 Portal, API, SoftQuotaLimitExce Contact ECS
soft quota SNMP Trap, eded: Namespace Remote Support
Syslog {namespace}
Notification Any Any User-defined Custom message that is
message. defined and provided by
the user.
Process Error 1354 Portal, API, Memory table size Contact ECS
memory Secure for blob process is Remote Support.
table free Remote X % less than the
space Services, specified threshold
percent SNMP Trap, of Y % on <node
Syslog IP>.
Repo chunk Warning 1333 Portal, API, User garbage Event trigger source Contact ECS
level GC Secure collection ● Example: Repo Chunk Remote Support.
Remote throughput is too reclamation rate is
Services, slow to catch less than 10% of the
SNMP Trap, up with garbage remaining garbage.
Syslog detection. ● This condition has
persisted for last
7 days, leading to
creation of this alert.
● Derived from
formula:
Full_Garbage > 10TB,
and
Garbage_Detected_
Rate -
Garbage_Chunk_Rec
laim_Rate > 100GB
Repo disk Warning 1337 Portal, API, Capacity free-up Event trigger source Contact ECS
level GC Secure throughput is too ● Example: Repo disk Remote Support.
Remote slow to catch up level GC reclamation
Services, with user garbage rate is less than 10 %
SNMP Trap, collection. of Garbage pending
Syslog delete at disk level.
● This condition has
persisted for last
7 days, leading to
creation of this alert.
● Derived from
formula: If
Garbage_Pending_D
elete > 10TB, and
Garbage_Chunk_Rec
laim_Rate -
Garbage_Capacity_R
eclaim_Rate > 100GB
Repo partial Warning 1341 Portal, API, Partial GC for user Event trigger source Contact ECS
GC Secure garbage is too slow. ● Example: Repo Partial Remote Support.
Remote repo GC works too
Services, slow to convert

40 Monitoring Events: Audits and Alerts


Table 22. ECS Object alert messages (continued)
Alert Severity Symptom Sent to... Message Description Action
code
SNMP Trap, partial garbage into
Syslog full garbage.
● This condition has
persisted for last
7 days, leading to
creation of this alert.
● Derived from
formula: If
Partial_Eligible_Garb
age > 10TB, and
Partial_To_Full_Conv
ert_Rate < 100GB
RPO Warning 1012 Portal, API, RPO for replication The recovery point Contact ECS
Secure group {RG} is {HH} objective (RPO) is Remote Support
Remote hour {SS} seconds greater than the RPO
Services, Trap, greater than {HH} threshold. The default
Syslog hour threshold set. value is one hour.
Slow CAS Info 1312 Portal, API, CAS Processing CAS GC cleanup tasks Contact ECS
GC Object Secure object cleanup are lagging. Remote Support
Cleanup Warning 1314 Remote speed is slow.
Services,
Error 1315 SNMP, Trap,
Critical Syslog

Slow CAS Info 1308 Portal, API, CAS Processing CAS GC reference Contact ECS
GC Secure reference collection collection tasks are Remote Support
Reference Warning 1309 Remote speed is slow. lagging.
Collection Services,
Error 1310 SNMP, Trap,
Critical 1311 Syslog

Slow Journal Info 1304 Portal, API, Journal parsing Journal parsing speed is Contact ECS
Parsing Secure speed is slow. slow. Remote Support
Warning 1307 Remote
Services,
Error SNMP, Trap,
Critical Syslog

Space Usage Warning 4005 Portal, API, Disk space usage is If Disk usage percent Contact ECS
Percent SNMP, Trap, ${inspectorValue}% crosses the threshold Remote Support
Error 4006 Syslog crosses threshold $ specified then the alert
{thresholdValue}% is triggered.
Critical 4007

SSD Read Error 1392 Portal, API, SSD read cache SSD read cache fall back Contact ECS
Cache Secure capacity auto clean to memory cache after Remote Support
Capacity Remote up failed >= clean up failed when
Failure Services, ${inspectorValue} capacity full.
SNMP, Trap, times on node $
Syslog {node} after SSD
capacity exceeded
threshold for
${resourceName}
process. Result:
SSD Read cache
on node ${node} is
disabled.

Monitoring Events: Audits and Alerts 41


Table 23. ECS fabric alert messages
Alert Severity Symptom Sent to... Message Description Action
code
Cluster CA Critical 2070 Portal, API, Cluster CA ● Alert will be sent
certificate SNMP Trap, certificate will when days between
will expire Syslog, Secure expire in {days} certificate expiration
Remote days. date and current date
Services are less or equal to
60, and greater or
equal to 14, and days
between last sent
and current date are
greater or equal to 7.
● Alert will be sent
when days between
certificate expiration
date and current date
are less or equal to
14, and days between
last sent and current
date are greater or
equal to 1.
● Alert will be sent
when days between
certificate expiration
date and current date
are less or equal to
0, and days between
last sent and current
date are greater or
equal to 1.
NOTE: All above
condition are check
once a day.

Low Life Info 2064 Portal, API, Node SN={node NVMe SSD endurance
Remaining SNMP Trap, SN}: Disk level is less than
Syslog, Secure SN=${disk sn} threshold level.
Remote in rack={rack},
Services node={fqdn},
slot={slot number}
has life remaining
below threshold
{threshold level} .
Disk Details:
Type={SSD/
NVMe},
Model={vendor
model},
Size={disk size}
GB, Firmware=$
{firmware version}
NVME_BAD Error 1389 Portal, API, No memory to Memory pool initiation
_MEMORY_ SNMP Trap, allocate to buffer failed or memory alloc
ERROR Syslog, Secure for nvmeengine, for read.
Remote node={publicip},
Services failedCount={count
}
NVME_DEVI Error 1390 Portal, API, NVMe partition Nvme device initiation
CE_INIT_FAI SNMP Trap, {uuid} initialization failed.
LED_ERROR Syslog, Secure on target node

42 Monitoring Events: Audits and Alerts


Table 23. ECS fabric alert messages (continued)
Alert Severity Symptom Sent to... Message Description Action
code
Remote {privateip} failed,
Services host
node={publicip},
failedCount={count
}
NVME_PRIV Error 1393 Portal, API, nvmeengine cannot Raise alert when
ATE_IP_UN SNMP Trap, acquire the ip for private.4 ip is not
AVAILABLE_ Syslog, Secure network interface available.
ERROR Remote private.4 on node
Services = {node_name}
[Ref_ID:
NvmePrivateIp]
NVME_BIND Error 1395 Portal, API, failed to bind Failed to bind for
_FAILED_ER SNMP Trap, restserver, restserver, suspect port
ROR Syslog, Secure listenPort={port}, occupied by other
Remote private.4={ip}, process.
Services failedCount={count
}
NVME_CLEA Error 1396 Portal, API, Target Delete incomplete target
N_CONFIG_ SNMP Trap, configuration delete configuration failed,
FAILED_ERR Syslog, Secure failed for partition = suspect permission or
OR Remote {targetuuid} in driver issue.
Services {directory path}
node={publicip},
failedCount={count
}
NVME_ABN Error 1397 Portal, API, nvmetargetviewer nvmetargetviewr could
ORMAL_TAR SNMP Trap, could not refresh not refresh target from
GETVIEWER Syslog, Secure target from engine, engine, check engine and
_ERROR Remote node={publicip}, json dump file status.
Services failedCount={count
}
NVME_POR Warning 1398 Portal, API, nvme target config NVME target Contact ECS
T_RANGE_E SNMP Trap, port range reached configuration port range Remote Support
XHAUSTED_ Syslog, Secure max, reached maximum.
WARNING Remote node={publicip},
Services failedCount={count
}
NVME_TAR Error 1399 Portal, API, Failed to configure Failed to configure
GET_CONFI SNMP Trap, NVMe-oF target for NVMe-oF target
G_FAILED_E Syslog, Secure partition {uuid}, for partition {uuid},
RROR Remote node={nodepublicip node={nodepublicip}.
Services }, reason={reason},
failedCount={count
}
NVME_TAR Error 1400 Portal, API, Failed to error Failed to error handle.
GET_ERROR SNMP Trap, handle NVMe-oF
_HANDLE_F Syslog, Secure target, partition
AILED_ERRO Remote {uuid},
R Services node={nodepublicip
},
failedCount={count
}

Monitoring Events: Audits and Alerts 43


Table 23. ECS fabric alert messages (continued)
Alert Severity Symptom Sent to... Message Description Action
code
NVME_TAR Error 1401 Portal, API, Nvmetargetviewer Nvmetargetviewer
GET_VIEWE SNMP Trap, cannot report any cannot report any
R_PARTITIO Syslog, Secure partition, partition.
N_FAILED_E Remote node={nodepublicip
RROR Services }
failedCount={count
}
Disk Ready Info 2061 Portal, API, Node SN={node Disk with
for SNMP Trap, sn} Disk SN=${disk SUSPECT/BAD health is
Replacement Syslog, Secure sn} in rack={rack}, stopped using by object
Remote node={fqdn}, service, is unmounted
Services slot={slot number} and is ready to be
is ready replaced.
for replacement.
Disk Details:
Type={disk type},
Model={vendor
model},
Size={disk size}
GB, Firmware=$
{firmware version}.
Disk Failed Error 2062 Portal, API, Node SN={node sn} Disk started to have
Replace SNMP Trap, Disk SUSPECT/BAD health,
Process Syslog, Secure SN={diskSerialNum Fabric started process to
Remote ber} in rack={rack}, remove that disk from
Services node={fqdn}, usage, but something
slot={slot} cannot went wrong.
be removed. Disk
Details: Type= {disk
type} ,
Model={Vendor
Model}, Size={size}
GB,
Firmware={firmwar
e}, reason: {reason}
Disk Missing Error 2063 Portal, API, Disk Fabric could not detect
SNMP Trap, SN={diskSerialNum an assigned disk or its
Syslog, Secure ber} in rack={rack}, partition.
Remote node={fqdn},
Services slot={slot} is
missing. Disk
Details:
Type={HDD/SSD},
Model={Vendor
Model}, Size={size}
GB,
Firmware={firmwar
e}, reason: {reason}
Disk Info 2065 Portal, API, Node SN={Node Disk has been replaced
Successfully SNMP Trap, SN}: Disk successfully.
Replaced Syslog, Secure SN={diskSerialNum
Remote ber} on
Services rack={rack},
node={fqdn},
slot={slot number}
has been
successfully
replaced. Disk

44 Monitoring Events: Audits and Alerts


Table 23. ECS fabric alert messages (continued)
Alert Severity Symptom Sent to... Message Description Action
code
Details: Type={disk
type},
Model={vendor
model}, Size={disk
size} GB,
Firmware={firmwar
e version}.
Disk added Info 2019 Portal, API, Disk Disk was added.
SNMP Trap, {diskSerialNumber}
Syslog on node {fqdn} was
added.
Disk failure Critical 2002 Portal, API, Disk SN= Health of disk that is
SNMP Trap, {diskSerialNumber} changed to BAD.
Syslog on rack={rack},
node= {fqdn} ,
slot={slot number}
has FAILED. Disk
Details: Type={disk
type}, Model='{VID
PID}', Size='{disk
size} GB',
Firmware={firmwar
e version}"
Disk good Info 2025 Portal, API, Disk Disk was revived.
SNMP Trap, {diskSerialNumber}
Syslog on node {fqdn} was
revived.
Disk Info 2035 Portal, API, Disk Disk was mounted.
mounted SNMP Trap, {diskSerialNumber}
Syslog on node {fqdn} has
mounted.
Disk removed Info 2020 Portal, API, Disk Disk was removed.
SNMP Trap, {diskSerialNumber}
Syslog on node {fqdn} was
removed.
Disk suspect Error 2003 Portal, API, Disk SN= Health of disk that is
SNMP Trap, {diskSerialNumber} changed to SUSPECT.
Syslog on rack={rack},
node= {fqdn} ,
slot={slot number}
has SUSPECTED.
Disk Details:
Type={disk type},
Model='{VID PID}',
Size='{disk size}
GB',
Firmware={firmwar
e version}"
Disk Warning 2036 Portal, API, Disk Disk was unmounted.
unmounted SNMP Trap, {diskSerialNumber}
Syslog on node {fqdn} has
unmounted.
Docker Critical 2022 Portal, API, Container Configure script
container SNMP Trap, {containerName} returned nonzero exit
Syslog, Secure configuration has code.

Monitoring Events: Audits and Alerts 45


Table 23. ECS fabric alert messages (continued)
Alert Severity Symptom Sent to... Message Description Action
code
configuration Remote failed on node The configure script is
failure Services {fqdn} with exit provided by object and
code {exitCode} called by fabric on object
{happenedOn}. container start-up. It is
only applicable for the
object container.

Docker Warning 2017 Portal, API, Container Container paused Contact ECS
container SNMP Trap, {containerName} Remote Support
paused Syslog has paused on node
{fqdn}.
Docker Info 2016 Portal, API, Container Container moved to
container SNMP Trap, {containerName} is running state.
running Syslog up on node {fqdn}.
Docker Error 2015 Portal, API, Container Container stopped Contact ECS
container SNMP Trap, {containerName} Remote Support
stopped Syslog has stopped on
node {fqdn}.
Events Error 2038 Portal, API, Events cannot be Verify configuration of
cannot be Secure delivered through the channel for which
delivered. Remote {SMTP|ESRS} and the alert is.
Services, lost.
SNMP Trap,
Syslog
Firewall Bad 2051 Portal, API, Firewall health is Rules or ip sets do not Contact ECS
health is BAD Secure BAD! {reason} exist, system firewall is Remote Support
or SUSPECT Suspect 2052 Remote off, ip tables or ip set
Services, Firewall health is utils do not exist.
SNMP Trap, SUSPECT! {reason}
Syslog Rules or ip sets do not
exist, trying to recover.

Fabric agent Error 2014 Portal, API, FabricAgent has Fabric agent health is
suspect SNMP Trap, suspected on node suspect.
Syslog {fqdn}.
Net interface Critical 2023 Portal, API, Net interface Fabric's net interface is Contact ECS
health down SNMP Trap, {$netInterfaceNam down. Remote Support
Syslog, Secure e}[ on node
Remote $FQDN] is
Services down[ with IP
address $IP]".
Net interface Info 2024 Portal, API, Net interface Fabric's net interface is
health up SNMP Trap, {$netInterfaceNam up.
Syslog, Secure e}[ on node
Remote $FQDN] is up[ with
Services IP address $IP]".
Net interface Critical 2026 Portal, API, Net interface Net interface is down for
permanent Secure {$netInterfaceNam at least 10 minutes.
down Remote e}[ on node
Services $FQDN] is
permanently
down[ with IP
address $IP].

46 Monitoring Events: Audits and Alerts


Table 23. ECS fabric alert messages (continued)
Alert Severity Symptom Sent to... Message Description Action
code
Net interface Info 2027 Portal, API, Net interface's Fabric's net interface IP
IP address SNMP Trap, {netInterfaceName} address changed
updated Syslog IP address on node
{fqdn} was changed
to {newIpAddress}.
Node failure Critical 2006 Portal, API, Node {fqdn} has Node is not reachable for
SNMP Trap, failed. 30 minutes.
Syslog, Secure
Remote
Services
Node up Info 2018 Portal, API, Node {fqdn} is up. Node moved to 'up'
SNMP Trap, state after it was down
Syslog for at least 15 minutes.
Root file Warning 2039 Portal, API, Thresholds ● Threshold between Contact ECS
system filling SNMP Trap, exceeded, usable 15G and 10G triggers Remote Support
on node Critical 2042 Syslog, Secure space on root fs warning.
Remote <BYTES> are less ● Threshold Less than
Services than threshold for 10G of free space
<LEVEL> level on results in Critical
node <NODE> alert.
Slot Critical 2021 Portal, API, Container Container stopped/
permanent SNMP Trap, {containerName} is paused or not started at
down Syslog, Secure permanently down all for at least 10 minutes
Remote on node {fqdn}.
Services
Service Critical 2011 Portal, API, Service Health Service failed
failure Syslog, Secure Failure Event
Remote
Services

Table 24. Secure Remote Services alert messages


Alert Severity Symptom code Sent to... Description
TestDialHome N/A TestDialHome Secure Remote Tests that Secure Remote Services connections can
Services be established and that the call home functionality
works.

Monitoring Events: Audits and Alerts 47


4
Advanced Monitoring
Advanced Monitoring dashboards provide critical information about the ECS processes on the VDC you are logged in to.
Topics:
• Advanced Monitoring
• Flux API
• Dashboard APIs

Advanced Monitoring
Advanced Monitoring dashboards provide critical information about the ECS processes on the VDC you are logged in to.
The advanced monitoring dashboards are based on time series database, and are provided by Grafana, which is well known
open-source time series analytics platform.
Refer Grafana for basic details of navigation in Grafana dashboards.

View Advanced Monitoring Dashboards


To view the advanced monitoring dashboards in the ECS Portal, select Advanced Monitoring. Data Access Performance -
Overview dashboard is the default.

Table 25. Advanced monitoring dashboards


Dashboard Description
Data Access Performance - Overview You can use the Data Access Performance - Overview dashboard to
monitor VDC data.
Data Access Performance - by Namespaces You can use the Data Access Performance - by Namespaces dashboard
to monitor performance data for individual namespace or group of
Namespaces.
Data Access Performance - by Nodes You can use the Data Access Performance - by Nodes dashboard to see
performance data for individual node or group of nodes in a VDC.
Data Access Performance - by Protocols You can use the Data Access Performance - by Protocols dashboard to
see performance data for each supported protocol (S3, ATMOS, SWIFT) or
set of protocols.
Disk Bandwidth - by Nodes You can use the Disk Bandwidth - by Nodes dashboard to monitor the
disk usage metrics by read or write operations at the node level. The
dashboard displays the latest values.
NOTE: For Disk Bandwidth - by Nodes dashboard, consistency
checker metric shows data only for read but not write as it is irrelevant.

Disk Bandwidth - Overview You can use the Disk Bandwidth - Overview dashboard to monitor the
disk usage metrics by read or write operations at the VDC level.
NOTE: For Disk Bandwidth - Overview dashboard, consistency
checker metric shows data only for read but not write as it is irrelevant.

Node Rebalancing You can use the Node Rebalancing dashboard to monitor the status of
data rebalancing operations when nodes are added to, or removed from, a
cluster. Node rebalancing is enabled by default at installation. Contact your
technical support representative to disable or reenable this feature.

48 Advanced Monitoring
Table 25. Advanced monitoring dashboards (continued)
Dashboard Description
Process Health - by Nodes You can use the Process Health - by Nodes dashboard to monitor
for each node of the VDC use of network interface, CPU, and available
memory. The dashboard displays the latest values, and the history graphs
display values in the selected range.
Process Health - Overview You can use the Process Health - Overview dashboard to monitor the
VDC use of network interface, CPU, and available memory. The dashboard
displays the latest average values, and the history graphs display values in
the selected time range.
Process Health - Process List by Node You can use the Process Health - Process List by Node dashboard to
monitor processes use of CPU, memory, average thread number and last
restart time in the selected time range. The dashboard displays the latest
values in the selected time range.
Recovery Status You can use the Recovery Status dashboard to monitor the data
recovered by the system.
SSD Read Cache You can use the SSD Read Cache dashboard to monitor total SSD disk
capacity and disk space that is used by SSD read cache.
Tech Refresh: Data Migration You can use the Tech Refresh: Data Migration dashboard to monitor the
data migration off and on a node or cluster.
Top Buckets You can use the Top Buckets dashboard to monitor the number of buckets
with top utilization that is based on total object size and count.

Table 26. Advanced monitoring dashboard fields


Dashboard Field Description
● Data Access Performance - Related Allows you to switch to other dashboards in access performance
Overview dashboards group, with the selected time.
● Data Access Performance - by
Namespaces
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - Transaction Lists the total Successful requests, System Failures, User Failures,
Overview Summary and Failure % Rate for the selected VDCs, namespaces, nodes, or
● Data Access Performance - by protocols.
Namespaces
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - Performance Lists the latest values of data access bandwidth and latency of read/
Overview Summary write requests for selected range.
● Data Access Performance - by
Nodes
● Data Access Performance - Successful The number of data requests that were successfully completed.
Overview requests
● Data Access Performance - by
Namespaces
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols

Advanced Monitoring 49
Table 26. Advanced monitoring dashboard fields (continued)
Dashboard Field Description
● Data Access Performance - System Failures The number of data requests that failed due to hardware or service
Overview errors. System failures are failed requests that are associated with
● Data Access Performance - by hardware or service errors (typically an HTTP error code of 5xx).
Namespaces
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - User Failures The number of data requests from all object heads are classified as
Overview user failures. User failures are known error types originating from the
● Data Access Performance - by object heads (typically an HTTP error code of 4xx).
Namespaces
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - Failure % Rate The percentage of failures for the VDC, namespace, nodes, or
Overview protocols.
● Data Access Performance - by
Namespaces
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - TPS (success/ Rate of successful requests and failures per second.
Overview failure)
● Data Access Performance - by
Namespaces
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - Bandwidth Data access bandwidth of successful requests per second.
Overview (read/write)
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - Failed Rate of failed requests per second, split by error type (user/system).
Overview Requests/s by
● Data Access Performance - by error type
Namespaces (user/system)
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - Latency Latency of read/write requests.
Overview
● Data Access Performance - by
Nodes
● Data Access Performance - by
Protocols
● SSD Read Cache

50 Advanced Monitoring
Table 26. Advanced monitoring dashboard fields (continued)
Dashboard Field Description
● Data Access Performance - Successful Displays the rate of successful requests per second, by method, node,
Overview request drill and protocol.
● Data Access Performance - by down
Nodes
● Data Access Performance - Successful Rate of successful requests per second, by method.
Overview Requests/s by
● Data Access Performance - by Method
Nodes
● Data Access Performance - by Successful Rate of successful requests per second, by node.
Namespaces Requests/s by
● Data Access Performance - by Node
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - Successful Rate of successful requests per second, by protocol.
Overview Requests/s by
● Data Access Performance - by Protocol
Nodes
● Data Access Performance - Failures drill Displays the rate of failed requests per second, by method, node, and
Overview down protocol.
● Data Access Performance - by
Nodes
● Data Access Performance - Failed Rate of failed requests per second, by method.
Overview Requests/s by
● Data Access Performance - by Method
Nodes
● Data Access Performance - by Failed Rate of failed requests per second, by node.
Namespaces Requests/s by
● Data Access Performance - by Node
Nodes
● Data Access Performance - by
Protocols
● Data Access Performance - Failed Rate of failed requests per second, by protocol.
Overview Requests/s by
● Data Access Performance - by Protocol
Nodes
● Data Access Performance - Failed Rate of failed requests per second, by error code.
Overview Requests/s by
● Data Access Performance - by error code
Nodes
● Data Access Performance - by Compare TPS of Select multiple nodes and compare rates of successful requests per
Nodes successful second.
● Data Access Performance - by requests
Namespaces
● Data Access Performance - by
Protocols
Data Access Performance - by Compare TPS of Select multiple nodes and compare rates of failed requests per second,
Namespaces failed requests by error type (user/system).
● Data Access Performance - by Compare read Select multiple nodes and compare data access bandwidth (read) of
Nodes bandwidth successful requests per second.
● Data Access Performance - by
Protocols

Advanced Monitoring 51
Table 26. Advanced monitoring dashboard fields (continued)
Dashboard Field Description
● Data Access Performance - by Compare write Select multiple nodes and compare data access bandwidth (write) of
Nodes bandwidth successful requests per second.
● Data Access Performance - by
Protocols
● Data Access Performance - by Compare read Select multiple nodes and compare latency of read requests.
Nodes latency
● Data Access Performance - by
Protocols
● Data Access Performance - by Compare write Select multiple nodes and compare latency of write requests.
Nodes latency
● Data Access Performance - by
Protocols
● Data Access Performance - by Compare rate of Select multiple nodes and compare rates of failed requests per second,
Nodes failed requests/s split by error type (user/system).
● Data Access Performance - by
Protocols
Data Access Performance - by Request drill Rate of requests per second, split by node.
Namespaces down by nodes
● Disk Bandwidth - by Nodes Read or Write Indicates whether the row describes read data or write data.
● Disk Bandwidth - Overview
● Disk Bandwidth - by Nodes Nodes The number of nodes in the VDC. You can click the nodes number
● Disk Bandwidth - Overview to see the disk bandwidth metrics for each node. There is no Nodes
column when you have drilled down into the Nodes display for a VDC.
● Disk Bandwidth - by Nodes Total Total disk bandwidth that is used for either read or write operations.
● Disk Bandwidth - Overview
● Disk Bandwidth - by Nodes Hardware Rate at which disk bandwidth is used to recover data after a hardware
● Disk Bandwidth - Overview Recovery failure.

● Disk Bandwidth - by Nodes Erasure Rate at which disk bandwidth is used in system erasure coding
● Disk Bandwidth - Overview Encoding operations.

● Disk Bandwidth - by Nodes XOR Rate at which disk bandwidth is used in the XOR data protection
● Disk Bandwidth - Overview operations of the system. XOR operations occur for systems with
three or more sites (VDCs).
● Disk Bandwidth - by Nodes Consistency Rate at which disk bandwidth is used to check for inconsistencies
● Disk Bandwidth - Overview Checker between protected data and its replicas.

● Disk Bandwidth - by Nodes Geo Rate at which disk bandwidth is used to support geo replication
● Disk Bandwidth - Overview operations.

● Disk Bandwidth - by Nodes User Traffic Rate at which disk bandwidth is used by object users.
● Disk Bandwidth - Overview
Node Rebalancing Data Rebalanced Amount of data that has been rebalanced.
Node Rebalancing Pending Amount of data that is in the rebalance queue but has not been
Rebalancing rebalanced yet.
Node Rebalancing Rate of The incremental amount of data that was rebalanced during a specific
Rebalance (per time period. The default time period is one day.
day)
Process Health - Process List by Process The last time the process restarted on the node in the selected time
Node Restarts range. The maximum time range could be 5 days because it is limited
by the retention policy.

52 Advanced Monitoring
Table 26. Advanced monitoring dashboard fields (continued)
Dashboard Field Description
Process Health - Overview Avg. NIC Average bandwidth of the network interface controller hardware that
Bandwidth is used by the selected VDC or node.
Process Health - Process List by NIC Bandwidth Bandwidth of the network interface controller hardware that is used
Node by the selected VDC or node.
Process Health - Overview Avg. CPU Usage Average percentage of the CPU hardware that is used by the selected
VDC or node.
Process Health - Overview Avg. Memory Average usage of the aggregate memory available to the VDC or node.
Usage
● Process Health - by Nodes Relative NIC Percentage of the available bandwidth of the network interface
● Process Health - Overview (%) controller hardware that is used by the selected VDC or node.

● Process Health - by Nodes Relative Memory Percentage of the memory used relative to the memory available to
● Process Health - Overview (%) the selected VDC or node.
● Process Health - Process List
by Node
● Process Health - by Nodes CPU Usage Percentage of the node's CPU used by the process. The list of
● Process Health - Process List processes that are tracked is not the complete list of processes
by Node running on the node. The sum of the CPU used by the processes is
not equal to the CPU usage shown for the node.
Process Health - by Nodes Memory Usage The memory used by the process.
● Process Health - by Nodes Relative Memory Percentage of the memory used relative to the memory available to
● Process Health - Overview (%) the process.
● Process Health - Process List
by Node
Process Health - Process List by Avg. # Thread Average number of threads used by the process.
Node
Process Health - Process List by Last Restart The last time the process restarted on the node.
Node
Process Health - by Nodes Host -
Process Health - Process List by Process -
Node
Recovery Status Amount of Data With the Current filter selected, this is the logical size of the data yet
to be Recovered to be recovered.
● When a historical period is selected as the filter, the meaning of
Total Amount Data to be Recovered is the average amount of
data pending recovery during the selected time.
● For example, if the first hourly snapshot of the data showed 400
GB of data to be recovered in a historical time period and every
other snapshot showed 0 GB waiting to be recovered, the value of
this field would be 400 GB divided by the total number of hourly
snapshots in the period.
SSD Read Cache Disk Usage Used SSD space by Read Cache
SSD Read Cache Disk Capacity Total SSD disk capacity
Tech Refresh: Data Migration Remaining This panel shows graph of remaining volume on source nodes.
Volume to
Migrate
Tech Refresh: Data Migration Migration Speed This panel shows graph of remaining volume on source nodes.

Advanced Monitoring 53
Table 26. Advanced monitoring dashboard fields (continued)
Dashboard Field Description
Tech Refresh: Data Migration Data Migration Detailed status of migration on source nodes. Migration speed and
Status predictions are calculated based on last 1 hour of currently selected
time interval.
Top buckets Top Buckets by Top used buckets by size.
Size
Top buckets Top Buckets by Top used buckets by object count.
Object Count
Top buckets Time of The time at which the displayed metrics of Top Buckets dashboard
Calculation were calculated.

View mode
Steps
1. To view a dashboard in the view mode, click the title of a dashboard, for example (TPS (success/failure) > View.
The dashboard opens in the view mode or in the full-screen mode.
2. Click Back to dashboard icon to return back to the dashboards view.

Export CSV
Steps
1. To export the dashboard data to .csv format click the title of a dashboard, for example (TPS (success/failure) > More >
Export CSV.
The Export CSV window pops-up.
You can customize the csv output by modifying the Mode, Date Time Format, and check/uncheck the Excel CSV Dialect
attributes.
2. Click Export > Save to export the dashboard data to .csv format to your local storage.

Data Access Performance - Overview


Data Access Performance - Overview dashboard is the default.
In the Data Access Performance - Overview dashboard, you can monitor for all nodes in the VDC:
● TPS (success/failure)
● Bandwidth (read/write)
● Failed Requests/s by error type (user/system)
● Latency
● Successful Requests/s by Method
● Successful Requests/s by Protocol
● Failed Requests/s by Method
● Failed Requests/s by Protocol
● Failed Requests/s by error code
To view the Data Access Performance - Overview dashboard in the ECS Portal, select Advanced Monitoring.
Click Successful requests drill down to see the successful requests by all the methods, nodes, and protocols.
Click Failures drill down to see the failed requests by all the methods, nodes, protocols, and error code.
Click Related dashboards to view the other dashboards, with the selected time.

54 Advanced Monitoring
Data Access Performance - by Namespaces

In the Data Access Performance - by Namespaces dashboard, you can monitor for namespaces:
● TPS (success/failure)
● Failed Requests/s by error type (user/system)
● Successful Requests/s by Node
● Failed Requests/s by Node
● Compare TPS of successful requests
● Compare TPS of failed requests
To view the Data Access Performance - by Namespaces dashboard in the ECS Portal, select Advanced Monitoring >
Related dashboards > Data Access Performance - by Namespaces.
All the namespace data are visible in the default view. To select a namespace, click the legend parameter for the namespace
below the graph.
Requests drill down by nodes shows the successful and failed requests by node.
Compare: select multiple namespaces compares TPS of successful and failed requests.

Data Access Performance - by Nodes

In the Data Access Performance - by Nodes dashboard, you can monitor for nodes in a VDC:
● TPS (success/failure)
● Bandwidth (read/write)
● Failed Requests/s by error type (user/system)
● Latency
● Successful Requests/s by Method
● Successful Requests/s by Node
● Successful Requests/s by Protocol
● Failed Requests/s by Method
● Failed Requests/s by Node
● Failed Requests/s by Protocol
● Failed Requests/s by error code
● Compare TPS of successful requests
● Compare TPS of failed requests
● Compare read bandwidth
● Compare write bandwidth
● Compare read latency
● Compare write latency
To view the Data Access Performance - by Nodes dashboard in the ECS Portal, select Advanced Monitoring > Related
dashboards > Data Access Performance - by Nodes.
Data for all the nodes are visible in the default view. To select data for a node, click the legend parameter for the node below
the graph.
Successful requests drill down shows the successful requests by method, node, and protocol.
Failures drill down shows the failed requests by method, node, protocol, and error code.
Compare: select multiple namespaces compares TPS of successful and failed requests, compare read/write bandwidth,
compare read/write latency.

Data Access Performance - by Protocols

In the Data Access Performance - by Protocols dashboard, based on the protocol, you can monitor:
● TPS (success/failure)
● Bandwidth (read/write)
● Failed Requests/s by error type (user/system)

Advanced Monitoring 55
● Latency
● Successful Requests/s by Node
● Failed Requests/s by Node
● Compare TPS of successful requests
● Compare TPS of failed requests
● Compare read bandwidth
● Compare write bandwidth
● Compare read latency
● Compare write latency
To view the Data Access Performance - by Nodes dashboard in the ECS Portal, select Advanced Monitoring > Related
dashboards > Data Access Performance - by Protocols.
Data for all the protocols are visible in the default view. To select data for a protocol, click the legend parameter for the protocol
below the graph.
Requests drill down by nodes shows the successful and failed requests by node.
Compare: select multiple namespaces compares TPS of successful and failed requests, compare read/write bandwidth,
compare read/write latency.

Disk Bandwidth - by Nodes

You can use the Disk Bandwidth - by Nodes dashboard to monitor the disk usage metrics by read or write operations at the
node level. The dashboard displays the latest values.
To view the Disk Bandwidth - by Nodes dashboard, click Advanced Monitoring > expand Data Access Performance -
Overview > Disk Bandwidth - by Nodes

Disk Bandwidth - Overview

You can use the Disk Bandwidth - Overview dashboard to monitor the disk usage metrics by read or write operations at the
VDC level.
To view the Disk Bandwidth - Overview dashboard, click Advanced Monitoring > expand Data Access Performance -
Overview > Disk Bandwidth - Overview

Node Rebalancing

You can use the Node Rebalancing dashboard to monitor the status of data rebalancing operations when nodes are added to,
or removed from, a cluster. Node rebalancing is enabled by default at installation. Contact your customer support representative
to disable or re-enable this feature.
To view the Node Rebalancing dashboard, click Advanced Monitoring > expand Data Access Performance - Overview >
Node Rebalancing
A series of interactive graphs shows that the amount of data rebalanced, pending rebalancing, and the rate of rebalancing data
in bytes over time.
Node rebalancing works only for new nodes that are added to the cluster.

Process Health - by Nodes

You can use the Process Health - by Nodes dashboard to monitor for each node of the VDC use of network interface, CPU,
and available memory. The dashboard displays the latest values and the history graphs display values in the selected range.
To view the Process Health - by Nodes dashboard, click Advanced Monitoring > expand Data Access Performance -
Overview > Process Health - by Nodes

56 Advanced Monitoring
Process Health - Overview

You can use the Process Health - Overview dashboard to monitor the VDC use of network interface, CPU, and available
memory. The dashboard displays the latest average values and the history graphs display values in the selected time range.
To view the Process Health - Overview dashboard, click Advanced Monitoring > expand Data Access Performance -
Overview > Process Health - Overview

Process Health - Process List by Node

You can use the Process Health - Process List by Node dashboard to monitor processes use of CPU, memory, average
thread number and last restart time in the selected time range. The dashboard displays the latest values in the selected time
range.
To view the Process Health - Process List by Node dashboard, click Advanced Monitoring > expand Data Access
Performance - Overview > Process Health - Process List by Node

Recovery Status

You can use the Recovery Status dashboard to see:


● The latest value of the logical size of the data yet to be recovered in the selected time range, and
● History of the amount of data that is pending recovery in the selected time range.
To view the Recovery Status dashboard, click Advanced Monitoring > expand Data Access Performance - Overview >
Recovery Status.

SSD Read Cache


ECS is upgraded to enable SSD caching. There is one single SSD read cache drive per node. SSD read cache feature is
implemented on ECS Gen2 U-Series and Gen3.
If a VDC has a mixed hardware configuration where some nodes cannot support SSD read cache, then the SSD read cache
feature in such VDC is not supported.
You can use the SSD Read Cache dashboard to monitor total SSD disk capacity and disk space that is used by SSD read cache.
NOTE: The nodes which do not have SSD disks are also listed in the node selection dropdown but the values will be 0 since
it does not have SSD disks.
To view the SSD Read Cache dashboard, click Advanced Monitoring > expand Data Access Performance - Overview >
SSD Read Cache
See, ECS Solve Online for details.

Tech Refresh: Data Migration


You can use the Tech Refresh: Data Migration dashboard to monitor the data migration off and on a node or cluster.
To view the Tech Refresh: Data Migration dashboard, click Advanced Monitoring > expand Data Access Performance -
Overview > Tech Refresh: Data Migration

Top Buckets
ECS is upgraded with a mechanism in metering to calculate the number of buckets with top utilization that is based on total
object size and count.
Statistics of buckets with top utilization for the system is displayed in monitoring dashboards. The number of buckets that are
displayed on the monitoring dashboard is a configurable value.
To view the Top buckets dashboard, click Advanced Monitoring > expand Data Access Performance - Overview > Top
buckets.

Advanced Monitoring 57
Automatic Metering Reconstruction
Automatic metering reconstruction is a mechanism to reconstruct the metering statistics completely.
Metering is responsible for storing the statistics for utilization by namespace and bucket that is based on object size and
count. When an object is created in a bucket, then the statistics are reported to the metering service where the statistics
are aggregated and stored. Statistics are aggregated and mapped to a time which is the nearest multiple of five minutes. For
example, objects that are created at 10:04:59 pm are mapped to time at 10:00:00 pm. The metering statistics are stored in time
series format to provide historical view of the statistics and to serve billing sample queries. The statistics are displayed in a time
window.
As a result of logic errors in implementation of metering, blob service side operations wrong statistics are reported to
metering. Incorrect metering information gets compounded and remains inaccurate from that point forward. Automatic metering
reconstruction is a mechanism to overcome the problem of erroneous statistics.
This feature is disabled in ESC 3.5.0.0. You have to manually enable it.
The automatic reconstruction is invoked in the following scenarios:
● During upgrade
● When the system recovers from a PSO

Share Advanced Monitoring Dashboards


Share dashboard icon enables you to create a direct link to the dashboard or panel, share a snapshot of an interactive dashboard
publicly, and export the dashboard to a JSON file.
For procedures on sharing the dashboard link, dashboard snapshot, and dashboard as a JSON file, refer to Grafana
documentation.

Flux API
Flux API enables you to retrieve time series database data by sending REST queries using curl. You can get raw data from
fluxd service in a way similar to using the Dashboard API. You have to get a token, and provide the token in the requests.

Prerequisites
Requires one of the following roles:
● SYSTEM_ADMIN
● SYSTEM_MONITOR
Request payload examples

json:

{
"query": "from(bucket:\"monitoring_main\") |> range(start: -30m) |> filter(fn: (r) =>
r._measurement == \"statDataHead_performance_internal_transactions\")"
}

application/vnd.flux - CSV format:

query=from(bucket: "monitoring_main")
|> range(start: -30m)
|> filter(fn: (r) => r._measurement == "statDataHead_performance_internal_transactions")

Steps
1. Generate a token.

Token:

admin@ecs:> tok=$(curl -iks https://localhost:4443/login -u emcmonitor:#### | grep X-


SDS-AUTH-TOKEN)

58 Advanced Monitoring
admin@ecs:/> echo $tok
X-SDS-AUTH-TOKEN:****

#### represents a password.


**** represents a X-SDS-AUTH-TOKEN value.
2. Run the query.
Curl arguments varies depending on output format (JSON or CSV). See the examples for details.
Example
JSON example

admin@ecs:/> curl https://localhost:4443/flux/api/external/v2/query -XPOST -k -sS -H


"$tok" -H 'accept:application/json' -H 'content-type:application/json' -d '{
"query": "from(bucket:\"monitoring_main\") |> range(start: -30m) |> filter(fn: (r) =>
r._measurement == \"statDataHead_performance_internal_transactions\")" }'
{
"Series": [
{
"Datatypes": [
"long",
"dateTime:RFC3339",
"dateTime:RFC3339",
"dateTime:RFC3339",
"long",
"string",
"string",
"string",
"string",
"string",
"string"
],
"Columns": [
"table",
"_start",
"_stop",
"_time",
"_value",
"_field",
"_measurement",
"host",
"node_id",
"process",
"tag"
],
"Values": [
[
"0",
"2020-03-10T09:54:31.207799855Z",
"2020-03-10T10:24:31.207799855Z",
"2020-03-10T09:56:43Z",
"1",
"failed_request_counter",
"statDataHead_performance_internal_transactions",
"ecs.lss.emc.com",
"28cd473e-ca45-4623-b30d-0481c548a650",
"statDataHead",
"dashboard"
],
[
"0",
"2020-03-10T09:54:31.207799855Z",
"2020-03-10T10:24:31.207799855Z",
"2020-03-10T10:01:43Z",
"1",
"failed_request_counter",
"statDataHead_performance_internal_transactions",
"ecs.lss.emc.com",
"28cd473e-ca45-4623-b30d-0481c548a650",
"statDataHead",

Advanced Monitoring 59
"dashboard"
],
[
"0",
"2020-03-10T09:54:31.207799855Z",
"2020-03-10T10:24:31.207799855Z",
"2020-03-10T10:06:43Z",
"1",
"failed_request_counter",
"statDataHead_performance_internal_transactions",
"ecs.lss.emc.com",
"28cd473e-ca45-4623-b30d-0481c548a650",
"statDataHead",
"dashboard"
],

CSV example

admin@ecs:> curl https://localhost:4443/flux/api/external/v2/query -XPOST -k -sS -H


"$tok" -H 'accept:application/csv' -H 'content-type:application/vnd.flux' -d
'from(bucket:"monitoring_main") |> range(start:-30m) |> filter(fn: (r) => r._measurement
== "statDataHead_performance_internal_transactions")'
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,long,string,stri
ng,string,string,string,string
#group,false,false,true,true,false,false,true,true,true,true,true,true
#default,_result,,,,,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement,host,node_id,process,tag
,,0,2020-03-10T09:58:59.049910533Z,2020-03-10T10:28:59.049910533Z,2020-03-10T10:01:43Z,1,
failed_request_counter,statDataHead_performance_internal_transactions,ecs.lss.emc.com,28c
d473e-ca45-4623-b30d-0481c548a650,statDataHead,dashboard
,,0,2020-03-10T09:58:59.049910533Z,2020-03-10T10:28:59.049910533Z,2020-03-10T10:06:43Z,1,
failed_request_counter,statDataHead_performance_internal_transactions,ecs.lss.emc.com,28c
d473e-ca45-4623-b30d-0481c548a650,statDataHead,dashboard
,,0,2020-03-10T09:58:59.049910533Z,2020-03-10T10:28:59.049910533Z,2020-03-10T10:11:43Z,1,
failed_request_counter,statDataHead_performance_internal_transactions,ecs.lss.emc.com,28c
d473e-ca45-4623-b30d-0481c548a650,statDataHead,dashboard
,,0,2020-03-10T09:58:59.049910533Z,2020-03-10T10:28:59.049910533Z,2020-03-10T10:16:43Z,1,
failed_request_counter,statDataHead_performance_internal_transactions,ecs.lss.emc.com,28c
d473e-ca45-4623-b30d-0481c548a650,statDataHead,dashboard

Monitoring list of metrics


Following tags have common values across all measurements:
● host- name of data node
● node_id- ID of data node
● tag- internal, set to dashboard

Monitoring list of metrics: Non-Performance

Database monitoring_main
Performance metrics in this database are raw, each is split by data node, that is all have host and node_id tags.

Data for ECS Service I/O Statistics

Information:
Measurement in this section have following structure:

service_IO_Statistics_data_read - for read I/O counters

service_IO_Statistics_data_write - for read I/O counters

60 Advanced Monitoring
Service is the name of ECS service that produces the measurement, i.e. blob, cm, georcv,
statDataHead.

For example,

blob_IO_Statistics_data_read
cm_IO_Statistics_data_write

Measurement: blob_IO_Statistics_data_read
...
Tags: host, node_id, process, tag
Fields: read_CCTotal (float, bytes)
read_ECTotal (float, bytes)
read_GEOTotal (float, bytes)
read_RECOVERTotal (float, bytes)
read_USERTotal (float, bytes)
read_XORTotal (float, bytes)

Measurement: blob_IO_Statistics_data_write
...
Tags: host, node_id, process, tag
Fields: write_CCTotal (integer)
write_ECTotal (integer)
write_GEOTotal (integer)
write_RECOVERTotal (integer)
write_USERTotal (integer)
write_XORTotal (integer)

Data for SSD Read cache

Measurement: blob_SSDReadCache_Stats
Tags: host, id, last, node_id, process
Fields: +Inf (integer)
0.0 (integer)
1000.0 (integer)
25000.0 (integer)
5000.0 (integer)
rocksdb_disk_capacity_failure_counter (integer)
rocksdb_disk_usage_counter_bytes (integer)
rocksdb_disk_usage_percentage_counter (integer)
ssd_capacity_counter_bytes (integer)

CM statistics
These statistics represent processes in ECS service CM, such BTree GC, Chunk management, Erasure coding.

Measurement: cm_BTREE_GC_Statistics
Tags: host, node_id, process, tag
Fields: accumulated_candidate_garbage_btree_gc_level_0 (integer)
accumulated_candidate_garbage_btree_gc_level_1 (integer)
accumulated_detected_data_btree_level_0 (integer)
accumulated_detected_data_btree_level_1 (integer)
accumulated_reclaimed_data_btree_level_0 (integer)
accumulated_reclaimed_data_btree_level_1 (integer)
candidate_chunks_btree_gc_level_0 (integer)
candidate_chunks_btree_gc_level_1 (integer)
candidate_garbage_btree_gc_level_0 (integer)
candidate_garbage_btree_gc_level_1 (integer)
copy_candidate_chunks_btree_gc_level_0 (integer)
copy_candidate_chunks_btree_gc_level_1 (integer)
copy_completed_chunks_btree_gc_level_0 (integer)
copy_completed_chunks_btree_gc_level_1 (integer)
copy_waiting_chunks_btree_gc_level_0 (integer)
copy_waiting_chunks_btree_gc_level_1 (integer)
deleted_chunks_btree_level_0 (integer)
deleted_chunks_btree_level_1 (integer)
deleted_data_btree_level_0 (integer)

Advanced Monitoring 61
deleted_data_btree_level_1 (integer)
full_reclaimable_chunks_btree_gc_level_0 (integer)
full_reclaimable_chunks_btree_gc_level_1 (integer)
reclaimed_data_btree_level_0 (integer)
reclaimed_data_btree_level_1 (integer)
usage_between_0%_and_5%_chunks_btree_gc_level_0 (integer)
usage_between_0%_and_5%_chunks_btree_gc_level_1 (integer)
usage_between_10%_and_15%_chunks_btree_gc_level_0 (integer)
usage_between_10%_and_15%_chunks_btree_gc_level_1 (integer)
usage_between_5%_and_10%_chunks_btree_gc_level_0 (integer)
usage_between_5%_and_10%_chunks_btree_gc_level_1 (integer)
verification_waiting_chunks_btree_gc_level_0 (integer)
verification_waiting_chunks_btree_gc_level_1 (integer)

Measurement: cm_Chunk_Statistics
Tags: host, node_id, process, tag
Fields: chunks_copy (integer)
chunks_copy_active (integer)
chunks_copy_s0 (integer)
chunks_level_0_btree (integer)
chunks_level_0_btree_active (integer)
chunks_level_0_btree_active_index_page (integer)
chunks_level_0_btree_active_leaf_page (integer)
chunks_level_0_btree_index_page (integer)
chunks_level_0_btree_leaf_page (integer)
chunks_level_0_btree_s0 (integer)
chunks_level_0_btree_s0_index_page (integer)
chunks_level_0_btree_s0_leaf_page (integer)
chunks_level_0_journal (integer)
chunks_level_0_journal_active (integer)
chunks_level_0_journal_s0 (integer)
chunks_level_1_btree (integer)
chunks_level_1_btree_active (integer)
chunks_level_1_btree_active_index_page (integer)
chunks_level_1_btree_active_leaf_page (integer)
chunks_level_1_btree_index_page (integer)
chunks_level_1_btree_leaf_page (integer)
chunks_level_1_btree_s0 (integer)
chunks_level_1_btree_s0_index_page (integer)
chunks_level_1_btree_s0_leaf_page (integer)
chunks_level_1_journal (integer)
chunks_level_1_journal_active (integer)
chunks_level_1_journal_s0 (integer)
chunks_repo (integer)
chunks_repo_active (integer)
chunks_repo_s0 (integer)
chunks_typeII_ec_pending (integer)
chunks_typeI_ec_pending (integer)
chunks_undertransform_ec_pending (integer)
chunks_xor (integer)
data_copy (integer)
data_level_0_btree (integer)
data_level_0_btree_index_page (integer)
data_level_0_btree_leaf_page (integer)
data_level_0_journal (integer)
data_level_1_btree (integer)
data_level_1_btree_index_page (integer)
data_level_1_btree_leaf_page (integer)
data_level_1_journal (integer)
data_repo (integer)
data_repo_copy (integer)
data_xor (integer)
data_xor_shipped (integer)

Measurement: cm_EC_Statistics
Tags: host, node_id, process, tag
Fields: chunks_ec_encoded (integer)
chunks_ec_encoded_alive (integer)
data_ec_encoded (integer)
data_ec_encoded_alive (integer)

Measurement: cm_Geo_Replication_Statistics_Geo_Chunk_Cache
Tags: host, node_id, process, tag

62 Advanced Monitoring
Fields: Capacity_of_Cache (integer)
Number_of_Chunks (integer)

Measurement: cm_REPO_GC_Statistics
Tags: host, node_id, process, tag
Fields: accumulated_deleted_garbage_repo (integer)
accumulated_reclaimed_garbage_repo (integer)
deleted_chunks_repo (integer)
deleted_data_repo (integer)
ec_freed_slots (integer)
full_reclaimable_aligned_chunk (integer)
merge_copy_overhead_in_deleted_data_repo (integer)
merge_copy_overhead_in_reclaimed_data_repo (integer)
reclaimed_chunk_repo (integer)
reclaimed_data_repo (integer)
slots_waiting_shipping (integer)
slots_waiting_verification (integer)
total_ec_free_slots (integer)

Measurement: cm_Rebalance_Statistics
Tags: host, node_id, process, tag
Fields: bytes_rebalanced (integer)
bytes_rebalancing_failed (integer)
chunks_canceled (integer)
chunks_for_rebalancing (integer)
chunks_rebalanced (integer)
chunks_total (integer)
jobs_canceled (integer)
segments_for_rebalancing (integer)
segments_rebalanced (integer)
segments_rebalancing_failed (integer)
segments_total (integer)

Measurement: cm_Rebalance_Statistics_CoS
Tags: CoS, host, node_id, process, tag
Fields: bytes_rebalanced (integer)
bytes_rebalancing_failed (integer)
chunks_canceled (integer)
chunks_for_rebalancing (integer)
chunks_rebalanced (integer)
chunks_total (integer)
jobs_canceled (integer)
segments_for_rebalancing (integer)
segments_rebalanced (integer)
segments_rebalancing_failed (integer)
segments_total (integer)

Measurement: cm_Recover_Statistics
Tags: host, node_id, process, tag
Fields: chunks_to_recover (integer)
data_recovered (integer)
data_to_recover (integer)

Measurement: cm_Recover_Statistics_CoS
Tags: CoS, host, node_id, process, tag
Fields: chunks_to_recover (integer)
data_recovered (integer)
data_to_recover (integer)

SR statistics
These statistics represent processes in ECS service SR, responsible for space reclamation.

Measurement: sr_REPO_GC_Statistics
Tags: host, node_id, process, tag
Fields: accumulated_merge_copy_overhead_in_full_garbage (integer)
accumulated_total_repo_garbage (integer)
full_reclaimable_repo_chunk (integer)
garbage_in_partial_sr_tasks (integer)
garbage_in_repo_usage (integer)
merge_copy_overhead_in_full_garbage (integer)

Advanced Monitoring 63
merge_way_gc_processed_chunks (integer)
merge_way_gc_src_chunks (integer)
merge_way_gc_targeted_chunks (integer)
merge_way_gc_tasks (integer)
total_repo_garbage (integer)
usage_between_0%_and_33.3%_repo_chunk (integer)
usage_between_33.3%_and_50%_repo_chunk (integer)
usage_between_50%_and_66.7%_repo_chunk (integer)

SSM statistics
These statistics represent processes in ECS storage manager service SSM.

Measurement: ssm_sstable_SSTable_SS
Tags: SS, SSTable, last, process, tag
Fields: allocatedSpace (integer)
availableFreeSpace (integer)
downDurationTotal (integer)
freeSpace (integer)
largeBlockAllocated (integer)
largeBlockAllocatedSize (integer)
largeBlockFreed (integer)
largeBlockFreedSize (integer)
pendingDurationTotal (integer)
pingerDurationTotal (integer)
smallBlockAllocated (integer)
smallBlockFreed (integer)
smallBlockFreedSize (integer)
smallBlockSize (integer)
state (string)
timeInStateTotal (integer)
totalSpace (integer)
upDurationTotal (integer)

Measurement: ssm_sstable_SSTable_SS_datamigration
Tags: SS, SSTable, last, process
Fields: status (integer)
totalCapacityToMigrate (integer)

Database monitoring_last

Service status, memory, and cache statistics

Measurement: blob_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Measurement: blob_Total_memory_and_disk_cache_size
Tags: Total_memory_and_disk_cache_size, host, last, node_id, process
Fields: Disk_cache_size (integer)
Memory_cache_size (integer)

Measurement: cm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Measurement: eventsvc_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Measurement: mm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)

64 Advanced Monitoring
NumberofWritePageAllocationOutsideWriteCache (integer)

Measurement: resource_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Measurement: rm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Measurement: sr_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Measurement: sr_Total_memory_and_disk_cache_size
Tags: Total_memory_and_disk_cache_size, host, last, node_id, process
Fields: Disk_cache_size (integer)
Memory_cache_size (integer)

Measurement: ssm_Process_status
Tags: Process_status, host, node_id, process
Fields: MemoryTableFreeSpacePercentagePerMinute (integer)
NumberofWritePageAllocationOutsideWriteCache (integer)

Export of configuration framework values

Measurement: dtquery_cmf
Tags: last, process
Fields: com.emc.ecs.chunk.gc.btree.enabled (integer)
com.emc.ecs.chunk.gc.btree.scanner.verification.enabled (integer)
com.emc.ecs.chunk.gc.repo.enabled (integer)
com.emc.ecs.chunk.gc.repo.verification.enabled (integer)
com.emc.ecs.chunk.rebalance.is_enabled (integer)
com.emc.ecs.objectgc.cas.enabled (integer)
com.emc.ecs.sensor.btree_sr_pending_mininum (integer)
com.emc.ecs.sensor.repo_sr_pending_mininum (integer)

Top bucket statistics

Measurement: mm_topn_bucket_by_obj_count_place
Tags: last, place, process, tag
Fields: bucketName (string)
namespace (string)
value (integer)

Measurement: mm_topn_bucket_by_obj_size_place
Tags: last, place, process, tag
Fields: bucketName (string)
namespace (string)
value (integer)

Vnest membership and performance statistics

Measurement: vnestStat_membership_ismember
Tags: host, ismember, last, node_id, process
Fields: is_leader (string)

Measurement: vnestStat_performance_latency_type
Tags: host, id, last, node_id, process, type
Fields: +Inf (integer)
0.0 (integer)

Advanced Monitoring 65
1.0 (integer)
7999999.99999999 (integer)
825912.9477680004 (integer)
85266.52466135359 (integer)
8802.840841123942 (integer)
9.686250859269972 (integer)
908.7975284781536 (integer)
93.82345570870827 (integer)

Measurement: vnestStat_performance_transactions_from_type
Tags: from, host, last, node_id, process, type
Fields: failed_request_counter (integer)
succeed_request_counter (integer)

Database monitoring_op

Node system level statistics

Information:
Measurements listed in this section are from default Telegraf plugins. Here, measurement
name equals plugin name. Refer to plugin documentation for more information.

For example, documentation for Telegraf plugin "cpu" can be found here.

Measurement: cpu
Tags: cpu, host, node_id, tag
Fields: usage_guest (float)
usage_guest_nice (float)
usage_idle (float)
usage_iowait (float)
usage_irq (float)
usage_nice (float)
usage_softirq (float)
usage_steal (float)
usage_system (float)
usage_user (float)

Measurement: disk
Tags: device, fstype, host, mode, node_id, path, tag
Fields: free (integer)
inodes_free (integer)
inodes_total (integer)
inodes_used (integer)
total (integer)
used (integer)
used_percent (float)

Measurement: diskio
Tags: ID_PART_ENTRY_UUID, SCSI_IDENT_SERIAL, SCSI_MODEL, SCSI_REVISION, SCSI_VENDOR,
host, name, node_id, tag
Fields: io_time (integer)
iops_in_progress (integer)
read_bytes (integer)
read_time (integer)
reads (integer)
weighted_io_time (integer)
write_bytes (integer)
write_time (integer)
writes (integer)

Measurement: linux_sysctl_fs
Tags: host, node_id, tag
Fields: aio-max-nr (integer)
aio-nr (integer)
dentry-age-limit (integer)
dentry-nr (integer)
dentry-unused-nr (integer)
dentry-want-pages (integer)

66 Advanced Monitoring
file-max (integer)
file-nr (integer)
inode-free-nr (integer)
inode-nr (integer)
inode-preshrink-nr (integer)

Measurement: mem
Tags: host, node_id, tag
Fields: active (integer)
available (integer)
available_percent (float)
buffered (integer)
cached (integer)
commit_limit (integer)
committed_as (integer)
dirty (integer)
free (integer)
high_free (integer)
high_total (integer)
huge_page_size (integer)
huge_pages_free (integer)
huge_pages_total (integer)
inactive (integer)
low_free (integer)
low_total (integer)
mapped (integer)
page_tables (integer)
shared (integer)
slab (integer)
swap_cached (integer)
swap_free (integer)
swap_total (integer)
total (integer)
used (integer)
used_percent (float)
vmalloc_chunk (integer)
vmalloc_total (integer)
vmalloc_used (integer)
wired (integer)
write_back (integer)
write_back_tmp (integer)

Measurement: net
Tags: host, interface, node_id, tag
Fields: bytes_recv (integer)
bytes_sent (integer)
bytes_sum (integer)
drop_in (integer)
drop_out (integer)
err_in (integer)
err_out (integer)
packets_recv (integer)
packets_sent (integer)
packets_sum (integer)
speed (integer)
utilization (integer)

Measurement: nstat
Tags: host, name, node_id, tag
Fields: IpExtInOctets (integer)
IpExtOutOctets (integer)
TcpInErrs (integer)
UdpInErrors (integer)

Measurement: processes
Tags: host, node_id, tag
Fields: blocked (integer)
dead (integer)
idle (integer)
paging (integer)
running (integer)
sleeping (integer)

Advanced Monitoring 67
stopped (integer)
total (integer)
total_threads (integer)
unknown (integer)
zombies (integer)

Measurement: procstat
Tags: host, node_id, process_name, tag, user
Fields: cpu_time (integer)
cpu_time_guest (float)
cpu_time_guest_nice (float)
cpu_time_idle (float)
cpu_time_iowait (float)
cpu_time_irq (float)
cpu_time_nice (float)
cpu_time_soft_irq (float)
cpu_time_steal (float)
cpu_time_stolen (float)
cpu_time_system (float)
cpu_time_user (float)
cpu_usage (float)
create_time (integer)
involuntary_context_switches (integer)
memory_data (integer)
memory_locked (integer)
memory_rss (integer)
memory_stack (integer)
memory_swap (integer)
memory_vms (integer)
nice_priority (integer)
num_fds (integer)
num_threads (integer)
pid (integer)
read_bytes (integer)
read_count (integer)
realtime_priority (integer)
rlimit_cpu_time_hard (integer)
rlimit_cpu_time_soft (integer)
rlimit_file_locks_hard (integer)
rlimit_file_locks_soft (integer)
rlimit_memory_data_hard (integer)
rlimit_memory_data_soft (integer)
rlimit_memory_locked_hard (integer)
rlimit_memory_locked_soft (integer)
rlimit_memory_rss_hard (integer)
rlimit_memory_rss_soft (integer)
rlimit_memory_stack_hard (integer)
rlimit_memory_stack_soft (integer)
rlimit_memory_vms_hard (integer)
rlimit_memory_vms_soft (integer)
rlimit_nice_priority_hard (integer)
rlimit_nice_priority_soft (integer)
rlimit_num_fds_hard (integer)
rlimit_num_fds_soft (integer)
rlimit_realtime_priority_hard (integer)
rlimit_realtime_priority_soft (integer)
rlimit_signals_pending_hard (integer)
rlimit_signals_pending_soft (integer)
signals_pending (integer)
voluntary_context_switches (integer)
write_bytes (integer)
write_count (integer)

Measurement: swap
Tags: host, node_id, tag
Fields: free (integer)
in (integer)
out (integer)
total (integer)
used (integer)
used_percent (float)

Measurement: system

68 Advanced Monitoring
Tags: host, node_id, tag
Fields: load1 (float)
load15 (float)
load5 (float)
n_cpus (integer)
n_users (integer)
uptime (integer)
uptime_format (string)

DT statistics

Measurement: dtquery_dt_dist_dt_node_id_type
Tags: dt_node_id, process, tag, type
Fields: count_i (integer)

Measurement: dtquery_dt_dist_host_dt_node_id
Tags: dt_node_id, process, tag
Fields: count_i (integer)

Measurement: dtquery_dt_dist_type_type
Tags: process, tag, type
Fields: count_i (integer)

Measurement: dtquery_dt_status
Tags: process, tag
Fields: total (integer)
unknown (integer)
unready (integer)

Measurement: dtquery_dt_status_detailed_type
Tags: process, tag, type
Fields: total (integer)
unknown (integer)
unready (integer)

Fabric agent statistics

Measurement: ecs_fabric_agent_dirstat_size_bytes
Tags: host, node_id, path, tag, url
Fields: gauge (float)

SR journal statistics

Measurement: sr_JournalParser_GC_RG_DT
Tags: DT, RG, last, process
Fields: majorMinorOfJournalRegion (string)
pendingChunks (integer)
timestampOfChunkRegion (string)
timestampOfJournalParserLastRun (string)

Measurement: sr_ObjectGC_CAS_RG
Tags: RG, last, process
Fields: STATUS (string)

Vnest Btree statistics

Measurement: vnestStat_btree
Tags: cumulative_stats, host, level, node_id, tag
Fields: level_count (float)

Advanced Monitoring 69
page_count (float)
size_bytes (float)

Database monitoring_vdc
Metrics in this database are calculated values over whole VDC without reference to particular data node.

Information:

Metrics below are aggregated over data nodes for raw measurements used in Grafana ECS UI.

Measurement: cq_disk_bandwidth
Tags: type_op ('read', 'write')
Fields: consistency_checker (float)
erasure_encoding (float)
geo (float)
hardware_recovery (float)
total (float)
user_traffic (float)
xor (float)

Measurement: cq_node_rebalancing_summary
Tags: none
Fields: data_rebalanced (integer)
pending_rebalance (integer)

Measurement: cq_process_health
Tags: none
Fields: cpu_used (float)
mem_used (float)
mem_used_percent (float)
nic_bytes (float)
nic_utilization (float)

Measurement: cq_recover_status_summary
Tags: none
Fields: data_recovered (integer)
data_to_recover (integer)

Monitoring list of metrics: Performance

Information about generic tag values


Following tags have common values across all measurements:
● process- internal, set to statDataHead
● head- type of protocol, that is S3
● namespace- name of namespace
● method - protocol-specific request method, that is GET, POST, READ, WRITE

Database monitoring_main
Performance metrics in this database are raw, each is split by data node, that is all have node and node_id tags.
Most of integer fields are increasing counters that is values that increase over time. Increasing counters restart from zero after
datahead service restart.

Measurement: statDataHead_performance_internal_error
Tags: host, node_id, process, tag
Fields: system_errors (integer)
user_errors (integer)

70 Advanced Monitoring
Measurement: statDataHead_performance_internal_error_code
Tags: code, host, node_id, process, tag
Fields: error_counter (integer)

Measurement: statDataHead_performance_internal_error_head
Tags: head, host, node_id, process, tag
Fields: system_errors (integer)
user_errors (integer)

Measurement: statDataHead_performance_internal_error_head_namespace
Tags: head, host, namespace, node_id, process, tag
Fields: system_errors (integer)
user_errors (integer)

Measurement: statDataHead_performance_internal_latency
Tags: host, id, node_id, process, tag
Fields: +Inf (integer)
0.0 (integer)
1.0 (integer)
111.6295328521717 (integer)
12461.15260479408 (integer)
23.183877401213103 (integer)
2588.0054039994393 (integer)
4.814963904455889 (integer)
537.4921713544796 (integer)
59999.999999999985 (integer)

Measurement: statDataHead_performance_internal_latency_head
Tags: head, host, id, node_id, process, tag
Fields: +Inf (integer)
0.0 (integer)
1.0 (integer)
111.6295328521717 (integer)
12461.15260479408 (integer)
23.183877401213103 (integer)
2588.0054039994393 (integer)
4.814963904455889 (integer)
537.4921713544796 (integer)
59999.999999999985 (integer)

Measurement: statDataHead_performance_internal_throughput
Tags: host, node_id, process, tag
Fields: total_read_requests_size (integer)
total_write_requests_size (integer)

Measurement: statDataHead_performance_internal_throughput_head
Tags: head, host, node_id, process, tag
Fields: total_read_requests_size (integer)
total_write_requests_size (integer)

Measurement: statDataHead_performance_internal_transactions
Tags: host, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)

Measurement: statDataHead_performance_internal_transactions_head
Tags: head, host, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)

Measurement: statDataHead_performance_internal_transactions_head_namespace
Tags: head, host, namespace, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)

Measurement: statDataHead_performance_internal_transactions_method
Tags: host, method, node_id, process, tag
Fields: failed_request_counter (integer)
succeed_request_counter (integer)

Advanced Monitoring 71
Database monitoring_vdc
Performance metrics in this database are calculated values over whole VDC without reference to particular data node.
Most of values are:
● Rates (number of requests per seconds)- for all measurements not ending by "_delta"
● Delta values, increase of a counter from previous time stamp- for all measurements ending by "_delta"
● Down sampled values (aggregated one point per day)- for all measurements ending by "_downsampled"

Measurement: cq_performance_error
Tags: none
Fields: system_errors (float)
user_errors (float)

Measurement: cq_performance_error_downsampled
Tags: none
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_code
Tags: code
Fields: error_counter (float)

Measurement: cq_performance_error_code_downsampled
Tags: code
Fields: error_counter (float)
Measurement: cq_performance_error_delta
Tags: none
Fields: system_errors_i (integer)
user_errors_i (integer)

Measurement: cq_performance_error_delta_downsampled
Tags: none
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_error_head
Tags: head
Fields: system_errors (float)
user_errors (float)

Measurement: cq_performance_error_head_downsampled
Tags: head
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_head_delta
Tags: head
Fields: system_errors_i (integer)
user_errors_i (integer)

Measurement: cq_performance_error_head_delta_downsampled
Tags: head
Fields: system_errors_i (integer)
user_errors_i (integer)
Measurement: cq_performance_error_ns
Tags: namespace
Fields: system_errors (float)
user_errors (float)

Measurement: cq_performance_error_ns_downsampled
Tags: namespace
Fields: system_errors (float)
user_errors (float)
Measurement: cq_performance_error_ns_delta
Tags: namespace
Fields: system_errors_i (integer)
user_errors_i (integer)

Measurement: cq_performance_error_ns_delta_downsampled
Tags: namespace
Fields: system_errors_i (integer)
user_errors_i (integer)

72 Advanced Monitoring
Measurement: cq_performance_latency
Tags: id
Fields: p50 (float)
p99 (float)
Measurement: cq_performance_latency_downsampled
Tags: id
Fields: p50 (float)
p99 (float)
Measurement: cq_performance_latency_head
Tags: head, id
Fields: p50 (float)
p99 (float)

Measurement: cq_performance_latency_head_downsampled
Tags: head, id
Fields: p50 (float)
p99 (float)
Measurement: cq_performance_throughput
Tags: none
Fields: total_read_requests_size (float)
total_write_requests_size (float)

Measurement: cq_performance_throughput_downsampled
Tags: none
Fields: total_read_requests_size (float)
total_write_requests_size (float)
Measurement: cq_performance_throughput_head
Tags: head
Fields: total_read_requests_size (float)
total_write_requests_size (float)

Measurement: cq_performance_throughput_head_downsampled
Tags: head
Fields: total_read_requests_size (float)
total_write_requests_size (float)
Measurement: cq_performance_transaction
Tags: none
Fields: failed_request_counter (float)
succeed_request_counter (float)

Measurement: cq_performance_transaction_downsampled
Tags: none
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_delta
Tags: none
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)

Measurement: cq_performance_transaction_delta_downsampled
Tags: none
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_head
Tags: head
Fields: failed_request_counter (float)
succeed_request_counter (float)

Measurement: cq_performance_transaction_head_downsampled
Tags: head
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_head_delta
Tags: head
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)

Measurement: cq_performance_transaction_head_delta_downsampled
Tags: head
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)
Measurement: cq_performance_transaction_method
Tags: method

Advanced Monitoring 73
Fields: failed_request_counter (float)
succeed_request_counter (float)

Measurement: cq_performance_transaction_method_downsampled
Tags: method
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_ns
Tags: namespace
Fields: failed_request_counter (float)
succeed_request_counter (float)

Measurement: cq_performance_transaction_ns_downsampled
Tags: namespace
Fields: failed_request_counter (float)
succeed_request_counter (float)
Measurement: cq_performance_transaction_ns_delta
Tags: namespace
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)

Measurement: cq_performance_transaction_ns_delta_downsampled
Tags: namespace
Fields: failed_request_counter_i (integer)
succeed_request_counter_i (integer)

Flux API replacements for deprecated dashboard API

Processes statistics

Dashboard API

GET /dashboard/nodes/{id}/processes

GET /dashboard/processes/{id}

Flux API
Database:
● monitoring_op
Measurement:
● procstat(detailed info on available fields and tags https://github.com/influxdata/telegraf/tree/master/plugins/inputs/
procstat)
Fields:
● memory_rss- resident memory of a process (bytes)
● cpu_usage- cpu usage percentage for a process (percent used of a single cpu)
● num_threads- number of threads used by process (int)
Tags:
● process_name- valid process names:
○ nvmeengine
○ nvmetargetviewer
○ dtsm
○ rack-service-manager
○ rpcbind
○ blobsvc
○ cm

74 Advanced Monitoring
○ coordinatorsvc
○ dataheadsvc
○ dtquery
○ ecsportalsvc
○ eventsvc
○ georeceiver
○ metering
○ objcontrolsvc
○ resourcesvc
○ transformsvc
○ vnest
○ fluxd
○ influxd
○ throttler
○ grafana-server
○ dockerd
○ fabric-agent
○ fabric-lifecycle
○ fabric-registry
○ fabric-zookeeper
● host- hostname (fqdn)
● node_id- host id
NOTE:

For replacement of /dashboard/processes/{id}, specify corresponding r.process_name


and r.node_id fields accordingly to "{id}" value.

For example, id "330e4b8f-4491-4ec7-b816-7b10ac9c6abf-cm" equals to:

r.node_id == "330e4b8f-4491-4ec7-b816-7b10ac9c6abf"
r.process_name == "cm"

Example query:

from(bucket: "monitoring_op")
|> filter(fn: (r) => r._measurement == "procstat" and r._field == "memory_rss" and
r.process_name == "vnest" and r.host == "host_name")
|> range(start: -24h)
|> keep(columns: ["_time", "_value", "process_name"])

Example output:

#datatype,string,long,dateTime:RFC3339,long,string
#group,false,false,false,false,true
#default,_result,,,,
,result,table,_time,_value,process_name
,,0,2019-08-15T13:05:00Z,2505809920,vnest
,,0,2019-08-15T13:10:00Z,2505887744,vnest
,,0,2019-08-15T13:15:00Z,2506014720,vnest
,,0,2019-08-15T13:20:01Z,2506010624,vnest

Nodes statistics

Dashboard API

GET /dashboard/nodes/{id}

Database:

Advanced Monitoring 75
● monitoring_op
Measurement:
● cpu (detailed info on available fields and tags https://github.com/influxdata/telegraf/tree/master/plugins/inputs/cpu)
Fields:
● usage_idle- idle cpu usage (percents)
Tags:
● host- hostname (fqdn)
● node_id- host id
Example query:

from(bucket: "monitoring_op")
|> filter(fn: (r) => r._measurement == "cpu" and r.cpu == "cpu-total" and r._field ==
"usage_idle" and r.host == "host_name")
|> range(start: -24h)
|> keep(columns: ["_time", "_value", "host"])

Example output:

#datatype,string,long,dateTime:RFC3339,double,string
#group,false,false,false,false,true
#default,_result,,,,
,result,table,_time,_value,host
,,0,2019-08-15T13:20:00Z,19.549454477395525,host_name
,,0,2019-08-15T13:25:00Z,17.920104933062728,host_name
,,0,2019-08-15T13:30:00Z,18.050788903551002,host_name
,,0,2019-08-15T13:35:00Z,19.801364027505095,host_name

Measurement:
● mem (detailed info on available fields and tags https://github.com/influxdata/telegraf/tree/master/plugins/inputs/mem)
Fields:
● free- free memory on host (bytes)
Tags:
● host- hostname (fqdn)
● node_id- host id
Example query:

from(bucket: "monitoring_op")
|> filter(fn: (r) => r._measurement == "mem" and r._field == "free" and r.host ==
"host_name")
|> range(start: -24h)
|> keep(columns: ["_time", "_value", "host"])

Example output:

#datatype,string,long,dateTime:RFC3339,long,string
#group,false,false,false,false,true
#default,_result,,,,
,result,table,_time,_value,host
,,0,2019-08-15T14:10:00Z,3181088768,host_name
,,0,2019-08-15T14:15:00Z,2988388352,host_name
,,0,2019-08-15T14:20:00Z,3002994688,host_name
,,0,2019-08-15T14:25:00Z,3115741184,host_name

76 Advanced Monitoring
Performance statistics

Dashboard API

GET /dashboard/nodes/{id}

GET /dashboard/zones/localzone

GET /dashboard/zones/localzone/nodes

Dashboard APIs
Lists the APIs that are changed or deprecated.

APIs changed in ECS 3.6.0.0


The following APIs are changed in ECS 3.6.0.0:
● /dashboard/zones/localzone
● /dashboard/zones/localzone/nodes
● /dashboard/nodes/{id}
● /dashboard/storagepools/{id}/nodes
From the above APIs, the following data are removed:

nodeCpuUtilization*, nodeMemoryUtilizationBytes*, nodeMemoryUtilization*,


nodeNicBandwidth*, nodeNicReceivedBandwidth*, nodeNicTransmittedBandwidth*
nodeNicUtilization*, nodeNicReceivedUtilization*, nodeNicTransmittedUtilization*
capacityRebalanceEnabled, capacityRebalanced, capacityPendingRebalancing
capacityRebalancedAvg, capacityRebalanceRate, capacityPendingRebalancingAvg
transactionReadLatency, transactionWriteLatency, transactionReadBandwidth,
transactionWriteBandwidth
transactionReadTransactionsPerSec, transactionWriteTransactionsPerSec,
transactionErrors.*
diskReadBandwidthTotal, diskWriteBandwidthTotal, diskReadBandwidthEc,
diskWriteBandwidthEc
diskReadBandwidthCc, diskWriteBandwidthCc, diskReadBandwidthRecovery,
diskWriteBandwidthRecovery
diskReadBandwidthGeo, diskWriteBandwidthGeo, diskReadBandwidthUser
diskWriteBandwidthUser, diskReadBandwidthXor, diskWriteBandwidthXor

Alternative places to find removed data


Below you can find information about where to find replacement for removed data. All data are accessible with Flux API.
NOTE: All removed data do not have direct alternatives. Some of the removed data should be calculated based on other
metrics.

Table 27. Alternative places to find removed data


1. Node system level data
Data removed
nodeCpuUtilization*, nodeMemoryUtilizationBytes*,
nodeMemoryUtilization*, nodeNicBandwidth*,
nodeNicReceivedBandwidth*, nodeNicTransmittedBandwidth*,
nodeNicUtilization*, nodeNicReceivedUtilization*,
nodeNicTransmittedUtilization*

Where replacement can be found See Monitoring list of metrics: Non-Performance > Database monitoring_op >
Node system level statistics.

Advanced Monitoring 77
Table 27. Alternative places to find removed data (continued)
Measurements cpu, mem, net

2. Rebalance related data


2.1 Data removed
capacityRebalanced, capacityPendingRebalancing,
capacityRebalancedAvg, capacityRebalanceRate,
capacityPendingRebalancingAvg

Where replacement can be found See Monitoring list of metrics: Non-Performance > Database monitoring_vdc.

Measurement cq_node_rebalancing_summary

2.2 Data removed


capacityRebalanceEnabled

Where replacement can be found See Monitoring list of metrics: Non-Performance > Database monitoring_last
> Export of configuration framework values.
Measurement dtquery_cmf

Field com.emc.ecs.chunk.rebalance.is_enabled (integer)

3. Transaction related data


Data removed
transactionReadLatency, transactionWriteLatency,
transactionReadBandwidth, transactionWriteBandwidth,
transactionReadTransactionsPerSec,
transactionWriteTransactionsPerSec, transactionErrors*

Where replacement can be found For VDC metrics, see Monitoring list of metrics: Performance > Database
monitoring_vdc.

For Node metrics, see Monitoring list of metrics: Performance > Database
monitoring_main.

4. Disk related data


Data removed
diskReadBandwidthTotal, diskWriteBandwidthTotal,
diskReadBandwidthEc, diskWriteBandwidthEc,
diskReadBandwidthCc, diskWriteBandwidthCc,
diskReadBandwidthRecovery, diskWriteBandwidthRecovery,
diskReadBandwidthGeo, diskWriteBandwidthGeo,
diskReadBandwidthUser, diskWriteBandwidthUser,
diskReadBandwidthXor, diskWriteBandwidthXor

Where replacement can be found For VDC metrics, see Monitoring list of metrics: Non-Performance > Database
monitoring_vdc.

For Node metrics, see Monitoring list of metrics: Non-Performance > Data for ECS
Service I/O Statistics.

APIs removed in ECS 3.5.0.0


The following table lists the APIs that are removed in ECS 3.5.0.0:

Table 28. APIs removed in ECS 3.5.0


API Name Syntax Description
Get Process GET /dashboard/processes/{id} Gets the process instance details.
Get Node Processes GET /dashboard/nodes/{id}/processes Gets the details of processes in the node.

78 Advanced Monitoring
5
Examining Service Logs
Describes the location and content of ECS service logs.
Topics:
• ECS service logs

ECS service logs


Describes the location and content of ECS service logs.
You can access ECS service logs directly by an SSH session on a node. Change to the following directory: /opt/emc/
caspian/fabric/agent/services/object/main/log. You can also access the logs from the Service Console. The
following logs are provided:
NOTE:

The emcservice user cannot access service logs. When the node is locked using the platform lockdown feature, a user
cannot access service logs. Only an administrator who has permission to access the node can access the logs.
● authsvc.log: Records information from the authentication service
● blobsvc*.log: Records aspects of the binary large object service (BLOB) service
● cassvc*.log: Records aspects of the CAS service
● coordinatorsvc.log: Records information from the coordinator service
● ecsportalsvc.log: Records information from the ECS Portal service
● eventsvc*.log: Records aspects of the event service. This information is available in the ECS Portal at Monitor >
Events
● hdfssvc*.log: Records aspects of the HDFS service
● objcontrolsvc.log: Records information from the object service
● objheadsvc*.log: Records aspect of the various object heads supported by the object service.
● provisionsvc*.log: Records aspects of the ECS provisioning service
● resourcesvc*.log: Records information that is related to global resources like namespaces, buckets, object users
● dataheadsvc-access.log: Records the aspects of the object heads supported by the object service, the file service
supported by HDFS, and the CAS service.

Examining Service Logs 79


I
Document feedback
If you have any feedback or suggestions regarding this document, mailto:[email protected].

80 Document feedback

You might also like