Student Guide - PowerScale Advanced Administration_Course Guide
Student Guide - PowerScale Advanced Administration_Course Guide
ADVANCED
ADMINISTRATION
COURSE GUIDE
PARTICIPANT GUIDE
[email protected]
[email protected]
PowerScale Advanced Administration
Concepts................................................................................................................... 10
Module Objectives ............................................................................................................. 11
A Day in Life of a Storage Administrator ............................................................................ 12
Disaster Recovery Introduction .......................................................................................... 22
PowerScale Disaster Resilience ........................................................................................ 39
OneFS Domains ................................................................................................................ 59
Troubleshooting PowerScale Cluster ................................................................................. 70
Prerequisite Skills
To understand the content and successfully complete this course, a student must
have a suitable knowledge base or skill set. The student must have an
understanding of:
• Networking fundamentals such as TCP/IP, DNS and routing
• PowerScale hardware troubleshooting and maintenance procedures
• PowerScale Concepts Course
• PowerScale Administration Course
The graphic shows the PowerScale Solutions Expert certification track. You can
leverage the Dell Technologies Proven Professional program to realize your full
potential. A combination of technology-focused and role-based training and exams
to cover concepts and principles as well as the full range of Dell Technologies'
hardware, software, and solutions. You can accelerate your career and your
organization’s capabilities.
PowerScale Solutions
(C) - Classroom
The course consists of eight modules. The total time required to complete this
course content and lab exercises is approximately 5 days.
A Day in Life of a Storage Administrator Networking Architecture Multiprotocol Permissions Events & Alerts
Troubleshooting PowerScale
Protocols SNMP
Clusters
5 OneFS Job Engine 6 OneFS Services 7 Disaster Recovery 8 Performance and Monitoring
Job Engine Architecture SFSE Data Protection and Disaster DataIQ Deep Dive
Recovery
SyncIQ Advanced
Concepts
Module Objectives
There is no "one size fits all" when defining the daily tasks of a storage
administrator. Some of the general administrative tasks may include monitoring the
cluster, addressing notifications, analyzing problems, formulating courses of action,
escalating issues, contacting technical support, and making configuration changes.
1 2 3
4 5
4: Storage administrators plan and analyze for future implications such as storage
growth, incorporation of new applications, revamping of old applications and
workflows, which may impact the ecosystem of the storage solution.
The first priority for an administrator may be to verify processes that run overnight
have completed, such as backups or remote synchronization tasks. The first view
may be the dashboard of the PowerScale cluster or clusters to get the status of the
system. The dashboard provides a glance at areas such as performance, capacity,
and alerts. If areas look to have, or indicate issues, the administrator goes into
action.
Few questions to ponder are: How are your storage administration tasks done? Are
there other tasks you do daily? Do you provide daily status reports? Other tasks for
discussion are postmortems, help desk tickets and requests, patch planning,
upgrade planning, hardware refresh rollout, proof of concepts, and so on.
Div-Gen provides discrete clinical data for university hospitals for specific patient
demographic information. Within their organization, they have about 6000
individuals accessing data on the cluster. Some of the users are from academia
institutions, while most are internal employees. Internally, the cluster is used for
home directories and file sharing.
1 An example for reactive management is when a user quota is reached and the
user is unable to further write data to the cluster. The administrator must now
immediately increase the quota limit or notify the user to delete unneeded data.
The emergency could have been avoided if the impending problem was
investigated and acted upon in the early stages.
• Snapshots3
• Backup and Recover4
• Replication5
• Data Tiering6
Notification Example
Let us take the example of a hard limit exceeded on a user quota to demonstrate
the combination of notifications and reactive management.
Monitoring is key to proactive management. Listed below are the most common
interfaces that a storage administrator may use on a daily basis. Click each tab to
learn more about the interface.
WebUI
The WebUI dashboard is a graphical tool that can be used to monitor different
cluster variables such as cluster size, storage efficiency, throughput and CPU,
active client connections, node status and so on. Administrators can also monitor
cluster alerts, job reports and status of different configurations such as Quotas
using the WebUI.
7 If capacity is available, a simple and temporary fix is increasing the quota limit.
9Data matching a certain pattern such as least accessed files or files older than
100 days can be tiered off to a different physical layer of storage to free up space.
10Administrators may request users to delete unneeded files and personal files in
order to free up space.
CLI
You can monitor the PowerScale cluster using different OneFS commands. The
CLI command outputs provide information on a particular configuration at a more
granular level. For example, to view information about a directory quota, the isi
quota command is used.
PAPI
Another advanced area to consider is taking advantage of the API calls. A chief
benefit of PAPI is its scripting simplicity, enabling administrators to automate their
storage administration. Hayden can develop and tailor specific areas to monitor
using APIs.
DataIQ
InsightIQ
The Dell EMC PowerScale Information Hub connects users to a central hub of
information and experts to help maximize their current storage solution.
See Dell support portal for all the technical whitepapers related to PowerScale.
Challenge
What Is a Disaster?
Data is the most valuable asset for an organization. Any event that can cause data
unavailability or data loss constitutes as a disaster. A disaster can range from a
single file loss to complete data center loss. Based on the cause, disasters can be
classified into three broad categories: natural, technological and man made.
1: The data center may get physically damaged due to natural calamities such as
earthquakes, hurricanes, floods, and tornadoes occurring at the site. These events
are usually rare. Organizations plan and locate their data center in a place where
the probability of such events happening is minimal.
3: Human negligence is the most likely cause for a disaster. Examples include a
user accidentally overwriting or deleting a file, ignoring cluster warnings such as
quota advisory limits, locking shared resources. Man-made causes also include
intentional attacks such as cyber attacks, acts of terrorism, data theft,
compromising data integrity and so on.
11Source data can also be described as production data, primary data, active data,
read/write data, and others.
12 Based on the RPO, organizations plan for the frequency with which a backup or
a replica must be made. For example, if the RPO of a particular business
application is 24 hours, then backups are created every midnight.
13Based on the criticality of data, the RTO for an organization may vary. The
organization must be well prepared or equipped to overcome a disaster within the
RTO to avoid any business impact. For example, if an organization has an RTO of
two hours, data access must be restored within two hours. Both RPO and RTO are
counted in minutes, hours, or days.
Business Continuity (BC) and Disaster Recovery (DR) are often used
interchangeably, but are two entirely different strategies, each of which plays a
significant aspect in safeguarding business operations.
Fact: According to the Data Breach 2019 Report by IBM®, the global
average cost for data breach in the 2019 study is $3.92 million. The
study included 507 organizations in 16 countries and regions and
across 17 industry sectors.
14Business continuity accounts for non-data centric factors. Factors include human
resources, staffing, transportation, skill needs, infrastructure, hardware, software,
and other related CAPEX and OPEX items.
When a disaster occurs at the primary site, data and operations are shifted to the
secondary site until the primary site is restored. Based on the RTO, an organization
can implement the secondary site in 3 different ways:
• Hot Site16 - Fully equipped and configured to enable failover in a relatively short
RTO.
• Warm Site17 - Equipped and setup, but not configured to resume operations
immediately.
• Cold Site18 - Unprepared and needs to be set up and configured to resume
operations.
Organizations can fail after a disaster when they have no recovery plan, no
communication plan, unrealistic recovery goals, inaccurate disaster recovery plan,
and unclear roles.
From the conceptualization to the realization of the BC plan, a life cycle of activities
can be defined for the BC process.
17 A warm site has the infrastructure unboxed and installed but it is not configured
as a disaster recovery solution. The RTO for a warm site is the time it takes to
restore data and configure the system for access. Because your data is not being
consistently replicated between production and target, there is greater latency for
failover, ranging from seconds to hours.
18 A cold site is the cheapest recovery option, but also the least effective one. Cold
site recovery ranges from powering on dormant systems to a full deployment of
hardware and software. The RTO drives the level of preparedness of the cold site.
1:
2:
3:
4:
5:
Plan for short, medium, and long-term emergencies. Define what a short, medium,
and long-term emergency is. The definition can be in terms of dollars, business
lost, time, or fatigue. The plan should address the long term affects to the business
if any of the services or functions cannot be restored.
A subset of the business continuity plan is the disaster recovery plan. The plan
should define the short, medium, and long-term contingencies.
1: Determine the disaster recovery team, their roles and responsibilities, and their
contact information. The team may include the DR lead, management team, facility
team, and the network, server, application, and storage teams.
2: The plan has a description and location of the recovery facilities, transportation
and accommodations details, and the location of data and backups.
4: The process for activating the plan needs to be clearly defined. The minimum
information is the who, what, when, where, and how. Who decides and who needs
to be contacted? What kind of disaster and what is the scope? What is the timeline
and what needs to happen when? Where is the data? How is access going to be
cutover?
5: Perform trial runs to identify weaknesses or failures and then revise and test the
plan again.
To demonstrate RTO and RPO in a disaster recovery plan, consider the following
scenario for the Div-Gen organization:
19If a catastrophic event causes data unavailability, then the eight-hour RTO
requires secondary site data access within eight hours.
20A hot or warm site may simply require failover steps. A cold site may need
powering up systems and restoring data from tapes, probably not realistic given an
eight-hour RTO.
• Modern building structures for the data center to withstand natural calamities
such as earthquakes and fires.
• Security measures such as CCTV, entry badges, and guards.
• Employing redundant power supply for the data center.
• Ensuring high availability of software and hardware. For example, taking data
backups and replicas, stock hot swap components and so on.
• Warm and Hot sites to redirect operations when a disaster occurs at the primary
site.
• Ability to perform maintenance task while keeping the data available.
• Identify gaps, risks, vulnerabilities by creating and performing resilience tests.
3: Recovery addresses the question "The disaster has occurred, now what?".
Recovery typically is not immediate and access to the data may be delayed until it
is restored. Organizations with no plan or a poor plan are likely to fail after a
disaster. Some examples of the disaster recovery measures include:
Resistance
having inherent protection that is built in to the storage array to prevent damage.
Some examples of PowerScale Resistance include:
* Dry-bulb temperature, 5.5⁰ C dew point to 60% relative humidity ** Dry-bulb temperature, -12⁰ C
dew point and 8% to 85% relative humidity
21Gen 6 uses a M.2 drive as the journal whereas Gen 6.5 hardware uses a
NVDIMM drive. In both cases, a dedicated battery backup of the vault drive helps
prevent data loss. When the data center loses power, writes to the PowerScale are
preserved.
Resilience
PowerScale has inherent high availability features at both the hardware and
software level to reduce the impact of disaster. Some of the PowerScale Resilience
measures include:
23Each Gen 6.5 node has a dual redundant power supply units. Gen 6 nodes are
added as node pairs which provide power redundancy.
− SyncIQ24
− SnapshotIQ25
− SmartConnect26
− FlexProtect27
− CloudPools28
27 The FlexProtect job copies data to other drives across multiple nodes.
28Configures data to be stored outside the cluster and to target cloud solution such
as ECS.
Recovery
With the many OneFS services available and configured, it is possible to recover
from a disaster using snapshots, replicas, backups, or cloud tiered data. For
example, once a disaster occurs, SyncIQ can be used to resume operations at a
target site.
29 The value of Eyeglass is its ability to replicate configuration data and orchestrate
failover and failbacks. It acts a witness between the source and target clusters.
• When a disaster occurs and the primary site loses funtionality, the data and
operations are failed over to the target site.
• Once the primary site is functional again, data and operations are failed back to
it.
Disaster Considerations
30Ideally, the facility has two power sources. Systems should have different and
redundant sources of power and should they fail, generators, and then battery
power. When the data center fails over to generator power, the refueling should be
planned and facilitated.
31Mission critical systems can be given higher protection by segregating them from
the other enterprise system. The segregation can be both physical as well as
logical.
SmartConnect
SmartFail
Power Redundancy
Listed here are the OneFS fault tolerant functions with a brief definition.
Feature Description
Dynamic sector repair The file system forces bad disk sectors to
be rewritten elsewhere.
33Gen 6 uses a M.2 drive as the journal whereas Gen 6.5 hardware uses a
NVDIMM drive. In both cases, a dedicated battery backup of the vault drive helps
prevent data loss. When the data center loses power, writes to the PowerScale are
preserved. Large journals offer flexibility in determining when data should be
moved to the disk. A node mirrors their journal to its peer node. A backup battery
helps maintain power while data is stored in the vault.
Automatic drive firmware updates Automatic drive firmware updates for new
and replacement drives.
OneFS fault tolerance is not within the scope of this training, but is part of
PowerScale Disaster resiliency. For more in-depth details, go to
https://www.emc.com/collateral/hardware/white-papers/h10588-isilon-data-
availability-protection-wp.pdf link and review the white paper.
Protection Review
Following table provides an easy reference for all the protection levels.
Shown in the graphic is a representation of drive sleds with three drives, typical in
the F800, H500, and H400 nodes.
Example
File size – 768 KB
Stripe can start on
Each stripe unit is
any disk in the failure 768 KB file written
Protection – N+2d:1n
128 KB
domain equals six stripe
units
Neighborhood Review
In the illustration the cluster goes from three disk pools to six, each color
represents a disk pool.
Single
Each
neighborhood, 3
neighborhood has
Shown in the graphic is a 40 node cluster to illustrate a chassis failure. The addition
of 40th node splits the cluster into four neighborhoods, labeled NH 1 through NH 4.
At 40 nodes, splits to 4
neighborhoods
Quorum
6-node Cluster
More than or
equal to 50% of
equivalent nodes
down or removed
Protection Level Review: The cluster provides data availability even when one or
multiple devices fail. Reed-Solomon erasure coding is used for data protection.
Protection is applied at the file level, enabling quick and efficient data recovery.
OneFS has +1n through +4n protection levels, providing protection for up to four
simultaneous component failures. A single failure can be an individual disk, a node,
or an entire Gen 6 chassis. A higher protection level than the minimum can be set
and can be controlled per cluster or directory level. Remember that Gen 6 requires
a minimum of 4 nodes of the same type. With mirroring, the cluster can recover
from N - 1 drive or node failures without sustaining data loss. For example, 4x
protection means that the cluster can recover from a three drive or three node
failures. Consider the trade-offs when making choices. Capacity requirements
increase as the protection level is increased. Data can be split into tiers and node
pools with different protection levels. Do the size of your node pools meet the
minimum required size for the protection level? Will the small performance penalty
impact your workflow? These are the considerations when determining how much
to increase the requested protection level to meet your objectives.
Data Layout Review: Drive failures represent the largest risk of data loss
especially as node pool and drive sizes increase. To illustrate the data protection
resiliency, let us begin with a refresher on the OneFS data layout using Gen 6
nodes. Gen 6 nodes have drive sleds with three, four, or six drives. As shown in the
graphic, each of the 4 nodes has 5 drive sleds and the first disk in each sled
belongs to the same disk pool. Likewise, the second disk in each sled belongs to a
different disk pool, as does the third disk in each sled. The different colors
represent the 3 different disk pools. All data, metadata, and FEC blocks are striped
across multiple nodes in the disk pool. The striping protects against single points of
failure and bottlenecks. All parts of a file are contained in a single disk pool. The
graphic shows an example file of 768 KB written to the disk pool noted in blue.
Containing files into a disk pool creates isolated drive failure zones. Disk pool
configuration is automatically done as part of the auto provisioning process and
cannot be configured manually. File data is broken into 128 KB data stripe units
consisting of up to 16 x 8 KB blocks per data stripe unit. A single file stripe width
can contain up to 16 x 128 KB data stripe units for a maximum size of 2 MB. A
large file has thousands of file stripes per file distributed across the disk pool. The
protection is calculated based on the requested protection level for each file stripe
using the data stripe units assigned to that file stripe. The higher the desired
protection level, the more FEC stripes units are calculated. This example
showcases the +2d:1n protection level on a 4 node cluster. The data can be rebuilt
if two disks fail, or if a node fails. Typically, higher value data is configured with a
higher protection level.
Neighborhood Review: Node pools are made up from groups of like-type nodes.
Gen 6 node pools are divided into neighborhoods for smaller, more resilient fault
domains. A Gen 6 node pool splits into two neighborhoods when adding the 20th
node. One node from each node pair moves into a separate neighborhood. In this
example, the cluster goes from three disk pools to six. In the figure, each color
represents a disk pool. After the 20th node added up to the 39th node, no 2 disks in
a given drive sled slot of a node pair share a neighborhood. The neighborhoods
split again when the node pool reaches 40 nodes. At 40 nodes, each node within a
chassis belongs to a separate neighborhood. The next neighborhood division
happens when the 80th node is added, and then again when the 120th node is
added. Given a protection of +2d:1n, the loss of a single chassis does not result in
a data unavailable or a data loss scenario. In a Gen 6.5 node pool, neighborhood
splitting and suggested protection policies are identical to Gen5.
Gen 6 Chassis Failure: A loss of a node does not automatically start reprotecting
data. Many times a node loss is temporary, such as a reboot. If N+1 data protection
is configured on a cluster, and one node fails, the data is accessible from every
other node in the cluster. If the node comes back online, the node rejoins the
cluster automatically without requiring a rebuild. If the node is physically removed, it
must also be SmartFailed. Only SmartFail nodes when needing to remove from the
cluster permanently. Once the 40th node is added, the cluster splits into four
neighborhoods, labeled NH 1 through NH 4. The splits place each node in a
chassis into a failure domain different from the other three nodes in the chassis.
Also, every disk in a node is in a separate disk pool from the other node disks.
Having each node in a distinct neighborhood from other nodes in the chassis
allows for chassis failure. SmartFailing a failed node rebuilds the data on the free
space of cluster. Adding a node distributes the data to the new node.
Quorum: For a quorum, more than half the nodes must be available over the
internal network. A 6 node Gen 6 cluster requires a 4-node quorum. Imagine a
cluster as a voting parliament where the simple majority wins all votes. If 50% or
more of the members are missing, there can be no vote. Without quorum, reads
can occur, but no new information is written to the cluster. The OneFS file system
becomes read-only when quorum is lost. The quorum also dictates the minimum
number of nodes required to support a given data protection level. As seen in the
earlier table, each protection level requires a minimum number of nodes. For
example, +2n needs a minimum of four Gen 6 nodes. Why? You can lose two
nodes and still have four Gen 6 nodes up and running; greater than 50%. Under
protection applies maintaining redundancy of power sources and networks. In a
SyncIQ solution, if snapshots are used on a target cluster, consider the snapshot
schedule and maintenance over time. Define a snapshot expiration period on the
target, otherwise the replicated dataset can consume more capacity than intended.
Int-b - Passive
failover network
Int-b
Int-a
Int-a - Active
network
Shown in the graphic are the areas to size a cluster for maximum resilience.
Forward error
correction
Example
3d:1n1d
3 FEC units
Left image shows rear view of Gen 6 node and right image shows front view of Gen 6 node.
Decision point: How many nodes will give the best storage
efficiency?
The protection level is tunable and variable down to the file level. With Gen 6x, for
better resiliency, better efficiency, using +2d:1n, +3d:1n1d, or +4d:2n is
recommended. Let us illustrate resiliency and efficiency with an example. If the
workflow requires the +3d:1n1d protection that is recommended on large capacity
drives. What is the most efficient cluster size that must meet the requirement? The
cluster must be large enough to tolerate the loss of three disks and one node. The
graphic shows 15 stripe units (SU) and 3 FEC units for the maximum stripe width
for +3d:1n1d protection. A smaller cluster size does not use a stripe width of 15
data stripe units. 9 Gen 5 nodes or 10 Gen 6 nodes can meet the protection
requirements of the workflow at the maximum efficiency. In the example, if three
disks fail, the data can be rebuilt. If one node fails and one disk fails, the data can
be rebuilt.
High Availability
High availability and resilience are integral to OneFS from the lowest level on up.
For example, OneFS provides mirrored volumes for the root and /var file systems
using the Mirrored Device Driver (IMDD), stored on flash drives. OneFS also
automatically saves last known good boot partitions for further resilience.
SmartConnect software contributes to data availability by supporting dynamic NFS
failover and failback for Linux and UNIX clients and SMB3 continuous availability
for Windows clients. This ensures that when a node failure occurs, or preventative
maintenance is performed, all in-flight reads and writes are handed off to another
node in the cluster to finish its operation without any user or application
interruption.
Safety Margins
Safety margins are not wasted capacity, but a good use of capacity, and they are
important.
35Consider growth rate, protection levels, and how long it takes to order and install
nodes.
Resiliency is ensuring the cluster can continue working even if nodes and disks
malfunction. Key is leaving headroom in the cluster. Data protection can manage
the loss of a node, but reprotecting data requires the space to rebuild. A loss of a
disk or node pushes the capacity of cluster to operate below the demands placed
on it. Examine data delivery variance at 80% and again at 85% capacity utilization.
High CPU levels, pushing the maximum connection counts per node, maximizing
drive IOPS, or creating snapshots as fast as they are deleted, leaves no headroom.
When the space availability of disk pools falls below the low space threshold, the
job engine engages the low space mode. New Jobs are not started in low space
mode and jobs that are not space-saving will be paused. Once free space returns
above the low-space threshold, jobs that have been paused for space are
resumed.
Access Cluster
Before assessing the cluster for a failure, ensure that the failure is within the
cluster.
3
1 2
1: PowerScale accounts for 1/3rd of the solution. Knowing where the breakdown
occurs helps to isolate the failure. If the issue is unique to a single user, then it is
likely a client problem. The customer owns and maintains the client and network
points of the topology, making 2/3rds of the workflow the customer responsibility.
2: If the clients can only access resources outside of the data center, then the
problem may be the network. If a user is denied access to a file share of cluster,
the problem is likely in the authentication process. If the user cannot open a file,
then the issue could be permissions or ownership on the file.
3: If the user cannot save data to the file share, enforced quota limits may be
reached. Because of PowerScale’s resiliency, disk and node failures may have little
impact to client access. Conversely, the impact can become severe if the cluster
runs near full storage capacity, or other resources such as CPU and RAM are near
peak levels. In severe cases such as storage capacity near maximum and
encountering multiple disk failures, the failure can be near catastrophic.
Check Jobs
Check jobs from the web administration interface on the Cluster Management, Job
Operations page shown.
1: The Job Summary page shows FlexProtect running. The elapsed time and the
phase can be used as a rough indication of the total time the job may take. A job
has up to 6 phases depending on the job. Some phases are skipped and go
straight to phase 6 without going to 4 and 5.
2: The Job Types list all the jobs and allows the administrator to manually start a
job.
5: The Impact Policies tab allows administrators to add or modify the impact
policies.
The job information is also provided using the isi job command.
DataIQ monitors cluster health independent of the cluster status. DataIQ monitors
multiple clusters with massive node counts. It also configures and receives alerts
that are based on limits and issues.
The graphic shows an excerpt of the WebUI page with error logs generated for
plug-ins and the DataIQ system.
The DataIQ server scans the managed storage, saves the results in an index, and
provides access to the index. DataIQ is an optimized data storage scan, index, and
in-memory search database platform that provides visibility to data spanning
multiple platforms.
Use InsightIQ’s to generate periodic reports. Detailing cluster events, plotting job
execution, and breaking out job workers are some of the many functions of
InsightIQ.
Under - Protected
Operating underprotected or disabling virtual hot spare, or VHS, can gain disk
space for users, but risks filling the cluster to 100% full.
VHS reserve
Used Capacity
0% to 20%
By default, all available free space on a cluster is used to rebuild data. Without
enough free space, jobs can stop, and a failed disk cannot SmartFail out of the
cluster. VHS allocation enables space allocation for data rebuild when a drive
failure. VHS can assure there is always space available and to protect data
integrity if overuse of the cluster space occurs. VHS enabled may stop writes when
apparently the space is available. VHS is not deducted when viewing capacity such
as the isi status output. For example, setting two virtual drives or 3% of total
storage causes each node pool to reserve virtual drive space. The space is
equivalent to two drives or 3% of the total capacity for virtual hot spare, whichever
is larger. Free-space calculations exclude the space that is reserved for the virtual
hot spare. The reserved virtual hot spare free space is used for write operations
unless you select the option to deny new data writes.
Data Recovery
Decision Point: How will I recover lost data? Snaps? SyncIQ? Tape?
If the assessment of damage includes data loss, how is the lost data recovered?
Snapshots are a great way to recover files and directories, but if the loss includes
losing the needed snapshots, restoring from tape might be the solution. If the loss
is extensive or the restore time from tape fails the RTO requirement, failing back
from a target cluster may be a good option. Depending on the amount of data to
recover, restoring data over the WAN can be time intensive and exceed the RTO.
Challenge
Lab Assignment:
1) Explore the data protection levels and settings applied at different
levels.
2) Verify PowerScale data resilience when a node is down or smartfailed.
OneFS Domains
OneFS Domains
A OneFS domain is a directory that when acted upon, the actions are applied only
to that root directory. SnapRevert, SmartLock, SyncIQ, and Snapshot are the four
OneFS domains.
View the information in each of the tabs for an overview of the OneFS domains.
domain, also called the SyncIQ root directory for the SyncIQ policy. When an action
such as a domainmark runs, OneFS applies the action only to that SyncIQ domain.
SyncIQ Domain
OneFS assigns SyncIQ domains to the source and target directories when creating
the replication policy.
OneFS functions only act on data within the scope of the SyncIQ domain. For
example, when a domain mark is initiated, the domain mark only tags LINs in the
SyncIQ domain.
40 The graphic shows that a SyncIQ policy replicates the Boston cluster homedirs
directory to the Seattle cluster homedirs directory. OneFS automatically creates the
Seattle cluster homedirs SyncIQ domain when the SyncIQ policy first runs.
SmartLock Domain
SnapshotIQ Domain
Snapshot domain: 6
Snapshot domain: 5
SnapRevert Domain
The SnapRevert job restores a snapshot in full to its top level directory. OneFS
assigns SnapRevert domains to directories within snapshots to prevent
modification of files and directories when reverting a snapshot. OneFS does not
automatically create SnapRevert domains. Click on each item for more information:
• Prevents writes.43
43A SnapRevert domain uses a piece of file system metadata and associated
locking to prevent writes to files in the domain while restoring to the last known
good snapshot.
44 You cannot revert a snapshot until creating a SnapRevert domain on its top level
directory.
You can delete SyncIQ domains or SnapRevert domains if you want to move
directories out of the domain. You cannot delete a SmartLockdomain. OneFS
automatically deletes a SmartLock domain when you delete a SmartLock directory.
isi_pdm command
Use the isi_pdm utility to manage OneFS domains. The isi_pdm command has
the options to create, delete, exclude, include, show_exclusions,
showall, list, read, and write domains.
• list example:
# isi_pdm list /ifs/eng/sw All
[ 5.0100, 6.0100, 6.0300 ]
• showall example:
# isi_pdm showall
...
{
Base DomID = 6.0000
Owner LIN = 1:0033:00be
Ref count = 3
Ready = true
Nested tag count = 0
Rename tag count = 0
Modification time = 2018-11-13T06:34:31-0500
DomIDs = [ 6.0100, 6.0300 ]
}
Administrators can view the IFS domain IDs by using the isi get command. The
output is verbose so the example uses grep to show only the IFS Domain IDs
field. The first line of output is for the ./ directory (/ifs/eng/sw). The second line is
for the ../ directory (/ifs/eng), and the third is for a file in the /ifs/eng/sw directory.
• Example:
# isi get -g /ifs/eng/sw | grep "IFS Domain"
* IFS Domain IDs: {5.0100(Snapshot), 6.0100(Snapshot),
6.0300(WORM) }
* IFS Domain IDs: {5.0100(Snapshot) }
* IFS Domain IDs: {5.0100(Snapshot), 6.0100(Snapshot),
6.0300(WORM) }
• The output shows a nested snapshot domain and a SmartLock domain.
The list shows areas to consider when working with OneFS domains. Click each list
item for details.
• Copying many files into a OneFS domain can be a lengthy process. OneFS
must mark each file individually as belonging to the protection domain.
• The best practice is to create protection domains for directories while the
directories are empty, and then add files to the directory.
• The isi sync policies create command contains an --accelerated-
failback true option, which automatically marks the domain. The option can
decrease failback times.
• If using SyncIQ to create a replication policy for a SmartLock compliance
directory, you must configure the SyncIQ and SmartLock compliance domains
at the same root directory level. A SmartLock compliance domain cannot be
nested inside a SyncIQ domain.
• If a domain prevents the modification or deletion of a file, you cannot create a
OneFS domain on a directory that contains that file. For example, if
/ifs/data/smartlock/file.txt is set to a WORM state by a SmartLock
domain, you cannot create a SnapRevert domain for /ifs/data/.
• You cannot move directories in or out of protection domains. However, you can
move a directory to another location within the same protection domain.
Troubleshooting Guides
Locating Guides
Using Guides
46 Troubleshooting specific issues utilizes the troubleshooting guides. Here you can
find step-by-step instructions to help you troubleshoot some common issues that
affect OneFS clusters. Some of these troubleshooting guides reference other
troubleshooting guides or refer to other Dell EMC PowerScale documents, such as
knowledge base articles or white papers.
47More troubleshooting guides are available for Technical Support only. These
guides involve specific process issues, or issues of a less common occurrence, or
may involve risky operations only to be performed under support supervision.
Guides Availability
New guides are added as they become available or new topics are recognized with
new versions of OneFS and new PowerScale hardware. Visit the Info Hub page
and the support page periodically to check for new guides, or existing guide
updates. The guides on the Info Hub page are grouped into related topic areas.
Under each topic area-specific guide links are listed for each available
troubleshooting guide.
Each guide is available for download as a PDF file. Guides vary in length and topic
focus. A single guide can contain any number of investigation topic areas. Take a
few moments to review the guides available for each topic area.
Troubleshooting Process
1: Client
When reaching the top layer of clients, most likely the issue is external to the
cluster, or a client configuration issue. To troubleshoot client connectivity issue, use
the Clients Cannot Connect To A Node guide.
2: Protocols: Next protocols, such as SMB, NFS, HDFS, S3, and the associated
processes.
3: File System
The file system is the third layer. This includes many internal processes and
routines. To troubleshoot file system issues, use the Troubleshoot An
Unresponsive Node guide.
4: Hardware
The hardware is the next layer. Hardware issues are often masked in other
symptoms. To troubleshoot hardware issues, use the PowerScale Troubleshooting
Guide: Hardware - Top Level guide.
5: Network
Start with the network layer as the foundation. Many PowerScale issues are a
result of network issues, both the internal back-end network, and the client-side
front-end. This includes reaching external resources such as DNS servers48, and
internal issues with SmartConnect. Troubleshooting Your SmartConnect
Configuration guide can be used to perform SmartConnect troubleshooting.
Troubleshooting Steps
Within each guide, there is a Content and Overview page or pages depending on
the length of the guide. It acts as a high-level link to separate flowcharts, and quick
access to associated appendixes.
The page numbers and the name of the guides are positioned the same on every
page for quick reference.
Using the guide flowcharts is straight forward if familiar with using flowcharts.
1: Caution box warns that a particular step must be performed with great care, to
prevent serious consequences. Caution notes must be followed. The symbol is on
the explanation note and the step it is associated with.
2: Each page begins with the page number or a start location. From there, follow
the arrows. Any page-related notes should be read and followed. Directional
arrow indicates the path through the process flow.
3: Follow the decision trees. These may be yes or no decisions, or may contain
multiple choices questions and answers to find the appropriate direction to follow.
Document Shape: Calls out supporting documentation for a process step. When
possible, these shapes contain links to the reference document. Sometimes linked
to a process step with a colored dot.
5: Every guide applies the troubleshooting guide process to the start page. Each
flowchart starts with a note respective to the troubleshooting guide. The process
begins at the Start symbol and requires a step to log in to the cluster. To continue
the process, select an appropriate Go to Page #.
Hardware issues can be the root cause of many issues that appear to be
something else.
• Performance Degradation49
• Intermittent Issues
• Unplanned Reboots50
• Job Pauses51
50 Unplanned node reboots can be often attributed to boot drive partition issues.
51 Job pauses are often experienced while a failing drive is SmartFailed from the
cluster.
52The CELOG collects and logs the events that are detected. The log files are
gathered using the isi diagnostics gather command using the CLI or through the
web administration. Many hardware-specific troubleshooting guides and knowledge
base (KB) articles are available to assist with troubleshooting issues.
Hardware status indicators take several different forms such as: Cluster status,
individual node status, drive status, the percentage full on boot drive partitions and
firmware mismatch.
Two primary methods are available to check the hardware status: CLI commands
and WebUI pages. The CLI command options that are displayed are also
represented using the web administration interface.
isi status
The isi status command with the --node or -n option allows greater detail for
individual node status to be displayed.
Like the isi status command, the Cluster Status provides similar information.
• Manage Nodes54
• Health Status55
• View Details
• SmartFail State56
54The LNN, health status and OneFS version are displayed along with the node
configuration and chassis model number.
56 The status for the node being Smartfailed, if the node is down, and if the node is
in the cluster are displayed for quick issue identification.
isi_hw_status
isi status: You get a high-level view of the cluster health, cluster storage capacity
and usage, individual node health, node storage capacity and usage, and node
throughput. In addition, critical events are listed, the status of any running, paused,
or failed jobs, and a list of most recently run jobs and the status.
Status Check Using WebUI: You can click an individual node to display
information for a specific node, node model, serial number, node size and capacity
usage, throughput, CPU utilization, and network connection information. The root
cause of a node with an attention status can be found using the basic commands.
Often a log file analysis is required to identify the underlying issue.
57 The command only views information from the node you are logged into.
associated with. You can also specify a specific node or all nodes using the --
node-lnn <LNN | all> option.
Drive Firmware 58
58Beginning with OneFS 8.0, drive firmware can be updated without cluster or
node-wide disruption. Use the isi devices drive firmware list command to view
current drive firmware versions and drive models by node, by drive sled for Gen 6
hardware, or for all nodes in the cluster.
The drive information is available using the web administration interface. The
information is similar to the isi devices drive CLI command.
You can SmartFail a drive or easily update the drive firmware for a single drive.
Out of date or mismatched node firmware can often be the root of intermittent
issues. You can specify a specific node or all nodes in the cluster.
59
To display the node firmware from the CLI use this command for clusters running
OneFS 8.0 or later
Mismatched firmware61
60 Use this command for clusters with OneFS versions prior to 8.0.
The same node firmware information available using the CLI is also available using
the web administration interface. All firmware for the node hardware is displayed. A
column indication mismatched firmware is added for quick management.
You can monitor the status of NVRAM batteries and charging systems. This task
can only be performed using the OneFS CLI and on node hardware that supports
the command.
Interface CLI and CLI CLI and WebUI CLI and WebUI
WebUI
Error State X
Interface CLI and WebUI CLI and CLI only CLI only
WebUI
Error State
Description The drive is The drive was added The drive is No drive
new and and contained a undergoing a is in this
blank. This PowerScaleGUID but format bay.
is the state the drive is not from operation. The
that a drive this node. This drive drive state
is in when likely will be changes to
you run the formatted into the HEALTHY
isi dev cluster. when the format
command is successful.
with the -a
add option.
Error State
Description The drive type is Unique to the The drive The drive
wrong for this A100 drive, cannot be is ready for
node. For which has boot acknowledged removal
example, a non- drives in its by the OneFS but needs
SED drive in a bays. system. your
SED node, SAS NOTE: In the attention
instead of the web because
expected SATA administration the data
drive type. interface, this has not
state is been
included in Not erased.
available. You can
erase the
drive
manually
to
guarantee
that data is
removed.
Interface CLI only CLI only CLI and WebUI CLI only
Error state X
Error State X X
Running out of available storage capacity can cause severe issues with cluster and
individual node pool behavior and performance.
• Best Practice62
62 The best practice for the amount of free space to maintain can vary based on the
size of the cluster or node pool. The best practice is to pay attention when a node
pool reaches 80 percent full. Doing so allows adequate time for additional space
provisioning and installation. This helps to maintain adequate space is available.
• Small cluster63
• Larger node pools64
• Nodes are in a provisioned state65
• Join fails66
• Monitoring available capacity on a regular basis67
The use of SmartQuotas can help prevent out of control users or applications from
consuming large amounts of space without intervention. You should run the
FSAnalyze job on schedule, which populates InsightIQ for analysis. The ingest
rates and predictive analysis can assist in capacity consumption estimations.
65Whenever nodes are added to the cluster, you should verify that all nodes are in
a provisioned state, and all nodes are added successfully to a node pool.
66It is possible to attempt to add nodes, and the join fails and remains in an
unprovisioned state.
The table shows the error messages that you may get when attempting to write to a
full or nearly full cluster or node pool.
To resolve capacity full issues, please follow the steps that are outlined in the
PowerScale customer troubleshooting guide, Troubleshoot a Full Pool or Cluster.
Several steps are available before impacting existing data on the cluster.
Other corrective actions have impact on the cluster data. Some corrective actions
may not be appropriate based on the severity of the issues.
• Move data from the full node pool to a node pool with available capacity. 71
• Attempt to deduplicate data72
69Enable spillover on a full node pool if it has not been enabled to allow writes to
continue on the spillover target node pool.
70Temporarily disable VHS to allow the reserved space to be used by the file
system, while other corrective actions can be taken. Reenable VHS when the
current issue is resolved. VHS provides important required data space safety
margin for the required space when rebuilding data from failed drives, and data
restriping.
71Create and run a SmartPools file pool policy to move data and manually run the
SmartPools job. This does require a SmartPools license. If necessary contact
Technical Support for a temporary SmartPools license key.
Snapshot deletion can be the most significant action to clear up space on a full
cluster.
72 The SmartDedupe job requires a long run time to sample and deduplicate data.
This may be impractical based on the urgency required. The job may also have
limited benefit based on the type of data stored.
73Many shadows stores are stale containing data that has been modified or
deleted.
74 Snapshots retain the data blocks they protect. Depending upon the age and
quantity of the snapshots, a considerable amount of data may be retained on the
cluster.
76 Any data blocks protected by a snapshot remain on the cluster until the snapshot
is removed either by expiration or manually deleted.
77 The data blocks are freed when the SnapshotDelete job runs.
To troubleshoot capacity full issue, use the Troubleshoot A Full Pool Or Cluster
guide.
Tip: Download the Best Practices Guide for Maintaining Enough Free
Space on PowerScale Clusters and Pools. The guide provides
guidance to avoid cluster or node pool full situations, enable safety
measures to mitigate risk (though the white paper is old, it is still
valid).
78Only FlexProtect and FlexProtectLin jobs can run while a cluster is in a degraded
protection state.
79All other jobs are paused and queued until the cluster is returned to a fully
operational protection state.
Challenge
Advanced Access
Module Objectives
Networking Architecture
If required, the CLI command to change or modify the int-b internal network
netmask is: netmask int-b 255.255.255.0. To add an IP range to int-b
netmask network, run the iprange int-b 192.168.206.21-
192.168.206.30 command. Run the interface int-b enable command to
specify the interface name as int-b.
Protocols: NFS / CIFS, Jumbo Frame Enabled, Mount Options used (NFS only Connectivity)
Smartconnect configuration
The whole network stack depends upon the foundation being healthy.
Network Troubleshooting
Network design is based on many concepts; the following are considerations and
principles to guide the process:
83Understanding the application data flow from clients to the PowerScale cluster
across the network allows for resources to be allocated accordingly while
minimizing latency and hops along this flow.
84As traffic traverses the different layers of the network, the available bandwidth
should not be significantly different. Compare this available bandwidth with the
workflow requirements.
85 Ensuring latency is minimal from the client endpoints to the PowerScale nodes
maximizes performance and efficiency. Several steps can be taken to minimize
latency, but latency should be considered throughout network design.
Prune VLANs86
VLAN Hopping87
86It is important to limit VLANs to areas where they are applicable. Pruning
unneeded VLANs is also good practice. If unneeded VLANs are trunked further
down the network, this imposes additional strain on endpoints and switches.
Broadcasts are propagated across the VLAN and impact clients.
On the other hand, sufficiently large buffers can artificially delay dropping packets
on congested networks. This matters, because TCP depends upon dropped
packets to help signal when to reduce transmission rates. If the delay between
congestion taking effect and packets dropping is sufficiently long, then TCP
transmission rates can oscillate wildly between unrealistically high speeds, and
unnecessarily modest speeds. This is known as bufferbloat. Again, Wireshark is a
great tool for diagnosing bufferbloat. Capture the packets of a connection
suspected of bufferbloat, graph transmission, and drop and retransmission rates to
successfully diagnose bufferbloat.
Link Aggregation
90Per the IEEE specification, gigabit speed is available only in full-duplex and all
types of aggregation are only point to point. Link aggregation provides graceful
recovery from link failures. If a link fails, traffic is automatically sent to the next
available link without disruption.
Link Aggregation does not provide increase of performance, it provides only high
availability in some scenarios. It is important to recognize that regarding bandwidth,
distributes traffic across links. However, a single session only uses a single
physical link to ensure packets are delivered in order without duplication of frames.
92 Stateful protocols, such as NFSv4 and SMBv2 benefit from link aggregation as a
failover mechanism. On the contrary, SMBv3 Multichannel automatically detects
multiple links, using each for maximum throughput and link resilience.
the concepts discussed for single switch Link Aggregation still apply to Multi-
Chassis Link Aggregation. Additionally, as the multiple switches form a single
virtual switch, it is important to understand what happens if the switch hosting the
control plane fails. Those effects vary by the vendor’s implementation but will
impact the network redundancy gained through Multi-Chassis Link Aggregation.
You can configure which network interfaces are assigned to an IP address pool. If
you add an aggregated interface to the pool, you cannot individually add any
interfaces that are part of the aggregated interface.
The CLI commandisi network pools view <id> is used to display the
configuration details of a specific IP address pool on the cluster.
Values Description
Flow Control
CLI command used to see which interfaces in a cluster have received pause frames.
Pause frames are a signaling device for flow control. What they mean is roughly:
"Slow down! You are transmitting too much, too quickly!" The frames might be sent
or received by the cluster, but in either case they mean that there is a network
performance imbalance. The cluster might be flooding clients that cannot keep up
with it, or the cluster might be overloaded. In either case, network connections are
throttled until all participants can keep pace.
• The OneFS network stack was designed to comply with the IEEE_802.1Q RFC.
• Maximum number of VLANs: 409495.
Run the isi network subnets list to identify the name of the external
subnet you want to modify for VLAN tagging. Run the isi network subnets
94 Not applying a tag to a subnet on a port that is not trunked/tagged will cause
issues with network traffic to/from the PowerScale cluster.
95VLAN tagging requires a VLAN ID that corresponds to the ID number for the
VLAN set on the switch.
Routing
If the only systems that the cluster ever has to talk to are on networks to which the
cluster is directly connected, there is not a problem. The issue occurs when there
are multiple foreign networks96. Because there can only be one default gateway on
any node at a given time, the default gateway mechanism is insufficient to allow the
cluster to correctly route packets.
96Networks to which the cluster does not have a direct connection, and that are
only reachable through a gateway/router. If the number of “foreign network” is small
and relatively static, then defining static routes provides an effective workaround.
IPv6 is a special case because of how IPv6 is supposed to handle complex routing,
and currently OneFS does not do SBR for IPv6. If working with a complex IPv6
network, plan to put more of the burden on the network infrastructure, and ask the
network administrator which flexible routing options are available.
• To enable source-based routing: 7.2.x: isi networks sbr enable and 8.x
and later: isi network external modify --sbr=true
• To disable source-based routing: 7.2.x: isi networks sbr disable and
8.x and later: isi network external modify --sbr=false
UNIX, IPv4, and Routing: One might have expected the cluster to run down the
list of available routes, in so-called waterfall fashion, and use the first one that
matches, but instead the cluster orders all matching routes in order of specificity
and uses the most specific one. This information is important if trying to reconstruct
what the cluster is actually doing in routing terms. It is also important to remember
97The process of adding ipfw rules is stateless and essentially translates to per-
subnet default routes without manual intervention.
that some UNIX hosts evaluate routes in waterfall fashion, and administrators
should understand the difference between the ways that the cluster operates and
other hosts operate.
Source - Based Routing: When SBR was originally implemented, it overrode any
existing static routes. This turned out to be a problem, because in some
environments people wanted to have particular static routes with flexible SBR. In
recent versions of OneFS, this has been corrected. Because of the way that similar
features have been implemented on other well-known platforms (e.g. NetApp,
CLARiiON, VNX), there is widespread misunderstanding of what was implemented
on OneFS. Most of the other platforms use stateful packet inspection, so that the
network stack tracks connected flows, which enable the network stack to send a
reply to a packet back to the same gateway that sent the packet to the PowerScale
node. This is NOT what SBR does. SBR enables sending a packet through the
same interface on which it arrived. Instead of relying on the destination IP, SBR
creates dynamic forwarding rules using the IP address of sender and the subnet
that the packet arrives on. It then creates a reverse rule so that packets going to
that IP address will always be forwarded to the default gateway for that subnet.
3 4
1: If SBR is not configured, the highest priority gateway, that is, gateway with the
lowest value which is reachable, is used as the default route. Once SBR is
enabled, when traffic arrives from a subnet that is not reachable via the default
gateway, firewall rules are added. As OneFS is FreeBSD based, these are added
through ipfw.
SBR is entirely dependent on the source IP address98 that is sending traffic to the
cluster. SBR creates per-subnet default routes in the following steps:
The table shows the common tools that are used for network troubleshooting.
98 If a session is initiated from the source subnet, the ipfw rule is created. The
session must be initiated from the source subnet, otherwise the ipfw rule is not
created. If the cluster has not received traffic that originated from a subnet that is
not reachable via the default gateway, OneFS will transmit traffic it originates
through the default gateway.
99The default gateway is the path for all traffic intended for clients that are not on
the local subnet and not covered by a routing table entry. Utilizing SBR does not
negate the requirement for a default gateway, as SBR in effect overrides the
default gateway, but not static routes.
100 Static routes are an option when the cluster originates the traffic, and the route
is not accessible via the default gateway. As mentioned above, static routes are
prioritized over source-based routing rules.
Access Zones
Access zones
Access zones allow the cluster to be divided securely to accomplish various tasks
required for multitenancy.
• Administrators can control who connects to the cluster and what nodes they
connect to based on the hostname when using access zones with the
SmartConnect pools.
• It allows for separating authentication providers or for multiple authentication
providers of the same type that is used in the same cluster.
• Administrators can define user-mapping rules to manipulate identities provided
for various user tokens.
• A base directory must be defined when configuring an access zone. Defining a
base directory allows administrators to isolate directories for security and
compliance requirements.
When a client connects to the cluster, they must undergo several checks to ensure
that they are allowed to access wherever they are trying to go.
• Which authentication providers are in the access zone associated with that
SmartConnect zone name?
• Do they have an account in at least one of those authentication providers?
• What other identifiers are needed for their access token?
• Do they have permission to connect to the share or export?
• If sharing base directories between access zones, ensure multiple LDAP or NIS
providers use unique UID or GID ranges.
• Do not nest base directories inside the base directory of another zone. Avoid
sharing base directories between access zones.
• Only assign one authentication provider of each type to each access zone.
• Avoid overlapping UID or GID ranges for authentication providers in the same
access zone.
If combining multiple authentication providers with UID or GID values that are
assigned to users and groups, ensure that the various authentication providers do
not use overlapping ranges. Two users in different authentication providers with the
same UID, sharing a base directory, are treated as the same user when accessing
files. If possible, avoid putting multiple LDAP or NIS providers in different access
zones that share a base directory.
Another consideration is that you should not nest base directories inside the base
directory of another zone. There is an exception for the System zone as its base
directory is /ifs and all other access zones are nested within that base directory.
If both LDAP and NIS are configured within the same zone, be certain that the two
providers do not use overlapping UID or GID ranges. The consequence of this
occurring is that the users are treated the same within the base directory that is
defined for that access zone. Users treated the same can lead to unintended
permissions being granted or denied to each user. The best practice is to only use
one LDAP or one NIS provider within each access zone, or between access zones
that share the base directory. There is no guarantee that the identifiers each
provider use are globally unique.
The ability for access zones to have overlapping paths has changed over OneFS
releases.
• In 7.0, when access zones were created, it worked but it was not intended to, so
it was never supported.
• Starting in 7.1.1, the “loophole” was closed and it stopped working and
upgrading forced people to fix their access zones to comply with the new rules.
• Starting in 8.0, the functionality was restored and is officially supported.
• Example101
• You should create access zones on a SyncIQ secondary cluster that is used for
backup and disaster recovery, with some limitations.
• System configuration settings, such as access zones, are not replicated to the
secondary server.
101There are valid use cases for overlapping paths between access zones but care
must be taken due to permission clashes. For example if the two zones
authenticate to two distinct untrusted domains, file permissions may be
troublesome. A warning is issued when a storage admin creates an access zone
that has a path that overlaps with another zone.
On the CLI, it is assumed that commands are run in the system zone. For isi
commands where access zones matter (example: isi nfs exports), a –zone
flag allows the admin to specify a different zone. But the traditional FreeBSD
commands do not have the –zone flag and, therefore, execute in the system
zone. This can lead to less clear output such as raw SIDs in the ls –led
command. This is because the AD domain of the System zone may not be able to
resolve the SIDs.
• BSD CLI commands are not zone aware and are often useful.
− They run in the system zone.
− They are Numeric-only permissions values and harder to interpret results.
− OneFS is modified from the parent BSD codebase.
• isi_run can context switch these commands to give zone-aware responses.
Using isi_run to add zone sensitivity to a BSD CLI command requires the zone
ID which can be found by running isi zone. This displays a listing of zones and
their zone IDs. With the correct zone ID, use isi_run to derive detailed
information from the CLI.
Get zone ID
Use zone ID
from the CLI. Administrators can tell, for example, that a user's SID in one access
zone is equivalent to a different user's SID in another access zone, and thus avoid
overlapping the zones to avoid access conflicts. isi_run is also useful for creating
a zone context for a series of commands. This and other options are detailed
together in the man page.
Quality of Service
You can set upper bounds on quality of service102 by assigning specific physical
resources to each access zone.
The list describes a few ways to partition resources to improve quality of service:
• NICs103
103You can assign specific NICs on specific nodes to an IP address pool that is
associated with an access zone. By assigning these NICs, you can determine the
nodes and interfaces that are associated with an access zone. This enables the
separation of CPU, memory, and network bandwidth.
• SmartPools104
• SmartQuotas105
Access zones carve out access to a PowerScale cluster creating boundaries for
multitenancy or multiprotocol. They permit or deny access to areas of the cluster.
At the access zone level, authentication providers are also provisioned.
System Zone
Moving client traffic to Access Zones ensures that the System Zone is only used for
management and accessed by administrators. Access Zones provide greater
security as administration, and file access is limited to a subset of the cluster,
rather than the entire cluster.
104SmartPools are separated into multiple tiers of high, medium, and low
performance. The data written to a SmartPool is written only to the disks in the
nodes of that pool.
Associating an IP address pool with only the nodes of a single SmartPool enables
partitioning of disk I/O resources.
105 Through SmartQuotas, you can limit disk capacity by a user or a group or in a
directory. By applying a quota to the base directory of an access zone, you can
limit disk capacity that is used in that access zone.
Root-Based Path
The table shows a few important limitations when using access zones. We do not
support different access zones using the same IP ranges. Some service providers
give out the same private subnet to multiple customers. OneFS does not support
this. It would be possible to work around this at the networking layer with a NAT
facility, but the cluster itself does not implement a NAT107 facility. Instead, the IP
ranges themselves differentiate access zones from each other, from the cluster's
point of view.
While the major protocols are supported in nonsystem zones, some of the minor
ones, such as FTP, HTTP, and RAN (Restful Access to Namespace) are not. This
means that if the scenario relies upon different groups having differentiated FTP
access, use FTP's authentication and control techniques to preserve customer
security. The same applies to HTTP, which may be important if using the cluster to
be the back end of a website. All administrative functions including the CLI, WebUI,
and PAPI, only work in the System zone. This implies that RBAC administrative
users only work in the System zone as well. The only exception is that there is
limited MMC support, that is zone aware. If MMC is a particular part of the Microsoft
networking context, plan to limit access to that facility on a per-domain basis. While
there are no hard limits in the code, support recommends to not exceed 50 zones
with 50 AD domains per cluster starting with OneFS 8.0. In prior versions, the limits
are 20 zones and five AD domains.
DNS
DNS on a PowerScale cluster serves two functions: DNS client and DNS server
Client Server
108DNS serves the cluster with names and numbers for various reasons (most
notably authentication), and this means that the cluster is acting as a DNS client.
109
The cluster itself serves DNS information to inbound queries (in the service of
SmartConnect) and as such acts as a DNS server.
DNS serves the cluster in the pursuit of name resolution. Also serves the cluster
clients for load balancing and access purposes, misconfiguration on either side can
disrupt cluster functions. Always double-check DNS function, and if something
seems to be misbehaving, get the network team involved in your solution.
110This is not a full-fledged DNS server and cannot be used to forward events to
another DNS server in the event it does not recognize a DNS request. So, it is best
used as simply an authoritative source of the DNS zone that is assigned to the
SmartConnect Zone, and should be configured as a delegate for the DNS zone in
the customer DNS server.
Groupnet DNS
Subnet SSIP
SSIP
SSIP
SSIP
SSIP
SSIP
Subnet
Subnet
Under each subnet, pools are defined, and each pool will have a unique
SmartConnect Zone Name. It is important to recognize that multiple pools lead to
multiple SmartConnect Zones using a single SSIP.
DNS lookups of SmartConnect zone names involve four separate DNS operations.
2: The site DNS server has a delegation record for example.domain.com. It sends
a DNS request to the defined nameserver address in the delegation record, the
SmartConnect service SSIP (SmartConnect Service IP Address).
3: The cluster node hosting the SmartConnect Service IP (SSIP) for this zone
receives the request. Cluster determines the IP address to assign based on the
configured connection policy for the pool in question (such as round robin). Cluster
then sends a DNS response packet to the site DNS server.
4: The site DNS server sends the response back to the client.
SmartConnect Features
Following are the description for SmartConnect features mentioned in the table:
Operation Command
Client access protocols on PowerScale can be divided into the following categories.
• Stateful: The client/server relationship usually has a session state112 for each
open file.
• Stateless: Stateless protocols are generally accepting of failover without
session state information being maintained (except for locks).
112Failing over IP addresses to other nodes for these types of workflows means
that the client assumes that the session state information was carried over. Session
state information for each file is not shared among PowerScale cluster nodes.
SmartConnect SSIP
An SSIP is assigned to the node in the node pool with the lowest device ID 113 in the
subnet with the SSIP. The node has an active interface that is in any IP pools that
SSIP services.
• The SmartConnect DNS service must be active on only one node at any
time114, per subnet.
• For example, The SmartConnect service continues to run throughout the
process as the existing nodes are refreshed115.
113 The device ID is the Node ID given to node and it is different from LNN.
114The SmartConnect Service IP resides on the node with the lowest node ID that
has an interface in the given subnet. SSIP does not necessarily reside on the node
with the lowest Logical Node Number (LNN) in the cluster.
115Suppose that an existing four-node cluster is refreshed with four new nodes.
Assume that the cluster has only one configured subnet, all the nodes are on the
network, and that there are sufficient IP addresses to handle the refresh. The first
step in the cluster refresh is to add the new nodes with the existing nodes,
temporarily creating an eight-node cluster. Next, the original four nodes are
SmartFailed. The cluster is then composed of the four new nodes with the original
dataset.
The SmartConnect service always runs on the node with the lowest node ID;
NodeID 1 is mapping to LNN 1.
1 1 Clustername-1 Original
2 2 Clustername-2 Original
3 3 Clustername-3 Original
4 4 Clustername-4 Original
5 5 Clustername-5 New
6 6 Clustername-6 New
7 7 Clustername-7 New
8 8 Clustername-8 New
The original nodes are removed using SmartFail. At this point, NodeID 5 is
mapping to LNN 1.
1 5 Clustename-5 New
2 6 Clustename-6 New
3 7 Clustename-7 New
4 8 Clustename-8 New
The updated Node IDs and LNNs remain the same, but map to a different Node
Name. At this point, NodeID 5 is mapping to LNN 1.
1 5 Clustename-1 New
2 6 Clustename-2 New
3 7 Clustename-3 New
4 8 Clustename-4 New
The isi_dnsiq_d daemon on that node is considered the primary for the IP pools
that SSIP services. The primary node is responsible for responding to DNS
requests on that SSIP and for deciding about moving IP addresses in any dynamic
IP pools that the SSIP services. Multiple subnets in a given groupnet can be
assigned a specific SSIP to respond for all queries into the SmartConnect Zone for
host resolution. This step would require you to select the wanted --sc-subnet
when creating the pool for the --sc-dns-zone wanted. This can come in handy
when a client has a large network. Due to restrictions, not all the network segments
to the PowerScale cluster have connectivity to the DNS servers. Should the node
servicing SSIP go down, or its interfaces become inactive, it stops servicing the
SSIP. The isi_dnsiq_d daemon on next lowest node, by device ID, will receive
an update and become the primary for the SSIP and the IP pools that it services.
Each cluster needs at least one SSIP, as long as there are no firewalls between the
infrastructure DNS servers and the SSIP that block TCP and UDP port 53.
116OneFS creates a file containing SSIP information, the SSIP Resource File. To
host an SSIP, a node must hold a lock on this file. All the nodes that are ready to
host an SSIP, attempt to lock the SSIP Resource File. The first nodes to get the
lock, host the SSIP.
117In certain scenarios, a node may host more than a single SSIP, depending on
the number of nodes and SSIPs in the subnet. The new process ensures that the
node assignment is based on a lock to nodes within the subnet, avoiding the issues
from previous releases. Once the node is offline, or the interface goes down, the
SSIP becomes available for lock again. The next quickest node to capture the lock
hosts the SSIP. OneFS ensures that SSIPs are as evenly distributed as possible
within a subnet, using a feature to limit a single node from hosting multiple SSIPs.
SmartConnect Multi-SSIP
The addition of more than a single SSIP provides fault tolerance and a failover
mechanism. Multi SSIP ensures the SmartConnect service continues to load
balance clients according to the selected policy.
119 SmartConnect Basic allows 2 SSIPs per subnet while SmartConnect Advanced
allows 6 SSIPs per subnet.
120Although the additional SSIPs are in place for failover, the SSIPs configured are
active and respond to DNS server requests. Multi-SSIP configuration is Active-
Passive, where each node hosting an SSIP is independent and ready to respond to
DNS server requests, irrespective of the previous SSIP failing. SmartConnect
continues to function correctly if the DNS server contacted the other SSIPs,
providing SSIP fault tolerance.
121It is unaware of the status of the load-balancing policy and starts the load-
balancing policy back to the first option.
At step 2, the site DNS server sends a DNS request to the SSIP. Server then
awaits a response in step 3 for a node IP address that is based on the client
connection policy. If for any reason, the response in step 3 is not received within
the timeout window, the connection times out. The DNS server tries the second
SSIP and awaits a response in step 3. After another timeout window, the DNS
server continues cycling through subsequent SSIPs, up to the sixth SSIP with
SmartConnect Advance, if a response is not received after a request is sent to
each SSIP.
Note: Do not configure the site DNS server to load balance the
SSIPs. Each additional SSIP is only a failover mechanism, providing
fault tolerance and SSIP failover. Allow OneFS to perform load
balancing through the selected SmartConnect policy, ensuring
effective load balancing.
Configure Multi-SSIP
• To identify the name of the external subnet you want to configure with Multi-
SSIP, run the isi network subnets list command.
• Run the isi network subnets modify command with the --sc-
service-addrs option, specifying an IP address range, in the following
format:
You can assign DNS servers to a groupnet and modify DNS settings that specify
DNS server behavior.
Settings Description
DNS Search Suffixes Sets the list of DNS search suffixes. Suffixes are
appended to domain names that are not fully
qualified.
You cannot specify more than six suffixes.
Enable DNS resolver rotate Sets the DNS resolver to rotate or round-robin
across DNS servers.
Enable DNS cache Specifies whether DNS caching for the groupnet
is enabled.
You can set a DNS zone name in SmartConnect to be a short name, for example,
isicl1 instead of isicl1.example.com. In order to set a zone name, you must
ensure that server-side DNS search is enabled (default configuration) and DNS
search list is specified. This allows for a name that is not an FQDN to see the
SmartConnect zone and answer any of the assigned search domains.
122Short names have no big technical purpose other than user convenience. They
function by a defaulting process that is a normal function in DNS, and these
instructions set it up correctly.
3 5 7
1
4
2
1:
2:
3:
4:
5:
6:
7:
123Check for a discrepancy, observe the actual response from SmartConnect using
dig. If the Status is Refused, that usually means that there is a discrepancy
between the name you are asking for and the sc-dns-zone you configured on
cluster.
124Check for a discrepancy, observe the actual response from SmartConnect using
dig. If the Status is Refused, that usually means that there is a discrepancy
between the name you are asking for and the sc-dns-zone you configured on
cluster.
Why is SmartConnect not returning "no error" for type ANY queries?
• SmartConnect does not currently support ANY queries.
8:
Why is SmartConnect not returning "no error" for type ANY queries?
• SmartConnect does not currently support ANY queries.
Round Robin
Load Balancing
The load balancing policies are Round Robin(default) , Connection Count, CPU
Utilization, and Network Throughput.
Thus, if new connections occur slowly (for example, every 10 seconds), all policies
tend to look like round-robin.
Cbind handles OneFS client-side DNS. Cbind is the distributed bind cache
daemon on OneFS. The primary purpose of cbind is to speed up DNS lookups on
the cluster, in particular for NFS workloads which can involve large number of DNS
lookups - especially with netgroups.
2: The design of the cache is to distribute the cache and DNS workload among
each node of the cluster. Cbind supports caching AAAA queries. A records and
silently drops AAAA records. This results in a five second delay as the client falls
back to the next DNS server in the resolver config.
3: To include the concept of tenancy, the cbind daemon has the support for
multiple caches of dnscaches. Each tenant referred to the cache has its own cache
within cbind that is independent of other tenants.
4: To support different DNS caches for multiple groupnets, the cbind interface is
changed to have multiple client interfaces to separate DNS requests from different
groupnets.
6: Post OneFS 8.0 the service command to enable/disable the cbind daemon has
been removed. The only way to enable/disable the dns cache option on OneFS 8.0
and later is through the isi network command. The cache can be
Each groupnet has its own cbind cache instance, there are two commands that
flush the cache.
• isi network dnscache flush command - flushes the dnscache for all
groupnets.
• isi_cbind flush groupnet <groupnet-name> command - flushes the
dnscache for a specific groupnet.
The isi_cbind show cluster command is used to check metrics. Note: This is for OneFS
DNS caching, the IP addresses referenced are internal interface(s), not used for external access.
Cache flushing is not something that should be done regularly for no reason. It
interferes with performance. It can be necessary under certain conditions, to flush
stale cache entries. Typical cases are after large network changes; whether they
are internal to normal corporate functions, or a result of disaster recovery activities
or other big moves. Alternatively, cache flushing can happen with the purpose of
debugging the DNS environment, and the PowerScale cluster relationship to the
environment. If you suspect that there is a problem with name resolution, flushing
the cache for a certain groupnet, and then reexamining its normal operations as it
reestablishes the cache can demonstrate how the DNS infrastructure is operating.
You can set DNS cache settings for the external network.
• To flush DNS cache, from the Actions area, click Flush DNS Cache and
Confirm.
• To modify the DNS cache settings, enter the required limits and click Save
Changes.
Challenge
Lab Assignment:
1) Analyze DNS packets between a source and destination.
2) View and analyze SmartConnect configuration.
3) Troubleshoot DNS issues to redirect clients to the right access zone.
Authentication Providers
5: The local provider provides authentication, and lookup facilities for user accounts
added by an administrator. Local authentication is useful when Active Directory,
LDAP, or NIS directory services are not configured or when a specific user or
application needs access to the cluster.
Most providers use UIDs (users ID), GIDs (group ID) and SIDs (security ID). A
major consideration in a multiprotocol environment is ensuring that users can
access their files regardless of the protocol they use. There are several ways to
address multiprotocol access. The first is that Active Directory supports RFC 2307,
which allows adding UNIX attributes to domain accounts. Other ways to map the
IDs together are discussed later in this topic.
Feature Description
UNIX-centric user and group properties Login shell, home directory, UID, and
GID. Missing information is
supplemented by configuration templates
or additional authentication providers.
Trusted
Direction of access (account)
Domain
Cluster should belong to
only one AD domain
within a forest
Domains in forest
automatically trust each
other
Trusting
(resource)
Direction of trust
Domain Users from trusted
domains can access
cluster
OneFS uses access zones127 to partition a cluster into multiple virtual containers.
Verify the authentication providers by using the command isi auth status. A
status of online means that the cluster and providers can reach each other. Use
the isi auth refresh command to refresh the status of authentication
providers.
126For this reason, a cluster need to only belong to one Active Directory domain
within a forest or among any trusted domains. A cluster should belong to more than
one AD domain only to grant cluster access to users from multiple unstructured
domains.
127
Access zones support configuration settings for authentication and identity
management services. Access zones are discussed shortly.
Definition
The Network Information Service (NIS) provides authentication and identity uniformity across local area networks.
OneFS includes an NIS authentication provider that enables to integrate the cluster
with NIS infrastructure.
• NIS can authenticate users and groups when they access the cluster.
• The NIS provider exposes the passwd, group, and netgroup maps from an NIS
server.
• Hostname lookups are also supported.
• You can specify multiple servers for redundancy and load balancing.
Decision point: Are NIS and NIS+ the same? Does OneFS supports
NIS+?
Each NIS provider must be associated with a groupnet. The groupnet is a top-level
networking container that manages hostname resolution against DNS nameservers
and contains subnets and IP address pools. The groupnet specifies which
networking properties the NIS provider will use when communicating with external
servers. The groupnet associated with the NIS provider cannot be changed.
Instead you must delete the NIS provider and create it again with the new groupnet
association.
You can add an NIS provider to an access zone as an authentication method for
clients connecting through the access zone. An access zone may include at most
one NIS provider. The access zone and the NIS provider must reference the same
groupnet. You can discontinue authentication through an NIS provider by removing
the provider from associated access zones. NIS is different from NIS+, which
OneFS does not support.
NIS Configuration
You can view, configure, and modify NIS providers or delete providers that are no
longer needed. You can discontinue authentication through an NIS provider by
removing it from all access zones that are using it. By default, when you configure
an NIS provider it is automatically added to the System access zone.
Decision point: How do you have cluster resolve the error message
that indicates that the client is unable to reach the NIS servers?
Definition
The local provider provides authentication and lookup facilities for user accounts added by an administrator.
A use case for having local providers may be an organization with no networked
providers (dark sites) and that needs to have separate authentication and access to
the cluster. For example, one group requires access to high-performance nodes
while another group access the utility nodes.
Use Case:
In a PowerScale environment, there are 5000 Active Directory users that access
shares in the access zone. 10 Linux users that also access data in the access
zone.
You don’t want the administrator adding LDAP to the access zone only to
authenticate 10 users.
What if the LDAP provider has 5000 users and not all the users are the same as
the 5000 AD users. This can become a very serious issue if adding the LDAP
provider to the access zone. Because OneFS will automatically map the users. So,
a user “John” in LDAP might not be the same “John” in AD, but OneFS will see
them as the same “John” and the token shows this. So, the LDAP John one day
discovers there are some useful files in a directory. He mounts the directory and
discovers he owns all kinds of files and subdirectories. He doesn’t know what any
of it is and so deletes it all. AD John logs in to find all his files gone. Now there is a
confusion between the two John's. One John keeps recovering files while the other
keeps deleting them. IT tickets are generated and the Linux admin doesn’t know
why and the AD admin doesn’t know the reason for the strange behavior.
The solution can be to have the storage admin remove LDAP from Access Zone,
flush and refresh the tokens. Add only the needed Linux users to the AZ local
provider.
When you create an access zone, each zone includes a local provider that allows
you to create and manage local users and groups.
Although you can view the users and groups of any authentication provider, you
can create, modify, and delete users and groups in the local provider only.
The user can change groups added automatically to local provider zone, in order to
grant or deny access if the default configuration does not meet security company
profile.
Definition
A file provider enables an authoritative third-party source of user and group information to an Isilon cluster.
You can configure one or more file providers, each with its own combination of
replacement files, for each access zone.
Each file provider pulls directly from up to three replacement database files: a
group file that has the same format as /etc/group; a netgroups file; and a binary
password file, spwd.db, which provides fast access to the data in a file that has
the /etc/master.passwd format.
• Password database files, which are also called user database files, must be in
binary format.
• You must copy the replacement files to the cluster and reference them by their
directory path.
Kerberos Overview
Definition
Kerberos is a network authentication provider that negotiates encryption tickets for securing a connection. OneFS supports
Microsoft Kerberos and MIT Kerberos authentication providers on a cluster.
Kerberos Configuration
131 When a user authenticates with an MIT Kerberos provider within a realm, an
encrypted ticket with the user service principal name is created. The ticket is
validated to securely pass the identification of user for the requested service. Each
MIT Kerberos provider must be associated with a groupnet.
Decision point: How do I resolve issues with the error, Failed to join
realm: (LW_ERROR_DOMAIN_IS_OFFLINE) The domain is offline.
Link: See Troubleshoot Kerberos Issues on your Isilon Cluster guide
to resolve the error.
You can configure an MIT Kerberos provider for authentication without Active
Directory. Configuring an MIT Kerberos provider involves creating an MIT Kerberos
realm, creating a provider, and joining a predefined realm. Optionally, you can
configure an MIT Kerberos domain for the provider. You can also update the
encryption keys if there are any configuration changes to the Kerberos provider.
You can include the provider in one or more access zones.
• nc -z 88
• nc -z 389
• nc -z 445
• nc -z 464
Protocols
133Allows Microsoft Windows and macOS X clients to access files that are stored
on the cluster.
134
Allows Linux and UNIX clients that adhere to the RFC1813 (NFSv3) and
RFC3530 (NFSv4) specifications to access files that are stored on the cluster.
136Allows clients to access files that are stored on the cluster through a web
browser.
137Allows any client that is equipped with an FTP client program to access files that
are stored on the cluster through the FTP protocol.
• You can set Windows-based and UNIX-based permissions on OneFS files and
directories.
• With the required permissions and administrative privileges, you can create,
modify, and read data on the cluster through one or more of the supported file
sharing protocols.
• By default, all file sharing protocols are disabled.
Configuring access to the cluster through SMB shares involves setting share
permissions.
Share permissions have only three settings for users or groups: Full control, Read-
write, or Read-only.
SMBv3 Encryption
• OneFS 8.1.1 and above supports SMBv3 encryption to secure access to data
over untrusted networks by providing on-wire encryption141 between the client
and PowerScale cluster.
• SMB encryption can be used by any clients142 which support SMBv3.
138 If a user is a member of multiple groups with different levels of permissions, the
permissions are added to give the user more permission. For example, user JaneD
is a member of Domain Admins and Domain Users. Domain Admins is given
permission of Full Control in the share, and Domain Users, is given Read-write
permission in the share. JaneD is granted Full Control in the share.
139 Another rule to remember is that a deny overrides an allow if ordered correctly
at the top of the permissions list.
140
If the deny is ordered after an allow permission, the cluster does not enforce the
deny.
141
Prevents an attacker from tampering with any data packet in transit without
needing any extra infrastructure.
142
Eligible clients include Windows Server 2012, 2012R2, 2016, Windows Client 8,
and Windows 10.
Windows 7 client connection is rejected because it lacks the SMB encryption support.
Windows 10 client data access will be encrypted as it supports SMBv3 encryption.
On the PowerScale side, the encryption and decryption happen in the kernel level
with Intel CPU extensions for hardware acceleration to gain a performance benefit
for next generation PowerScale clusters. Encryption and Decryption can be easily
managed at the global, access zone and individual share level on PowerScale:
• For global level, on-wire data between clients and PowerScale clusters are
encrypted after authentication.
• For access zone level, on-wire data between clients in the access zone are
encrypted after authentication.
• For share level, on-wire data between clients and share are encrypted once
clients can have access to the share.
SMB Multichannel
144OneFS can transmit more data to a client through multiple connections over
high-speed network adapters or over multiple network adapters.
148 Supports clients are Windows Server 2012, 2012 R2 or Windows 8, 8.1 clients.
149 SMB Multichannel cannot share the load between PowerScale nodes.
• SMB CA can be enabled for SMB 3.0 capable Windows clients150 in OneFS 8.0
and later.
150 CA is supported with Microsoft Windows 8, Windows 10, and Windows 2012 R2
clients.
− None153
− Write-read coherent154
− Full155
151 SMB3 uses persistent handles to provide CA by mirroring the file state across
all nodes. CA timeout value specifies the amount of time you want a persistent
handle to be retained after a client is disconnected or a server fails. The default is 2
minutes.
152 When enabled, prevents a client from opening a file if another client has an
open but disconnected persistent handle for that file. When disabled, OneFS issues
persistent handles, but discards them if any client other than the original opener
tries to access the file. Strict timeout is enabled by default.
153Continuously available writes are not handled differently than other writes to the
cluster. If you specify none and a node fails, you may experience data loss without
notification. This setting is not recommended.
154 Writes to the share are moved to persistent storage before a success message
is returned to the SMB client that sent the data. This is the default setting.
It is recommended that you configure advanced SMB share settings156 only if you
have a solid understanding of the SMB protocol.
For the complete list of advanced options, view the PowerScale OneFS CLI
Command Reference guide.
Share Description
Setting
Create Sets the default source permissions to apply when a file or directory
Permission is created. The default value is Default acl.
Directory Specifies UNIX mode bits that are removed when a directory is
Create Mask created, restricting permissions. Mask bits are applied before mode
bits are applied. The default value is that the user has Read,
Write, and Execute permissions.
Directory Specifies UNIX mode bits that are added when a directory is
Create Mode created, enabling permissions. Mode bits are applied after mask
bits are applied. The default value is None.
155 Writes to the share are moved to persistent storage before a success message
is returned to the SMB client that sent the data, and prevents OneFS from granting
SMB clients write-caching and handle-caching leases.
156The advanced settings affect the behavior of the SMB service. Changes to
these settings can affect all current and future SMB shares.
File Create Specifies UNIX mode bits that are removed when a file is created,
Mask restricting permissions. Mask bits are applied before mode bits are
applied. The default value is that the user has Read, Write, and
Execute permissions.
File Create Specifies UNIX mode bits that are added when a file is created,
Mode enabling permissions. Mode bits are applied after mask bits are
applied. The default value is that the user has Execute
permissions.
Impersonate Allows all file access to be performed as a specific user. This must
User be a fully qualified username. The default value is No value.
NFS Aliases
• NFS aliases157 provide shortcuts for directory path names in OneFS. If those
path names are defined as NFS exports, NFS clients can specify the aliases as
NFS mount points.
• Each alias must point to a valid path on the file system158.
• Aliases and exports are completely independent159.
• NFS aliases are zone-aware160.
• WebUI: Navigate to Protocols > UNIX Sharing (NFS) > NFS Aliases.
• CLI command: isi nfs aliases create/modify/delete
• Example: An alias named /engineering-gen maps to /ifs/div-
gen/engineering/general-purpose. An NFS client could mount that
directory through either of:
157 NFS aliases are designed to give functional parity with SMB share names within
the context of NFS. Each alias maps a unique name to a path on the file system. It
is useful for long path names.
158While this path is absolute, it must point to a location beneath the zone root (/ifs
on the System zone). If the alias points to a path that does not exist on the file
system, any client trying to mount the alias would be denied in the same way as
attempting to mount an invalid full pathname.
159
You can create an alias without associating it with an NFS export. Similarly, an
NFS export does not require an alias. As a best practice, it is recommended to use
NFS aliases for long directory path names.
160 By default, an alias applies to the client's current access zone. To change this,
you can specify an alternative access zone as part of creating or modifying an
alias. Each alias can only be used by clients on that zone, and can only apply to
paths below the zone root. Alias names are unique per zone, but the same name
can be used in different zones—for example, /home.
• The root squash allows all users with a UID of 0 to be given a different UID
while connected to that export.
• The default UID given is 65534, which has a name of nobody or nfsnobody.
• The root-squashing rule prevents root users on NFS clients from exercising root
privileges on the NFS server.
• The exact user that a root UID is mapped to can be changed per export.
• CLI: isi nfs exports modify 1 --map-root-enabled true --map-
root nobody
• Set up an external firewall with appropriate rules and policies to allow only
trusted clients and servers to access the cluster.
• Allow restricted access only to ports that are required for communication161 and
block access to all other ports on the cluster.
• Configure one or more security types: UNIX (system)162, Kerberos5, Kerberos5
Integrity, Kerberos5 Privacy163.
• Limit root access to the cluster to trusted host IP addresses.
• Ensure all new devices added to the network are trusted164.
161
Ports: 2049 for NFS, 300 for NFSv3 mount service, 302 for NFSv3 NSM, 304 for
NFSv3 NLM, and 111 for ONC RPC portmapper.
162 The default security flavor (UNIX) relies upon having a trusted network.
163 If you do not completely trust everything on your network, then the best practice
is to choose a Kerberos option. If the system does not support Kerberos, it will not
be fully protected because NFS without Kerberos trusts everything on the network
and sends all packets in cleartext.
164Use an IPsec tunnel. This option is very secure because it authenticates the
devices using secure keys. Alternatively, configure all of the switch ports to go
inactive if they are physically disconnected. In addition, ensure that the switch ports
are MAC limited.
NFS export settings can be configured globally for all exports or specific to an
export.
It is recommended that you configure advanced NFS export settings165 only if you
have a solid understanding of the NFS protocol.
For the complete list of advanced options, view the PowerScale OneFS CLI
Command Reference guide.
165Changes to default export settings affect all current and future NFS exports that
use default settings.
Share Description
Setting
Block Size The block size used to calculate block counts for NFSv3 FSSTAT
and NFSv4 GETATTR requests. The default value is 8192 bytes.
Directory The preferred directory read transfer size reported to NFSv3 and
Transfer Size NFSv4 clients. The default value is 131072 bytes.
Read The maximum read transfer size reported to NFSv3 and NFSv4
Transfer Max clients. The default value is 1048576 bytes.
Size
Write Transfer The maximum write transfer size reported to NFSv3 and NFSv4
Max Size clients. The default value is 1048576 bytes.
Max File Size Specifies the maximum file size to allow. This setting is advisory in
nature and is returned to the client in a reply to an NFSv3 FSINFO
or NFSv4 GETATTR request. The default value is
9223372036854776000 bytes.
Encoding Overrides the general encoding settings the cluster has for the
export. The default value is DEFAULT.
166
Verify signatures using AWS Signature Version 4 or AWS Signature Version 2
and validate it against the S3 request.
167The access key ID can be a 16 to 128-byte string. The access ID indicates who
the user is. OneFS generates one access ID per user. For example, OneFS may
generate access ID 1_joe_accid for user joe. The prefix number represents the
user’s access zone ID. Each access ID would have a latest secret key without an
expiry time set and an old secret key that has an expiry time set.
− Secret Key168
• OneFS treats unauthenticated requests as anonymous requests made by the
nobody user (UID 65534).
• Only users in the Administrator role are authorized to generate access keys.169
• CLI command: isi s3 keys create
• The multipart upload allows users to upload new large files or make a copy of
an existing file in parts for better uploading performance.
• Parts are uploaded to the temporary directory .isi_s3_parts_UploadId,
and the temporary directory is created under the target directory.
• A part has a maximum size of 5 GB and the last part has a minimum size of 5
MB.
• After all the parts are uploaded successfully, multipart upload is completed by
concatenating the temporary files to the target file.
168The secret key is used to generate the signature value along with several
request-header values. After receiving the signed request, OneFS uses the access
ID to retrieve a copy of the secret key internally, recompute the signature value of
the request, and compare it against the received signature. If they match, the
requester is authenticated, and any header value that was used in the signature is
now verified to be untampered as well.
169If an administrator creates a new secret key for a user and forgets to set the
expiry time, the administrator cannot go back and set the expiry time again. The
new key is created and the old key is set to expire after 10 minutes, by default.
3: After receiving the client complete multipart upload request, OneFS finishes the
multipart upload operation by concatenating the temporary files to the target file.
Applications developed using the S3 API can access OneFS files and directories
as objects using the OneFS S3 protocol.
Listed in the table are the common OneFS S3 bucket and object operations.
For the complete list and description, view the Dell EMC PowerScale: OneFS S3
API Guide.
CreateBucket170 GetObject171
ListObjects172 DeleteObject173
GetBucketLocation174 HeadObject175
DeleteBucket176 PutObject177
170 The PUT operation is used to create a bucket. Anonymous requests are never
allowed to create buckets. By creating the bucket, the authenticated user becomes
the bucket owner.
172 The API returns some or all (up to 1,000) of the objects in a bucket.
173Delete a single object from a bucket. Deleting multiple objects from a bucket
using a single request is not supported.
175 HEAD operation retrieves metadata from an object without returning the object
itself. This operation is useful if you are only interested in an object's metadata. The
operation returns a 200 OK if the object exists and if you have permission to
access it. Otherwise, the operation might return responses such as 404 Not Found
and 403 Forbidden.
176 Delete a bucket. When a bucket is deleted, OneFS only removes the bucket
information while preserving the data under the bucket.
HeadBucket178 CopyObject179
ListBuckets180 CreateMultipartUpload181
ListMultipartUploads182 UploadPart183
179Create a copy of an object that is already stored in OneFS. You can treat it as
server-side-copy which reduces the network traffic between the clients and OneFS.
180 Get a list of all buckets owned by the authenticated user of the request.
181Initiate a multipart upload and return an upload ID. This upload ID is used to
associate with all the parts in the specific multipart upload. You can specify this
upload ID in each of your subsequent upload part requests. You also include this
upload ID in the final request to either complete or cancel the multipart upload
request.
183Upload a part in a multipart upload. Each part must be at least 5 MB, except the
last part. The maximum size of each part is 5 GB.
184Each node in the cluster runs an instance of the Apache HTTP Server to
provide HTTP access. You can configure the HTTP service to run in different
modes.
185
OneFS performs distributed authoring, but does not support versioning and
does not perform security checks.
Important: HTTP and FTP only work for the System access zone.
HTTP Administration
You can configure HTTP and DAV to enable users to edit and manage files
collaboratively across remote web servers.
186
HTTP Secure (HTTPS) encrypts information and then exchanges it. With
HTTPS the message is only understood by the sender and the recipient. Anyone
who opens the message in between cannot understand it.
Step 1
Click Protocols and then go to HTTP settings. In the Service area, select one of
the following settings:
• Enable HTTP187
• Disable HTTP and redirect to the OneFS Web Administration interface 188
187Allows HTTP access for cluster administration and browsing content on the
cluster.
188Allows only administrative access to the web administration interface. This is the
default setting.
• Disable HTTP189
• CLI: isi http settings modify --service= {enabled | disabled
| redirect}
• Enable or disable access to a PowerScale cluster through the Apache service
over HTTPS: isi_gconfig -t http-config https_enabled={true |
false}
189Closes the HTTP port that is used for file access. Users can continue to access
the web administration interface by specifying the port number in the URL. The
default port is 8080.
Step 2
Type or choose a path within /ifs as the document root directory. Then, select the
HTTP authentication method:
• Off190
• Basic Authentication Only191
• Integrated Authentication Only192
191 Enables HTTP basic authentication. User credentials are sent in clear text.
194Enables HTTP basic authentication and enables the Apache web server to
perform access checks.
195Enables HTTP integrated authentication via NTLM and Kerberos, and enables
the Apache web server to perform access checks.
Step 3
• File Transfer Protocol (FTP) allows systems with an FTP client to connect to the
cluster and exchange files.
• OneFS includes a secure FTP service called vsftpd, which stands for Very
Secure FTP Daemon, that you can configure for standard FTP and FTPS file
transfers.
• You can set the FTP service to allow any node in the cluster to respond to FTP
requests through a standard user account.
• When configuring FTP access, ensure that the specified FTP root is the home
directory of the user who logs in to the cluster197.
• Administration:
2
3
4
1: FTP is disabled by default. You also need to enable the vsftpd service by
running the isi services vsftpd enable command.
2: Allow users with "anonymous" or "ftp" as the username to access files and
directories without requiring authentication. This setting is disabled by default.
3: Allow local users to access files and directories with their local username and
password, allowing them to upload files directly through the file system. This setting
is enabled by default.
197 For example, the FTP root for local user jsmith should be ifs/home/jsmith.
4: Allow files to be transferred between two remote FTP servers. This setting is
disabled by default.
Challenge
Advanced Authorization
Module Objectives
Multiprotocol Permissions
Multiprotocol Overview
198 Single data access protocols are self-contained. Windows users access
Windows file servers through the Server Message Block (SMB) protocol. UNIX
users access file servers through the Network File System (NFS) protocol. When a
user connects to a cluster to read and write files, the protocol assesses the security
of file against a set of permissions. The protocol assesses to determine whether
access will be allowed. Each protocol has its own type of file permissions to the
user and to the file(s), which prevents a UNIX user from accessing Windows file
servers, and conversely. Each protocol is a closed system.
199 Multiprotocol access puts the NAS platform in the middle, creating a system
where different users can connect to the same file server (or cluster) through
different protocols. The multiprotocol NAS platform handles and stores the
permissions for each protocol and user.
In OneFS, multiprotocol means that users who connect through NFS, SMB, and
other protocols can access the same file and directories. If necessary, you can
create a file or a directory that a Windows or UNIX client200 accesses. However,
unlike other file systems or NAS systems—which might maintain protocol
permissions separately or rely on user mapping - OneFS uses a single unified
permission model201.
200 OneFS supports the standard UNIX tools for viewing and changing permissions,
"ls, chmod, and chown". For more information, run the "man ls", "man chmod", and
"man chown" commands.
The actual file permissions of user are entirely defined by comparing the access
token against permissions on the file.
OneFS files support two distinct states: Permissions state 202and Access
permissions203.
202 When a file is accessed over SMB, OneFS generates a synthetic ACL that is
based directly on the POSIX permissions of file. The synthetic ACL is a correlation
of the POSIX permissions to an ACL. The synthetic ACL is not persistent: it is not
stored on disk. OneFS only creates the synthetic ACL at the time the file is
accessed using the SMB protocol.
* When the NFS client issues an "ls", the approximated POSIX are seen, but actual
file access is evaluated against the ACL. The plus sign (+) indicates that file has
real ACLs.
1:
When you access a file that is in the real POSIX/synthetic ACL state using NFS,
OneFS checks the standard POSIX permissions. When a Windows user checks
the permissions of a file that has the POSIX authoritative state by using the
Windows Explorer Security tab, that user expects to see the file ACLs 204.
204When a file is accessed over SMB, OneFS generates a synthetic ACL that is
based directly on the POSIX permissions of file. The synthetic ACL is a direct, one-
to-one correlation of the POSIX permissions to an ACL. The synthetic ACL is not
persistent: it is not stored on disk. OneFS only creates the synthetic ACL at the
time the file is accessed using the SMB protocol.
At the NFS side, File1 is in the real POSIX/synthetic ACL state. If you are a
Windows user looking at permissions over SMB of File1 using a Windows Explorer
Security tab, you do not want to see POSIX bits205. The resulting ACLs emulate the
POSIX permissions and look like normal ACLs. This action does not change the
actual permissions. It also does not affect actions that you can take on File1.
2:
205For one thing, POSIX bits are not as rich as SMB ACLs, and for another, you
expect to see SMB-style ACLs for File1. OneFS accommodates Windows users
expectations by automatically generating a set of synthetic ACLs on the fly based
on the POSIX bits.
206Accessing that file from Windows/SMB means that OneFS performs the file
access check directly against the ACL as usual.
207 The POSIX bits do not matter from an access check perspective, but OneFS will
still need to show the POSIX mode bits. That is because when you issue an "ls"
command on a file across NFS, OneFS has to return an NFS view of file
permissions. However, those POSIX mode bits do not represent the file
permissions: the REAL ACL does.
208For example, you can add many UIDs, GIDs, and SIDs to define permissions if
you need to. But POSIX has only three bits to work with: read/write/execute for the
The permissions state and the access permissions on a file or directory do not
affect each other. Access permissions must be consistent (identical) regardless of
the file permissions state. In OneFS, files and directories can have only one set of
permissions and can exist in only one of two states: POSIX, ACL.
The share permissions and file permissions are added up and compared to each
other when determining what the effective permissions are on any object.
In order to check PowerScale OneFS file permissions, from the PowerScale CLI,
the extended ls commands are used.
Note: The ls command lists the directory contents. The -l option is to list files in the
long format. The -e option prints the Access Control List (ACL). The -d option lists
on the directory and not its contents. The n option displays user and group IDs
numerically.
Use -s before the username or group name in the command if the user or group is
located in an Active Directory authentication provider.
For example, the command for changing the owner of the file to the student
account in the dees.lab AD domain is chown -s DEES\\student file1.
The AD owner root and group owner wheel has read, write and execute
permissions over file1, all the other users are allowed only to read the files. Users
hayden and domain admins cannot write file1 because hayden and domain
admins are not in group wheel.
The AD owner and group owner has all the permissions, all the other users are
allowed only to read the files.
Without using the -s flag, users hayden and domain admins cannot change the
owners for file1.
After using -s flag, hayden and domain admins are the owner and group owner
of file1, they now have read, write and execute permissions over file1.
• The account that is used to log in to that share has the appropriate
permissions209.
• Inheritance is automatically added to the objects when adding or modifying
permissions210 of SMB.
210Administrators can disable this behavior per directory through SMB. The ability
to modify permissions over SMB can also be disabled through a global cluster
setting.
Export Permissions
Configuring access to the cluster using NFS requires configuring NFS exports and
associated permissions. With exports, permissions are not granted to users and
groups, but rather to hosts by hostname or IP, entire subnets, or netgroups. Hosts
should only be in one field per export.
Yes, you can create an Export with nesting access. In the following example, all the
clients within the IP range 192.168.3.3/24 has read-only permission. Client with IP
192.168.3.3 has read-write permissions.
You can enter a client by hostname, IPv4 or IPv6 address, subnet, netgroup, or
CIDR range. Client fields:
• Clients: Clients that are specified in the generic Clients field are given
read/write permission, unless the Restrict access to read-only checkbox is also
selected. Selecting treats all hosts in the Clients field as read-only but not affect
the hosts in the Always Read-write Clients or Root Clients fields.
• Always Read-Write Clients: Clients in the Always Read-Write Clients field are
given read/write permissions. The Restrict access to read-only checkbox does
not apply to hosts in this field.
• Always Read-Only Clients: Clients in the Always Read-Only Clients field are
given read-only permissions.
• Root Clients: Clients in the Root Clients field are mapped as root if the user
logged in to the local host is logged in as root. This option gives users
significant privileges within the export directories and should be avoided where
possible. Avoid this issue with the other permission fields by setting the cluster
to automatically perform root squash for all root users when connecting to the
cluster.
• Map Users: The Map Users options allow the administrator to specify users to
be mapped to other UIDs and thus be treated as if they are that other user
when connected to the cluster. This can also be done per access zone with
user-mapping rules. It also allows the cluster to squash root.
If the user or group for this ACE is in an AD domain, the domain must be specified
as part of the username or group name as shown using either
'[user|group]@domain_name' or 'domain_name\[user|group]'
For the AD user student1 and AD group domain admins, the domain dees.lab is specified as part
of the user or group name.
As ACEs can be added individually, they can also be removed. The command uses
chmod -a# <ACE number> <filename>. It is important to verify that the right
ACE is being removed from the ACL when using this option.
Note: You can see the document Dell EMC Isilon: Access Control
Lists on HDFS and Isilon OneFS for list of PowerScale ACEs.
Advanced options can be used to set up ACLs. See the document
Access Control Lists on Dell EMC PowerScale OneFS.
The job contains three different execution options, or modes, depending on the
resolution required.
Mode Usage
Caution: Using Clone or Inherit changes the ownership and group of the
new directory.
211Tasks are multiple individual work items that are divided up and load balanced
across the nodes within the cluster.
This method saves time if many new directories are needed as the basic
permissions do not need to be applied manually on every directory. An
administrator can change the impact configuration of a running job, without
changing the default for all instantiations of that job. The basic idea is that, rather
than updating the basic configuration of job, start the job and then change the
parameters of that one, running job, preventing changing the default configuration.
Review the question and answer provided in each tab along with the directory and
share permissions.
Example 1
Answer: No. POSIX is authoritative and directory permission only allows root to
write to the test directory.
Example 2
Answer: No. The share permission gives Everyone read-only permission, including
the “Administrator” account.
Example 3
Answer: No. The user “hayden” is a member of the group "Users”, which has a
deny write permission set on the directory.
212 During the translation to SMB ACL, OneFS translates the internal ACL to a
canonical ACL and sends it to SMB clients. A canonical ACL always places an
explicit ACE before an inherited ACE, and always places a deny ACE before an
allow ACE. Since SMB clients are always presented with a reordered canonical
ACL other than the actual ACL in OneFS, users need to be careful when editing
ACLs through SMB.
Challenge
User Mapping
User mapping provides a way to control access by specifying a user’s complete list
of the security identifiers, user identifiers, and group identifiers. OneFS uses the
identifiers— which are commonly called SIDs, UIDs, and GIDs respectively—to
determine ownership and check access.
213With the user mapping service, rules are configured to manipulate a user’s
access token by modifying which identity OneFS uses, adding supplemental user
identities, and changing a user’s group membership. OneFS maps users only
during login or protocol access.
214 When there is no user mapping OneFS authenticates user from the active
directory and builds an access token which prioritizes the account info from active
directory. If rules are not configured, a user authenticating with one directory
service receives full access to the identity information in other directory services
when the account names are the same.
1: The default mapping provides a user with a UID from LDAP and a SID from the
default group in Active Directory. The user’s groups come from Active Directory
and LDAP, with the LDAP groups added to the list.
215The ID-mapping service maps the user’s SIDs to UIDs and GIDs if the user
connects over SMB. If the user connects to the cluster over NFS, the ID-mapping
service does not map the UID and GIDs to SIDs by default. There is no mapping
since the default on-disk identity is in the form of a UID and GID.
216 The user-mapping service is responsible for combining access tokens from
different directory services into a single token.
User mapping -
combine directory
service tokens
Connect
via NFS
1. When the cluster builds an access token, it must begin by looking up users in
external directory services.
• Over SMB: AD preferred, LDAP can be appended.
• Over NFS: LDAP or NIS only
2. By default, the cluster matches users with the same name in different
authentication providers and treats them as the same user.
3. The ID-mapping service populates the access token with the appropriate
identifies. Accounts are matched to combine access tokens from different
directory services.
4. Finally, the on-disk identity is determined.
User-Mapping Options
There are a few options that the administrator has when considering how to
configure multiprotocol access to the cluster. If multiprotocol access cannot be
avoided, the best practice is to keep the naming schema for users in different
authentication providers the same. However, if the usernames are not the same in
the different authentication providers, the admin must choose how they want to
map.
Example217
Scenario 1 Scenario 2
The user-mapping rules have five operators that can be applied to each rule, and
each rule can be applied to a specific access zone.
• Append (++)218
217
If the username in the active directory is "jsmith", then the username in the
LDAP should also be the same "jsmith".
• Insert (+=)219
• Replace (=>)220
• Remove groups (--)221
• Join (&=)222
218Append rule adds fields to an access token, but it does not displace a primary
user or group. The mapping service appends the fields that are specified in the list
of options (user, group, groups) to the first identity in the rule.
219 Insert also adds fields to an access token, but it displaces a primary group into
the additional identifiers list. When the rule inserts a primary user or primary group,
it becomes the new primary user or primary group in the token. The previous
primary user or primary group moves to the additional identifiers list.
220 Removes a token and replaces it with a specified user. If the second username
is left blank, the mapping service removes the first username in the token, leaving
no
username, and then login fails with a no such user error.
221Remove groups removes group identifiers from the access token. Modifies a
token by removing the supplemental groups.
222 Join merges two access tokens together. While the operation is bi-directional,
meaning the cluster could perform mapping using either username specified in the
rule, the order does matter to determine file ownership. If one of the usernames
should always be the owner of a file, make it the first name in the rule.
• The mapping rules you create are configured per access zone, so ensure to
select the correct access zone before configuring.
• Once the operation is selected, the page updates to reflect the options to
configure for that operation.
In this example, two users are being joined. The first user is in Active Directory, and
the second user is in LDAP. If the cluster is unable to perform the lookup, the user
is mapped to the “Guest” account. No other rules are checked for the users who
are specified in this rule.
• This example shows joining the AD account student with the AD account
administrator, but only in the System zone.
− Other access zones do not map these two accounts together unless another
rule is created specifically for those access zones.
− There was no option to map a default user or to stop processing rules in this
command, but those options are available through the CLI as well.
• isi zones modify <Access Zone> --add-user-mapping-
rules=<username and action>
• isi zone zones modify System --add-user-mapping-
rules=<username and action>
You can view any rules that you create through CLI using the following command
isi zone zones view <Access Zone>. Once the rules are configured, verify
that the IDs are correctly mapped by running isi auth mapping token
<username> and checking that the right identifiers are displayed.
The examples combine the username formats with operators to form rules. Several
of the rules include an option to specify how OneFS processes the rule.
• The access token for any user can be viewed using the isi auth mapping
token command.
223 The break option forces OneFS to stop applying rules and to generate the token
at the point of the break.
224This rule uses wildcards to join users from the DESKTOP domain with UNIX
users who have the same name in LDAP, NIS, or the local provider.
225This rule tightly restricts access by removing the identities of everybody other
than those permitted by preceding rules.
226This rule maps the administrator account from any Active Directory domain to
the nobody account on OneFS. The rule exemplifies how to turn a powerful
account into an innocuous account.
• Use this to determine correct mapping of the identifiers for the user and
showing the appropriate groups for that user.
• No zone is specified if the access tokens are being checked in the System
access zone. If the access token is being checked for another access zone, the
--zone option needs to be specified.
Username must be
specified as
domain\\username
Troubleshoot Commands
RFC 2307
RFC 2307 allows to implement unified authentication for UNIX and Windows Active
Directory accounts by associating a user ID (UID), group ID (GID), home directory,
and shell with an Active Directory object.
OneFS does not require the NIS authentication component, as only the UID/GIDs
are used. AD with RFC 2307 maps SIDs with UID/GIDs, eliminating the need for
mapping in OneFS, simplifying management further.
OneFS contains advancements and user mapping rules that make it easier to
converge LDAP, NIS, and Local Users with Active Directory users and groups.
Best Practices
Dell EMC PowerScale recommends the following best practices to simplify user
mapping.
a. Use Microsoft Active Directory with Windows Services for UNIX and RFC 2307
attributes to manage Linux, UNIX, and Windows systems.
b. Follow the naming convention and name the users consistently so that each
UNIX user corresponds to a similarly named Windows user.
c. Ensure that UID and GID ranges do not overlap in networks with multiple
identity sources.
d. You should not use well-known UIDs and GIDs in your ID ranges because they
are reserved for system accounts.
A. Integrating UNIX and Linux systems with Active Directory centralizes identity
management and eases interoperability, reducing the need for user mapping rules.
Ensure your domain controllers are running Windows Server 2003 or later.
B. The simplest configurations name users consistently so that each UNIX user
corresponds to a similarly named Windows user. Such a convention allows rules
with wildcards to match names and map them without explicitly specifying each pair
of accounts.
C. It is also important that the range from which OneFS automatically allocates
UIDs and GIDs does not overlap with any other ID range. The range from which
OneFS automatically allocates a UID and GID is 1,000,000 to 2,000,000. If UIDs
and GIDs overlap across two or more directory services, some users might gain
access to other users’ directories and files.
D. UIDs and GIDs below 1000 are reserved for system accounts; do not assign
them to users or groups.
E. OneFS processes every mapping rule by default. Processing every rule, though,
can present problems when you apply a rule to deny all unknown users access. In
addition, replacement rules may interact with rules that contain wildcard characters.
F. A user principal name (UPN) is an Active Directory domain and username that
are combined into an Internet-style name with an @ sign, like an email address:
[email protected]. If you include a UPN in a rule, the mapping service ignores it
and might return an error.
G. This practice lets OneFS honor group permissions on files created over NFS or
migrated from other UNIX storage systems.
Challenge
Lab Assignment:
1) View and verify the access token for an unmapped user existing on
both Windows and Linux.
2) Create and test a user mapping rule for a user on both Windows and
Linux.
Reporting
Module Objectives
System Events
228OneFS continuously monitors the health and performance of the cluster and
generates events when situations occur that might require attention. Events and
event notifications information includes drives, nodes, snapshots, network traffic,
and hardware.
229The main goal of the system events feature is to provide a mechanism for
customers and support to view the status of the cluster.
• Events provide notifications for any ongoing issues and display the history of an
issue230.
• The Cluster Events Log (CELOG) process monitors, logs, and reports the
important activities and error conditions on the nodes and cluster.
CELOG
230 Event information can be sorted and filtered by date, type/module, and criticality
of the event.
232
The administrator can configure conditions for alert delivery, to best reflect the
needs of the organization.
1: Monitor is responsible for system monitoring and event creation, it will send the
event to kernel queue.
2: Capture is responsible for reading event occurrences from the kernel queue,
storing them safely on persistent local storage, generating attachments, and
queuing them in priority buckets for analysis. Event capture continues to operate on
isolated nodes until the local storage full.
3: The main analysis process runs on only one node in the cluster. The analysis
process collects related event occurrences together as event group occurrences,
which can be reported upon by the Reporter, ignored (either automatically for
things like Job Engine events or manually) and resolved (either automatically by
event occurrences or manually).
4: Similar to the analysis, the event reporter runs on only one node in the cluster.
The event reporter periodically queries Event Analysis for event group occurrences
that have changed and for each of these evaluates any relevant alert conditions,
generating alert requests for any which are satisfied.
5: The Alerting is the final stage in the CELOG workflow. It is responsible for
delivering the alerts requested by the reporter. There is a single sender on the
cluster for each enabled channel.
CELOG Architecture
• Coalesces events into event groups and provides conditional alerting to prevent
over-notification.
• CELOG system processes raw events and stores them in log databases.
• Events themselves are not reported, but CELOG reports on event groups234.
234 Reporting on event groups is not uniform, but depends on conditions, and
defined reporting channels. Networking issues would be reported to a channel that
includes network administrators. However, database administrators would probably
not benefit much from the information, so their reporting channel need not be on
the list for networking related issues.
235 Identifies the type of event that has been generated on the cluster.
Ignore or resolve
multiple event
groups Filter event groups
• The event groups are listed in the Cluster Management > Events and alerts >
Events page.
237For example, if a chassis fan fails in a node, OneFS might capture multiple
events related both to the failed fan itself, and to exceeded temperature thresholds
within the node. All events related to the fan will be represented in a single event
group. Because there is a single point of contact, you do not need to manage
numerous individual events. You can handle the situation as a single, coherent
issue.
239
When an event group is marked as ignored, it is not reported upon and does not
appear in lists by default.
240Some event groups can collect multiple event types and have IDs that do not
correspond to event types.
Event Analysis
241 The 'Event group causes' section provides the short version of the event.
− Severity243
− Alert Channels244
− Event Count245
− Time noticed246
− Resolver Information247
− Ignored248
• Events within the event group are displayed below the summary information of
the event group.
242The event group instance is assigned a unique identifier within the cluster to
distinguish it from other instances of the same event group type.
245The number of events that were generated for the given event group. The event
count and event time provide key information to determine the root cause.
247When the event group is marked as resolved, additional information such as the
resolver name and resolver time is displayed.
248 The flag indicates whether the event group is marked as ignored or not.
• Topics include managing event groups, alerts, alert channels, alert maintenance
and testing, and a full list of event IDs or codes.
• For each event group, the description and administrative action are specified.
249At any point in time, you can view event groups to track situations occurring on
your cluster. However, you can also create alerts that will proactively notify you if
there is a change in an event group. You can control how alerts related to an event
group are distributed.
250Channels are pathways by which event groups send alerts. You can create and
configure a channel to send alerts to a specific audience, control the content the
channel distributes, and limit frequency of the alerts. The channel is a convenient
way of managing alerting configurations, such as SNMP hosts, and lists of email
addresses.
251Alerts are definable to meet the needs of the organization. Different alerts are
defined to provide separate event group alerting. For example, an organization can
create an alert for hardware events only, and route the alerts to the hardware
support team within the organization. Also, administrators can create alerts to
provide management notification when an event severity increases.
• Administrators can manually create and manage channels using the OneFS
WebUI and CLI.
• To create a new channel, a channel name and type is required.
• There are three primary channel types: SMTP252, ConnectEmc253, SNMP254
• You can specify255 one or more nodes that are allowed or denied to send alerts
through the channel.
252Alerts are sent as emails through an SMTP server. With an SMTP channel type,
email messages are sent to a distribution list. SMTP, authorization, and security
settings can be set.
253 ConnectEmc enables Dell EMC support to receive alerts from SRS.
255
The node number is specified as an integer. If you do not specify any allowed
nodes, all nodes in the cluster will be allowed to send alerts.
Hayden creates an SMTP alert channel for all the marketing administrators.
WebUI: Cluster Management > Events and alerts > Alerts | CLI: isi event channels
create
The alert channel types have different setup requirements. When the type is
selected, the setup options change appropriately in the WebUI.
Alert Administration
• Configure alerts to associate the alert channel for sending alert notifications,
and to determine event criteria.
• Event criteria includes: Event Group IDs, condition to send, frequency, event
duration before sending.
• Administrators can create any number of alerts the organization requires.
• Administrators can select any or all event group categories256 to include in the
alert.
• CLI command: isi event alerts create
• Using the CLI, administrators can specify the severity level257.
Hayden creates an alert to email all marketing administrators on new event groups for
SmartQuotas, Snapshots and Software-related events.
WebUI: Cluster Management > Events and alerts > Alerts > Create an alert
It is recommended but not required to create the alert channels before creating
alerts. The different alert conditions available are:
256Also, individual specific event group IDs can be added to the alert. Alert
conditions provide additional refinement for when and how the alert is sent.
• New event groups - Reports on event group occurrences that have never before
reported.
• New events - Reports on event group occurrences that are new since the event
group was last reported on.
• Interval - Provides periodic reports on event group occurrences that have not
been resolved.
• Severity increase - Reports on event group occurrences whose severity has
increased since the event group was last reported on.
• Severity decrease - Reports on event group occurrences whose severity has
decreased since the event group was last reported on.
• Resolved event group - Reports on event group occurrences that have been
resolved since the event group was last reported on.
The Maximum Alert Limit restricts the number of alerts sent. Some events can
generate tens, hundreds, or thousands of alerts. Maximum alert limits do not apply
to Interval conditions. The event longevity condition provides for a time delay
before sending an alert. Some events are self-correcting or may last a few seconds
based on certain cluster conditions. For example, a node CPU at 100 percent
utilization may only last for a short duration. Although the condition may be critical if
the event occurs over a prolonged period, events over a short period may not be
important.
Heartbeat Alert
258In order to confirm that the system is operating correctly, test events are
automatically sent every day, one event from each node in your cluster.
259By default, heartbeat test alerts are not sent to any other alert channel. To
monitor their success, administrators can configure an alert channel and add the
channel to the Heartbeat alert. Administrators can change the interval using the
WebUI or CLI.
• The test alert can be created and sent with a custom test message.
• CLI: isi event test create "Test message"
• You can modify settings to determine how event data is handled on your
cluster:
− Resolved event group data retention260
− Event log storage limit261
260By default, data related to resolved event groups is retained indefinitely. You
can set a retention limit to make the system automatically delete resolved event
group data after a certain number of days.
261You can also limit the amount of memory that event data can occupy on your
cluster. By default, the limit is 1 megabyte of memory for every 1 terabyte of total
memory on the cluster. You can adjust this limit to be between 1 and 100
megabytes of memory. When your cluster reaches a storage limit, the system will
begin deleting the oldest event group data to accommodate new data.
262These events are known and considered benign to the normal operations of the
cluster.
Challenge
Lab Assignment:
1) View and analyze events and event groups.
2) Create an SMTP alert channel and alert administrators for different
software events.
Log Files
Definition
Log files are a collection of informational files from multiple sources within the cluster.
The log entries provide detailed information about the operating system, file
system, entire cluster and on a node level including health, status, events, and
error conditions. Certain log files, like /var/log/messages, contain multiple
types of data while others are specialized and only contain one type.
• Log files are the primary source of information for troubleshooting issues on the
cluster.
• Multiple different log files are created for each type of cluster issue category and
provide the raw captured details.
• Different logs provide cluster-wide and node level details. Each log contains
their own set of information captured.
• The log file information provides the detailed information of activity and status of
the cluster at any given point in time. Use log file information to troubleshoot
issues with the cluster.
The graphic shows an example with a portion of the /var/log/messages log file from node 1.
The file reflects information from multiple processes.
OneFS writes log messages that are associated with NFS events to a set of files in
/var/log.
With the log level option, you can specify the detail at which log messages are
output to log files.
The table describes the log files that are associated with NFS.
Example
OneFS maintains both the cluster-wide and node-specific logs are maintained on
each node. Each node has its own set of log files.
264Some logs are general log files such as the messages log file, and some are
process-specific, such as logs for SMB, CELOG, alerts or events, and hardware
logs such as drive evaluation and drive history.
While there are ways to see cluster information remotely, the raw log files on each
node are located under the /var/log directory. Under the /var/log directory, a
series of different log file subdirectories exist that contain the individual log files.
265The replaced log files remain in the /var/log directory for time. Administrators
may need to delete these old log files when troubleshooting, but should not do so
without senior technical support guidance. The size of the /var/log partition is either
500 MB or 2 GB for generation 4 and 5 nodes, and varies by node type for
generation 6 nodes.
266 You can send log files to technical support. Technical support will request log
files when troubleshooting issues. Upload the logs using Secure Remote Services,
HTTP, or FTP. Log files can be large depending on the cluster activity and can
some time to collect and upload.
268 The find –name command allows administrators to identify the specific files.
Rotated logs
Directory
The graphic shows the list of all the CELOG files including the older log files that are removed from
the current file being logged to using the find command.
Look for the masterfile.txt to determine which node is the primary CELOG for
the whole cluster of nodes. This is located in the following file:
/ifs/.ifsvar/db/celog/masterfile.txt.
Looking in the masterfile.txt, the number in the file signifies the node that has
devid ‘n’ where ‘n’ is the number in the masterfile.txt. In a six node cluster, 1
signifies the node that has devid as 1.
The amount of detail that is captured in the log files can vary based on the level of
detail set either as a default or changed to help troubleshoot an issue.
Trace and Debug are additional levels that engineering and development use, the
levels are so verbose that they should not be left on. The amount of data they
gather can affect the cluster’s performance substantially. They can generate
enough data to fill /var/log if run for an extended period.
One consideration is that increasing the log detail level increases the number of
entries in the log file and increases the amount of information to sort through to find
the issue. This also increases the amount of time for sorting the information to find
the issue. The practice is to use the lowest detail level required to identify, isolate,
and troubleshoot an issue. Another potential issue is /var/log directory is
limited in size. Too much information can risk filling up the /var directory. If
changing the log detail level, it is important to reset it when finished. Generally, a
reset should only be done at the direction of technical support or engineering.
272
Verbose can have significant effect on cluster performance and effect
workflows.
The graphic shows how the isi diagnostics gather command, whether run
through the CLI or using the web administration interface, collects similar log files,
packages them into several files in .tar file format. The tar files are then
packaged together and compressed using gzip into a single large .tgz file known
as a tarball for transport to Technical Support.
Tarball
• Extract the contents of a tarball with the following command at the command
line:
• tar xzvf filename.tgz
• To examine a list of the files in a tarball without opening, it uses the command:
Gather Settings
• The setting changes the global defaults for gather info and offer limited options
for modification.
• If having difficulty configuring, contact the network administration team.
• The preference is to gather full log sets. The tool used by support to analyze log
sets requires a full log gather.
• Check the default FTP settings if logs are not being uploaded to Technical
Support’s FTP server during a log gather process.
• HTTP and FTP proxy server settings point to internal FTP or HTTP proxy server
that is used as the gateway to reach external IP network.
The gather process collects information about different system utilities and collects
groups of information. You can specify multiple upload locations and upload
methods. You can modify the default setting.
To view all the available options and option descriptions, use the following
commands:
• isi diagnostics gather -h
• isi diagnostics gather settings view -h
• isi diagnostics gather settings modify -h
The graphic shows several of the most commonly used command options. Use these options to
view the help related to running log gather, or viewing and modifying log gather settings.
The isi diagnostics gather collects log files from the cluster, runs various
commands, then produces the output from the commands. Logs are copied from
the cluster during the log gathering process and in the output.
In a log set there are many different logs that can be classified as they relate to a
single node at a time, or the whole cluster. Logs can also be classified based on
whether they only describe a functioning aspect of the cluster, or whether a lot of
generic information is placed in them.
• Node-specific logs
• Logs relating to the whole cluster
• Messages log
In a log set, node-specific logs are placed in a separate directory for each node.
This means that administrators can use these logs to understand what was
happening from the perspective of each node by going into that sub directory of the
log set.
Logs relating to the whole cluster are stored in the shared /ifs directory. These are
not placed in node-specific sub directories. CELOG output is generic because it
relates to all cluster events, without being limited to a single node or a single
application. The CELOG files exist on each node. The
isi_celog_analysis.log, isi_celog_capture.log,
isi_celog_alerting.log, isi_celog_events.log, and
isi_celog_monitor.log logs make up the CELOG log file set. The logs exist
on every node, however, the set on the primary coalescer node contains the
primary log set for the cluster-wide CELOG events. So a log can contain specific
node and cluster-wide information in the same log file.
Each node has its own messages log, because the messages log is built by the
individual OneFS instance running on each node separately, but it is not linked to
any one service or application on the node. The vsftpd.log is specific to the FTP
daemon running on each node individually, and therefore only tells about that one
service on each node individually. Here is paths to some of the logs on the live
system as described above:/var/log/messages/var/log/lsassd.log
Examples of some of the various commands that are run during the log gather are
isi_status, isi_quota, and isi_hw_status. The output from each command
is in the log gather output file.
• Difficulties in a PowerScale cluster can arise for many reasons, and the same
symptom may appear different in the logs.
• For example, administrators may have a network configuration where routing
errors prevent the cluster from reaching an LDAP server. This could mask the
fact that LDAP is also misconfigured, or this could be the result of a cluster
configuration error.
• Failed reporting systems can completely prevent alerts from reaching
administrative staff. SPAM filters may prevent internal staff from receiving email
alerts.
• Log files are guides to help identify problems. Log files are not answers to the
root cause.
Challenge
Notifications
Quota Notifications
Quota notifications are generated for enforcement quotas, providing users with
information when a quota violation occurs. Reminders are sent periodically while
the condition persists.
You can configure notifications globally, and apply to all quota domains, or
configure for specific quota domains. Enforcement quotas support the following
notification settings.
Use the system settings for quota Uses the global default notification for
notification the specified type of quota.
Each notification rule defines the condition that is to be enforced and the action that
is to be executed when the condition is true. An enforcement quota can define
multiple notification rules. When thresholds are exceeded, automatic email
notifications can be sent to specified users, or you can monitor notifications as
system alerts or receive emails for these events.
Notification Rules
Quota notification rules can be written to generate alerts that are triggered by event
thresholds. When an event occurs, a notification triggers according to your
notification rule.
A notification trigger may execute one or more actions, such as sending an email or
sending a cluster alert to the interface.
OneFS can send event notifications through an SMTP mail server. OneFS supports
SMTP authentication.
1: IPv4 or IPv6 address or the fully qualified domain name of the SMTP relay are
entered here.
3: If SMTP authentication is required, then check this box. After that enter the
authentication username and password, and confirm the password.
4: Type the originating email address that will be displayed in the To line of the
email in the Send email as field.
6: Select an option from the Notification Batch Mode drop-down menu to batch
event notification emails.
7: In the Default Email Template drop-down menu, select whether to use the
default template provided with OneFS or a custom template. If you select a
custom template, the Custom Template Location field appears. Enter a path name
for the template.
OneFS configures and applies global quota notification to all quotas. You can
continue to use the global quota notification settings, modify the global notification
settings, or disable or set a custom notification for a quota.
• Threshold exceeded
• Over-quota reminder
• Grace period expired
• Write access denied
You can configure default global quota notification settings that apply to all quotas
of a specified threshold type.
1: In the Archive Directory field, type or browse to the directory where you want to
archive the scheduled quota reports.
2: In the Number of Scheduled Reports Retained field, type the number of reports
that you want to archive.
4: In the Archive Directory field, type or browse to the directory where you want to
archive the manually-generated quota reports.
5: In the Number of Live Reports Retained field, type the number of reports that
you want to archive.
In the Email Mapping area, define the mapping rule or rules that you want to use.
To add an email mapping rule, click Add a Mapping Rule, and then specify the
settings for the rule.
1:
From the Current domain list, select the domain that you
want to use for the mapping rule.Callout
In the Map to domain field, type the name of the domain that
you want to map email notifications to.
2:
Tip: Before using quota data for analysis or other purposes, verify that
no QuotaScan jobs are in progress by checking Cluster
Management > Job Operations > Job Summary.
If email notifications for exceeded quotas are enabled, you can customize
PowerScale templates for email notifications or create your own. There are four
email notification templates provided with OneFS. The templates are located in
/etc/ifs and are described in the following table:
• quota_email_template.txt
− A notification that disk quota has been exceeded (also includes a parameter
to define a grace period in number of days).
• quota_email_test_template.txt
− A notification test message you can use to verify that a user is receiving
email notifications.
• quota_email_advisory_template.txt
Tip: If the default email notification templates do not meet the needs,
you can configure your own custom email notification templates by
using a combination of text and SmartQuotas variables.
An email template contains text, and optionally, variables that represent values.
You can use any of the SmartQuotas variables in your templates.
This procedure assumes that you are using the PowerScale templates, which are
located in the /etc/ifs directory.
• Open a secure shell (SSH) connection to any node in the cluster and log in.
• Copy one of the default templates to a directory in which you can edit the file
and later access it through the OneFS web administration interface.
• Open the template file in a text editor.
• Edit the template. Ensure that the template has a Subject : line, if a
customized template is being used or created.
• Save the changes. Template files must be saved as .txt files.
Click each icon to learn more about configuration of SMTP email settings.
3 2 1
2: You can specify an origination email and subject line for all event notification
email messages sent from the cluster.
3: SMTP settings include the SMTP relay address and port number that email is
routed through.
1 2
Protocol Auditing
Auditing Overview
Auditing is the ability to log specific activities on the cluster. Auditing provides the
capability to track whether the data was accessed, modified, created, or deleted.
The auditing capabilities in OneFS include two areas: monitoring preaccess275 and
postaccess276 on cluster. These areas are the ability to audit any configuration
changes and to audit the client protocol activity.
Postaccess: Log
configuration
changes
Preaccess: Log
protocol activity -
NFS, SMB, HDFS
275 Preaccess configuration changes are cluster login failures and successes.
audit system also provides the capability to make the audit logs available to third-
party audit applications for review and reporting.
Audit Review
OneFS stores all audit data in audit topic277 files, which collect log information that
can be further processed by auditing tools.
1:
OneFS 7.1 consists of an input/output (LWIO) filter manager. The filter manager
provides a plug-in framework for pre, and post input/output request packet278 (IRP).
2:
In OneFS 7.1.1, audit logs are automatically compressed. Audit logs are
compressed on file roll-over281.
279 The audit events are logged on the individual nodes where the SMB/NFS client
initiated the activity. The events are then stored in a binary file under
/ifs/.ifsvar/audit/logs.
280 The logs automatically roll over to a new file once the size reaches 1 GB. The
default protection for the audit log files is +3. There are various regulatory
requirements, such as HIPAA, which require two years of audit logs, the audit log
files are not deleted from the cluster.
281 As part of the audit log roll-over, a new audit log file is actively written to, while
the previous log file is compressed. The estimated space savings for the audit logs
is 90%.
3: OneFS 8.0.1 adds the support for concurrent delivery to multiple CEE servers.
Each node initiates 20 HTTP 1.1 connections across a subset of CEE servers.
Each node can choose up to 5 CEE servers for delivery. The HTTP connections
are evenly balanced across the CEE servers from each node. The change results
in increased audit performance.
4:
Starting from OneFS 8.2.0, OneFS protocol audit events have been improved284 to
allow for more control of what protocol activity should be audited. The changes
allow increased performance and efficiency by allowing customers to configure
OneFS to no longer collect audit events that are not registered by their auditing
application.
282Once the auditing event has been logged, a CEE forwarder service handles
forwarding the event to CEE. The event is forwarded using an HTTP PUT
operation. At this point, CEE will forward the audit event to a defined endpoint,
such as Varonis DatAdvantage. The audit events are coalesced by the third-party
audit application.
284
It provides a granular way to select protocol audit events to stop collecting
unneeded audit events that third-party applications do not register for.
Audit Capabilities
285 By default, all protocol events that occur on a particular node are forwarded to
the /var/log/audit_protocol.log file, regardless of the access zone the event
originated from.
1: Users can specify which events to log in each access zone. For example, user
might want to audit the default set of protocol events in the System access zone,
but audit only successful attempts to delete files in a different access zone. The
audit events are logged on the individual nodes where the SMB, NFS, or HDFS
client initiated the activity. Then the events are stored in a binary file under
/ifs/.ifsvar/audit/logs. The logs automatically roll over to a new file after
the size reaches 1 GB. Settings are disabled by default.
Protocol auditing tracks and stores activity that is performed through SMB, NFS,
and HDFS protocol connections. Users can enable and configure protocol auditing
for one or more access zones in a cluster. Enable protocol auditing for an access
zone records file access event through the SMB, NFS, and HDFS protocols in the
protocol audit topic directories.
CLI commands:
Event Forwarding
Users can configure OneFS to send protocol auditing logs to servers that support
the Common Event Enabler, or CEE. The CEE enables third-party auditing
applications to collect and analyze protocol auditing logs. The CEE has been tested
and verified to work on several third-party software vendors.
It does not matter if the audit log is enabled or disabled, you can query all the SMB
protocol access information through /var/log/lwiod.log.
For example, you can disable the protocol access audit log and then create a new
folder called audit test in an SMB file share. In /var/log/lwiod.log you can
find the following entries.
Modifying Event
The first command sets a create_file audit event upon success. The second
example logs all audit failures. To view the configured events for the access
zone, use the command that is highlighted.
287
In OneFS auditing stops the collection of audit events that third-party
applications do not register for or need.
288The events are a direct mapping to CEE audit events - create, close, delete,
rename set_security, get_security, write, read. The CEE servers listen, by default,
on port 12228.
OneFS provides a tool to view the binary audit logs stored on the cluster. Errors
while processing audit events when delivering them to an external CEE server are
shown in the /var/log/isi_audit_cee.log. Protocol-specific logs show
issues that the audit filter has encountered:
• /var/log/lwiod.log –SMB
• /var/log/nfs.log –NFS
• /var/log/hdfs.log -HDFS
Elevation of Privileges
Decision Point
Yes, audit log will show for the elevated users using three aspects:
Create a test SMB share and allow the account Dante to run as root. The following
is the protocol access audit log entry which is generated by isi_audit_viewer
-t protocol:
Create a test NFS export and map all the non-root users to root:
Note: Map Non Root User mapping is disabled by default. We recommend that
you specify this setting on a per-export basis, when appropriate.
The following example shows a user logging in as Dante to run sudo command.
The "UID":0 in graphic represents the root account. However, unlike the example
for SMB/NFS permission elevation, the SID here represents the one behind the
sudo command. In this case, it is Dante:
Note: Root can grant enhanced permissions to users so they may perform
privileged tasks. This way users are provided access to a OneFS to perform their
specific tasks beyond normal end user permissions.
Audit log can track system level object creation or deletion operations only when
performed through WebUI or CLI.
The graphic shows an example of the audit log for creation and deletion of a
storage tier in the PowerScale SmartPools:
View the audit log through the CLI command: isi_audit_viewer -t config
Challenge
Lab Assignment:
1) Configure protocol auditing for an access zone.
2) View, add and verify different events to audit.
3) Track changes to an individual user account.
SNMP
SNMP Overview
SNMP Architecture
OneFS
SNMP Service (snmpd)
SNMP Community
PowerScale Management Information Bases (MIBs) PowerScale Management Information Bases (MIBs)
SNMP Response
• If using SNMP for CELOG alerts, the SNMP reporting settings are the default
settings used.
• The default SNMP v3 username289 (general) and password, can be changed
using the CLI or the WebUI.
• Configure an NMS to query each node directly through a static IPv4
address.
• To enable SNMP v3 access:
• isi snmp settings modify --snmp-v3-access=yes
• To configure the security level, the authentication password and protocol, and
the privacy password and protocol:
289
The username is only required when SNMP v3 is enabled and making SNMP v3
queries.
In the WebUI, navigate to Cluster Management > General Settings > SNMP
Monitoring.
1:2 Download the MIB file you want to use and copy the MIB files to a directory
3
where the SNMP tool can find them.
2: If your protocol is SNMPv2, ensure that the Allow SNMPv2 Access check box is
selected. 4
4: In the SNMPv3 Read-Only User Name field, type the SNMPv3 security name to
change the name of the user with read-only privileges. The default read-only user
is general. In the SNMPv3 Read-Only Password field, type the new password for
the read-only user to set a new SNMPv3 authentication password.
The SNMP is enabled cluster wide for unset system and the I$ilonpublic community has read-only
access.
Module Objectives
290A Job Engine job is a specific task, or family of tasks, intended to accomplish a
specific purpose. Jobs play a key role in data reprotection and balancing data
across the cluster, especially if the hardware fails or the cluster is reconfigured.
291 The parent process which runs on each node of the cluster.
292 Each job is broken down into work units which are handed off to nodes based
on node speed and workload. Every unit of work is tracked. That way, if you pause
a job, it can be restarted from where it last stopped.
Individual jobs are scheduled to run at certain times, or start by an event, such as a
drive failure295, or start manually by the administrator.
All jobs have priorities. The most important jobs have the highest job priority, and
you should not modify them.
Jobs are given impact policies that define the maximum amount of usable cluster
resources.
Jobs run until completion. One job that holds up other jobs can affect job
operations296.
293To achieve this, it reduces a task into smaller work items and then allocates, or
maps, these portions of the overall job to multiple worker threads on each node.
Progress is tracked and reported on throughout job execution and a detailed report
and status is presented upon completion or termination.
294 This allows jobs to be paused and resumed, in addition to stopped and started.
295 For example, the FlexProtect job runs to reprotect the data when a hard drive
fails.
296
If contention occurs, examine which jobs are running, which jobs are queued,
when the jobs started, and the job priority and impact policies for the jobs.
Each isi_job_d daemon manages the separate jobs that run on the cluster. The
daemons spawn off processes to perform jobs as necessary.
The isi_job_d daemons on each node communicate with each other to confirm
that actions are coordinated across the cluster. This communication ensures that
jobs are shared between nodes to keep the workload as evenly distributed as
possible.
Job Categories
The Job Engine typically executes jobs as background tasks across the cluster,
using spare or especially reserved capacity and resources.
297 These jobs perform background file system maintenance, and typically require
access to all nodes. These jobs are required to run in default configurations, and
often in degraded cluster conditions. Examples include file system protection and
drive rebuilds. Although the file system maintenance jobs are run by default, either
on a schedule or in reaction to a particular file system event, any Job Engine Job
can be managed by configuring both its priority-level (in relation to other jobs) and
its impact policy.
• Jobs - Job Engine jobs often consist of several phases, each of which are
executed in a pre-defined sequence.
• Phase - A phase is one complete stage of a job. Jobs may have one or more
phases.
• Task - Each job phase is composed of a number of work chunks, or tasks
distributed around the cluster to be performed by each node individually.
• Item - A task produces an individual work item that run as parallel thread on a
node.
• Item Results - Successful execution of a work item produces an item result.
• Checkpoints - Tasks and task results are written to disk, along with some
details about the job and phase, to provide a restart point.
298
The feature support jobs perform work that facilitates some extended storage
management function, and typically only run when the feature has been configured.
Examples include deduplication and anti-virus scanning.
299These jobs are run directly by the storage administrator to accomplish some
data management goal. Examples include parallel tree deletes and permissions
maintenance.
2 4 5
3
1
2: Phases enable better insight into the progress of a job at a high level, by
examining the job engine logs. Phases also enable better efficiency since multiple
parallel phase functions do not contend with each other for cluster resources. If an
error occurs in a phase, the job is marked as failed at the end of the phase and
does not progress. Each phase of a job must complete successfully before
advancing to the next stage or being marked as complete, returning a job state
Succeeded message. Each phase is run in turn, but the job is not finished until all
the phases are complete. Each phase is broken down into tasks.
3: A phase is started with one or more tasks that are created during job startup. All
remaining tasks are derived from those original tasks similar to the way a cell
divides. A single task does not split if one of the halves reduces to a unit less than
whatever makes up an item for the job. For example, if a task derived from a
restripe job has the configuration setting to a minimum of 100 logical inode number
(LINS), then that task does not split further if it derives two tasks, one of which
produces an item with fewer than 100 LINs. A LIN is the indexed information that is
associated with specific data.
The tasks are logically alike within each phase, since they address different parts of
the same phase’s role. An example would be checking the integrity of files on the
cluster-wide file system. Each task would cover a series of files, all performing the
same checks, and different cluster nodes would check different files. The results of
these parallel tasks are collated and amount to the total result of the phase.
4: A task which is given to a particular node for execution is not monolithic, but
consists of many work items. If the job is for file deduplication, and the phase is for
block comparisons, and the task is a series of blocks to traverse and compare, then
a single item would be a single block to examine, calculate and compare with other
known block values. This level is the bottom of the job management hierarchy.
Items are not further decomposed into any smaller components. The result of each
item execution is logged, so that if there is an interruption, the job can restart from
where it stopped.
5: Task status from the constituent nodes are consolidated and periodically written
to checkpoint files. These checkpoint files allow jobs to be paused and resumed,
either proactively, or in the event of a cluster outage. Job engine checkpoint files
are stored in results and tasks subdirectories under the path
/ifs/.ifsvar/modules/jobengine/cp/<job_id>/ for a given job.
Coordinator
Director
The Director process is responsible for monitoring, governing and overseeing all
job engine activity on a particular node, constantly waiting for instruction from the
coordinator to start a new job.
• Each node in the cluster has a job engine director process, which runs
continuously and independently in the background.
• Principle responsibilities include:
Manager
Worker
• If any task is available, each Worker is given a task. The Worker then
processes the task, item by item, until the task is complete or until the Manager
removes the task from the worker.
• Towards the end of a job phase, the number of active threads decreases300 as
Workers finish up their allotted work and become idle.
• Check status301 of nodes' Worker threads: isi job statistics view
Delegation Hierarchy302
Shared Work Distribution303
300Nodes which have completed their work items just remain idle, waiting for the
last remaining node to finish its work allocation. When all tasks are done, the job
phase is considered to be complete and the worker threads are terminated.
301In addition to the number of current worker threads per node, a sleep to work
(STW) ratio average is also provided, giving an indication of the worker thread
activity level on the node.
303Once the work is initially allocated, the job engine uses a shared work
distribution model in order to execute the work, and each job is identified by a
unique job identification number.
Central Coordination304
Other Threads305
1: The job daemons elect a Coordinator by racing to lock a file. The node that first
locks the file becomes the Coordinator. Racing is an approximate way of choosing
the least busy node as the Coordinator. If the Coordinator node goes offline and
the lock is released, the next node in line becomes the new Coordinator.
While the actual work item allocation is managed by the individual nodes, the
Coordinator node takes control, divides up the job, and evenly distributes the
resulting tasks across the nodes in the cluster. For example, if the Coordinator
304A job’s workload is delegated from a central coordinator to spread it out across
the cluster, thus avoiding choking any one node.
305 There are other threads which are not displayed in the graphic. They relate to
internal functions, such as communication between daemons, and collection of
statistics. Here we focus on the operational components which perform the jobs.
2: The Director runs on each node, communicates with the job Coordinator, and
coordinates tasks with the Managers. When three jobs are running simultaneously,
each node has three Manager processes, each with its own number of Worker
threads. The Director process serves as a central point of contact for all the
Manager processes running on a node, and as a liaison with the Coordinator
process across nodes.
3: The Managers on each node coordinate and manage the tasks with the Workers
on their respective node. If three jobs run simultaneously, each node would have
three Manager processes, each with its own number of Worker threads. Managers
request and exchange work with each other and supervise the Worker processes
they assign. Under direction from the Coordinator and Director, a Manager process
maintains the appropriate number of active threads for a configured impact level,
and for the current activity level of a node.
4: The job daemon uses Worker threads to enable it to run multiple tasks
simultaneously. A thread is the processing of a single command by the CPU. The
Coordinator tells each node job daemon what the impact policy of the job is, and
how many threads should be started to complete the job. Each thread handles its
task one item at a time, and the threads operate in parallel. The number of threads
determines the number of items being processed. The maximum number of
assigned threads manages the defined impact level and the load that is placed on
any one node. It is possible to run enough threads on a node that they can conflict
with each other. An example would be five threads all trying to read data off the
same hard drive. Since serving each thread at once cannot be done, threads are
queued and wait for each other to complete.
306
Concurrent job execution is governed by job priority, exclusion sets and cluster
health.
307 OneFS protects data by writing file blocks across multiple drives on different
nodes. This process is known as ‘restriping’ in the OneFS lexicon. The Job Engine
defines a restripe exclusion set that contains these jobs that involve file system
management, protection and on-disk layout. The restriping exclusion set is per-
phase instead of per job. This helps to more efficiently parallelize restripe jobs
when they don't need to lock down resources. The jobs with restriping phases often
have other no restriping phases as part of the job. For these jobs, when the
restriping phases are not running, other jobs with restriping phases can run. If two
jobs happen to reach their restriping phases simultaneously and the jobs have
different priorities, the higher priority job will continue to run, and the other will
pause. If the two jobs have the same priority, the one already in its restriping phase
will continue to run, and the one newly entering its restriping phase will pause.
308 OneFS marks blocks that are actually in use by the file system. IntegrityScan,
for example, traverses the live file system, marking every block of every LIN in the
cluster to proactively detect and resolve any issues with the structure of data in a
cluster. Multiple jobs from the same exclusion set will not run at the same time. For
example, Collect and IntegrityScan cannot be executed simultaneously, as they are
both members of the marking jobs exclusion set. Similarly, MediaScan and
SetProtectPlus won’t run concurrently, as they are both part of the restripe
exclusion set.
• A job is not required to be part of any exclusion set309, and jobs may also
belong to multiple exclusion sets310.
• Multiple restripe or mark job phases cannot safely and securely run
simultaneously311 without interfering with each other or risking data corruption.
• Job Engine exclusion sets are predefined and cannot be modified or
reconfigured.
309The majority of the jobs do not belong to an exclusion set. These are typically
the feature support jobs and coexist and contend with any of the other jobs.
310
MultiScan is both a restripe job and a mark job. When MultiScan runs, no
additional restripe or mark job phases are permitted to run.
311Up to three jobs can run simultaneously. The Job Engine restricts the
simultaneous jobs to include only one restripe category job phase and one mark
category job phase simultaneously.
Restriping jobs only block each other when the current phase may perform
restriping.
• Two restripe jobs, MediaScan and AutoBalanceLin, are both running their
respective first job phases.
• AutoBalanceLin restripes in the first phase causing ShadowStoreProtect,
also a restriping job, to be in waiting state.
• MediaScan restripes in phases 3 and 5 of the job, only if there are disk errors
(ECCs) which require data reprotection.
• If MediaScan reaches phase 3 with ECCs, it will pause until AutoBalanceLin is
no longer running. However, if MediaScan's priority were in the range 1-3, it
would cause AutoBalanceLin to pause instead.
• Job Engine enters a low space mode when it sees the available space on one
or more disk pools below a low space threshold, it regards the cluster as
running out of space.
• Low space mode enables jobs that free space (space saving jobs) to run before
the Job Engine or even the cluster become unusable.
• When available space reaches the high threshold, the Job Engine exits the low
space mode and resumes the jobs that were paused.
The graphic shows disk pools that have varying amounts of free capacity. One of the disk pools
crosses the low space threshold.
Job Priority
• Job priorities determine which job takes precedence when more than three jobs
of different exclusion sets attempt to run simultaneously.
• The Job Engine assigns a priority value between 1 and 10 to every job, with 1
being the most important and 10 being the least important.
• Job priorities are configurable312 by the cluster administrator. The default priority
settings are recommended.
• Higher priority jobs always cause lower-priority jobs of the same exclusion set to
be paused.313
312 Job priority can be changed either permanently or during a manual execution of
a job. A new job does not interrupt a running job when the jobs have the same
priority. It is possible to have a low impact, high priority job, or a high impact, low
priority job. In the Job Engine, jobs from similar exclusion sets are queued when
conflicting phases run. Changing the priority of a job can have negative effects on
the cluster. Jobs priority is a trade-off of importance. Historically, many issues have
been created by changing job priorities. Job priorities should remain at their default
unless instructed to change by a senior level support engineer.
313 The maximum number of jobs that can run simultaneously is three. If a fourth
job with a higher priority is started, either manually or through a system event, the
Job Engine pauses one of the lower-priority jobs that is currently running. The Job
Engine places the paused job into a priority queue, and automatically resumes the
paused job when one of the other jobs is completed. If two jobs of the same priority
level are scheduled to run simultaneously, and two other higher priority jobs are
already running, the job that is placed into the queue first is run first.
The graphic shows the timeline for a scenario where different jobs are run. The number on the job
name indicates the job priority for the respective job.
• Job Engine uses impact policies that you can manage to control when a job
runs and the system resources that it consumes314.
• Job Impact Policy is a reflection of the impact caused by jobs on available CPU
and I/O resources315.
• Job Engine has four default impact policies that you can use but not modify.
315If you want to specify other than a default impact policy for a job, you can create
a custom policy with new settings. Jobs with a low impact policy have the least
impact on available CPU and disk I/O resources. Jobs with a high impact policy
have a significantly higher impact. In all cases, however, the Job Engine uses CPU
and disk throttling algorithms to ensure that tasks that you initiate manually, and
other I/O tasks not related to the Job Engine, receive a higher priority.
316By default, most jobs have the LOW impact policy, which has a minimum impact
on the cluster resources.
317 More time-sensitive jobs have a MEDIUM impact policy. These jobs have a
higher urgency of completion that is typically related to data protection or data
integrity concerns.
318The use of the HIGH impact policy is discouraged because it can affect cluster
stability. HIGH impact policy use can cause contention for cluster resources and
locks that can result in higher error rates and negatively impact job performance.
319The OFF_HOURS impact policy enables greater control of when jobs run,
minimizing the impact on the cluster and providing the resources to handle
workflows.
Important: Increasing job impact policy does not always make a job
complete faster because it may be constrained by disk I/O or other
running cluster processes.
• An impact policy can consist of one or many impact intervals, which are blocks
of time within a given week.
• Each impact interval can be configured to use a single impact level320.
• The available impact levels are: Paused, Low, Medium, and High.
320Impact levels are predefined and they specify the amount of cluster resources to
use for a particular cluster operation. A job's priority and impact level are
independent of each other.
322Each job cannot exceed the impact levels set for it, and the aggregate impact
level cannot exceed the highest level of the individual jobs.
Scenarios showing the overall cluster impact when jobs with different impact levels run at the same
time on the cluster.
• Job Engine throttling limits the resources and thereby limiting the rate at which
jobs can run.
• Certain jobs, if left unchecked, could consume vast quantities of a cluster’s
resources, contending with and impacting client I/O.
• Throttling is employed at a per-manager process level, so job impact can be
managed both granularly and gracefully.
• The coordinator process gathers cluster CPU and individual disk I/O load data
every 20 seconds from all cluster nodes:
− Decide number of threads323 running on each node to service each job.
323This can be a fractional number, and fractional thread counts are achieved by
having a thread sleep for a given percentage of each second.
The most common Job Engine jobs can be broken into different types of use.
324Using this CPU and disk I/O load data, every sixty seconds the coordinator
evaluates how busy the various nodes are and makes a job throttling decision,
instructing the various job engine processes as to the action they must take. This
enables throttling to be sensitive to workloads in which CPU and disk I/O load
metrics yield different results. Also, there are separate load thresholds tailored to
the different classes of drives used in OneFS powered clusters, including high
speed SAS drives, lower performance SATA disks and SSDs.
325 If little client activity is occurring, more worker threads are spun up to allow more
work, up to a predefined worker limit. For example, the worker limit for a low-impact
job might allow one or two threads per node to be allocated, a medium-impact job
from four to six threads, and a high-impact job a dozen or more. When this worker
limit is reached (or before, if client load triggers impact management thresholds
first), worker threads are throttled back or terminated. For example, a node has four
active threads, and the coordinator instructs it to cut back to three. The fourth
thread is allowed to finish the individual work item it is currently processing, but
then quietly exit, even though the task as a whole might not be finished. A restart
checkpoint is taken for the exiting worker thread’s remaining work, and this task is
returned to a pool of tasks requiring completion. This unassigned task is then
allocated to the next worker thread that requests a work assignment, and
processing continues from the restart check-point. This same mechanism applies
when multiple jobs are running simultaneously on a cluster.
Jobs are not exclusive to themselves and often work by calling other jobs to
complete their task.
Jobs that are related to the distribution of the data on the cluster include:
Jobs that are related to testing the data integrity and protection include:
326Data integrity and data protection jobs can be further broken down into active
error detection and reprotection of the data. The active error detection includes jobs
that are often found running for long periods of time. The jobs run when no other
jobs are active and look primarily for errors on the drives or within the files.
The last category of jobs contains the jobs that are selectively run for specific
purposes.
These jobs may be scheduled, however, the administrator runs them only when the
job is required.
Job Performance
• Maintenance functions use system resources and can take hours or days to
run.
• Not all OneFS Job Engine jobs run equally fast.328
• The time that it takes for a job to run can vary depending on several factors,
including:
− Other system jobs that are running
328 Some jobs can take a long time to complete. However, those jobs should get
paused so jobs of higher immediate importance can finish. Pausing and restarting
is an example of the balance for job priorities that are considered when the default
settings were determined. A job that runs through files progresses slower on a
cluster with many small files than on a cluster with a few large files. Jobs that
compare data across nodes (such as Dedupe) run slower when making many
comparisons. Many factors play into this, and linear scaling is not always possible.
If a job runs slowly, the first questions should be directed to discover what is the
context of the job.
− Other processes that are taking up CPU and I/O cycles while the job is
running
− Cluster configuration
− Dataset size
− Job access method329
− Time since the last iteration of the job
• Jobs results are merged and delivered in batches330 when multiple jobs are
running simultaneously.
329The specific access method influences the run time of a job. For instance, some
jobs are unaffected by cluster size, others slow down or accelerate with the more
nodes a cluster has, and some are highly influenced by file counts and directory
depths.
330 On large clusters with multiple jobs running at high impact, the job coordinator
can become bombarded by the volume of task results being sent directly from the
worker threads. In OneFS 8.2 and later, this is mitigated by certain jobs performing
intermediate merging of results on individual nodes and batching delivery of their
results to the coordinator.
• The cluster health depends on the Job Engine and the configuration of jobs in
relationship to each other.
• Administrators can manage the Job Engine using the WebUI or the CLI.
− Manual Job Execution331
− Scheduled Job Execution332
− Proactive Job Execution333
331
The majority of the Job Engine jobs have no default schedule and can be
manually started by a cluster administrator.
Job Operations
All job operations are managed by navigating to Cluster management > Job
operations page of the OneFS WebUI.
Job Summary
333The Job Engine can also initiate certain jobs on its own. For example, if the
SnapshotIQ process detects that a snapshot has been marked for deletion, it will
automatically queue a SnapshotDelete job.
334The Job Engine executes jobs in response to certain system event triggers. In
the case of a cluster group change, for example the addition or subtraction of a
node or drive, OneFS automatically informs the job engine, which responds by
starting a FlexProtect job. The coordinator notices that the group change includes a
newly-smart-failed device and then initiates a FlexProtect job in response.
335Job administration and execution can be controlled via the WebUI, the CLI, or
the OneFS RESTful platform API. For each of these control methods, additional
administrative security can be configured using RBAC. By restricting access via the
ISI_PRIV_JOB_ENGINE privilege, it is possible to allow only a sub-set of cluster
administrators to configure, manage and execute job engine functionality, as
desirable for the security requirements of a particular environment.
Job Types
336You can change the priority and impact policy of a running, waiting, or paused
job. When you update a job, only the current instance of the job runs with the
updated settings. The next instance of the job returns to the default settings for that
job.
Job Reports
• View the report details and event associated for a completed job or job phase.
• View job history.
Job Events
• View the details of job events for each job or job phase.
• Jobs can be filtered based on job ID or job type.
Scenario
Schedule Job
Coordinator Node
• isi job status command views the running, paused, or queued jobs, and
the status of the most recent jobs.
• The --verbose option adds failed and completed jobs to the list and gives
greater detail in the job summary.
• The output provides job-related cluster information, including identifying the
coordinator node and if any nodes are disconnected from the cluster.
• Running isi job status is a simple way to detect whether jobs are creating
the main performance bottleneck on the cluster.
The Job Engine provides detailed monitoring and statistics gathering, with insight
into jobs and job engine.
Various job engine-specific metrics are available via the OneFS CLI, including per
job disk usage.
For example, worker statistics and job level resource usage can be viewed using
the isi job statistics list command.
Also, the status of the Job Engine workers is available using the isi job
statistics view command.
The Coordinator assigns a Job ID for the entire cluster and the PID for the individual nodes.
Job Reports
• If you suspect the combination of client workload and jobs are overworking the
cluster, examine the performance of the disks.
• The sysctl command views the number of operations queued per disk.
• When interpreting the results, keep in mind the type of connections337.
The example shows no latency from the sysctl hw.iosched command output.
Assess the activity of the Job Engine, including assessments of which jobs put
which kinds of loads on your cluster.
After assessment, establish whether the Job Engine is a meaningful source of load
for the cluster at all.
Sometimes it is, in other cases it is entirely negligible but you cannot know with
certainty unless you look at the job run times.
• Sometimes jobs take a long time to complete, or that they appear to be stuck.338
• Bring up the job engine log file at /var/log/isi_job_d.log and look for job engine
worker assignments.
• The log indicates when a worker completes a task and is assigned new tasks. If
still unclear the administrator should call support before taking action.
• Restarting a job may not be the best path as it has to start from the beginning of
the task and may not quickly return to the previous point.
• Another way to resolve the issue is by adjusting the impact policies.
338 Job system monitoring can reveal cases where a long running job is repeatedly
interrupted by higher priority jobs, making it appear to stay in the queue indefinitely
when in fact it is rarely getting a chance to run. If a job appears stuck, there may
still be activity going on that is difficult to see.
Worker reporting
Listed are the common areas to consider when addressing Job Engine related
issues:
• Misconfigured Jobs339
• Job History340
339Misconfigured jobs can affect cluster operations. Examine how the jobs have
been configured to run, and how they have been running and if jobs are failing.
Failed jobs can be an indicator of other cluster issues also. For example, when the
MultiScan or Collect jobs have many starts and restarts, indicating group changes.
Group changes occur when drives or nodes leave or join the cluster.
340The job events and operations summary either from the WebUI or the CLI is
useful for immediate history. Often an issue is recurring over time and can be more
easily spotted from the job history or job reports. For example, a high priority job
constantly pushes other jobs aside, but a less consistent queue backup can still
prevent features from properly operating. The issue can require deeper dives into
the job history to see what is not running, or is running only infrequently.
342 Impact level changes directly affect the job completion time and the cluster
resources. For example, an administrator modified the LOW impact policy to have
0.1 maximum workers or threads per storage unit. The result was that no low
impact job ever completed. The customer then changed the jobs with LOW impact
policies to a MEDIUM impact policy. When the jobs ran it negatively impacted
cluster performance. After investigation, the customer made the changes to limit
the impact during peak workflow hours. Restoring all settings to the system defaults
fixed the issue. The use of a custom schedule was implemented using a
modification of the OFF_HOURS policy, obtaining the intended goal.
Limits
The Job Engine has certain limitations, some by design and others by the nature of
its role:
• Only three concurrent jobs can run on the Job Engine.
• High priority jobs are prioritized in the same exclusion set.
• Job scalability varies:
343 Some jobs naturally have a long lifespan. FSA, deduplication, and Autobalance
can all have a long active period. Administrators should carefully evaluate the
circumstances before trying to stop the jobs on the assumption that they have
somehow stopped responding. There are some jobs that should not be interfered
with unless support directs you. FlexProtect being the primary one, as this job re-
protects data. The cluster monitors this job closely and alerts if there is a problem
during this job. Adjust between low and medium impact, but consult technical
support before any additional action is taken.
344For example, Dedupe job slows proportional to the product of storage pools size
and the number of storage pools. Small cluster may run dedupe daily, medium
clusters on the weekends and larger clusters might be monthly or quarterly.
Considerations
Best Practices
Challenge
Lab Assignment:
1) Perform the different job operations.
2) Understand Job Engine behavior due to factors such as exclusion sets
and job priority.
OneFS Services
OneFS Services
Module Objectives
Small file
(<=128 KB)
A small file is a file less than one stripe unit in length, or 128 KB or less. OneFS
does not break small files into smaller logical chunks.
OneFS uses forward error correction (FEC) to parity protect a file, resulting in high
levels of storage efficiency.
Small files are mirrored, so have a larger on-disk footprint. With mirroring, OneFS
makes copies of each file and distributes multiple instances of the entire protection
unit across the cluster. The loss protection requirements of the requested
protection determine the number of mirrored copies. If the workflow has millions of
small files, the efficiency can become a significant issue. When FEC protection is
calculated, it is calculated at the 8 KB block level. If there is only one 8 KB to use in
the calculation, the result is a mirror of the original data block. The requested
protection level determines the number of mirrored blocks.
Many archive datasets are moving away from large file formats such as tar and .zip
files to storing smaller files individually, allowing in-place analytics. To address the
trend, OneFS uses Small File Storage Efficiency or SFSE. SFSE maximizes the
cluster capacity by decreasing the storage that is required for small file archive
data.
Files Sizes
64 KB file
The table shows a 64 KB file with the protection level set at N+2d:1n. The
protection level results in a 3x mirror. The result is that the 64 KB file consumes
192 KB of storage.
345Since small files are a single stripe unit and not related to other stripe units,
there is no minimum read or write cache benefits. The use of L3 cache can improve
chances of gaining a cache benefit for repeat random reads. In other words, the
same small read multiple times could benefit from L3 cache.
346If the workflow is predominantly small files, setting the access pattern to random
can reduce using unnecessary cluster resource when predicting cache data.
347If the workflow data is going to be all small files, CPU resources can be saved
by setting the requested protection level as mirrored protection.
176 KB file
Files not evenly divisible by 128 KB result with some FEC blocks mirroring348.
348Not all 8 KB blocks have a corresponding block in the second data stripe to
calculate FEC against.
The file has one 128 KB stripe unit and one 48 KB stripe unit.
The first six 8 KB blocks of each stripe unit calculate FEC.
The remaining ten 8 KB blocks have mirrored protection.
The ten unused blocks of data stripe 2 are free for use in the next stripe unit349.
The 176 KB file has little caching benefits350.
Consume 8 KB block
Assuming GNA is not enabled, there are three 512 B metadata blocks per file for
this example, or 1.5 KB. So the total space is 25.5 KB for the file on disk.
All files 128 KB or less are mirrored. For a protection strategy of N+1 the 128 KB
file is mirrored, the original data and a copy.
349 The stripe unit is not padded, and the capacity is not wasted.
350L3 cache recognizes this file size and enables repeat random read caching.
Setting a random access pattern may be appropriate depending on the workflow.
The example demonstrates how all files 128 KB or less are mirrored. Shown is a
four node cluster. For a protection strategy of N+1 the 128 KB file has a 2X mirror,
the original data and a mirrored copy. FEC is still calculated on files less than or
equal to 128 KB, but the result is a copy.
Calculating Space for Small File: 8 KB is the minimum block size used. 8 KB was
chosen for storage efficiencies and is the optimal size for most PowerScale
workflows. Any file or portion of a file less than 8 KB consumes an 8 KB block. So a
4 KB file consumes one 8 KB block. A 12 KB file consumes two 8 KB blocks. A 24
KB file consumes three 8 KB blocks. A 4 KB file with N+2d:1n requested protection
level has 8 KB for the data, and two 8 KB mirrors, totaling 24 KB.
• Even with the protection and metadata overhead, the consumed space is little.
• SFSE can minimize the potential impact351.
With one million 24 KB files at a requested protection of N+2d:1n, the file data,
protection overhead, and metadata is about 70.09 GB.
In the given example, in the case of the small files scenario, it takes 70GB to hold 1
million files of 24KB each so, the efficiency ratio is of 2.92:1 (that is, it take 2.92
storage units to store 1 real storage unit). Compared to the large files scenario
where it takes 1.35GB to hold 1.2GB, the efficiency ratio is of 1.13:1.
A 1.5 hour YouTube video at 1080p averages approximately 1.2 GB per file before
protection and metadata. With protection and metadata overhead, the file is about
1.35 GB. So, one million small files are about the same as 52 YouTube videos. It
takes few large files to equal the consumption of many small files.
You can improve the storage efficiency for small files by using inline data reduction.
OneFS inline data reduction combines both real-time compression and
351 OneFS small file usage may not be highly efficient, but there is not a large
impact. One method is to analyze data in three categories: The number of small
files and the average file size, the number of large files and average file size, and
the number of all other or medium files and average file size. The idea is to look at
all workflows and not just the workflow with many small files.
• The primary purpose of OneFS inline data reduction is to reduce the storage
requirements for data, resulting in a smaller storage footprint.
• Inline data reduction also helps to shrink the total amount of physical data that
is written to storage devices.
1 2 3
1: The inline data reduction zero block removal phase detects blocks that contain
only zeros and prevents them from being written to disk. This phase reduces the
disk space requirements and avoids unnecessary writes to SSD, resulting in
increased drive longevity.
Zero block removal occurs first in the OneFS inline data reduction process. As
such, it has the potential to reduce the amount of work that both inline
352 Compression uses a lossless algorithm to reduce the physical size of data when
it is written to disk and decompresses the data when it is read back. More
specifically, lossless compression reduces the number of bits in each file by
identifying and reducing or eliminating statistical redundancy. No information is lost
in lossless compression, and a file can easily be decompressed to its original form.
353 Deduplication differs from data compression in that it eliminates duplicate copies
of repeating data. Whereas compression algorithms identify redundant data inside
individual files and encode the redundant data more efficiently. Deduplication
inspects data and identifies sections, or even entire files, that are identical, and
replaces them with a shared copy.
deduplication and compression must perform. The check for zero data does incur
some overhead. However, for blocks that contain nonzero data the check is
terminated on the first nonzero data that is found, which helps to minimize the
impact.
The following characteristics are required for zero block removal to occur:
The write converts the block to sparse if not already. A partial block of zeroes being
written to a non-sparse, non pre-allocated block does not zero eliminate.
When a client writes a file to a node pool configured for inline deduplication on a
cluster, the write operation is divided up into whole 8 KB blocks. Each of these
blocks is then hashed and its cryptographic fingerprint compared against an in-
memory index for a match. One of the following operations occurs:
3: When a file is written to OneFS using inline data compression, the logical space
of file is divided up into equal sized chunks that are called compression chunks.
• This is optimal since 128 KB is the same chunk size that OneFS uses for its
data protection stripe units.
• It provides simplicity and efficiency, by avoiding the overhead of additional
chunk packing.
If OneFS SmartDedupe is also licensed and running on the cluster, this data
reduction savings value reflects a combination of compression, inline deduplication,
and postprocess deduplication savings. If both inline compression and
deduplication are enabled on a cluster, zero block removal is performed first,
followed by deduplication, and then compression. This order allows each phase to
reduce the scope of work each subsequent phase.
After compression, this chunk is reduced from sixteen to six 8KB blocks in size.
This means that this chunk is now physically 48 KB in size. OneFS provides a
transparent logical overlay to the physical attributes.
This overlay describes whether the backing data is compressed or not and which
blocks in the chunk are physical or sparse, such that file system consumers are
unaffected by compression.
Compression chunks never cross node pools. This avoids the need to decompress
or recompress data to change protection levels, perform recovered writes, or shift
protection-group boundaries.
OneFS provides six principle reporting methods for obtaining efficiency information
with inline data reduction.
The most comprehensive of the data reduction reporting CLI utility is the isi
statistics data-reduction command.
The recent writes data to the left of the output provides precise statistics for the
five-minute period before running the command.
isi compression
The isi compression stats command also accepts the list argument, which
consolidates a series of recent reports into a list of the compression activity across
the file system.
isi dedupe
The isi dedupe stats command provides cluster deduplication data usage and
savings statistics, in both logical and physical terms.
The isi dedupe stats output reflects the sum of both in-line dedupe and
SmartDedupe efficiency.
isi get -O
The addition of a -O logical overlay flag to isi get for viewing compression
details of a file.
The logical overlay information is described under the protection groups output.
The example in graphic shows a compressed file where the sixteen-block chunk is
compressed down to six physical blocks (#6) and ten sparse blocks (#10).
Under the Metatree logical blocks section, a breakdown of the block types and
their respective quantities in the file is displayed - including a count of compressed
blocks.
In OneFS 8.2.1 and later, OneFS SmartQuotas has been enhanced to report the
capacity saving from in-line data reduction as a storage efficiency ratio354.
On a cluster with licensed and configured SmartQuotas, the efficiency ratio can be
easily viewed from the WebUI or using the isi quota quotas list CLI
command.
In OneFS 8.2.1 and later, the OneFS WebUI cluster dashboard now displays a
storage efficiency tile, which shows physical and logical space utilization
histograms and reports the capacity saving from in-line data reduction as a storage
efficiency ratio.
Graphics shows OneFS WebUI Cluster Status Dashboard – Storage Efficiency Summary
The cluster data reduction metrics on the right of the output are slightly less real
time but reflect the overall data and efficiencies across the cluster. This metric is
designated by the Est. prefix, denoting an estimated value.
The ratio data in each column is calculated from the values above it. For example,
to calculate the data reduction ratio, the logical data (effective) is divided by the
preprotected physical (usable) value. The calculated data Reduction ratio is
1.76:1 (339.50 / 192.87 = 1.76).
• Storage consolidation creates datasets with mixed file sizes, reducing total
storage overhead.
• Analyze full distribution of small and large files, not average – average file size
calculates to higher storage overhead.
355The SFSE estimation tool can anticipate the expected savings from the SFSE
feature. SFSE targets workflows with files less than 1 MB in size.
356 Improvements in storage efficiency are achieved by packing multiple small files
into shadow stores called containers.
Tip: See Small File Storage Efficiency for Archive to learn more on
points explaining SFSE for archive.
When files with shadow references are deleted, truncated, or overwritten, it can
leave unreferenced blocks in the shadow stores. These blocks are later freed and
result in holes which causes fragmentation and reduces the storage efficiency.
Reclaiming the space is the problem the defragmentation tool, or defragmenter,
solves.
The shadow store defragmenter helps expand the SFSE feature for archive-type
workloads. To improve storage efficiency, the defragmenter reduces fragmentation
that overwrites and deletes cause. Limit overwrites and deletes to containerized
files, which cause fragmentation and impact both file read performance and storage
efficiency. In OneFS 8.2, the ShadowStoreDelete Job runs on a daily schedule
instead of a weekly schedule. The defragmenter divides each shadow store up into
logical chunks and assesses each chunk for fragmentation. If the current storage
efficiency of each chunk is below a target efficiency, then OneFS moves the chunk
to another shadow store location. The default target efficiency is 90% of the
maximum storage efficiency available with the protection level on the shadow store.
Larger protection group sizes can tolerate a higher level of fragmentation before
the target efficiency drops below this threshold.
Prerequisites
Before considering to enable SFSE, ensure that the following prerequisites are
met:
• Archive dataset357
• Small files358
• SmartPools license359
• OneFS 9.0.0 or later360
Enable SFSE
Use the CLI to enable packing. Enabling is done using the CLI. If needed, use the
isi_packing command to configure the maximum file size value instead of
defining in the file pool policy. Also, consider the minimum age for packing (--min-
age <seconds>) when configuring.
Example:
# isi_packing --enabled=true
358Most of the archive consists of small files. By default, the threshold target file
size is from 0 MB to 1 MB.
Configure using a file pool policy. Use a path-based file pool policy, where possible,
rather than complex file pool filtering logic. The default minimum age for packing is
one day. Due to the work to pack the files, the first SmartPools job may take a
relatively long time, but subsequent runs should be much faster.
Example create:
# isi filepool policies create fin_arc --enable-
packing=true --begin-filter --path=/ifs/finance/archive
--end-filter
Example verify:
SmartPools Job
The SmartPools job containerizes the small files in the background. You can run
the job on-demand or using the nightly schedule. Data in a snapshot is not packed,
SFSE only containerizes HEAD file data. A threshold prevents very recently
modified files from being containerized. The SmartPoolsTree job, isi filepool
apply, and isi set can pack files.
Example:
Unpacking
Example:
Job Report
The commands that report on the status and effect of small file efficiency are isi
job and isi_packing --fsa.
Monitoring
Example in the graphic uses a 4+2 protection level and the maximum efficiency is
0.66, or 66%.
361The fragmentation score is the ratio of holes in the data where FEC is required.
Fully sparse stripes do not need FEC so are not included. Lower fragmentation
scores are better.
362Efficiency is the ratio of logical data blocks to total physical blocks, including
protection overhead. The protection group layout limits the maximum efficiency.
Estimation tool
The example shows that running the command on a newly installed cluster does
not reflect a production system.
Enable defragmenter
Output terminology:
• BSINs 363
• CSINs364
• Chunk size 365
• Target efficiency 366
• PG efficiency 367
• Snapshots368
Defrag tool
The ShadowStoreDelete job distributes the work across the cluster, isi_sstore
defrag runs on a single node and with a single process. The command defaults
the target efficiency and chunk size to the values in the gconfig. You must explicitly
specify the rest of the options on the command line regardless of the gconfig
settings.
367PG efficiency causes more aggressive defragmentation if it can reduce the total
number of protection groups needed to store the shadow store data. Enabled by
default.
Healthcare PACS
In 2017, regulations no longer allow larger containers of small files. This forces
Hayden to use solutions that can handle smaller files with greater storage
efficiency. Hayden sees that the use case for the storage efficiency for PACS
feature is an archive scenario in the Healthcare PACS workflow. Seeing physicians
who require access to diagnostic imaging are causing a shift in the healthcare
market. That shift is causing PACS and vendor-neutral archive (VNA) vendors to
transition to noncontainerized studies.
There is a trade-off between storage efficiency and performance. The goal of small
file storage efficiency is to improve storage efficiency, which can affect
performance.
Packing
369In the Healthcare vertical, one of the regulations changes how PACS
applications must store their data. PACS applications can create and store larger
containers of small files.
• The shadow store for PACS isolates fragmentation and supports tiering and is
different from OneFS shadow stores.
• The shadow store for PACS is parity protected, which typically provides a
greater storage efficiency than mirroring.
The graphic shows traditional small files with 3x mirroring, illustrating inefficiency. The After packing
graphic shows the packing process scans for small files and packs them into a shadow store for
PACS.
Interoperability
Packed files are treated as normal files during failover and failback operations. If
the target PACS feature is enabled and the correct file pools policy can be applied,
packed files can be packed on the cluster.
• SyncIQ: SyncIQ does not synchronize the file pools policies. Manually create
the correct file pools policies on the target cluster. Best practice is to enable
Storage Efficiency for PACS on the source cluster and on the target cluster.
Enabling both sides retains the benefits of storage efficiency on both clusters.
The Storage Efficiency for PACS feature enables you to store more data for
small files: Plan your data replication accordingly. If syncing data between two
equal size clusters, and the PACS feature is enabled on one side. The side that
is not feature enabled could potentially run out of space first. Running out of
space blocks data replication.
• File clones and Deduplication: Interoperability is limited. Clones are not
optimized. Deduplication skips packed files. Cloned and deduplicated files
already have good storage efficiency without packing.
• InsightIQ: Packing does not update the disk usage fields, unpacking, or
deduplication operations. InsightIQ cluster summary figures accurately display
the used and free space. However, per directory and per file usage may not be
accurate.
• CloudPools: Stubbed files are not packed. Packed files are stubbed.
• SmartLock: Packing processes write once or read many (WORM) files as
regular files. WORM files are good candidates for packing. WORM files are
unlikely to cause fragmentation due to writing.
Migration
Migration
Overview
Migrating data
Migrating
authorization - ACLs,
POSIX
Considering pipe -
WAN, LAN, bandwidth
A data migration strategy ensures that the data is migrated with no loss of data
integrity.
An analysis of the source environment must consider many aspects from data
integrity to data protection to the impact of feature functionality to environmental
factors.
A thorough analysis should reduce the data migration risk or identify barriers to the
migration.
Migration Phases
1 2 3 4
1: Design a migration strategy that considers minimum risk and downtime. Perform
a detailed review of the migration source environment. Key aspects of the planning
phase are discovery and analysis of source infrastructure, the target data, and the
PowerScale cluster. Areas to consider are the mapping on the cluster, and features
such as access zone, quotas, snapshots, backups, and deduplication.
2: Review, validate, and test the strategy. Testing the migration is typically run on a
subset of data. The test should provide insight into the performance, timing, and
validity of data. Validity of data includes accessibility, permission models, and
workflow function. The test results should meet the requirements of the migration
strategy.
3: Running the migration typically requires an initial full copy and then incremental
updates. The "first full, then incremental" approach eases the migration cutover.
The cutover may involve halting writes on the source data, a final incremental copy,
and moving client and application connectivity.
4: Validate the migration once the cutover is complete. Validate the access and
data before enabling writes to the data. Once clients and applications write and
modify data, a rollback becomes difficult. In the validation phase, monitor the
cluster, client access, performance, and feature functionality. Ideally, the post-
migration phase has no issues.
Planning Phase
In the planning phase, evaluate the infrastructure of the source system, network
architecture, and the network paths between the source data and the PowerScale
cluster.
Determine how the source data maps to the target end state on the cluster.
Testing Phase
Validate the
performance
The testing phase validates the migration strategy. The test should meet the
requirements of the strategy. The test enables you to tune and explore the use of
alternate migration tools, settings, and methods.
Migrating Phase
Rollback Cutover
Validating Phase
Monitor the PowerScale and OneFS to ensure that all expectations are met. Re-
implement features such as quotas and snapshots if needed.
Migration Tools
The table list the common tools used to migrate data to the PowerScale cluster.
Tool Notes
370isi_vol_copy supports data migration using NDMP. The tool enables the cluster
to mimic the behavior of the source system. The tool copies data from the source
system to the cluster.
371
EMCOPY copies files, directories, and subdirectories from SMB shares to other
SMB shares with the security and attributes intact.
Resource: https://www.dellemc.com/resources/en-us/asset/white-
papers/products/storage/h15756-netapp-to-onefs-migration-tools.pdf
373 The rsync tool copies files, directories, and subdirectories from one NFS export
to another. You can run rsync natively on a PowerScale node against locally
mounted NFS source exports that are mounted directly on a node. Data is migrated
directly from the source system to the PowerScale.
Planning
Best practices:
• Use the correct version of the tool.
• Understand the EMCOPY switches374.
• Use mapped drives375.
374 Different migrations use different switches to meet the migration requirements.
• You can create a dedicated SMB migration share. Using a hidden ($) share will
hide the share from users browsing the SMB shares on the cluster.
• For extensive migrations, set restrictive share permissions to limit the migration
share access.
• This example creates new, user shares. Configure the share permissions prior
to the data migration.
• When creating the share, use the Do not change existing permissions
option.
Know the function of the command options. Get a full list of switches using the
emcopy.exe /? command.
376Separates the migration traffic from the production traffic, allowing for maximum
throughput and reduces the potential production impact.
377 The tool must have access to the source and target data.
The example migrates data from the R: directory to the mapped X: directory. The
command copies owner security, directories and subdirectories, and will retry on
files.
Validate
After the migration, validate the data and the file attributes. Ensure file data is
copied and intact. Verify file security, ownership, and attributes. Check that the
timestamps on the files are correct.
Migration
Once the test migration is validated. Run the migration. First run a full copy and
then incremental copies. For large and complex migrations, you may run many
incremental copies before performing a cutover.
Cutover
Planning
Best practices:
• Use the correct version of the tool.
• Understand the rsync switches378.
• Use local paths379.
• Restrict access380.
378Understand the rsync switches and when and how to use them. Each migration
may require the use of different switches. Start with the baseline switches when
testing the migration.
379The rsync tool can operate in both a local and remote mode, and can push or
pull data.
380Restricting access prevents users from changing data on the source server
instead of the migration target.
Know the function of the command options. Get a full list of switches using the man
rsync command.
The example migrates data from the locally mounted file system /mnt/eng, to the
OneFS target /ifs/divgen/eng.
381root_squash prevents remotely connected user from having root privileges. Root
access is needed to migrate all the files and directories. Use the option
"no_root_squash".
Validate
After the migration, validate the data and the file attributes. Ensure file data is
copied and intact. Verify file security, ownership, and attributes. Check that the
timestamps on the files are correct.
During the test migration, monitor and benchmark performance. Knowing how long
an incremental copy takes will help determine the time needed to perform the
cutover and the outage window.
Migration
Once the test migration is validated. Run the migration. First run a full copy and
then incremental copies. For large and complex migrations, you may run many
incremental copies before performing a cutover.
This example mounts the source data on the PowerScale node. For a large
migration, using multiple nodes scales bandwidth. Data moves from the source to
the cluster.
Cutover
Challenge
Lab Assignment: Migrate user data from a NFS file server to PowerScale
cluster.
1) Perform a NFS migration test.
2) Validate the migrated test data.
SmartQuotas Advanced
SmartQuotas Advanced
SmartQuotas Recap
SmartQuotas also facilitate thin provisioning, or the ability to present more storage
capacity to applications and users than is physically present (overprovisioning).
Quota Types
Quota Types
• Accounting Quotas382
• Enforcement Quotas383
A SmartQuota can have one of four enforcement type settings: Hard, Soft or
Advisory.
There are three SmartQuotas enforcement states: Under (U), Over (O), Expired
(E).
QuotaScan Job
The QuotaScan job updates quota accounting for domains created on an existing
directory path.
383Enforcement Quotas on the other hand include all of the functionality of the
accounting option plus the ability to limit disk storage and send notifications. Using
enforcement limits, you can logically partition a cluster to control or restrict how
much storage that a user, group, or directory can use.
The QuotaScan job is the cluster maintenance process responsible for scanning
the cluster to performing accounting activities to bring the desired governance to
each inode.
Quotas Daemons
There are three main processes or daemons that are associated with
SmartQuotas:
384Although it is typically run without any intervention, the administrator has the
option of manually control if necessary or desirable. By default, QuotaScan runs
with a ‘low’ impact policy and a low priority value of ‘6’.
OneFS 8.2 and later also include the rpc.quotad service to facilitate client-side
quota reporting on UNIX and Linux clients using native quota tools. The service
which runs on tcp/udp port 762 is enabled by default, and control is under NFS
global settings.
Also, in OneFS 8.2 and later, users can view their available user capacity set by
soft and/or hard user and group quotas rather than the entire cluster capacity or
parent directory-quotas. This feature avoids the illusion of seeing available space
that may not be associated with their quotas.
• Set up a system alert387 to notify the storage admin to add capacity (nodes)
when the 200 TB is 75% full.
• Scale out by adding additional capacity only when needed.
Quotas Report
Each generated report includes the quota domain definition, state, usage, and
global configuration settings388.
A quota report is a timestamped XML file that starts off with global configuration
settings and global notification rules.
387 Hayden may need to investigate. Chances are the 200 TB limit is a segment of
the cluster capacity and not the entire cluster capacity. Hayden can also add nodes
if users are reaching their limits. Chances are Hayden will notify the users to
cleanup their directories before making a big purchase.
388 Quota Notification Rules are read and inserted into a domain entry only if the
domain is not inherited. These rules are inserted to avoid any performance impact
of reading the Quota Notification Rules with each domain.
When listing domains, both inode & path and name & ID are stored with each
domain389.
Quota reports are managed by configuring settings that provide control over when
reports are scheduled. How they are generated, where and how many are stored
and how they are viewed.
The maximum number of scheduled reports that are available for viewing in the
web-administration interface can be configured for each report type390.
389 Quota Notification Rules are read and inserted into a domain entry only if the
domain is not inherited. These rules are inserted to avoid any performance impact
of reading the Quota Notification Rules with each domain.
You can create manual reports at any time to view the current state of the storage
quotas system. These live reports can be saved manually. The OneFS CLI export
functionality uses the same data generation and storage format as quota reporting.
This functionality should not require any extra requirements beyond the three types
of reports. After the collection of the raw reporting data, data summaries can be
produced given a set of filtering parameters and sorting type. Reports can be
viewed from historical sampled data or a live system. In either case, the reports are
views of usage data at a given time. SmartQuotas does not provide reports on
aggregated data over time (that is trending reports). However, a Quota
Administrator uses the raw data to answer trending questions.
Quota Nesting
Nested quotas have multiple quotas within the same directory structure.
The isi quota quotas list command is used to compare the size of a quota
to the amount of data it holds.
Warning: If you are setting a higher threshold than parent quota hard
threshold. This may cause the current threshold to be ignored.
At the top of the hierarchy, the /ifs/sales folder has a directory quota of 1 TB. Any
user can write data into this directory, up to a combined total of 1 TB. The
/ifs/sales/sales-gen directory has a group quota assigned that restricts the total
amount of write into this directory to 1 GB. Even though the parent directory (sales)
is below its quota restriction. The /ifs/sales/sales-gen/MySales directory has a
default user quota of 500 MB that restricts the capacity of this directory to 500 MB.
The /ifs/Sales/sales-gen/MySales/test1 directory has a user quota of 50 MB.
The /ifs/sales/sales-gen/Example directory has default user quota of 250 MB. The
/ifs/sales/sales-gen/Example/test3 directory has a user quota of 100 MB. However,
if users place 500 GB of data in the /ifs/sales/MySales directory, users can only
place 500 GB in the other directories. The parent directory cannot exceed 1 TB.
The quota configuration provides the option to include or not the snapshot data and
data-protection overhead upon the usage calculation of a quota.
SmartQuotas reports only on snapshots that are created after the quota domain
was created391.
Deduplicated files appear no differently than regular files to standard quota policies.
391 Determining quota governance (including QuotaScan job) for existing snapshots
is a time and resource consuming operation. However, as snapshots age out,
SmartQuotas gradually accrues accounting information for the entire set of relevant
snapshots.
SmartQuota reports efficiency as a ratio across the desired data set as specified in
the quota path field.
The compression efficiency ratio is for the full quota directory and its contents,
including any overhead, and reflects the net efficiency of compression.
SyncIQ Failover and Failback does not replicate cluster configurations such as
SMB shares and NFS exports, quotas, snapshots, and networking settings, from
the source cluster.
392During replication SyncIQ ignores quota limits. However, if a quota is over limit,
quotas still prevent users from writing additional data. Ideally, whatever quotas are
set in the SyncIQ domain, the administrator should configure a quota domain on
the target directory with the same quota settings.
393
Instead, SyncIQ operation fails, as opposed to deleting an existing quota. This
may occur during an initial sync where the target directory has an existing quota
under it, or if a source directory is deleted that has a quota on it on the target. The
quotas still remains and requires administrative removal if desired.
Application logical quotas, available in OneFS 8.2 and later, provides a quota
accounting metric. The quota accounting metric accounts for reports and enforces
on the actual space consumed.
Quota Management
Scenario: A semiconductor company uses a large HPC compute cluster for parts
of their EDA workflow, and wants to guard against runaway jobs from consuming
massive amounts of storage. The company runs heavy computation jobs from a
large compute farm against a scratch space directory, housed on an F200 tier on
their cluster, and garbage collection is run at midnight.
Throughout the workday, its hard for the storage admins to keep track of storage
utilization. Occasionally, jobs from the compute gets out-of-control, tying up large
swathes of fast, expensive storage resources and capacity. What should be done
to help prevent this?
• Set an advisory directory quota on the scratch space at 80% utilization for
advanced warning of an issue.
• Configure a hard directory quota to prevent writes at 90% utilization.
Considerations
− Increased from 20,000 quota limits per cluster to 500,000 quota limits per
cluster.
Link: For more information see the Storage Quota Management And
Provisioning With Dell EMC PowerScale SmartQuotas white paper.
Challenge
Lab Assignment:
1) Investigate and troubleshoot misconfigured quotas.
2) Add a notification rule using custom email templates.
3) Configure and monitor quota reports.
SnapshotIQ Advanced
SnapshotIQ Advanced
SnapshotIQ Overview
1
2
1: Only the changed blocks of a file are stored in a snapshot thereby ensuring
highly-efficient storage capacity utilization. User access to the available snapshots
is via a special hidden ‘snapshot directory' under each file system directory.
394OneFS snapshots are used to protect data against accidental deletion and
modification. Because snapshots are available locally, users can restore their data
without administrative intervention.
395 Some OneFS operations generate snapshots for internal system use without
requiring a SnapshotIQ license. If an application generates a snapshot, and a
SnapshotIQ license is not configured, the snapshot can be still accessed. However,
all snapshots that OneFS operations generate are automatically deleted when no
longer needed. You can disable or enable SnapshotIQ at any time. Note that you
can create clones on the cluster using the "cp" command, which does not require a
SnapshotIQ license.
Architecture
• Directory based397
• Logical snapshot process398
• Snapshot space allocation399
397
OneFS snapshots are per-directory based. This is in contrast to the traditional
approach where snapshots are taken at a file system or volume boundary.
398 Since OneFS manages and protects data at the file-level, there is no inherent
block-level indirection layer for snapshots to use. Instead, OneFS takes copies of
files or pieces of files (logical blocks and inodes) in what’s termed a logical
snapshot process.
Snapshot Tracking Files (STF)400 are the main data structures associated with a
snapshot. A snapshot tracking file has three major purposes:
In the given example, the Snapshot 0900_snap is tracked and the account
information of the changed blocks are given below in the box.
Snapshot Management
When the data is not in the snapshot, the block tree of the inode on the snapshot
doesn’t point to a real data block. Instead it has a flag marking it as a Ditto Block401.
400 STFs are a special file type with several unique characteristics, and are involved
in the full snapshot life cycle, including the creation, storing any changes, and
deletion of snapshots.
401 A Ditto-block means that the data is the same as the next newer version of the
file, so OneFS will automatically look ahead to find the newer version of the block.
In this case, blocks 3 and 4 were changed after the first snapshot (Snap_ID 98) was taken and
before the second (Snap_ID 100), and blocks 0 and 4 where changed after the second snapshot
was taken
Painting Algorithm
When a file is written to, the system needs to do a small amount of work to
determine if the file is part of a snapshot. If so, a copy of the old data needs to be
kept. This is done via a process known as the painting algorithm402.
Snapshot Domains
End-User
ifs/data/foo/bar.txt
ifs/data/foo/.snapshot/0900_snap/bar.txt
NFS and SMB users can view and recover data from OneFS snapshots, with the
appropriate access credentials and permissions.
403A user accidentally deletes a file ‘/ifs/data/foo/bar.txt’ at 9.10am and notices it’s
gone a couple of minutes later. By accessing the 9am snapshot, the user is able to
recover the deleted file themselves at 9.14am, by copying it directly from the
snapshot directory ‘/ifs/data/foo/.snapshot./0900_snap/bar.txt’ back to its original
location at ‘/ifs/data/foo/bar.txt’.
1 2 3 4
2: Indicates the total number of snapshots that were deleted on the cluster since
the last snapshot delete job was run. The space consumed by the deleted
snapshots is not freed until the snapshot delete job is run again.
3: Indicates the total number of snapshot aliases that exist on the cluster.
Managing Snapshots
Delete Snapshot
Yes, the snapshot can be deleted if required. OneFS frees disk space occupied by
deleted snapshots when the SnapshotDelete job is run. Also, if you delete a
snapshot that contains clones or cloned files, data in a shadow store might no
longer be referenced by files on the cluster.
The name and expiration date of a snapshot can be modified by running the isi
snapshot snapshots modify <snapshot> {--name <name> | --
expires {<timestamp> | <duration>} | --clear-expires | --
alias <name>}... [--verbose] command.
A snapshot alias can be reassigned, to redirect clients from a snapshot to the live
file system.
Restoring Snapshots
Revert a Snapshot
CLI command to restore a snapshot: isi job jobs start <type> [--dm-
type {snaprevert | synciq}]
List associated
snapshots with
modification time
Snapshot options
Navigate to the directory that you want to restore or the directory that contains the
file that you want to restore. If the directory has been deleted, you must recreate
the directory.
A file or directory can be restored from a snapshot through the CLI command.
Clone a file from the snapshot by running the cp command with the -c option.
File Clones
• OneFS also provides the ability to create writable clones of files. OneFS File
Clones provides a rapid, efficient method for provisioning multiple writable
copies of files.
• Common blocks are shared between the original file and clone, providing space
efficiency and offering similar performance and protection levels across both.
• This mechanism is ideal for the rapid provisioning and protection of virtual
machine files and is integrated with VMware's linked cloning and block and file
storage APIs.
File Clones
Snapshot Reserve
Best Practices
a. Configure the cluster to take fewer snapshots, and for the snapshots to expire
more quickly, so that less space will be consumed by old snapshots.
b. Using SmartPools, snapshots can physically reside on a different disk tier than
the original data.
c. Avoid creating snapshots of directories that are already referenced by other
snapshots.
d. It is recommended that you do not create more than 1000 hard links per file in a
snapshot to avoid performance degradation.
e. Always attempt to keep directory paths as shallow as possible. The deeper the
depth of directories referenced by snapshots, the greater the performance
degradation.
Considerations
Challenge
Lab Assignment:
1) Create a SnapRevert domain and create snapshots.
2) Create and view a changelist.
3) Restore data using a snapshot.
SyncIQ Advanced
SyncIQ Advanced
SyncIQ Overview
• SyncIQ is the OneFS data replication module that provides consistent replicas
of data between two clusters.
• SyncIQ performance increases as cluster scales out.
• SyncIQ provides automated failover and failback capabilities404.
404 Failover and failback only include the cluster preparation activities and do not
include DNS changes, client redirection or any required networking changes.
4
5
2 3
1
2: SyncIQ uses a policy-driven engine to execute replication jobs across all nodes.
The policy includes information to replicate and the replication schedule. The
administrator then starts the replication policy to launch a SyncIQ job. A policy is
like an invoice list of what should get replicated and how.
3: SyncIQ uses snapshot technology, taking a point in time copy of the source, or
SyncIQ domain, when the SyncIQ job starts. The first time the policy runs, an initial
or full replication of the data occurs. Subsequently, changes are tracked as they
occur and then a snapshot is taken for the change tracking. The new change list
begins when a snapshot is taken to begin the synchronization. On the source,
when a SyncIQ job completes successfully, the older source snapshot is deleted.
With SnapshotIQ licensed, administrators can choose to retain the snapshots for
historical purposes.
5: The data on the target cluster is read-only. When a SyncIQ job completes
successfully, a snapshot is taken on the target cluster. This snapshot replaces the
previous last known good snapshot. If a sync job fails, the last known good
snapshot is used to reverse any target cluster modifications. Policies cannot point
to the same target path.
SyncIQ Components
workers on the target cluster to accrue the benefits of parallel and distributed
data transfer.
• Target monitor: The target monitor provides critical information about the
target cluster and does not participate in the data transfer. It reports back with
IP addresses for target nodes including any changes on the target cluster.
Additionally, the target monitor takes target snapshots as they are required.
1 2
3 4
4: With the copy policy in SyncIQ, you can delete data on the source without
affecting the target, leaving a remote archive for disk-based tertiary storage
applications or staging data before it moves to offline storage. Remote archiving is
ideal for intellectual property preservation, long-term records retention, or project
archiving.
One of the simplest ways to manage resource consumption on the source and
target clusters is with proper planning of job scheduling.
Proactive Scheduling
• If the business has certain periods when response time for clients is critical,
then schedule replication around these times.
• If a cluster is a target for multiple source clusters, then modifying schedules to
evenly distribute jobs throughout the day is also possible.
Directory Selection
However, when required RTOs and RPOs dictate that replication schedules be
more aggressive or data sets be more complete, there are other features of SyncIQ
that help address this.
Worker Control
SyncIQ offers administrators the ability to control the number of workers that are
created when a SyncIQ job is run. This can improve performance when required or
limit resource load if necessary.
Administrators can also specify which source and target nodes are used for
replication jobs on a per policy basis. This allows for the distribution of workload
across specific nodes to avoid using resources on other nodes that are performing
more critical functions.
• Default is 8 * [# of nodes].
• Example: 20 nodes * 8 = 160 maximum pworkers per policy
• Workers are dynamically allocated to policies based on the size of the
cluster and the number of running policies.405
405
Workers from the pool are assigned to a policy when it starts, and the number of
workers on a policy will change over time as individual policies start and stop. The
goal is that each running policy always has an equal number (+/- 1) of the available
workers assigned.
Policy Bandwidth
• Limits the replication bandwidth between the source and target cluster to
preserve network performance.
• Useful when the link between the clusters has limited bandwidth or to maintain
performance on the local network.
• Administrators can limit the number of files that are processed in a given period
to limit node resource load
Performance Rules
406 Using performance rules, you can set network and file processing threshold
limits to limit resource usage. You can configure network-usage rules that limit the
bandwidth that is used by SyncIQ replication processes. This may be useful during
peak usage times to preserve the network bandwidth for client response. Limits are
also applied to minimize network consumption on a low bandwidth WAN link that
exists between source and target.
• You can limit the number of concurrent SyncIQ jobs running during peak cluster
usage and client activity.
• Consider all factors prior to limiting the number of concurrent SyncIQ jobs, as
policies may take more time to complete, impacting RPO and RTO times.
• Configuration Steps:
One 8.2 and later provides over-the-wire, end-to-end encryption for SyncIQ data
replication, protecting and securing in-flight data between clusters.
407The certificates are stored and managed in the certificate stores of the source
and target clusters. Encryption between clusters takes place with each cluster
storing its own certificate and the certificate of its peer. Storing the certificate of the
peer essentially creates an approved list of clusters for data replication.
Certification revocation is supported through an external Online Certificate Status
Protocol (OCSP) responder.
• SyncIQ encryption supports protocol version: TLS 1.2 and OpenSSL 1.0.2
• A TLS authentication failure causes the corresponding SyncIQ job to
immediately fail.
• SyncIQ peers must store the end entity certificates of each other.
• Customers are responsible for creating, managing, and safeguarding their own
X.509 certificates.
408The clusters require all incoming and outgoing SyncIQ policies to be encrypted
through a simple change in the SyncIQ global settings.
Scenario
✓ Create X.509 certificates signed by a certificate authority for source and target
clusters.
✓ Add the certificates to the appropriate source cluster stores.
✓ Set the SyncIQ cluster certificate on the source cluster.
✓ Add the certificates to the appropriate target cluster stores.
✓ Set the SyncIQ cluster certificate on the target cluster.
✓ Create encrypted SyncIQ policy on the source cluster.
Step 1
• <ca_cert_id>
• <src_cert_id>
• <tgt_cert_id>
Step 2
Step 3
Step 4
Step 5
Step 6
Resource: For the description for each command, view the CLI
Command Reference Guide. For more information about SyncIQ
encryption, view the Dell EMC PowerScale SyncIQ: Architecture,
Configuration, and Considerations white paper.
Performance Rules
• Setting performance rules enables control over resources for different types of
workflows.409
• The performance rules apply to all policies running during the specified time
interval.
• CLI command: isi sync rules create
WebUI Navigation: Cluster management > Data Protection > SyncIQ >
Performance rules
1: You can disable a performance rule to temporarily prevent the rule from being
enforced. You can also enable a performance rule after it has been disabled.
2:
3: Based on the rule type, you can set the limit in terms of kb/s for bandwidth,
files/s for file count, and % limit for CPU and workers.
4: The rule is enforced during the specified time interval for the selected days.
411As bandwidth reservations are configured, consider the global bandwidth policy
which may have an associated schedule. The global reservation is split amongst
the running policies.
413
Even split of bandwidth across all running policies. This is current behavior of
bandwidth rules.
414Even split of bandwidth across all running policies, until they reach their
requested reservation. This effectively ensures that the policies with the lowest
reservation amount reaches their reservation before policies with larger
reservations, preventing starvation.
Step 1
Step 2
Optional Steps
Use Case
415 It is recommended to have all nodes on the source and target clusters
configured with Network Time Protocol (NTP) Peer Mode. If Compliance SmartLock
is required, all source and target nodes must be configured in NTP Peer Mode prior
to configuring the compliance clock.
416 Contact support for help to recover the policy. Breaking or resetting the policy
results in duplicate data consuming space because the users are forced to create a
new policy to a new empty target path. The old target path will have to remain with
its data since our SOX compliant code does not allow deleting or overwriting.
Configurations supported
Source Site Target Site
Create target SmartLock compliance directory before
replicating
Best Practice: Keep the source and target directories of the same
type.
Data Reduction
• SmartDedupe:
• Source cluster enabled with 16 TiB file support can only connect to targets that
are also enabled for 16 TiB file support.
It is recommended to have the same OneFS version and patches on both the
source and target cluster.
Disaster Recovery
Disaster Recovery
Module Objectives
• Data protection and backup are broad terms that are used to describe a host of
tools.
• File-level data protection uses FEC417.
• Other protection tools such as snapshots, data replication, and NDMP backup
can be employed to protect data.
• Typically, there is no one size fits all solution.
The graphic shows the data recovery approaches in order of decreasing timeliness.
418
Synchronization and replication provide valuable tools for business continuance
and offsite disaster recovery.
419Snapshots offer rapid, user-driven restores without the need for administrative
assistance.
At the beginning of the continuum sits high availability. Redundancy and fault
tolerant designs satisfy this requirement. The goal here is continuous availability
and the avoidance of downtime by the use of redundant components and services.
• Snapshots421
• Replication422
• NDMP Backup423
421 Snapshots are frequently used to back up the data for short-term retention and
to satisfy low recovery objective SLAs.
423NDMP backup to tape or VTL (virtual tape library) typically satisfies longer term
high recovery objective SLAs and any regulatory compliance requirements.
1:
• Backup to Tape
• Higher RTO and RPO
2:
3:
4:
424An example is taking a full backup from a snapshot for DR, and letting it run
while using snapshots for any daily restores.
425Generally, organizations use tape for long-term storage. Management for tape
can be complex and recovery times unpredictable, unreliable, and long at petabyte
scale. Recovery from a disaster can take weeks at petabyte scale. Many
organizations still use backup to tape as their recovery solution. The cost,
maintenance, and resources can be less than a site to site solution, especially if
backing up to tape instead of disk.
426If a disaster occurs, the tapes can be stored offsite to maintain SLAs until the
source is brought back online.
The graphic shows two NDMP recovery methods which are solutions for disaster recovery.
These examples are not all-or-nothing situations. Organizations can have recovery
data on a cluster in a remote site while the less critical, archival data can restore
from tape for months if necessary.
427 The replication process is completed quickly and with minimal disruption.
One-to-Many429
Functions
430User snapshots for data protection use CoW. CoW allows users and
administrators to retrieve lost, corrupted, malicious files and more. Before a new
write is written, the original block is copied to the snapshot area. CoW incurs a
double write penalty, but it results in less file fragmentation.
431 RoW are system defined snapshots and not a data protection use case.
A
CO RO
A
B
Snapshot Snapshot
B
C
C
File File D
System System
D D
B
The graphic shows changes that are made to D. Changes incur a double write penalty, there is less
fragmentation of the HEAD file.
Grandfather-Father-Son Scheduling
Backup
• In PowerScale, OneFS checks the previous snapshot for the NDMP backup
operation, snapshot_T1, and compares it to a new snapshot, snapshot_T2.
• OneFS then backs up all files that are modified since the last snapshot was
made.
Replication
Clients
the use of snapshots, the file system directories would have to be scanned to find
the modified data, which is unrealistic at petabyte scale.
• The first job initial synchronization for a SyncIQ policy sets a point-in-time
baseline of the production data. For the initial synchronization, the source
cluster creates a snapshot432.
• The incremental synchronization updates the target data.
• Once the initial synchronization is done, the first incremental update is ready433.
A snapshot is created before each replication job.
432The snapshot is used to ensure data that is modified after the snapshot point-in-
time is not replicated.
433The first update may take some time depending on the data change rate and
duration of the initial synchronization.
In the example, data E’ and F’ are not replicated in the initial synchronization. The
goal is to get a point-in-time baseline copy to the target cluster. The initial
synchronization can consume large amounts of network bandwidth and take a long
time to complete. SyncIQ uses the difference between two snapshots, the previous
snap and the new snap. The graphic shows the new snapshot checking what
blocks are different. Only the changed blocks are updated to the target.
Subsequent updates should replicate less data until the replication achieves a
steady state where replication time and the amount of data are predictable.
• There are several key variables that impact the time that the initialization takes,
such as link speed and the amount of data434.
• For example, if the initial synchronization takes three weeks during which 80%
of the data is modified, the first update may take several days.
• In this example, it can take some time before the policy reaches a predictable
and steady rate.
434With a slow link and massive amount of data, the initial copy could take weeks
to months. Furthermore, depending on the change rate of data, the first
synchronization could also take a long time to finish.
• The SyncIQ policy is configured, and 1 PB of data is replicated over the LAN.
• Once the initial synchronization and subsequent updates are complete, the
SyncIQ policy is disabled, the cluster shipped to the remote facility, and the
policy re-enabled.
• A cookie is retained on the target cluster in order for the policy to continue
incremental syncs and avoid a full retransmission of the data via initial
synchronization.
The graphic shows an example where the target cluster is co-located with the source cluster.
One to Many
The graphic shows an organization with three clusters that has a typical hub and spoke topology.
Many to One
The graphic shows an organization that replicates data from multiple locations to a core data center.
Student Guide: Another challenge organizations may face is the need to have
parallel synchronization to multiple recovery clusters. Replicating to multiple sites
protects against multiple cluster and multiple site failures. In the scenario, an
organization has three clusters. Shown is typical of a hub and spoke topology. In
this scenario, the policy replicating to the North data center uses the same
snapshot as the policy replicating to the East data center. A challenge to this model
is the resources SyncIQ uses, especially if there are multiple SyncIQ policies and
many remote clusters. By default, SyncIQ uses resources across the cluster and its
network and CPU consumption is unlimited. SyncIQ may require throttling to
constrain resources that are used for SyncIQ so as not to interfere with cluster
operations, production workflows, and client responses. OneFS 8.0 and later
changes the worker limitations and updates performance rules.
Another hub and spoke topology is a many to one solution. The graphic shows an
organization that replicates data from multiple locations to a core data center. The
organization may use a many to one solution to consolidate production data to
central, disaster recovery cluster. At the core data center, data is backed up to tape
or can be archived to the cloud, providing added protection. Each remote facility
should follow the naming practices. The remote data centers replicating to data
center West each have unique base directories. The failover target path should be
the same from source to target. Having the same path naming enable smooth DNS
failover, maintains scripts, and keeps a consistent mountpoint for NFS connections.
What is the goal or the requirement for replication? Is a mirrored copy of the source
the goal? Or is the goal to have all source data copied and retain deleted file copies
in case they are required later? Many other platforms use similar methods. Some
environments may distinguish local and remote by the type of network connecting
the source and the target platforms.
Local intra-cluster replication is when the production and replicated data reside in
the same storage platform. SyncIQ uses the internal network to replicate, creating
an extra copy of high value data. Intra-cluster replication provides protection in the
event the source directory is lost.
A local cluster to cluster solution replicates data over the LAN to another platform
typically in the same data center. Cluster to cluster replication is typically done
between platforms in the same facility to protect against cluster failure. Remote
replication over the WAN protects against cluster and site failure. PowerScale
supports only asynchronous replication over IP. A subdirectory of /ifs is always
used as the source and target, never /ifs.
Two-way data replication is a common use case in many remote disaster recovery
implementations.
• The graphic shows two data centers that are named West and East.
• Each data center hosts read/write production data and synchronizes the
production data to a remote target.
• In the event one of the data centers has a catastrophic failure, both West’s
production data and East’s production data are protected.
Cascading
Cascading may cause inconsistent views of the data between North and South. If
deploying a cascading, PowerScale solution, care must be taken to prevent the
possible inconsistencies438.
437
In a cascading model, the data is replicated from cluster West to cluster East
and then from cluster East to cluster South.
Bandwidth
The graphic shows an example of estimating the amount of time the initial copy can
take.
1:
• In the example, the change rate is consistent at Can the amount of data I
10 GB per hour. need to synchronize meet
my RPO?
FlexProtect Overview
Important: Only SmartFail a node with the guidance of the Dell EMC
support team.
The data is always rebuilt from FEC. FlexProtect has a corresponding job,
FlexProtectLin. The suffix indicates that the job automatically uses an SSD-based
copy of metadata to scan the LIN tree, rather than the drives themselves.
Depending on the workflow, using FlexProtectLin often significantly improves job
runtime performance. FlexProtect uses a drive scan and an inode scan, whereas
FlexProtectLIN accesses using an inode scan.
• OneFS protects data in the cluster based on the configured protection policy.
• It distributes all data and error-correction information across the cluster and
ensures that all data remains intact and accessible even in the event of
simultaneous component failures.
• It rebuilds failed disks, uses free storage space across the entire cluster to
further prevent data loss, monitors data, and migrates data off of at-risk
components.
• Under normal operating conditions, all data on the cluster is protected against
one or more failures of a node or drive.
• OneFS reprotects data by rebuilding data in the free space of the cluster. While
the protection status is in a degraded state, data is more vulnerable to data
loss.
FlexProtect Impact
The time that it takes to rebuild data following a drive failure depends on many
variables. The best way to gauge a FlexProtect runtime is to use a previous rebuild
runtime. Without a history, the runtime becomes a guess. The graphic shows the
runtime to rebuild the data following smartfailing the disk is about 6.6 hours on a
1.2 TB drive.
Runtime on
idle system
Variables to the
data rebuild time
The graphic shows an example of a runtime on a cluster with no load and a mix of small and large
files.
The OneFS release determines the job engine version and how efficiently it
operates. The system hardware dictates the drive types, amount of CPU, and
RAM. The amount of file system data, the makeup of data, and the protection
levels have an impact. The load on the cluster also determines the amount of time
a rebuild takes. SmartFail runtimes range from minutes for empty, idle nodes to
days for nodes with large SATA drives and a high capacity utilization.
SmartFail Overview
SmartFail is the mechanism OneFS uses to protect data on failing drives or nodes.
OneFS smartfails drives when any potential data integrity exists. Smartfailing a
drive can be anticipated439 or unanticipated. FlexProtect is the process that handles
the reprotection and verification. After the process is complete, the failed device is
logically removed from the cluster and can then be replaced.
Note:
Consult Dell EMC support before manually smartfailing a node or
drive.
A node issue or failure does not automatically start rebuilding data.
Do not remove a drive without understanding the latest procedure –
consult Dell EMC support for the procedure.
Shown is a simplistic example of the process. Data is rebuilt using FEC and rebuilt
in free space within the same disk pool. Typically, smartfailing a node is done for
migration purposes or with assistance from the Dell EMC support. Large capacity
disks, such as 6 TB, 8 TB, and 10 TB SATA drives, may require longer data
reconstruction times. For example, a 6 TB drive at 90% capacity takes longer than
a 2 TB drive at 90% capacity. Conversely, a 2 TB drive at 90% capacity takes
longer than 6 TB drive at 10% capacity. Disk evolution has produced greater disk
density, but the disk mechanics remain constant, meaning the number of heads
and actuators remains the same. Large capacity disks raise the probability of a
multiple drive failure scenario.
Node Failures
If a node reboots, the file system does not need to be rebuilt because it remains
intact during the temporary failure.
• OneFS does not automatically start reprotecting data when a node fails or goes
offline.
• If N+1 data protection is configured on a cluster, and one node fails, all the data
is still accessible from every other node in the cluster.
• If the node comes back online, the node rejoins the cluster automatically without
requiring a full rebuild.
• If a node is physically removed from the cluster, it should be removed logically
also.
• After that the node automatically reformats its own drives, and resets440 itself to
the factory default settings.
• Use SmartFail441 to logically remove a node.
440The reset occurs only after OneFS has confirmed that all data has been
reprotected.
• After the new node is added, OneFS distributes the data to the new node.
• It is more efficient to add a replacement node to the cluster before failing the old
node because OneFS can immediately use the replacement node to rebuild the
data stored on the failed node.
The failover target path needs to be the same as the source. A consistent path 442
enables switching between clusters by changing the DNS direction.
441It is important that you smartfail nodes only when you want to permanently
remove a node from the cluster. If you remove a failed node before adding a new
node, data that are stored on the failed node must be rebuilt in the free space in the
cluster.
442Consistent path names keep any scripts using the shares and exports from the
source cluster from breaking. Also, the mount entries for NFS connections must
have a consistent mountpoint to avoid having to manually edit client fstab or
automount entries.
The graphic shows a two-way replication where both sites are a source and target.
Scenario
• Two-way replication, each data center hosts production read/write data
• Each data center acts as a DR target
• Data center East has two SyncIQ domains with different RPO requirements
Cluster East has two source directories each having a different RPO requirement.
The data in /ifs/east/engineering is critical data and therefore has a shorter
RPO. Exclude and include statements should be used only with a good
understanding of their function. By default, SyncIQ includes all files and folders
under the specified root directory, such as all subdirectories under
/ifs/west/sales. Explicitly including one path, such as
/ifs/west/sales/sales-gen, excludes all other paths such as
/ifs/west/sales/sales-media.
The larger and more disparate the organization, the more complex a disaster
recovery solution.
• Some of these methods are biased towards cost efficiency but have a higher
risk that is associated with them, and others represent a higher cost but also
offer an increased level of protection.
• Despite support for parallel NDMP and native two-way NDMP over Fiber
channel, traditional backup to VTL or tape is often not a feasible DR strategy at
the large or extra-large cluster scale. Instead, replication is usually preferred.
• For large clusters, snapshots typically provide the first line of defense in a data
protection strategy with low recovery objectives.
SyncIQ DR
SyncIQ DR
Scenario
Use Case
• To meet the RPO and RTO of 6 hours, the policy is configured to synchronize
data between source and target every 3 hours during the weekdays.
• The scheduled job runs only if the source contents are modified since the last
run.
• Administrators are made aware via events if RPO of 6 hours is exceeded.
• Policy name and description: As a best practice, the policy name field should
be descriptive for administrators to easily gather the policy workflow, as several
policies could be configured on a cluster. A unique name makes it easy to
recognize and manage.
• Enable/Disable policy: Temporarily disabling a policy allows for a less intrusive
option to deleting a policy when it may not be required. Additionally, after
completing the configuration for a policy, it can be reviewed for a final check,
prior to enabling.
• Policy type:
− Copy: A copy policy maintains a duplicate copy of the source data on the
target. Files deleted on the source are retained on the target. A copy policy
offers file deletion protection, but not file change protection. Copy policies
are most commonly used for archival purposes.
− Synchronize: A synchronization policy maintains an exact point in time copy
of the source directory on the target cluster. If a file or sub-directory is
deleted from the source directory, when the job runs, the file or directory is
removed from the target. The synchronization policy does not provide
protection from file deletion, unless the synchronization has not yet taken
place. It provides sources cluster protection.
• Job options:
− Manually: Run the replication job on a ad hoc basis. This limits cluster
overhead and saves bandwidth. Manual SyncIQ jobs still maintain a source
snapshot that accumulates changed blocks. Therefore, it is recommended to
run the manual job frequently, ensuring the source snapshot growth is
limited.
− On a schedule: Provides a time-based schedule for the SyncIQ policy
execution. When selected the time schedule options change to match the
selected interval. An option is available to not run the policy if no changes to
the data have occurred since the last time the policy was run. This option
saves system resources when replication is not required. Administrators can
specify an RPO (recovery point objective) for a scheduled SyncIQ policy and
trigger an event to be sent if the RPO is exceeded. The RPO calculation is
the interval between the current time and the start of the last successful sync
job.
− When source is modified: The SyncIQ domain is checked every 10
seconds for changes. If a change is detected, the policy runs automatically.
Events that trigger replication include file additions, modifications and
deletions, directory path, and metadata changes. The option is only
recommended for datasets with a very low archival change rate. Constantly
forcing SyncIQ to run the policy on high change rate datasets could severely
impact both source and target cluster performance. An option to delay the
start of the replication is available to allow new writes to the source to
complete prior to running the job. Delaying the policy allows fewer file
replication rather than many short replication runs. Content distribution and
Electronic Design Automation, or EDA, are the primary use cases.
− Whenever a snapshot of the source directory is taken: initiates when the
snapshot matching the specified pattern is run. The option is useful in a one-
to-many solution where only one user generated snapshot is required for all
replications. The job has the option to replicate data based on historic
snapshots of the source SyncIQ domain the first time the policy is run. This
creates a mirrored image of the snapshots on the target from the source and
is particularly useful for snapshot protection for file deletions. The Enable
capture of snapshots on the target cluster must also be set for mirrored
image of the snapshots.
Configuration
Rule - If both include and exclude directories are specified, any excluded
directories must be contained in one of the included directories. Otherwise, the
excluded directory setting has no effect.
− /ifs/div-gen/engineering/media/graphs
− /ifs/div-gen/engineering/media/sample_demo
• Excluded Directories:
− /ifs/div-gen/engineering/media/graphs/trails
− /ifs/div-gen/engineering/media
• Result:
condition links. File matching criteria says that if the file matches these rules
then replicate it. If the criteria does not match the rules, do not replicate the file.
• Restrict Source Nodes: Restrict Source Nodes allows the cluster to use any of
its external interfaces to synchronize data to the target. Selecting run on only
the nodes in the specified subnet and pool directs the policy to the specific pool
for replication. This option effectively selects a SmartConnect zone over which
the replication traffic transfers. SyncIQ only supports static IP address pools. If
using dynamically allocated IP address pools, SmartConnect might reassign the
address while a replication job is running. Reassigning the address will
disconnect the job and cause it to fail. You can specify a source pool globally on
the Settings tab.
• Target Host: Specify the target host using the target SmartConnect zone IP
address, the fully qualified domain name, or local host. Local host is used for
replication to the same cluster. You also specify the target SyncIQ domain root
path.
• Target Directory: As a best practice, ensure that the source target name, the
access zone name are in the target directory path.
• Restrict Target Nodes: SyncIQ will use only the node connected within the
SmartConnect zone. Once a connection with the target cluster is established,
the target cluster replies with a set of target IP addresses assigned to nodes
restricted to that SmartConnect zone. SyncIQ on the source cluster will use this
list of target cluster IP addresses to connect local replication workers with
remote workers on the target cluster.
• Enable Target Snapshots: SyncIQ always retains one snapshot of the most
recently replicated delta set on the secondary cluster to facilitate failover,
regardless of this setting. Enabling capture snapshots retains snapshots beyond
the time period that is needed for SyncIQ. The snapshots provide more recover
points on the secondary cluster. The snapshot alias name is the default alias for
the most recently taken snapshot. The alias name pattern is
SIQ_%(SrcCluster)_%(PolicyName). For example, a cluster called cluster1
for a policy called policy2 would have the alias SIQ_cluster1_policy2. You can
specify the alias name as a Snapshot naming pattern. For example, the pattern
%{PolicyName}-on-%{SrcCluster}-latest produces names similar to
newPolicy-on-Cluster1-latest.
• Snapshot Expiration: The expire options are days, weeks, months, and years.
It is recommended to always select a snapshot expiration period.
• Due to high replication frequency, reports are set to be deleted 2 weeks after
creation.
• Cloud data is retrieved by the source and replicated to the target.
Running a DomainMark during the failback process can take a long time to
complete.
• Report Retention: Defines how long replication reports are retained in OneFS.
Once the defined time has exceeded, reports are deleted.
• Record Sync Deletions: Track file and directory deletions that are performed
during synchronization on the target.
• Deep Copy: Applies to those policies that have files in a CloudPools target.
Deny is the default. Deny enables only stub file replication. The source and
target clusters must be at least OneFS 8.0 to support Deny. The Allow option
enables the SyncIQ policy to determine if a deep copy should be performed.
Force automatically enforces a deep copy for all CloudPools data that are
contained within the SyncIQ domain. Allow or Force are required for target
clusters that are not CloudPools aware.
Assess Policy
SyncIQ can run an assessment on a policy without actually transferring file data
between the primary and secondary cluster.
• SyncIQ scans the dataset and reports443.
• Assessment for performance tuning444.
443SyncIQ provides a detailed report of how many files and directories were
scanned. This is useful if you want to preview the size of the dataset that is
transferred if you run the policy.
444Running a policy assessment is also useful for performance tuning, allowing you
to understand how changing worker loads affects the file scanning process so you
can reduce latency or control CPU resource consumption.
• Verify communication445.
• How much data446 to replicate.
Assess Sync
445The assessment also verifies that communication between the primary, and
secondary clusters is functioning properly.
446The assessment can tell you whether your policy works and how much data will
be transferred if you run the policy. This can be useful when the policy will initially
replicate a large amount of data.
Report
Description
• The RPO is the amount of time that has passed since the last completed
replication job started.
• The RPO is never greater than the time it takes for two consecutive replication
jobs to run and complete.
• RTO is the maximum amount of time required to make data on the target
available to clients after a disaster.
• The RTO is always less than or approximately equal to the RPO.
Scenario
In the example, a disaster occurs at 6:50 before the update is complete. The data on the target
cluster reverts to the state it was in when the last replication job completed, which is 4:00.
• The impact of the change is dependent upon how the policy is modified.
• When an already running policy is modified, SyncIQ will run either the initial
replication or a differential replication again.
• When a policy is deleted:
− Replication jobs are not created for the policy.
− Snapshots and reports associated with the policy are deleted.
− Target cluster breaks the policy association with the source cluster.
− The target directory allows writes.
• Rather than modifying or deleting a policy when a suspension is required, you
can disable a policy and re-enable when required.
Modifying the fields shown in the graphic can trigger a replication to run.
Use Case
Considerations
Failover
Failover is the process of changing the role of the target replication directories into
the role of the source directories for assuming client read, write, and modify data
activities.
Failback
Failover Revert
A failover revert undoes a failover job in process. Use revert before writes occur on
the target.
Failover: Failovers can happen when the primary cluster is unavailable for client
activities. The reason could be from any number of circumstances including natural
disasters, site communication outages, power outages or planned events such as
testing a disaster recovery plan or as a result of upgrade or other scheduled
maintenance activities. Failover changes the target directory from read-only to a
read/write status. Failover is managed per SyncIQ policy. Only policies that are
failed over are modified. SyncIQ only changes the directory status and does not
change other required operations for client access to the data. Network routing and
DNS must be redirected to the target cluster. Any authentication resources such as
AD or LDAP must be available to the target cluster. All shares and exports must be
available on the target cluster or be created as part of the failover process.
Failback: A failback can happen when the primary cluster is available once again
for client activities. The reason could be from any number of circumstances
including that natural disasters are no longer impacting operations, or site
communication or power outages have been restored to normal. Each SyncIQ
policy must be failed back. Like failover, failback must be selected for each policy.
The same network changes must be made to restore access to direct clients to the
source cluster.
Failover Revert: A failover revert undoes a failover job in process. Use revert if the
primary cluster once again becomes available before any writes happen to the
target. A temporary communications outage or if doing a failover test scenario are
typical use cases for a revert. Failover revert stops the failover job and restores the
cluster to a sync ready state. Failover revert enables replication to the target cluster
to once again continue without performing a failback. Revert may occur even if data
modifications have happened to the target directories. If data has been modified on
the original target cluster, perform a failback operation to preserve those changes.
Not doing a failback loses the changes made to the target cluster. Using revert will
cause all changes written to the source and target cluster since the last SyncIQ
snapshot to be permanently lost. Before a revert can take place, a failover of a
replication policy must have occurred. A revert is not supported for SmartLock
directories.
Configuration Preparations
SyncIQ only manages the failover and failback for the file data. For a smooth
failover, the target must be configured similar to the source:
• The file system directory structure, shares and exports should be the same on
the source and target.
• Similar SmartConnect configurations such as DNS, access zones and SPNs.
• Quotas need to be considered447.
• The source and target must share the same authentication providers448.
• The UIDs, GIDs and SIDs must resolve to be the same user or group. 449
For SyncIQ Domain preparation at the target site, there are two possible scenarios
to prepare for:
1. The first scenario is the last sync job has completed successfully.
2. The second scenario is the last sync job did not complete successfully or failed
mid job.
447 Having a larger, or no, quota on the target can cause problems when failing
back. For example, quotas placed on the source can be exceeded when the target
is read/write and then failing back may deny users or groups write ability.
449 The POSIX and AD mappings for a user need to map to avoid file permission
issues. During the setup process, you may be required to make ID mapping
corrections, especially if the clusters were not sharing LDAP or AD domains
originally. The best practice is to join to the authentication providers before the first
use.
The first part of the site preparation stage is to set the SyncIQ directories for the
sync job to no longer accept incoming sync requests. The system then takes a
snapshot of the directories for the sync job, labels as “-new.” The system then
compares the “-new” snapshot to the “-latest” or last-known-good snapshot. If they
are the same and no differences are found, the sync directories have the read-only
bit removed and are placed into a read/write state and ready to accept write
activity.
In the case where a sync job has not completed, failed, or was interrupted in
progress, the “-new” snapshot is taken as before and compared to the “-latest” last-
known-good snapshot. The differences in directories, files, and blocks are then
reverted to the last-known-good state. This process is also called snapshot revert.
This restores the files to the last know consistent state. All synchronized data in the
difference between the snapshots is deleted. Be aware, some data might be lost or
unavailable on the target. After this has been accomplished, the sync directories
have the read-only bit removed and are placed into a read, write state and ready to
accept client write activity.
A failback consists of four distinct phases, the preparation, running the mirror
policy, restoring the source, and restoring the SyncIQ policy. A failback without
executing each phase will undo all changes that occurred on target while failed
over. To begin the failover, synchronize the source, shown as boston, dataset to
the target, shown as phoenix. Next, Allow Writes on the target to prepare for the
full failover. The preparation phase runs the resync-prep on the source cluster. This
phase readies the source to receive the updates that occurred on the target cluster.
It creates a mirror policy on the target. The next phase runs the mirror policy to
failback the dataset. Next make the source active by enabling Allow Writes on the
source SyncIQ local targets mirror policy. The final phase runs the resync-prep on
the target mirror policy to set the target cluster to read-only and ensures that the
datasets are consistent. Run the replication report to verify the failback completed
and redirect clients to the source.
Failover Steps
Steps
Administration
When a policy runs for the first time, it creates an association between the source
and target.
You can make the target data writable by either using the Allow Writes option or by
breaking the association between source and target.
Does not result in a full or The policy must be reset before the policy can
differential replication to occur run again. A full or differential replication will
after the policy is active again, occur the next time the policy runs. During this
as the policy is not reset. full resynchronization, SyncIQ creates a new
association between the source and its specified
target.
Failover and failback processes are initiated by the administrator using the CLI or
the web administration interface. Each SyncIQ policy must be initiated for failover
or failback separately on the target cluster. There is no global failover or failback
selection. Mirror policies and SyncIQ snapshots are baseline elements used by
SyncIQ in normal operations. Do not delete the mirror SyncIQ policies used for
failback. SyncIQ snapshots begin with SIQ- and should never be manually deleted.
450The cookie allows the association to persist, even if the target cluster’s name or
IP address is modified.
Steps
Administration
Reverting a failover operation does not synchronize data modified on the target
back to the source cluster. It undoes a failover job. The table shows the steps
following the failover. If testing a SyncIQ domain for disaster recovery and the
directory is only temporarily read/write on the target, the changes are discarded
with a failover revert. Discard the changes by clicking disallow writes in the web
administration interface for each policy.
Failback Steps
Steps
Resync-prep
Mirror Policy
Target
Source
Shown are the failback steps. The target cluster is failed over and accepting writes.
The target is not accepting updates from the source cluster. The source directory is
read-only.
• Prepare re-sync on the source cluster. For each SyncIQ policy on the source, a
mirror policy is created on the target. The source cluster is set to a read-only
state, is rolled back to the last known good state, and can accept updates from
the target mirror policies.
• Stop client access to the read/write directory on the target cluster to prevent
new data that may not be synchronized back to the source.
• From the target, resynchronize the target to the source using the mirror policies.
• Allow writes for each sync policy on the source cluster. This resets the sync
direction from the original source to the target. The target is not accepting
updates from the source and is still in a read/write state.
• Run the prepare resync on the target to reset the target sync relationship with
the source. The target is set to accept new syncs from the source and is
restored to a read-only status.
• Redirect clients to the source cluster from the target cluster at this point.
Scenario
Topology
Configuration
Advantages Disadvantages
Resource:
1) Superna Eyeglass with SyncIQ - PowerScale Info Hub
2) Eyeglass PowerScale Edition Quick Start Guide
components. When the task completes the discovered inventory can be seen in the
inventory view. Once the inventory task completes, Eyeglass Jobs are
automatically created to replicate between the SyncIQ policy defined source and
target. Enabling the configuration replication can be done on a job by job basis
from the jobs tool.
Scenario
Overview
• When data is tiered to the cloud, a SmartLink file is created on the cluster,
containing the relevant metadata to retrieve the file at a later point.
• A file that is tiered to the cloud, cannot be retrieved without the SmartLink file.
• During replication, the SmartLink files are also replicated to the target cluster.
• Both source and target can have read access, but only a single cluster can
have write access.
During normal operation, the source cluster has read-write access to the cloud provider, while the
target cluster is read-only.
When failover is performed, clients are only allowed read access to the cloud data
using the SmartLink file.
Changes to the cloud data are propagated only via the source after failback is
performed.
For extended or permanent failover, write access can be granted by using the isi
cloud access command.
Policy Configuration
To enable CloudPools integration on the target, Deep Copy must be either set to
Allow or Deny.
• Allow451
• Deny452
• Force453
451Replicates the SmartLinks from the source to the target cluster, but also checks
the SmartLinks versions on both clusters. If a mismatch is found between the
versions, the complete file is retrieved from the cloud on the source, and then
replicated to the target cluster.
452Deny is the default setting, allowing only the SmartLinks to be replicated from
the source to the target cluster, assuming the target cluster has the same
CloudPools configuration.
Target Configuration
• When CloudPools is configured prior to the SyncIQ policy: First SyncIQ job
checks for SmartLink files on the source directory.
• When CloudPools is configured after the SyncIQ policy: Next SyncIQ job after
CloudPool configuration checks for SmartLink files on the source directory.
• In both cases, if SmartLink files are found, the target cluster SyncIQ performs
the following:
− Configures the cloud storage account and CloudPools matching the source
cluster configuration.
− Configures the file pool policy matching the source cluster configuration.
• As a best practice, temporarily disable the associated SyncIQ policy prior to
configuring CloudPools.
You can manually run, view, assess, pause, resume, cancel, resolve, and reset
replication jobs that target other clusters.
No more than five running and paused replication jobs can exist on a cluster at a
time. However, an unlimited number of canceled replication jobs can exist on a
cluster.
453Requires CloudPools to retrieve the complete file from the cloud provider on to
the source cluster and replicates the complete file to the target cluster.
If a replication job remains paused for more than a week, SyncIQ automatically
cancels the job.
454You can manually start a replication job for a replication policy at any time. To
replicate data according to an existing snapshot, run the isi sync jobs start
command with the --source-snapshot option.
455You can pause a running replication job and then resume the job later. Pausing
a replication job temporarily stops data from being replicated, but does not free the
cluster resources replicating the data. A paused job reserves cluster resources
whether or not the resources are in use.
456You can cancel a running or paused replication job. Canceling a replication job
stops data from being replicated and frees the cluster resources that were
replicating data. You cannot resume a canceled replication job. To restart
replication, you must start the replication policy again.
When a replication job fails due to an error, SyncIQ might disable the
corresponding replication policy.
To fix the policy, you can either resolve457 or reset458 the policy.
It is recommended that you attempt to fix the issue rather than reset the policy.
457If SyncIQ disables a replication policy due to a replication error, and you fix the
issue that caused the error, you can resolve the replication policy. Resolving a
replication policy enables you to run the policy again. If you cannot resolve the
issue that caused the error, you can reset the replication policy.
458 If a replication job encounters an error that you cannot resolve, you can reset
the corresponding replication policy. Resetting a policy causes OneFS to perform a
full or differential replication the next time the policy is run. Resetting a replication
policy deletes the latest snapshot generated for the policy on the source cluster.
Challenge
Lab Assignment:
1) Create and assess a SyncIQ policy.
2) Execute a failover to a remote cluster.
3) Execute a failback.
NDMP
NDMP
NDMP Overview
• Reduces complexity
• Provides interoperability
• Allows faster backups
459 Data Management Application, or DMA uses an industry standard protocol that
facilitates a network-based method of controlling data backup and recovery
between NAS devices and data management applications. The DMA is responsible
for initiating the NDMP connection, provides authentication information, and passes
backup and recovery parameters to the NAS. The DMA also maintains the NDMP
client and device configuration and manages the backup catalog and NDMP client
file index.
• Gen 6 only
• 2xFC ports and 2x10GbE
• Two-way NDMP connection simultaneous with client connections
• Sold and deployed with node pairs.
• OneFS has no inherent PowerScale tools for troubleshooting the card -
common tools are camcontrol, mt, chio, and various ocs_fc ioctls
and .sysctls
Cluster
Backup Application
Tape Library
The NDMP two-way backup is also known as the local or direct NDMP backup.
460The DMA (Such as NetWorker) controls the NDMP connection and manages
the metadata. The NAS backs up the data directly, over Fibre Channel, to a locally
attached NDMP TAPE device.
If a cluster detects tape devices, the cluster creates an entry for the path461 to each
detected device.
Backup Application
Cluster
Tape Library
The NDMP three-way backup is also known as the remote NDMP backup.
461 If you connect a device through a Fiber Channel switch, multiple paths can exist
for a single device. For example, if you connect a tape device to a Fiber Channel
switch, and then connect the Fiber Channel switch to two Fiber Channel ports.
OneFS creates two entries for the device, one for each path.
Two-way or Direct Backup: The accelerators connect over fiber channel to the
backup tape library or virtual tape library system and gain greater backup
efficiencies.
Backups to virtual tape library systems are recommended for greater performance.
If possible use applications such as Data Domain with inline deduplication
capabilities to improve remote backup bandwidth efficiencies and storage
efficiencies. Understanding the bandwidth, the data change rate, the number and
average size of files can help determine the backup window.
Large datasets require either longer backup windows, more bandwidth, or both to
meet the backup SLAs. Two-way backups are the most efficient model and results
in the fastest transfer rates. The data management application uses NDMP over
the Ethernet to communicate with the Backup Accelerator node. The Backup
Accelerator node, which is also the NDMP tape server, backs up data to one or
more tape devices over fiber channel. File History, the information about files and
directories is transferred from the Backup Accelerator node to the data
management application, where it is maintained in a catalog.
You can use the NDMP multistream backup feature, with certain DMAs, to speed
up backups.
• Back up concurrently.
• Same backup context.
• Backup context is retained.
• Recover that data in multiple streams.
• Data is recovered one stream at a time using CommVault Simpana.
• Back up concurrently: With multistream backup, you can use your DMA to
specify multiple streams of data to back up concurrently.
• Same backup context: OneFS considers all streams in a specific multistream
backup operation to be part of the same backup context.
• Backup context is retained: A multistream backup context is retained for five
minutes after a backup operation completes.
• Recover that data in multiple streams: If you use the NDMP multistream
backup feature to back data up to tape drives, you can also recover that data in
multiple streams, depending on the DMA.
• CommVault Simpana: If you back up data using CommVault Simpana, a
multistream context is created, but data is recovered one stream at a time.
DMA Supported
Avamar Yes
the table. If you enable snapshot-based incremental backups, OneFS retains each
snapshot that is taken for NDMP backups until a new backup of the same or lower
level is performed. However, if you do not enable snapshot-based incremental
backups, OneFS automatically deletes each snapshot that is generated after the
corresponding backup is completed or canceled.
NDMP ComboCopy
Using DeepCopy, the recall backs up the files, not the SmartLinks.
NDMP DeepCopy
NDMP ShallowCopy
The ShallowCopy restore option is available for SmartLink files backed up with
ComboCopy option.
When using ComboCopy, the file recall backs up both the file and the SmartLink.
NDMP environment variables through the DMA or the NDMP setting variables CLI
can set backup and restore options.
DeepCopy: The file is recalled from the cloud and then backed up. Files can only
be restored as regular files. DeepCopy backup environment variable is 0x100 for
BACKUP_OPTIONS and RESTORE_OPTIONS. The DeepCopy restore option is
available for SmartLinked files that are backed up with ComboCopy option. If
DeepCopy or ShallowCopy restore option is not specified during restore, the
default action is to restore using ShallowCopy, but switch to DeepCopy if version
check fails.
NDMP Redirector distributes NDMP loads automatically over nodes with the FC
combo card.
NDMP Throttler manages the CPU usage during NDMP two-way sessions on
cluster nodes.
1: NDMP redirector and throttler features are enabled only using the CLI:
2: NDMP daemon checks each node CPU usage, number of NDMP operations
already running, and availability of tape target used for the operation. If no suitable
node is available, the operation runs on the node receiving the request.
When enabled, the NDMP throttle manages CPU usage of NDMP backup and
restore sessions running on the nodes.
3: An internal NDMP session runs between the agent node and the redirected
node. The DMA does not notice any difference when the session is redirected.
You can enable the NDMP Redirector to automatically distribute NDMP two-way
sessions to nodes with lesser loads. NDMP Redirector checks the CPU usage,
number of NDMP operations already running, and the availability of tape devices
for the operation on each node before redirecting the NDMP operation. The load-
distribution capability results in improved cluster performance when multiple NDMP
operations are initiated. NDMP traffic can overwhelm the Gen 6 nodes that are
deployed with the FC combo card.
If no suitable node is available, the operation runs on the node receiving the
request. When enabled, the NDMP throttle manages CPU usage of NDMP backup
and restore sessions running on the nodes.
• Three-way NDMP operations are not supported - redirection only happens with
FC connectivity.
• A redirected session fails if DMA connection breaks or DMA changes the
session to run a 3-way operation.
• The default CPU threshold value is 50, which means that the throttler limits
NDMP to use less than 50% of node CPU resources.
• The throttler threshold value can be changed using CLI - example: isi ndmp
settings global modify –throttler-cpu-threshold 80
• Throttler settings are global to all cluster nodes.
NDMP Statistics
OneFS 8.2 and later has some minor changes to backup and restore statistics. In
the example, the Stub Files field is new in OneFS 8.2 and later. Stub files are
SmartLink files that track the file data. The field shows the backup and the restore
statistics output.
Example output:
Objects (scanned/included):
----------------------------
Regular Files(scan/incl(reg/worm/sparse)): (0/0(0/0/0))
Stub Files(scan/incl(stub/reg/combo)): (4/4(4/0/0))
Directories : (1/1)
ADS Entries : (0/0)
Soft Links(scan/incl(slink/worm)) : (0/0(0/0))
Hard Links : (2/2)
Block Devices : (0/0)
Char Devices : (0/0)
FIFO : (0/0)
Sockets : (0/0)
Whiteout : (0/0)
Unknown : (0/0)
NDMP backup and restore operations can be done on data that is archived to the
cloud.
The NDMP backup can back up PowerScale CloudPools stub files. Data that is
associated with the stub file such as account information, local cache state, and
unsynchronized cache is also backed up. Data that are archived in the cloud is not
backed up unless using DeepCopy. When the data is restored, the stub file and its
attributes are restored. In the event the source cluster cannot be recovered,
restoring the data to the disaster recovery cluster maintains the stub files. Clients
accessing the new cluster can access the data that is stored in the cloud.
The NDMP Data Server (NAS) sends data to a locally attached tape device or library.
Avamar/PowerScale Integration
Dell EMC Avamar provides fast, reliable NAS system backup and recovery through
the Avamar NDMP Accelerator. Avamar reduces backup timings and the impact on
NAS resources, allowing an easier and faster recovery.
To back up and restore data residing on NAS systems, Avamar uses a device
called an Avamar NDMP Accelerator (accelerator). The accelerator is a dedicated
Avamar server node that functions as an Avamar client. The accelerator uses
NDMP to interface with and access NDMP-based NAS systems. Avamar does daily
incremental backups, which can be used with the initial full backup to create daily
synthetic full back ups. Because Avamar uses a hashing strategy on the source
track changes, incremental backups are fast.
Data from the NAS system is not stored on the accelerator. The accelerator
performs NDMP processing and real-time data deduplication and then sends the
data directly to the Avamar server. The accelerator can be connected to either a
Local Area Network (LAN) or Wide Area Network (WAN) with respect to the
Avamar server. However, to ensure acceptable performance, the accelerator must
be located on the same LAN as the NAS systems.
Backup Considerations
Link: See PowerScale OneFS Backup and Recovery Guide for more
information.
Although it is not exactly a backup, due to the high capacities and large file counts,
using snapshots and replicating to a DR site is a common “backup” strategy. One
drawback of the snap and replicate approach is the lack of a catalog. You should
know what you want to restore, or search through a snapshot. Using a snap and
replicate strategy on PowerScale with OneFS protects against accidental deletions,
as well as data corruption and provides a DR copy of the data and file system.
Backup Enhancement
OneFS 8.2 introduces support for sparse files in a CommVault backup solution.
Note: This functionality is only supported with OneFS 8.2 and later,
and CommVault v11 SP10 and higher.
Shown is a simple depiction of a file whose logical and physical size is three
blocks.
Sparse Punch
With Sparse Punch, blocks within the middle of a file can be selected. The logical
size remains the same whereas the physical capacity is freed. Reading Block 1
results in zeros, and writing to Block 1 once again consumes physical capacity.
Command Output
CloudPools offers the flexibility of another tier of storage that is off-premise and
off-cluster.
CloudPools provide a lower TCO463 CloudPools optimize primary storage with
intelligent data placement.
CloudPools expands the SmartPools framework by treating a cloud repository
as an additional storage tier.
CloudPools eliminates management complexity and enables a flexible choice of
cloud providers for archival-type data.
CloudPools Concepts
CloudPools moves file data from PowerScale to the cloud. The files are easily
accessible to the users as needed. Below are key CloudPools concepts that affect
the end users:
Archive464
464The CloudPools process of moving file data to the cloud. This process extracts
the data from the file and places it in one or more cloud objects. CloudPools then
moves these objects to cloud storage, and leaves in place on the local cluster a
representative file. This file is called a SmartLink file.
SmartLink File465
Inline Access466
465 SmartLink file contains metadata and map information which allows the data in
the cloud to be accessed or fully recalled. Smartlinks are also called stub or stub
files. If the SmartLink file archiving policy permits it, users can automatically
retrieve and cache data from the cloud by accessing the SmarkLink file.
467CloudPools requires you to set up one or more accounts with a cloud provider.
You use the account information from the cloud provider to configure cloud
accounts on the PowerScale cluster.
468A cloud storage account is a OneFS entity that defines access to a specific
cloud provider account. These accounts are used to enable and track local use of a
cloud provider account. The cloud storage account configuration includes the cloud
provider account credentials.
469File pool policies are the essential control mechanism for both SmartPools and
CloudPools. OneFS runs all file pool policies regularly. Each file pool policy
CloudPools Setup
Running CloudPools requires the activation of the two software module licenses:
SmartPools and CloudPools. To activate the licenses, upload the signed license file
to OneFS.
− Once a signed license is received from Dell EMC Software Licensing Central
(SLC), upload the file to your cluster. To add run:
isi license add --path <file-path <file-path-on-your-
local-machine>
specifies the files to manage, actions to take on the files, protection levels, and I/O
optimization settings.
Configure CloudPools
• To view the top-level configuration settings for CloudPools, run the following CLI
command:
− isi cloud settings view
• To modify CloudPools settings, run the following CLI command:
− isi cloud settings modify
• At times when primary encryption key gets compromised, generate a new
primary encryption key, by running the following command:
CloudPools supports the following cloud providers and associated storage types:
470A secondary PowerScale cluster provides a private cloud solution. The primary
cluster archives files to the secondary cluster. Both clusters are managed in your
corporate data center. The secondary cluster must be running a compatible version
of OneFS. To act as a cloud storage provider, the PowerScale cluster uses APIs
that configure CloudPools policies, define cloud storage accounts, and retrieve
cloud storage usage reports. These APIs are known collectively as the PowerScale
Platform API.
• Microsoft Azure474
• Google Cloud Platform475
• Alibaba Cloud476
• You can create and edit one or more cloud storage accounts in OneFS.
you must specify the S3 Storage Region in the connection settings. When you first
establish an account with Amazon C2S, the cloud provider gives you an account ID
and allows you to choose a storage region. Amazon C2S offers multiple storage
regions in the U.S. and other regions of the world.
474You can configure CloudPools to store data on Microsoft Azure, a public cloud
provider. CloudPools supports Blob storage, Hot access tiers on Microsoft Azure.
Cold blobs are not supported. When you establish an account with Microsoft Azure,
you create a username. Microsoft provides you with a URI and a passkey. When
you configure CloudPools to use Azure, you must specify the same URI,
username, and passkey.
475 CloudPools can store data on Google Cloud Platform, a public cloud provider.
CloudPools supports Standard, Nearline, and Coldline storage types on Google
Cloud Platform. Google Cloud Platform must be set to interoperability mode. Once
it is done, you can now configure Google Cloud Platform as the provider in OneFS
CloudPools.
476 CloudPools can store data on Alibaba Cloud, a public cloud provider.
CloudPools supports Standard OSS storage on Alibaba Cloud. When configuring
Alibaba Cloud as the provider, you must provide the Alibaba URI, username, and
passkey. Alibaba offers multiple sites in the U.S. and other areas of the world. The
URI indicates your chosen connection site.
• Before creating a cloud storage account, establish an account with one of the
supported cloud providers.
• OneFS attempts to connect to the cloud provider using the credentials you
provide in the cloud storage account.
• To create an Amazon C2S S3 account, perform the following steps using the
OneFS CLI:
Once the Create a Cloud Storage Account dialog box closes, a new cloud
account appears in the Cloud Storage Accounts list. The Name, Type, State,
Username, and URI associated with the account is displayed.
1: Enter the fully qualified URI for the account. The URI must use the HTTPS
protocol, and match the URI used to set up the account with your cloud provider
2: Enter the cloud provider account username. This username should have been
set up with the cloud provider.
3: Enter the password or secret key that is associated with the cloud provider
account username.
4: If you have defined one or more network proxies, and want to use one for this
cloud account, select the name from the proxy.
Cloud Policies
477A file pool policy can specify a local storage target, a cloud storage target, or
both.
CloudPools relationship
restored during failback
Target access to
CloudPools data
Supporting the synchronization of stub files does not change the SyncIQ
capabilities during the process including failover and failback for disaster recovery.
Both the source cluster and target cluster are CloudPools aware, meaning the
target cluster supports direct access to CloudPools data.
• You can create a SyncIQ policy that replicates full files rather than SmartLink
files when copying data from the primary source cluster to a secondary target
cluster.
• When you create a SyncIQ policy, you can modify the Deep Copy for
CloudPools setting.
• Connectivity to each ECS node in a VDC is established using the public node IP
addresses.
• In this configuration, the load balancers communicate with each other to
balance access across the VDCs in addition to between the ECS nodes.
• In a disaster recovery configuration for PowerScale CloudPools, the secondary
PowerScale cluster is also configured for CloudPools access to the ECS
namespace.
An ECS478 solution offers a single site, a two site, or a three site configuration to
meet organizational requirements and growth. Other benefits of combining ECS
with CloudPools are:
Advantages Disadvantages
• Obtain the GUID that is associated with the cloud data by running the following
command, on the cluster that originally archived to the cloud:
− isi cloud access list
• On the primary cluster, remove write access to the cloud data:
− isi cloud access remove <GUID>
• On the secondary cluster, give write access to the cloud data.
− isi cloud access add
• Now the secondary cluster can write modifications to the cloud, rather than
storing the modifications in cache.
Best Practices
• For better performance, use timestamps for cloud data archival and recall.
• You can gain the most benefit from CloudPools, in terms of freeing up storage
space on your cluster, by archiving larger files.
• Create exclusive accounts for CloudPools purposes, this prevents conflicts that
might lead to data corruption or loss.
• Use entirely separate accounts for other cloud applications with your cloud
provider.
• If you are performing a rolling upgrade to a OneFS 8.2.x version, and intend to
use CloudPools for the first time, wait until the upgrade and commit complete.
− By waiting for the upgrade to complete, you start using CloudPools with the
most recent CloudPools upgrades.
Challenge
Lab Assignment:
1) Configure a cloud storage account and create a CloudPool.
2) Demonstrate SyncIQ and CloudPools integration.
Module Objectives
Scenario
Scenario
Solution
Show by Tag
Project across
multiple volumes
Show the size
The graphic (DataIQ V1 interface) shows the report on a configured tag using different options.
• The manager can drill into the file and may discover the video rendering files
were moved to a lower tier of storage.
• The report enables the manager to quickly discover the potential issue.
• The chart shows the storage capacity that each project consumes and on which
volume or volumes host the project.
• In a large environment, elements of a project are likely to be located across
several or many different volumes.
• In this example, auto-tagging is configured. You can configure auto-tagging on
the Settings, Data management configuration page.
DataIQ Overview
Dataset management functionality was released in DataIQ 1.0 and is called Data
Management479.
DataIQ 2.0 introduces new storage monitoring capabilities for PowerScale clusters.
480It provides tools to monitor and analyze a cluster’s performance and file
systems, and simplifies administration for tasks such as cluster health monitoring,
cluster diagnostics, and capacity planning.
Architecture
481 It also provides additional OneFS functionality such as quota analysis, tiering
analysis, and file-system analytics. It is also capable of providing a graphical output
for easy trend observation and analysis. It uses OneFS PAPI to collect data for
storage monitoring and does not use cluster resources beyond the data-collection
process.
• Data management482
• Storage monitoring483 (PowerScale only)
482 Data management uses the metadata scanner to scan and index unstructured
file or object metadata and stores it in an on-disk database (RocksDB) for data
management. RockDB is used for data management. The default backup path is
/mnt/ssd/data/claritynow/ssd/backup.
483 After adding a PowerScale cluster to DataIQ and establishing a connection with
DataIQ, a collector is created automatically. It can collect monitoring data and store
it into TimescaleDB for storage monitoring. TimescaleDB is used for storage
monitoring. The default backup path is /opt/dataiq/backup/timescale.
Note: See Dell EMC DataIQ: Best Practices Guide for more
information.
Sizing Guidelines
Sizing Challenges
Disk-space requirements depend on various factors that are used for data
management or storage monitoring.
Data management:
484There is no one size that can fit all hardware resources that are planned for
DataIQ because every environment is different. The CPU, memory, and network
are shared resources for dataset management and storage monitoring.
485 This mount point (or subdirectory or folder) must exist before installation either
in the form of a simple folder or as a mounted partition.
Storage monitoring: The general considerations for storage monitoring disk space
are as follows:
The general rule for sizing is to add additional disk resources to the planned
capacity487.
486This can be in the form of a single large SSD VMDK for the entire operating
system and DataIQ application (such as the OVA by itself), or an SSD partition or
VMDK mounted to /mnt/ssd.
487 It is a best practice to assign designated disk partitions or VMDKs for the
separate functions of storage monitoring and dataset management. These can be
either static primary partitions (for solutions not expected to exceed assigned disk
resources) or by use of logical-volume-managed partitions so the solution may be
extended if needed.
Data management:
• When sizing a DataIQ solution, consider the complexity of the customer storage
environment and the total number of files, folders, and objects per volume.
• Workloads can be distributed across the external data mover nodes in a scale-
out fashion.
Storage monitoring: Consider the following general sizing rules for storage
monitoring:
• Always add additional disk resources to the planned capacity to allow for
monitoring-data growth.
• Never undersize the estimated size of the disk requirements for monitoring.
• The disk-space requirement depends on various factors488.
Note: DataIQ does not support IPv6. See the Dell EMC DataIQ: Best
Practices Guide for more information about sizing guidelines.
Challenge Example
Challenge Example
The graphic shows a representation of the challenge that businesses face when
mining their data.
Example of a project, there could be thousands, millions, or even billions of paths for project x-ray.
• Here the business has three clusters that may be geographically dispersed.
• Each cluster hosts different business units489 and each business unit has work
that is associated with a product feature called x-ray.
489
Furthermore, each business unit has different structures, varying path names,
and path lengths.
The challenge is, how can a business find, analyze, and report on a project that
has elements in different locations, with different names, and potentially millions of
different paths at different depths.
Solution Example
The goal or solution is to extract data of the organization490 using their business
knowledge.
geography
department
quarter
project
Geography, department, project, and quarter are the categories of the custom data.
• DataIQ can address the business challenge by applying tags to key categories.
490You can begin identifying or discovering the custom data and then designating
distinguishable categories for the paths. The key is establishing the custom data.
Having the custom data enables you to mine the data.
• Once the key categories are identified, the auto-tagging configuration file is built
using the categories to create the tags and rules.
• You can then generate reports that are based on the tags, and then act on the
information.
DataIQ starts with predefined security groups and roles. You can create your own
groups or import groups from tools like Active Directory, and then grant those
groups access to security roles.
Note: See the Dell EMC DataIQ Admin Guide 2.0.0.0 for more
information about sizing guidelines.
Cluster Summary
The STORAGE MONITORING > Cluster Summary page shows a high-level view
of the monitored clusters. It provides the overall health of clusters at a glance.
There are four common elements at the top of the Cluster Summary page. These
four common elements also apply to other dashboards.
1 2 3 4
1: Clusters: You can choose which specific clusters are shown or filtered through
the Cluster dropdown list. The drop-down list of clusters is always sorted
alphabetically.
2: Reports: You can share reports by using the Share icon. This option helps you
create a link, and then you can share the link or use to manually create a
bookmark. Click Copy to copy the URL to the clipboard. There are three options:
• Current time range: Choose whether the report shows the time range that is
selected, or whether the time shows relative to when the reader views the
report.
• Template variables: Choose whether the report shows using the system
default variables, or the variables you selected when generating the report.
• Theme: Choose whether the report displays using a light background, a dark
background, or the selected background.
3:
Time range: You can change the time range on reports by selecting the Clock
drop-down list. The UTC time zone is used no matter which time zone the user is
in. The time zone is not configurable. Also, you can set a specific time range or a
future time when using the absolute time range. DataIQ expands the graph to cover
that specific time range as they analyze view option.
4:
Reports refresh: You can refresh reports to get updated data by clicking the
Refresh button or change the refresh frequency by selecting Refresh drop-down
list. Also, you can change the refresh frequency to off and disable refresh reports.
Note: See Dell EMC DataIQ: Storage Monitoring Solution Guide for
more information.
Dashboard Navigator
Capacity Dashboard
The Capacity Dashboard page provides a summary capacity usage from tiers and
node pools, directories perspective.
There are five new elements at the top of the Capacity Dashboard, including the
following:
The Client and User Dashboard page provides detailed information about
protocol operations, protocol latency, and network throughput for clients and users
on the network.
Dedupe dashboard
The Dedupe Dashboard page provides the overview on the dedupe and
compression saved and storage efficiency on all clusters.
Filesystem dashboard
The Filesystem dashboard page provides shared directory details which pertain
to file access deferment rates (deadlocked, contended, locked, blocking, and so
forth).
Hardware dashboard
The Hardware dashboard page provides a view of the OneFS cluster state from a
hardware perspective. It includes details of operations and activities for each of the
nodes.
Network dashboard
The Network dashboard page provides a network view for the system
administrator from a protocols and network throughput perspective.
System dashboard
The System dashboard page provides top N jobs list that you may need attention,
and SRS connectivity status.
Graphs and charts are provided which compare physical usage and logical usage
about the top N directories showing the greatest growth rate. The top N directories
are sorted by quota-delta to show which directories are pushing assigned quota
limits.
• Top N tiers and node pools (by time to full or growth rate).
− The Capacity details of tier page can be accessed for a specific tier using
the Capacity dashboard. It shows the capacity that is used for a specific tier
and details of the node pool for the tier.
− The Capacity details of node pool page can be accessed for a specific
node pool using the Capacity dashboard. It shows the capacity that is used
for a specific node pool. You can quickly view significant changes in the
node pool. This line graph helps you to verify when users are writing
significant amounts of data and track large deletes in the form of negative
capacity changes.
• Top N directories (by capacity, file count, and their growth rate).
− The Capacity details of directory page can be accessed for a specific
directory using the Capacity dashboard. It shows the capacity details of a
specific directory and capacity details by user on the directory.
• Top N directories (by quota proximity or overrun).
− The Capacity and Quota details of Directory page can be accessed for a
specific directory using the Capacity dashboard. It shows the capacity and
quota details of a specific directory and capacity details by user on the
directory.
Client and User Dashboard: The Client and User Dashboard includes the
following three sections:
• Top N client IPs and user IDs by protocol ops, protocol latency, and network
throughput.
− This section shows the protocol operations rate, protocol latencies, and
network throughputs on the client over time. You can hover anywhere over
the line graphs to view the point-in-time values of each metric. The line
graphs help you focus on the protocol performance at the client level so that
you can identify abnormal performance on the client.
• Top N user IDs by quota proximity or overrun (quotas include hard, soft, and
advisory).
− The Performance details of user and group page can be accessed for a
specific user or group using the Client and user dashboard. It shows the
performance details of a specific user or group, and performance by node
and performance by protocol on the user and group.
− The User/Group Quota page can be accessed for a specific user or group
using the Client and User Dashboard. It shows quota details of a specific
user or group.
• Top N user IDs by capacity, file count, and their growth rates.
− The Capacity details of user/group page can be accessed for a specific user
or group using the Client and user dashboard. It shows the capacity details
of a specific user or group.
Cluster performance report: The Cluster Performance Report includes the
following three sections:
• Protocol operations: This section provides the details that are needed to
monitor protocol operations rate by cluster over time for front-end protocol. It
also provides clusters with operations rate for all protocol.
• Protocol latency: This section provides the details that are needed to monitor
protocol latency by cluster over time for front-end protocol. It also provides
clusters with latency for all protocol.
• Protocol throughput: This section provides the details that are needed to
monitor protocol throughput by cluster over time for front-end protocol. It also
provides clusters with throughput for all protocols.
• Total capacity used: The average percent of capacity used for all selected
clusters.
• Total dedupe saved: The capacity saved for all selected clusters, including
SmartDedupe (offline deduplication) and inline-dedupe.
• Compression saved: Compression capacity saved for all selected clusters.
• Storage efficiency by clusters: This chart shows the storage efficiency for the
selected clusters.
• SyncIQ: This section provides the details that are needed to monitor all SyncIQ
policies that are associated with failed and Recovery point objective (RPO) jobs.
• Network Data Management Protocol (NDMP): This section provides details to
help you monitor NDMP halted- and invalid-session-state information and
session details and see if the data is not protected.
• Internet Content Adaptation Protocol (ICAP): This section provides the
details that are needed to monitor all ICAP policies associated with failed jobs.
• Snapshots: A custom view of this section is defined by selecting values for the
following three elements. You can select different values to define a different
view of this section.
− The CRUD events and subpaths page can be accessed for a specific L3
path using the Filesystem Dashboard. It shows the Create, Read, Update,
Delete (CRUD) events and subpaths details of a specific L3 path. You can
choose one of predefined values in the Top N sub folders drop-down list
including 5, 10 and 15.
Hardware dashboard: The Hardware dashboard includes the following four
sections:
• Top N disks
• Top N nodes by performance
• Top N nodes by activity
• Node events
Network dashboard: The Network dashboard includes the following two sections:
There are two new elements at the top of the Network dashboard, includes:
• Protocols sorted by: You can choose one of predefined values including
throughput and latency.
• Interfaces sorted by: You can choose one of predefined values including
throughput, packet, and error.
System dashboard: The System dashboard includes the following two sections:
• Top N jobs
• Cluster SRS connectivity issues
There is one new element at the top of the System dashboard, includes:
Cluster Details
The Cluster details page is accessed for a specific cluster using the Cluster
summary. It provides capacity, event, and node details, and node pool and
CloudPools information about a specific cluster.
Choose which
specific tiers are
Choose one of predefined Choose one of predefined
shown or filtered
values including All, critical, values including All,
through the Tiers
information, and warning. unresolved, and resolved.
drop-down list.
• Capacity details (cluster level): The percent of capacity usage for the
selected cluster is displayed in the left side of the chart. The right side of the
chart displays the capacity by datatype.
• Event details:
− Total critical events: The number of critical events on the cluster.
− Total warning events: The number of warning events on the cluster.
− Total information events: The number of information events on the cluster.
− Event details: This chart shows the detailed information of the events on
the cluster; it includes Issue, Message, Event Group ID, Last event,
Severity, and Status.
• Capacity details by tiers: The capacity details by tiers in the cluster include
Tier, Used, Total capacity, Capacity used %, Growth rate/week, and Time to
full. You can analyze the Tier details report by clicking a specific tier link.
The Protocol Cluster Details page is accessed for a specific cluster using the
cluster performance report. It provides the protocol operation, protocol latency, and
protocol throughput for a specific cluster in the selected time range.
491 This section shows the protocol operations rate, protocol latencies and protocol
throughputs for all protocols on the cluster over time. You can hover anywhere over
the line graphs to view the point-in-time values of each metric on the left side of
figures. Also, the maximum and average values of each metric are displayed on the
right side of figures. This chart helps you first focus on the protocol performance at
the cluster level so that you can identify abnormal performance on the cluster.
492This section shows the protocol operations rate, protocol latencies and protocol
throughputs for SMB, NFS, and S3 on the cluster over time.
The goal of the DataIQ Analyze function is to provide organizations the ability to
view volumes from a business context.
Show/hide options
Analyze page
The graphic (DataIQ V1 interface) shows the Data Management, Analyze page default view with
the options panel open.
The graphic shows the default options panel on the Analyze page.
493This section shows the protocol operations rate, protocol latencies, and protocol
throughputs for protocols (exclude SMB, NFS, and S3) on the cluster over time.
6
1
5
2
4
3
3
4
2
5
1
6
1: You can view All volumes, Volumes, or Tags on the vertical axis. When selecting
the Tags option, the Tag categories option appears.
3: The horizontal axis is where you can change most of the elements. You can
show Size or Cost:
4: You can breakout the horizontal bar chart by different elements. Shown are the
Breakout options:
6: Refresh analyze data updates the Analyze window. Reset analyze options
puts all the options in the options panel back to the default setting and hides the
panel.
DataIQ Data Mover is a plug-in that enables the movement of files and folders.
Use Data Mover to transfer files and folders between file systems, to S3 storage, or
from S3 storage. The S3 endpoint must be added and mounted to DataIQ.
• UI plug-in494
• Service plug-in495
• Workers plug-in496
The Data Mover plug-in contrasts with other data migration tools such as Isilon
CloudPools in that it is not an automated time-based data-stubbing engine. All data
movement is out-of-band in relation to normal file and object access. Data Mover
enables administrators to choose a single file, or an entire folder, for a Copy-Move
operation from the folder tree UI.
Data mover may be installed on the DataIQ server for test purposes. However, the
best practice is to install data mover on a separate VM or host worker node.
Installing data mover separate isolates data I/O traffic from the CPU-intensive
database work that the DataIQ server performs. It is still necessary to install the
data mover plug-in on the DataIQ server first so that the service is running and
available for external worker node call-in.
495The service plug-in is the job manager that is installed on the DataIQ host. This
service accepts requests from the Data Mover UI Plugin and services the work
assignment requests from Data Mover Workers.
496 The workers plug-in is the transfer agent service, responsible for transferring
files. Install the Data Mover Worker plug-in on dedicated transfer hosts or the
DataIQ host. Administrators can deploy multiple Data Mover Worker hosts. If on a
dedicated host, requires RHEL/CentOS 7.6 or 7.7.
The graphic shows an overview of the end-to-end process for moving data from a
PowerScale cluster to target. The target can be an S3 object target, ECS,
PowerScale, or supported third-party platform.
1: 1. The DataIQ server scans and indexes a mounted volume. The Data Mover UI
plug-in and the Data Mover service plug-in is installed on the DataIQ host.
2: 2. The network share is also mounted to the DataIQ Data Mover worker. The
Data Mover service and workers must be logged into the DataIQ server.
4: 4. The Data Mover service receives requests from the Data Mover UI and sends
instructions to the Data Mover worker.
5: 5. After validating the file or folder path, the Data Mover heavy-worker begins the
copy job to the target. Only the data, not ACL, and DACL information is sent. No
file-stubs are used. The full file is either copied to target or copied with source
deleted. Light workers perform pre-transfer job validation and path preparation.
Heavy workers perform pre-allocations and data transfer. Administrators can
configure the number of workers in the
/usr/local/data_mover_workers/etc/workers.cfg file.
Allocating too many heavy workers may negatively impact the stability of the host.
6: 6. If the target is S3, the Data Mover uses embedded credentials and the Data
Mover worker logs in to the cloud subscription service. For ECS, the service opens
DataIQ Vs InsightIQ
Perfor
manc
e by
Client
and
User
Cluste
r
Perfor
manc
e by
Proto
col
Cluste
r
Perfor
manc
e by
Node
Cluste
r
Perfor
manc
e
Challenge
Lab Assignment:
1) Analyze cluster storage using Data Management page.
2) Gather metrics using Storage Monitoring page.
HealthCheck
HealthCheck
HealthCheck Overview
• The OneFS HealthCheck tool is a service that helps evaluate the cluster health
status and provides alerts to potential issues.
• Use HealthCheck to verify the cluster configuration and operation, proactively
manage risk, reduce support cycles and resolution times, and improve uptime.
• CLI command for HealthCheck:
• isi healthcheck
• CLI example to view the checklist items:
The graphic shows that the checklist items for the cluster_capacity check. The
HealthCheck terms and their definition are:
• Checklist497
• Checklist item498
Running HealthCheck
The example shows selecting the Run option for the cluster_capacity checklist.
Viewing an Evaluation
Evaluation showing
failures
You can view the evaluation from the HealthChecks tab or the Evaluations tab. For
a failed evaluation, the file will show the checklist items that failed.
HealthCheck Results
Whenever an evaluation of a cluster takes place the system provides a result either
Pass or Fail.
Scenario: View the percentage of service life remaining for the boot flash drives in
each node of the cluster.
• Emergency499
• Critical500
• Warning501
• Ok502
499The SSD boot drive has reached its smartfail threshold (100% used or 0% left)
as defined by the manufacturer.
500
The SSD boot drive has reached its end-of-life threshold as defined by the
manufacturer. Contact PowerScale Technical Support for assistance.
501
The SSD boot drive is approaching its end-of-life threshold as defined by the
manufacturer. Contact PowerScale Technical Support for assistance
• Check synciq cluster by listing all the available checklists. To do so run: isi
healthcheck checklists list
502
The SSD boot drive has sufficient wear life remaining, as defined by the
manufacturer.
Run the isi healthcheck items view <name> command to view details about a particular
item, where is the name of the item.
Email Notifications
• delivery_enabled
• delivery_email
• delivery_email_fail
Email Settings
If you update the delivery_email parameter, the results of all evaluations are sent to
the specified address(es), regardless of pass, or fail status.
HealthCheck Delivery
If you update the delivery_email_fail parameter, the results of failed evaluations are
sent to the specified address(es). You can use either of these parameters
independently.
Verify Settings
Once the settings are configured the results are sent to [email protected]
• To verify the settings run: isi email view and isi cluster contact
view
Resources
Best Practices
• Verify the cluster is running the latest Roll Up Patch for HCF (HealthCheck
Framework)
• Verify the latest version of IOCA (Isilon On-Cluster Analysis) has been updated
to HCF.
• A complete run of HCF check would be initiated by running: isi
healthcheck run all.
• For other run options see the isi healthcheck command reference at:
Challenge
Lab Assignment:
1) Run and view a HealthCheck evaluation.
2) Configure email notification for a HealthCheck evaluation.
Performance Foundation
Performance Foundation
• The F800 and F810 are an all flash solution that caters to high performance and
high capacity solution needs.
• The F600 provides larger capacity with massive performance in a cost-effective
compact form factor to power the most demanding workloads.
• Target workflows for F800 and F810 are in the digital media503, electronic
design automation504, and life sciences areas505.
• Target Workflows for F600 are M&E studios, hospitals, and financials that need
performance and capacity for demanding workloads.
• The F800/F810 competes against the other all-flash vendor solutions for
workflows that depend on high performance.
H600 Performance
• The H600, high performance, SAS-based node is geared toward cost optimized
work environments, but it still produces high-performance numbers.
• It targets verticals such as digital media506 and life sciences507 that do not need
the extreme performance of the F800.
• It is a standard four RU solutions with predictable performance even as it
scales.
• The H600 provides high-density performance that supports 120 drives per
chassis.
• The H500 and H400 hybrid nodes are built for high performance and high
capacity, ideal for utility workflows such as enterprise file services508,
analytics509, ROBO510, and home directories.
• The H500 gives you predictable performance even as it scales.
• The H400 is a capacity optimized solution with an element of performance.
510 The ideal use cases for Gen 6.5 (F200 and F600) is ROBO (remote office/back
office), factory floors, IoT, and retail. Gen 6.5 also targets smaller companies in the
core verticals, and partner solutions, including OEM. The key advantages are low
entry price points and the flexibility to add nodes individually, as opposed to adding
node pairs in Gen 6.
• The A200 is an active archive box that is optimized for a low cost per TB
solution.
• The PowerScale A2000 is a deep archive solution with the lowest cost per TB.
• Typical workflows are large-scale archives511, disaster recovery targets512, and
general-purpose file archiving513.
511For large-scale, archiving data storage that offers unmatched efficiency to lower
costs.
• Knowing the primary workflow that the cluster is meant to handle may be a
luxury in predicting the incoming requests.
• Understanding the workflows and aligning the workflows with known profiles
can help administrators prepare accordingly.
Complex Workflows
Once the baseline is established in terms of real metrics, start describing in greater
depth the actual situation on the cluster.
• Baseline Information514
− Example - Video Surveillance 515
• Multiple Purposes516
− Example - Home Directories and Video Streaming517
• Management518
514 If a cluster serves precisely one need, and nothing ever changes, your baseline
information correlates directly with your activities.
515A typical case may be a cluster which is used as storage for a security camera
installation. As long as the models and usage of the cameras remains unchanged,
the storage needs are likely to be highly predictable. As old information is deleted
or archived, and new information is brought in at a nearly constant rate, the net
cluster usage may be constant.
516PowerScale clusters are frequently used for multiple purposes in parallel. Every
case is different, and the flexibility of PowerScale clusters means that there are
often multiple functions that are implemented on a single cluster.
517 A single cluster might contain home directories for a wide variety of users. The
cluster may also host streaming videos of corporate events and object storage for
internal applications. The well-prepared storage administrator understands what
those load elements are, and differentiates them to understand what factors drive
changes on the cluster.
518 Often workflows are separated into access zones for easy management. Such
cases justify monitoring different access zones separately to establish what their
different baselines are. It can guide delivery of resources where they would be best
applied.
521Application stakeholders need data to: Plan for future growth of existing
applications, assertively query software vendors when there are workload changes
due to upgrades or replacements, provide quantitative requests, and set
performance expectations for storage administration and management.
What is the system doing? What is the trending data for the apps? What storage workloads exist?
What and when is app software When do i need more workloads and
How much capacity is it using?
upgraded? why?
525Diagram the network topology completely. Leave nothing out. Pictures can
resolve many issues. For a LAN, itemize gear models, speeds, feeds, maximum
transmission units (MTUs) per link, Layer 2 and 3 routing, and expected latencies.
Perform a performance study using the iperf tool for network performance
measurement. Iperf is distributed with OneFS. For a WAN, itemize providers,
topologies, rate guarantees, direct versus indirect pathways, and perform an iperf
study.
526You can determine storage stack performance by learning over time what your
normal performance is, and how to recognize when it is not normal. All clusters
have a unique configuration with a unique dataset and workload, and therefore you
are observing a unique result from your ecosystem.
Determine How Users Interact with the Application: An example for inefficient
serialized requests is a CAD application needing to load 10,000 objects from
storage before rendering a drawing on the user's display.
Cluster data is made available through SNMP, through the InsightIQ application,
and through the isi statistics command line. In all these cases, information
is made available when it is produced. The cases allow for alerting based on
current thresholds, accumulation in monitoring infrastructure, or active monitoring in
real time. These tools can answer the question: “What is the situation right now?”
Events such as a drive going bad, a configuration element being changed, an
advisory quota being reached all produce such immediate information. In addition
to a stream of immediate information, OneFS also makes recorded information
available.
• Performance is the predictable data delivery within variance limits that avoids
lost production or more production costs.
• Optimizations are actions that improve or restore data delivery.
• Performance and optimizations are building blocks to understanding the cluster
as a whole. Metrics include latency, throughput, and duration.
A trend exists because of changes between the past and the present. To measure
these changes, a starting point or baseline is needed. An easy way of establishing
a baseline of information for most metrics is with InsightIQ. Use InsightIQ with file
system analytics (FSA), and monitor activity closely to see the cluster’s behavior.
The graph shows InsightIQ throughput rate that shows a spike in activity.
• Usage - how much space is taken? How busy are the CPUs? How much RAM
is used for caches?
• Performance - what are typical throughput and latency figures?
• Cycles - when is peak time? How long does it last? How much slack time is
there?
It can be difficult to precisely determine which functions are placing which loads on
a cluster, but fortunately the PowerScale monitoring tools offer sound options.
• When examining the environment, establish what the key properties of the
workloads are.
• There are tools options that are available in PowerScale to monitor the
workloads.
Report data seen from the InsightIQ (Breakout by node pool over 6 hours).
Some loads originate from particular clients. Some operate at particular times, such
as end-of-year bookkeeping or scheduled virus scans every evening. Some loads
originate from particular applications and therefore have particular access patterns.
Some loads interact with the cluster through particular protocols, or use particular
features of those protocols, such as OpLocks, in particular ways. Some loads are
aggressive users of particular datasets.
The easiest part is to determine which are the largest data directories, or if the
cluster roles are split among pools, which pools are busiest. The InsightIQ capacity
estimation tool affords a good way of seeing how much capacity a cluster offers as
a whole. If quotas are enabled, another good source of information is quota
reporting. Quota reports are accessed through either the OneFS web
administration interface or InsightIQ web interface.SmartConnect zones offer a
good way of differentiating separate data flows to and from the cluster. Client
activity measures can help you differentiate the quiet and intermittent work load of
a home directory scenario from the heavy activity of an active Hadoop installation.
Even without SmartConnect, client activity reports are available and as long as
client addresses are differentiated in some respect. Client activity reports can
determine which functions are placing the greatest load on the cluster.
528Comparing short and long-term events allows for differentiating the loads that
are created by different applications, or usage profiles of different applications.
Comparisons can distinguish between a short-term blip in the numbers, and an
actual trend over the longer term.
Cluster capacity report seen from the InsightIQ trend period over one month.
Trend Analysis
• Administrators cannot rely upon generic ideas about what data is important or
which trends are significant. Skilled administrators examine their environments.
• They then determine which data and trends are more important in their
particular context, based on an understanding of the workflows prevalent in their
environments.
• Trend analysis helps answer the question: “What will happen?”
It is easy to monitor immediate signals and see when they cross thresholds.
Monitoring is important, but insufficient for the best storage administration
practices. Good practice includes being able to predict future activities and
performance. An example is monitoring the usage level of a cluster to predict when
it needs upgrading to meet the user base’s needs. The storage administrator must
be able to see the trend to anticipate the future needs. Most SNMP management
systems accumulate and displays data that are exported via SNMP to provide for
trend analysis, but InsightIQ offers substantial trending capabilities as well.
Safety Margins
Filling Cluster
• It is possible to fill a cluster to 100%. OneFS does not prevent this from
happening.
• A full cluster is a bad scenario. The cluster completely locks up and refuses
logins except from the console.
• A prevention measure to consider is having the right protection for the workload.
− Example530
• Another measure is to enable automatic deletion of snapshots.
• Features such as SmartQuotas, deduplication, file filtering are a few areas to
help prevent trouble by limiting the damage done by abusive clients, and
notifying you when thresholds are approached.
530 Do you want all your data that is protected at 5x, consuming much more
capacity than 2d:1n protection? Home directories may need less protection than a
vital repository where customer information is stored. You can survive with the loss
of Home directories, but losing vital customer information such as in progress
engagements can affect the company's bottom line.
and purchase sufficient storage for those needs. Maintaining VHS for emergencies
related to data loss is a best practice, and generous headroom is better.
531 Manually deleting B-trees in the file system can temporarily alleviate the
situation. Deleting B-trees is not a safe or desirable option. Deleting B-trees
involves manually and directly editing the literal file system’s internal data
structures. This scenario involves data loss, Severity 1 support calls to get help in
identifying, editing, and deleting items in the data structures, conference calls with
executives and possible weeks of downtime. Such scenarios should be avoided
whenever possible.
532 Virtual Hot Spares (VHS) can be disabled as a temporary measure to get the
cluster to operate again. When disabling VHS, its reserved space can be
recovered. Disabling VHS also means that there is less capacity available to resist
any kind of hardware failure, so it is not a good situation either. Even full clusters
start to suffer with capacity usage over 80%. The issue is a consequence of many
confounding factors, including hard drive physics and the calculations that are
required to maintain an optimized data layout. The more full the cluster, the higher
risk of hardware failure, increasing the potential of filling the cluster to capacity.
Performance Analysis
Performance Analysis
533How do the application works? What are the user interactions with the
application? What is the network model? What are the workload-specific metrics for
networking protocols, disk I/O, and CPU usage?
534Data flow models are important to show the processes that are used for
transferring data.
535 There are fewer outliers that are spread around the chart. Most of the results
are contained at the bottom of the graph in a specific area. The results indicate that
the requests were responded to in an appropriate time period. The results that are
shown are generally under 25 milliseconds. The graph depicts a cluster with plenty
of resources to support its workflows.
Administrators can adapt a data flow model for their environment. Models can help
understand the flows of information within the system and the processes that act
upon the information. This example shows the simplest model of accessing a file
share. A more through model would include the networking processes. Knowing
the processes can not only help isolate a problem, but identify the processes or
areas of an analysis. If a user is denied access to a cluster’s file share, a model
may show that the problem is likely in the authentication process. If the user cannot
open a file, then the issue could be permissions or ownership on the file. If the user
cannot save data to the file share, enforced quota limits may be reached.
Performance Baseline
Drive Metrics
Term Definition
Utilization/Busy Average of time the disk was busy over the sample
interval
Disk Percent Busy Average of time the disk was busy over an interval,
expressed as a percentage
Disk Latency Sum of seek time, rotation time, and transfer time
Service Time Time from the device controller request to the end of
transfer, including delays due to queuing and latency
Response Time Disk service time plus all other delays such as
network, until data is at the host
Cluster Metrics
The two metrics to highlight first are the PowerScale cluster portion of the end-to-
end model, the isi statistics drive and isi statistics heat. Many of
the metrics are guidelines only, not laws. The only definitive answer is, what values
work for the workload?
Disk
• To monitor drive performance, the best indicators are how long an I/O operation
is waiting—TimeInQ, and how many are waiting—Queued.
• To test the impact of different levels of disk I/O, see. Howcan I generate
different levels of disk load on a linux system.
Different drives in same TimeInQ should be <40 ms (SATA Queued should be near zero Best
cluster < 7; SAS < 3; SSD < 1) performance when < 2
Heat
With the heat map storage administrators can tell what is causing so many IOPS to
reach the drives.
Operations per
second
Disk: Key areas that trigger an investigation are protocol and operational latency.
Remember to see the baseline outputs to compare metrics. Review how the drives
are performing. If the cluster is unable to serve data to clients fast enough, other
performance parameters do not matter as we are waiting on the disks. The graphic
shows an example of using the output that is shown and comparing it to the
baseline data. These drives are over stressed by over 120%. A 3.5” SATA drive
can provide 100 IOPS sequentially. A workflow that is mostly reads or mostly writes
reach the 100 IOPS mark, or even exceed it under a best case scenario. As shown,
the drive IOPS are above 220, meaning these drives are being pushed beyond
their limits and may indicate a disk bound cluster.
Heat: Use heat map to review if there are any applications or files that are resource
hungry and consuming a great deal of IOPS. Review these applications to see if
they need such a high level of performance, or if it is ok to throttle them. If the
drives are stressed out, then there is a need to find out where it is coming from.
Isolating the issue is best accomplished with the use of the isi statistics
heat command. This command provides a list of the most used files and paths
from an IOPS perspective. For example, while the metadata intensive application
was running, the command was run. The graphic shows almost 10,000 locks
occurring between the globalcache directory and the IssueCollector
directory. In this scenario, it would be wise to review what is causing IOPS on those
folders if they continue being repeat offenders. Things such as snapshots and
SyncIQ can leave extensive block remapping on the directories. Remapping can
cause increased overhead in terms of latency when accessing those directories. If
the directory itself is not under snapshots or if it needs to be, then the next step is
to use the metadata write acceleration strategy. Change the strategy on the
directory with isi set –-strategy=metadata-write. The cluster directs any
incoming metadata writes to the SSDs.
Latency Metrics
Latency in
millisecond
s
• Good:
<10 ms
• Normal:
Poor latency can impact data * applies to all nodes, all protocols, all
access and response times operations 10 ms -
20 ms
• Bad: 20
ms +
• Investig
ate: 50
ms +
The graphic shows the command execution of which class of operation is taking the longest.
And finally, client performance. Once determining that only a handful of clients that
consume the IOPS, which is typical, focus on what exactly they are doing.
The output views the top 20 users and their external protocols.
Use the isi statistics client command to determine if any users are
dominant or out of balance with other users of the same protocol. The isi
statistics client indicates if a single user or group of users are dominating
the cluster resources. The command only outputs the Ops(1), Proto(6), and
UserName(8) columns. The Ops rate for each user can identify out of balance
usage. This baseline can also be used to identify a user who has an abnormal use
of the workflow. To determine the difference between a busy user and a non busy
user, increase the command output to where the user operation counts begin
decreasing. PowerScale stores 1024 username records for each 15-second
window. UNKNOWN usernames are not in the 1024 records.
Normal vs Abnormal
Listed are guidelines to use when faced with monitored data and trying to make
sense of it in the context of monitoring and alerting configurations:
536Understand the data’s scale. What is the time frame? Does the metric reach
zero? What is the maximum value? Is it dimensionless or are there units of
measurement?
537What is the typical running level of the data? Does the running level reflect the
normal level of activity?
538What is the absolute peak reached? Depending on the dataset, the lowest
trough may also be a concern. Make a note of the troughs.
539What is the context around the peak and/or trough? Is it often approached? How
closely is it approached? Do peaks or troughs cluster near each other? Do the
peaks or troughs relate to a known workflow? Is there a run up to a peak or is it a
sudden spike?
540Are the figures seen near the ultimate limits of the system as it is configured?
Do these numbers require immediate action, or are they read in a report at some
point?
Point of verification should tell administrators which levels would create an alert
that would require your attention. By understanding how the levels relate to the
workflows, administrators are in positioned to determine abnormal levels and
whether the cluster requires more resources.
Max reached
2nd highest
0-10Gb/s
Typical level
one week
The graph shows a maximum of under 9 GB/s total throughput for an entire cluster.
9 GB/s is well within the limits of operational performance, and should cause no
anxiety whatsoever. On the other hand, if this is the baseline, it may be prudent to
set an alert in case the number goes over 10 or 12 GB/s. The alert signals
abnormal activity levels.
Latency is the processing delay in network data. Average latency is not network
latency. Latency is expected. Remember to see the baseline latency metrics when
addressing latency issues.
Once acquiring a packet capture on the two storage solutions, one way to review
the overall latency is by looking at the round-trip time (RTT).
• Round-trip time543
• Measuring with Wireshark544
• Unhealthy network545
542Protocol latency is added latency due to the speed of the protocol operations. If
there is added latency at the protocol level, isolate to separate operations of the
protocol, such as READDIRPLUS for NFS.
543The round-trip time only measures the time that is taken for sending a packet
and for receiving the acknowledgment. Thus, round-trip time does not differentiate
network delays from computational delays.
544Wireshark can measure round-trip time navigating to Statistics > TCP Stream
Graph > Round-Trip Time Graph. Any added latency at the network level
compounds into added latency at the protocol level.
545If the network is not healthy or the storage solution is unable to respond in a
timely fashion, the end-user experience suffers.
Overloaded Cluster
546 The graph is not consistent, and contains a great number of outliers. The
outliers indicate performance issues on the storage solution, the cluster in this
case. The cluster is overloaded to such a degree that it is unable to reply to all calls
in a timely fashion. As a rule of thumb, customers experiencing a 0.25 second
response is an inconvenience, whereas a 0.5 second response becomes a
problem. It takes a uniform and highly optimized workload to produce no outliers
whatsoever.
The graphic shows an example of what to expect when a cluster starts to get too much load for
consistent good performance.
• Here Hayden gets a normal Wireshark RTT graph547, painting a much cleaner
picture.
• Some outliers generally occur even in healthy environments because some
protocol operations are naturally more time consuming than others.
547There are fewer outliers that are spread around the chart. The Most of the
results are contained at the bottom of the graph in a specific area. The results
indicate that the requests were responded to in an appropriate time period. The
results that are shown are generally under 25 milliseconds. The graph depicts a
cluster with plenty of resources to support its workflows.
Almost no outliers
The cluster can respond to other network calls in a timely fashion. There is no
added latency due to the network layer in this packet capture. Thus, we can
summarize that the application was resource hungry to the point that it affected all
other applications. It saw available resources, and it consumed as much as it could
without regards to other workflows.
• Once eliminated the network as a source of latency via the RTT chart, the next
step is to determine if the protocol latency is acceptable548.
• The network packets may have acceptable RTT, but if the new storage solution
adds 200 milliseconds of latency to protocol responses, the clients feel the
effects.
• The easiest way to review protocol latency is by using the Service Response
Time(SRT), analytics of Wireshark.
• A packet capture should be taken on both solutions, the current storage and the
cluster, and compare their SRT values.
• To view the SRT times, Hayden can go into Wireshark > Statistics > Service
Response Time > <select the appropriate protocol>.
• Examples:
• Wireshark > Statistics > Service Response Time > ONC-RPC > NFS >
NFSv3
• Wireshark > Statistics > Service Response Time > SMB
• Wireshark > Statistics > Service Response Time > SMB2
Preworkflow
Before the NFS workflow is migrated to the cluster, Hayden ensures that the cluster
SRT values are similar or better when compared to the previous storage.
Postworkflow
• READDIRPLUS calls549
• Metadata reads550
• Performance of the client or application551
• Latency552
549 Here the average times of the READDIRPLUS calls are taking 0.126 seconds
on average to return from client to cluster and back to client. The time is an
extreme delay, especially when considering that READDIRPLUS is a type of
metadata call that is expected to be fast. Based on the SRT times, Hayden can
infer that this workflow would highly benefit from metadata acceleration and a
larger L1 cache on the nodes.
550 If the application requires fast metadata reads, Hayden can view the current
live-performance of the protocol response times to compare with. Viewing is done
using the isi statistics protocol command. Using the isi statistics protocol, Hayden
views live statistics on protocol operations, and compare to what is needed. For
example, Hayden consistently sees the cluster’s NFS READDIRPLUS response
times of less than 30 ms. The application is working properly with 50 ms on the
existing solution. With these metrics, Hayden knows that the application’s response
time needs will be met after migrating.
This workflow
would highly
benefit from
READDIRPLUS metadata
calls 0.126 Avg acceleration
SRT indicates and a larger
extreme delay L1 cache on
nodes
The average response time is what dictates the overall performance of the client to
storage communication stream. In the example, the highest SRT is 0.015 seconds,
or 15 milliseconds. For most operations, we want it to be less than 50 milliseconds
for the best experience possible. SRT numbers break out by the communication
type with the cluster, not merely the port, or protocol. The numbers allow for
differentiation of the types of activities and how they affect the general performance
profile.
552If the latency continues to match the previous storage solution or exceed it,
Hayden knows that the application should work as expected. If not, there are tuning
options to increase the performance of the application. For example, enabling
metadata-write acceleration or pointing the workflow to A100 accelerator nodes are
some of the options available.
Using the end-to-end model, the namespace, RTT spike, and long ls return
symptoms can indicate a problem in the network or storage. Disk time in queue
points to a probable storage issue. This case study uses InsightIQ to get a clear
picture of the problem.
The first step is to identify the busy times. The InsightIQ output shows that the
environment is busy during a typical workday, about 8:00 to 5:00 pm. To isolate the
timeframe when seeing performance issues.
Namespace reads are about 30% percentage of the work on this cluster. It makes
sense that any bottleneck or constraint affects such a dominant operation.
Client Distribution
• Are heavy hitters expected? If so, are they showing unusual load?
• If not, ensure that the clients are doing what they think they are doing and
adjust resources as necessary.
Note: This page is for illustration – the example case study has a
fairly even client distribution.
Load by Node
Look for nodes that might be a bottleneck. PowerScale applies more cache and
CPU if access is distributed.
Isolating Nodes
The graphic shows the Protocol Operations rate. Three of the hardest hit nodes for
namespace_read are also among the busiest nodes.
CPU Utilization
• CPU was not an issue so no need for more nodes. namespace_read was the
dominating operation.
• Namespace reads can be dramatically accelerated using SSD for namespace
acceleration.
Note: Typically, CPU utilization should not run > 80% for sustained
periods, short spikes are ok.
Considerations
• Model workflows to understand the process used to read and write data to the
cluster.
• Establish baseline metrics or good picture of the model.
• When performing a workflow or workload analysis, the areas to look at are
storage metrics, network metrics, and client metrics.
• Monitor network and protocol latency before and after configuration changes or
added workflows.
Challenge
Lab Task:
1) Establish cluster baseline.
2) Simulate increased workload for the cluster.
Network Stack
Physical Connection
Check the bottom layer of the ISO7498 (OSI) stack: is it physically healthy?
Ensure network plugs are fully inserted and not loose or askew
It may seem like a trivial concern, but improperly seated or damaged physical
network interfaces and cables can result in problems ranging from a complete loss
of function, to intermittent reductions in performance. It is cheap and practical to
start by ensuring that the basic hardware is in good shape.
Transport
Media
Simple/duplex Basic configurations
errors
Line Frame
speeds sizes
• The next level above raw hardware is the transport protocol level553.
• Checking all these factors lie in the realm of network administration, and it
should be a collaborative effort.
• Network administrators also have access to more detailed, low-level logs and
diagnostic tools554.
Protocol
IP
555Administrators can use packet captures to look for network problems such as
fragmented packets and symptoms of congestion.
556There are logging and reference tools that can help to identify a long list of
issues - Wireshark being one of the most useful tools.
• Assuming that the lower-level network systems are in good shape, most of the
problems at this level reflect either a misconfiguration, or performance overload.
Example557.
• On the other hand, retransmissions tend to happen as a result of timeouts558.
Name Resolution
The systems that are shown in the table are ways that names and addresses are
translated to each other. Some, such as DNS can get sophisticated. Some, such as
host files are trivial. To check for as many systems in the chain as you can identify,
so that you are sure that hostname identification is really happening the way that
you intend.
558 Timeouts in turn generally result from oversubscribed facilities. Sometimes the
right answer is to buy more or faster hardware.
user education. It does happen, and one user who does this may actually persuade
other users to follow suit, resulting in a whole group of users whose storage
operations are not being load that is balanced at all.
Routing
Another facet of networking that takes a lot of work is routing. This is addressed
later in a bit more detail, but briefly there are three topics559 in routing that deserve
our attention. Whenever dealing with a routing issue, the key command to help
examine the routing table is netstat -r
However, routing is never a single device issue. There are routers, switches,
firewalls and virtual networks in practically any modern enterprise, whether
commercial or otherwise, and so it would be beneficial to engage the network
administrator, who is best placed to help navigate the network configuration of the
environment.
Firewalls
Firewalls are a peculiar hybrid of traffic router and traffic blocker. Their mission is to
increase information security, and they do this by letting approved forms of traffic
pass while blocking other forms of traffic.
Most firewalls also route between multiple network segments, including at least one
so-called demilitarized zone (DMZ), which is a network segment that is separated
from the internal network, but still is protected from the Internet at large. This
means that a misconfigured firewall can introduce, not merely traffic access issues,
but routing issues as well. In general, firewalls have limited application to storage
administrators, but if a storage system is in its own DMZ then every interface is
behind a firewall, and every change in the workflow will prompt a reexamination of
the firewall.
Network Troubleshooting
Basic network troubleshooting is fairly simple, conceptually. Example560. To
understand more complex troubleshooting scenarios, look for more subtle signs of
trouble. Example561. To see these problems clearly, perform an analysis on actual
network transmissions, a good tool for doing that is Wireshark.
Signs of trouble
560Example - Data either flows, or it does not flow. Hostnames either resolve, or do
not. Packets are either routed or dropped. This is the easy part.
Settings Description
TTL Lower Limit For Server Failures Specifies the lower boundary on time-to-live
for DNS server failures. The default value is
300 seconds.
TTL Upper Limit For Server Failures Specifies the upper boundary on time-to-live
for DNS server failures. The default value is
3600 seconds.
Test Ping Delta Specifies the delta for checking the cbind
cluster health. The default value is 30
seconds.
• Inode (LIN) scan - The most straightforward access method is via metadata,
using a Logical Inode (LIN) Scan. In addition to being simple to access in
parallel, LINs also provide a useful way of accurately determining the amount of
work required.
• Tree walk - A directory tree walk is the traditional access method since it works
similarly to common UNIX utilities, such as find - albeit in a far more distributed
way. For parallel execution, the various job tasks are each assigned a separate
subdirectory tree. Unlike LIN scans, tree walks may prove to be heavily
unbalanced, due to varying sub-directory depths and file counts.
• Drive scan - Disk drives provide excellent linear read access, so a drive scan
can deliver orders of magnitude better performance than a directory tree walk or
LIN scan for jobs that don’t require insight into file system structure. As such,
drive scans are ideal for jobs like MediaScan, which linearly traverses each
node’s disks looking for bad disk sectors.
• Changelist - Some Job Engine jobs utilize a changelist, rather than LIN-based
scanning. The changelist approach analyzes two snapshots to find the LINs
which changed (delta) between the snapshots, and then dives in to determine
the exact changes.
If the flag is true, the job is a space saving job. The jobs with the flag set by default
are:
• MultiScan
• AutoBalance
• Collect
• AutoBalanceLin
• ShadowStoreDelete
• SnapshotDelete
• TreeDelete
The fundamental responsibility of the system maintenance jobs is to ensure that the
data on /ifs is:
• AutoBalance - The goal of the AutoBalance job is to ensure that each node has
the same amount of data on it, in order to balance data evenly across the
cluster. AutoBalance, along with the Collect job, is run after any cluster group
change, unless there are any storage nodes in a “down” state. Upon visiting
each file, AutoBalance performs the following two operations:
− File level rebalancing - evenly spreads data across the cluster nodes in
order to achieve balance within a particular file.
− Full array rebalancing - moves data between nodes to achieve an overall
cluster balance within a 5% delta across nodes.
• AutoBalanceLin - There is also an AutoBalanceLin job available, which is
automatically run in place of AutoBalance when the cluster has a metadata copy
available on SSD. AutoBalanceLin provides an expedited job runtime.
• Collect - The Collect job is responsible for locating unused inodes and data
blocks across the file system. Collect runs by default after a cluster group
change, with AutoBalance, as part of the MultiScan job. In its first phase, Collect
performs a marking job, scanning all the inodes (LINs) and identifying their
associated blocks. Collect marks all the blocks which are currently allocated and
in use, and any unmarked blocks are identified as candidates to be freed for
reuse, so that the disk space they occupy can be reclaimed and reallocated. All
metadata must be read in this phase in order to mark every reference, and must
be done completely, to avoid sweeping in-use blocks and introducing allocation
corruption. Collect’s second phase scans all the cluster’s drives and performs
the freeing up, or sweeping, of any unmarked blocks so that they can be
reused.
• MultiScan - The MultiScan job, which combines the functionality of
AutoBalance and Collect, is automatically run after a group change which adds
a device to the cluster. AutoBalance(Lin) and Collect are only run manually if
MultiScan has been disabled. Multiscan is started when:
− Data is unbalanced within one or more disk pools, which triggers MultiScan
to start the AutoBalance phase only.
− When drives have been unavailable for long enough to warrant a Collect job,
which triggers MultiScan to start both its AutoBalance and Collect phases.
• FlexProtect - responsible for maintaining the appropriate protection level of
data across the cluster. For example, it ensures that a file which is supposed to
be protected at 2x, is protected at that level. Run automatically after a drive or
node removal or failure, FlexProtect locates any unprotected files on the cluster,
and repairs them as quickly as possible. The FlexProtect job includes the
following distinct phases:
− Drive Scan: FlexProtect scans the cluster’s drives, looking for files and
inodes in need of repair. When one is found, the job opens the LIN and
repairs it and the corresponding data blocks using the restripe process.
− LIN Verification: Once the drive scan is complete, the LIN verification phase
scans the inode (LIN) tree and verifies, reverifies and resolves any
outstanding reprotection tasks.
− Device Removal: In this final phase, FlexProtect removes the successfully
repaired drives or nodes from the cluster.
In OneFS 8.2 and later, FlexProtect does not pause when there is only one
temporarily unavailable device in a disk pool, when a device is smartfailed, or
for dead devices.
• FlexProtectLin - is run by default when there is a copy of file system metadata
available on SSD storage. FlexProtectLin typically offers significant runtime
improvements over its conventional disk-based counterpart.
• IntegrityScan - The IntegrityScan job is responsible for examining the entire
live file system for inconsistencies. It does this by systematically reading every
block and verifying its associated checksum. Unlike traditional ‘fsck’ style file
system integrity checking tools, IntegrityScan is designed to run while the
cluster is fully operational, thereby removing the need for any downtime. When
IntegrityScan detects a checksum mismatch, it generates and alert, logs the
error to the IDI logs and provides a full report upon job completion. IntegrityScan
is typically run manually if the integrity of the file system is ever in doubt.
Although the job itself may take several days or more to complete, the file
system is online and available during this time. Also, like all phases of the
OneFS job engine, IntegrityScan can be prioritized, paused, or stopped,
depending on the impact to cluster operations.
• MediaScan - The role of MediaScan within the file system protection framework
is to periodically check for and resolve drive bit errors across the cluster. This
proactive data integrity approach helps guard against a phenomenon known as
‘bit rot’, and the resulting specter of hardware induced silent data corruption.
MediaScan is run as a low-impact, low-priority background process, based on a
predefined schedule (monthly, by default). First, MediaScan’s search and repair
phase checks the disk sectors across all the drives in a cluster and, where
necessary, uses OneFS’ dynamic sector repair (DSR) process to resolve any
ECC sector errors that it encounters. For any ECC errors which cannot
immediately be repaired, MediaScan will first try to read the disk sector again
several times in the hopes that the issue is transient, and the drive can recover.
Failing that, MediaScan attempts to restripe files away from irreparable ECCs.
Finally, the MediaScan summary phase generates a report of the ECC errors
found and corrected.
The graphic shows mirrored small files that are packed into the shadow store with a more efficient
protection.
larger containers, or shadow stores. Shadow stores are parity protected using
erasure coding, and typically provide storage efficiency of 80% or greater.
• Infrequently modified, archive-type datasets: Use SFSE to archive static
small file workloads, or workloads with only moderate overwrites and deletes.
• Free space when unpacking: Ensure that the cluster has sufficient free space
before unpacking any containerized data.
Quotas Types
A SmartQuota can have one of four enforcement types:
Enforcement States
There are three SmartQuotas enforcement states:
• If a domain has an accounting only quota, enforcements for the domain are not
applied.
• Any administrator action may push a domain over quota. Examples include
changing protection, taking a snapshot, removing a snapshot, etc. The
administrator may write into any domain without obeying enforcements.
• Any system action may push a domain over quota, including repair etc. OneFS
maintenance processes are as powerful as the administrator.
Configuration data such as SMB shares, NFS exports, and quota settings are not
replicated with SyncIQ. In a failover, the cluster configuration information must be
configured manually on the remote cluster.
562 Professional services can create and install a script that ensures configuration
data on the source cluster is maintained on the target cluster. Without the
professional services script, the best practice is to make configuration changes on
both clusters simultaneously. Use the exact same names for SMB shares and
same aliases for NFS exports. The same naming allows users to connect
seamlessly to the same shares or exports on their system during a failover. Quotas
should be managed on both clusters simultaneously. Best practices would have
quotas on both clusters so there are no potential over quota situations when the
failback occurs.
• Customer created563
• Another option is to use a third-party solution such as Superna Eyeglass.
563
Customers can create custom scripts using ZSH, Bash, or the Platform API. The
downside is the level of complication and limited support.
1: Preliminaries:
2: Meeting Planning:
Meet after the first scan completes. Could be same day as the install, or, for large
clusters, a day or two later. Meeting must be in front of a DataIQ client browser,
preferably with a large screen. If there are multiple data managers, consider
meeting with one initially. The results of this first meeting help guide the others in
their discussions with you.
3: Rules Guidelines:
If you are experienced with auto-tagging, you can write the rules during the
meeting, in the DataIQ auto-tagging configuration file. Rules can be tested in real
time, and then applied to the entire file system in minutes. Auto-tagging is
reversible. If the results do not meet expectations, tune the rules and retest. The
old tags are automatically removed. Alternately, you can write rules offline, test
against the path examples, and then apply in a follow-up meeting with the
customer.
4: Rules Investigation:
Determine the key customer file system structures. Key structures follow business
rules and represent value to the business. Make a note of file systems where these
rules are followed, the depth of rules, and the exceptions to the rules. For example,
an object at a depth of eight levels is likely a copy and as such, should not be in the
rule. Get path examples and make notes of how the path applies. Use the DataIQ
flagging feature to aggregate paths that need attention.
Ask about key file system policies such as naming conventions. Ask about common
file system errors or violations such as obsolete naming conventions, common
typos, and so on. Be on the lookout for junk names like "old backup," "delete me,"
and "landfill," that could represent unused data. You can create a tag to identify the
junk data.
After the rules are written, familiarize the managers with the DataIQ Analyze
functionality. Guide them through an analysis to emphasize the simplicity of
generating the reports that they want. Know how to act on the results and perform
an analysis or the need for additional refinement.
Customers frequently pick up on the rules patterns and write their own. Regardless,
a scheduled routine to check or update the rules is a good opportunity to ensure
that reports continue to meet the business needs. Also, check ups provide the
opportunity to engage with the customer and better understand their pain points
and future needs.
Multi-factor Authentication
Definition
Multi-factor authentication (MFA) is a method of computer access control in which the user is only granted access after
successfully presenting several separate pieces of evidence to an authentication mechanism.
564
Increasing the security of privileged account access (for example,
administrators) to a cluster is the best way to prevent unauthorized access.
The Duo security platform handles MFA support for SSH with PowerScale.
The Duo service offers flexibility by including support for the Duo App565, SMS566,
voice567, bypass codes568 and USB keys569.
565
Approve login requests via smartphone and smartwatch using the Duo Mobile
app.
567 Receive a call via cell phone, landline, or car phone to quickly authenticate.
• Support for MFA with the Duo Service in conjunction with passwords, public
keys, or both.
• The public keys for users are stored in the LDAP server.
• SSH is configured by using the OneFS CLI.
570 Duo can be disabled and re-enabled without reentering the host, ikey, and skey.
• Specific users or groups can bypass571 MFA if specified on the Duo server.
• Duo uses a simple name match572 and is not AD aware.
• Duo has 2 two failback modes specifying what to do if the Duo service is
unavailable: Safe573 and Secure574.
Click the steps below to know more about the process of SSH Multi-factor
authentication with Duo.
571A bypass key does not work if auto push is set to true as no prompt option is
shown to the user.
The AD user ‘DOMAIN\john’ and the LDAP user ‘john’ are the same user to
572
Duo.
573 In safe mode SSH will allow normal authentication if Duo cannot be reached.
574In secure mode SSH will fail if Duo cannot be reached. This includes ‘bypass’
users, since the bypass state is determined by the Duo service.
1:
• Configure on Duo.
• Go to Dashboard > Application > Protect an Application.
• PowerScale cluster is represented as a UNIX application.
• Three components that are generated are: Integration Key, Secret Key, and API
Hostname.
2:
3:
Specify group option for use with the Duo service or for exclusion from the Duo
service. One or more groups can be associated. You can configure three types of
groups.
• Local groups (local authentication provider)
• Remote authentication provider groups (for example LDAP) - can add users
without a Duo account to the group.
• Duo groups that are created and managed though the Duo Service.
• Add users to the group and specify as Bypass - users of this group can SSH
in without MFA.
• The Duo service must be contacted to determine if the user is in the bypass
group or not.
Administrators can create a local or remote provider group as an exclusion group
using the CLI575. Users in this group are not prompted for a Duo key.
575Zsh may require to escape the ‘!’. If using such an exclusion group, precede it
by an asterisk to ensure that all other groups require the Duo One Time Key (“--
groups=“*!”).
SSH has CLI support to view and configure exposed settings using the isi ssh
settings view and isi ssh settings modify commands.
576This option ensures that the correct sets of settings are placed in the required
configuration files. The settings are password, public key, both or any.
577 Match blocks usually span multiple lines. If the option starts with --match=“, it
allows line returns and spaces until reaching the end quote (“).
The isi auth duo modify command is used to configure the Duo provider
settings.
579 The upgrade includes settings that are exposed and not exposed by the CLI.
The isi auth duo view command is used to view the configured settings.
1: The administrator sets the user authentication method to either public key,
password or both using the isi ssh settings modify command.
2:
• When the user authentication method is set to public key or both, the private
key of the user is provided at start of session. This is verified against user's
public key (from home directory or LDAP).
• When the user authentication method is set to password or both, the SSH
server requests the user’s password, which is sent to PAM and verified against
the password file or LSASS.
• If autopush set to yes, one time key sent to the user on set device.
• If autopush set to no, user chooses from list of devices that are linked to
account - one time key that is sent to that device.
• The user enters the key at the prompt and key is sent to Duo to verify if correct.
5: If all of the above steps succeed, the user is granted SSH access.
OneFS 8.2 and later enables the use of public SSH keys from LDAP rather than
from a users home directory on the cluster.
Antivirus
Antivirus Overview
OneFS allows:
• File system scanning for viruses, malware, and other security threats on a
PowerScale cluster
• Integration with third-party scanning services through the Internet Content
Adaptation Protocol (ICAP)
• Sending files through ICAP to a server running third-party antivirus scanning
software
These ICAP servers run the antivirus software and files are scanned for threats on
ICAP servers, not the cluster itself.
Antivirus Process
When an ICAP server scans a file, it informs OneFS of whether the file is a threat. If
a threat is detected:
Before OneFS sends a file to be scanned, it ensures that the scan is not redundant.
Scanned files, unmodified from a previous scan, are sent for rescanning only if the
virus database on the ICAP server was updated since the last scan.
2: OneFS verifies the file scan requirements, for examples: - file modification since
the last scan - a recent update of the antivirus definition file. In such cases, the file
is placed on the scan queue.
3: The requested file is assigned a worker thread, which sends it to an ICAP server.
If the requested file is excluded from scanning by path or glob filter, it is skipped.
Scan Types
On-Access Scanning
You can configure OneFS to send files for scanning581 before they are opened,
after they are closed, or both instances.
Using the OneFS Job Engine, you can create antivirus scanning policies, which
sends the files from a specified directory to be scanned.
581
Scanning can be done through file access protocols such as SMB, NFS, and
SSH.
• Antivirus policies can run manually at any time, or configured to run according to
a schedule.
• Antivirus policies target a specific directory on the cluster.
Individual file scanning sends a specific file to an ICAP server for scanning at any
time.
• If a virus is detected in a file but the ICAP server is unable to repair it, OneFS
can send the file to ICAP server582.
• To perform an individual file scan, run the following CLI command: isi
antivirus scan <file path>
On-Access Scanning: If configuring OneFS to scan files before they are opened,
also configure OneFS to files that are scanned after they are closed. Scanning files
as they are both opened and closed will not necessarily improve security, but
usually improves data availability. This is because when you scan a file after it's
closed and a user wants to access that file, it will not need to be scanned again
unless the antivirus server database has been updated. It means that most of the
time we won't have scan before open if the File is accessed multiple times.
Antivirus Policy Scanning: OneFS can prevent an antivirus policy from sending
certain files for scanning within the specified root directory based on the size,
name, or extension of the file.
582This file is sent after the virus database has been updated. The ICAP server
might then be able to repair the file. Scanning files individually tests the connection
between the cluster and ICAP servers.
Antivirus software can scan and quarantine the Write-Once, Read-Many (WORM)
files, but cannot repair or delete WORM files until their retention period expires.
ICAP Servers
The number of ICAP servers that are required to support a PowerScale cluster
depends on how you configure583 virus scanning.
583The amount of data a cluster processes, and the processing power of the ICAP
servers.
If the CPU utilization of the ICAP servers is over 95%, it is recommended to add
more CPU to the ICAP servers or add more ICAP server to the OneFS antivirus
solution.
OneFS provides two metrics to indicate whether the ICAP service can sustain the
workload584.
• too_busy status
• fail to scan ratio
Caution: When files are sent from the cluster to an ICAP server, they
are sent across the network in cleartext. Ensure that the path from the
cluster to the ICAP server is on a trusted network. Authentication is
not supported. If authentication is required between an ICAP client
and ICAP server, hop-by-hop Proxy Authentication must be used.
584If either of these errors occur, it usually indicates that there are not enough ICAP
servers to catch up with the speed of the workload. Add more ICAP servers until
the workload is manageable.
• Scan files on-access, it is recommended to have at least one ICAP server for
each node in the cluster.
• Configure more than one ICAP server for a cluster, ensure that the processing
power of each ICAP server is relatively equal. OneFS distributes files to the
ICAP server on a rotating basis, regardless of the processing power of the ICAP
servers.
too_busy status: PowerScale internally keeps a list of the status of ICAP servers
that are connected to isi_avscan_d. All state information of isi_avscan_d
including the status for the ICAP servers is recorded in the file
/var/log/isi_avscan_d.log.
If the too_busy state is set to true, this state means that an ICAP server is busy and
not able to respond with the expected reply. The too_busy state indicates that there
are not enough ICAP servers for the workload. Add more ICAP servers until the
too_busy state is false for all ICAP servers.
𝐹𝑎𝑖𝑙𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟𝑠
𝐹𝑎𝑖𝑙𝑒𝑑 𝑡𝑜 𝑠𝑐𝑎𝑛 𝑟𝑎𝑡𝑖𝑜 = × 100%
𝑆𝑐𝑎𝑛𝑛𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟𝑠
High value of failed-to-scan ratio can occur due to various reasons like ICAP
socket timeout or a poor network condition. Higher ratio value means that there are
limited ICAP servers to catch up with the speed of the workload, especially when
using scan on close. In this case, add more ICAP servers and check if the fail-to-
scan ratio is reduced.
Performance Factors
The performance of ICAP servers and its performance impact on the PowerScale
cluster depend on many factors. Click the factors to learn more.
The OneFS policy scan does not work with a cluster node that is not on the network
(NANON). A node not connected to the network cannot connect to the ICAP server,
which causes the antivirus job engine to fail. KB article for more information.
Most antivirus vendors update their virus definition file at least once a day to
eliminate the potential threats of a new virus.
• Through a scheduled job or a manual update, the updated virus definition files
are pushed to all the ICAP servers.
• At the same time, the ICAP service tag (ISTag) is updated at the ICAP server
level.
• OneFS maintains a timer job to synchronize the ISTags from ICAP servers
every hour.
• The timer job can result in a maximum wait of one hour for ISTags to be
updated on the cluster.
Dell EMC recommends setting an interval for updating the virus definition file for
ICAP servers to align with your scan policy. This alignment avoids unnecessary
scans that could negatively impact overall performance. For the details of ISTag,
refer to the IETF article RFC3507
File Size
File size is a key performance factor for ICAP servers integrated with PowerScale.
• Less than 1 MB file size: This typically results in a very steady and gentle trend
of the value for the scanned files per second.
• Greater than 1MB file size: This typically results in the value of the scanned files
per second decreasing quickly.
Note: Scanned files per second could vary depending on the PowerScale node
type, node number, ICAP server vendors, ICAP server number, network bandwidth,
and other factors. This number should remain steady for small files less than 1 MB.
Network Bandwidth
For optimal performance, Dell EMC recommends the following best practices for
the network bandwidth of ICAP servers, depending on the average file size.
The number of ICAP server threads is one of the most important configurations
regarding the ICAP server. Vendors have different recommendations for numbers
for threads, and within the same vendor there can be different versions with
different thread recommendations.
• McAfee: 50 to 100
• Symantec: ~20
The CPU utilization on the PowerScale node can be high when there is a large
number of files in a directory to be scanned. At the same time, the overall scanning
performance is degraded.
For a detailed explanation, refer to the KB article isi_avscan_d process utilizes a lot
of processor when accessing directories with a high number of files.
Antivirus Administration
To go to the antivirus option in the WebUI, click Data protection tab and then click
Antivirus.
1 2 3 4 5
1: Antivirus policies are created that cause specific files to be scanned for viruses
each time the policy is run. Users can modify and delete antivirus policies. Antivirus
policies can be temporarily disabled should users want to retain the policy but do
not want to scan the files.
Multiple files can be scanned for viruses by manually running an antivirus policy, or
scan an individual file without an anti-virus policy. These scans can also be
stopped.
2: Antivirus reports can be viewed through the web administration interface. Events
that are related to anti-virus activity are also viewable.
4: Before the user can send files to be scanned on an ICAP server, they must
configure OneFS to connect to the server. The user has to test, modify, and
remove an ICAP server connection. User can temporarily disconnect and reconnect
to an ICAP server.
A user can add and connect to an ICAP server. After a server is added, OneFS can
send files to the server for virus scanning. If the user prevents OneFS from sending
files to an ICAP server, yet wants to retain the ICAP server connection settings, the
user can temporarily disconnect from the ICAP server.
5: A user can configure global antivirus settings that are applied to all anti-virus
scans by default in this Setting tab.
Before you can send files to be scanned on an ICAP server, you must configure
OneFS to connect to the server. You can test, modify, and remove an ICAP server
connection. You can temporarily disconnect and reconnect to an ICAP server.
Users can create an antivirus policy that causes specific files to be scanned for
viruses each time the policy is run.
Click to create an
antivirus policy
OneFS generates reports about antivirus scans. Each time that an antivirus policy
is run, OneFS generates a report for that policy. OneFS also generates a report
every 24 hours that includes all on-access scans that occurred during the day.
You can configure OneFS to send an alert, repair, quarantine, or truncate about
any file in which the ICAP server detects viruses.
Tip: For more information about antivirus threat responses, see the
OneFS Web Administration Guide.
Users can configure global antivirus settings that are applied to all antivirus scans
by default.
Example 1
To view the status of the recent scans, enter the following command: isi
antivirus reports scans list
Example 2
For more details about a scan, use the following command: isi antivirus
reports scans view <id>
Other Commands
In order to discover its scope of work one of the classes of the Job Engine jobs
utilizes a 'changelist', rather than a full LIN-based scan.
• The changelist approach analyzes two snapshots to find the LINs which
changed (delta) between the snapshots, and from there determines the exact
changes.
• SyncIQ replication and the File System Analyze (FSAnalyze) cluster analytics
are good examples of a job that leverages snapshot deltas and the
ChangelistCreate mechanism.
• The FilePolicy and FSAnalyze jobs in OneFS 8.2 and later automatically share
the same snapshots and index, created and managed by the IndexUpdate job.
− The new index stores considerably more file and snapshot attributes than
the old FSA index. Until the IndexUpdate job effects this change, FSA keeps
running on the old index and snapshots.
OneFS uses the SmartPools jobs to apply its file pool policies. To accomplish this,
the SmartPools job visits every file, and the SmartPoolsTree job visits a tree of
files. However, the scanning portion of these jobs can result in significant random
impact to the cluster and lengthy execution times, particularly in the case of
SmartPools job.
• To address this, the FilePolicy job, included with OneFS 8.2, and later provides
a faster, lower impact method for applying file pool policies than the full-blown
SmartPools job.
• In conjunction with the IndexUpdate job, FilePolicy improves job scan
performance by using a file system index or changelist, to find files needing
policy changes, rather than a full tree scan.
• This dramatically decreases the amount of locking and metadata scanning work
the job is required to perform, reducing impact on CPU and disk - albeit at the
expense of not doing everything that SmartPools does.
• The FilePolicy job enforces just the SmartPools file pool policies, as opposed
to the storage pool settings.
However, the vast majority of the time SmartPools and FilePolicy perform the same
work. Disabled by default, FilePolicy supports the full range of file pool policy
features, reports the same information, and provides the same configuration
options as the SmartPools job.
• In-line data reduction, introduced in OneFS 8.1.3 for the F810 platform, will not
affect the data stored in a snapshot.
• However, snapshots can be created on compressed data. If a compression tier
is added to a cluster that already has a significant amount of data stored in
snapshots, it will take time before the snapshot data is affected by compression.
• Newly created snapshots will contain compressed data, but older snapshots will
not.
The snapshot storage target setting is applied to each file version by SmartPools.
When a snapshot is taken, the storage pool setting is simply preserved which
means that the snapshot will initially be written to the default data pool and then
moved
The SmartPools job subsequently finds the snapshot version and moves it to the
intended pool during the next scheduled SmartPools job run.
When using SmartPools, snapshots can be stored on a different disk tier than the one the original data
resides on.
SyncIQ also leverages snapshots for the consistency points required to facilitate
replication, failover, and failback between PowerScale clusters. This means that
only the changes between the source and target datasets need to be replicated
between the two clusters. This helps for efficient replication and granular recovery
objectives. The snapshots generated by SyncIQ can also be used for archival
purposes on the target cluster.
SyncIQ creates snapshots on the source cluster to ensure that a consistent point-
in-time image is replicated, and that unaltered data is not sent to the target cluster.
• SyncIQ replicates data according to the snapshot rather than the current state
of the cluster, allowing users to modify source-directory files while ensuring that
an exact point-in-time image of the source directory is replicated.
• SyncIQ can also replicate data according to either an on-demand or scheduled
snapshot generated directly by SnapshotIQ. If data is replicated using a
SnapshotIQ snapshot, SyncIQ does not generate another snapshot of the
source directory.
• SyncIQ generates source snapshots to ensure that replication jobs do not
transfer unmodified data. When a job is created for a replication policy, SyncIQ
checks whether it is the first job created for the policy.
− If not, SyncIQ compares the snapshot generated for the earlier job with the
snapshot generated for the new job.
• SyncIQ replicates only data that has changed since the last time a snapshot
was generated for the replication policy. When a replication job is completed,
SyncIQ deletes the previous source-cluster snapshot and retains the most
recent snapshot until the next job is run
When a replication job is run, SyncIQ generates a snapshot on the target cluster to
facilitate failover operations. When the next replication job is created for the
replication policy, the job creates a new snapshot and deletes the old one.
• If a SnapshotIQ license has been activated on the target cluster, you can
configure a replication policy to generate additional snapshots that remain on
the target cluster even as subsequent replication jobs run.
• SyncIQ generates target snapshots to enable failover on the target cluster
regardless of whether a SnapshotIQ license has been configured on the target
cluster.
• Failover snapshots are generated when a replication job completes. SyncIQ
retains only one failover snapshot per replication policy and deletes the old
snapshot after the new snapshot is created.
The disaster recovery plan documents the information needed for an organization
to react and act during a disaster scenario.
There is no one plan as each organization has their own priorities and recovery
objectives.
• DFS
• Client impact
2: What workflows are critical for the business and how long can they be offline?
The plan design ranks the organization’s workflows and applications, determining
recovery objectives for each.
3: Each workflow should have a runbook that outlines a step-by-step process for
recovery.
4: The plan is holistic, meaning all the organization’s functional areas in all affected
facilities are considered.
6: Test and validate the strategy. Do not design and set it up, teams must test and
ensure it works. Include call trees, response actions, and what if scenarios in the
test plan. Testing can be done as a walkthrough or a full interruption. Failback can
take an unexpected time to complete, exceeding testing windows. Whereas a
failover takes 30 seconds, a failback could take a week, depending on the size and
the amount of changed data. The test plan should be dynamic and reviewed and
updated periodically.
In my organization, what
scenarios apply to my
workflows? Why?
Several factors can help determine what type of scenario applies to a workflow.
First is analyzing the criticality of the workflow.
Asynchronous Replication:
Synchronous Replication:
• No downtime workflows
• Every minute of downtime costs the business money
Media Recovery:
What kind of data protection must the organization maintain to meet an acceptable
level of disruption? Is the data or access to the data important enough to warrant
the cost of a hot site? Is any amount of data loss detrimental to the business? How
important is it to business continuity if the workflow is down? Will the business lose
clients, jobs, money, or reputation if the data is inaccessible for extended periods?
SyncIQ replicates asynchronously and is not a high availability disaster recovery
strategy.
Scenario
The scenario highlights a lightweight disaster recovery plan for a media and
entertainment organization.
Workflow analysis:
• Workflow - media directory for file sharing and protection
• Business continuity risk is low
• Disaster recovery solution in line with business continuity
• ROI appropriate for workflow
• Dependencies
• Administrator makes recommendation, management decides
Solution
The remote disaster recovery solution shows the workflow that is replicated to a
remote office.
Users
Plan to Support
The disaster recovery plan for the lightweight workflow should include or link to
disaster recovery runbook for the workflow.
Workflow SOP585:
• Disaster declared
• PowerScale administrator follows procedures to fail over
• Assessment made for non-critical data
Scenario: The users edit and update media files, audio and video editing and then
save their work on the /ifs/core/data/media directory on the PowerScale
cluster. Analysis on the workflow concludes that the downtime for the share has a
low risk to business continuity. The workgroup can do their work during extended
downtime of the core cluster or facility. If the core cluster is lost, the users can still
do their job, but may have to recreate some of the work. The PowerScale
administrator makes the recommendation for the workflow’s disaster recovery plan,
management decides. Keep in mind that decisions are made holistically,
considering other areas such as client, network, and critical workflows.
Solution: When a failure happens, and access is switched to the remote office, user
can access the previous day’s work. Losing some work is painful for the users, but
does not impact the business. User may need to re-create some of the work. This
diagram may be found in the disaster recovery plan, but noting granular details
such as switches and operating systems and versions.
The disaster recovery plan for the lightweight workflow should include or link to
disaster recovery runbook for the workflow. The RTO for the media workflow in the
disaster recovery plan is 24 hours. The runbook details the communications,
actions, and checks that the PowerScale administrator must perform. Prolonged
downtime to the source cluster may warrant implementing systems and restoring
from tape at the target site.
Restricted Downtime
Scenario
The restricted downtime scenario features a hospital that has a central repository
for patient records stored on the PowerScale cluster.
Workflow analysis:
• Workflow – patient data records
• Business continuity – delayed access is acceptable
• Disaster recovery solution in line with business continuity
Solution
The remote disaster recovery solution shows the workflow that is replicated to a
remote office.
Users
Source Network
Target
Plan to Support
The graphic shows how the high-level milestones for the plan may look.
Workflow SOP:
• The disaster recovery teams for the hosting IT company are notified and
mobilized.
• The workflow is failed over, verified and access is confirmed.
• PowerScale administrator ensures that the target directory is accessible.
• The runbook details the communications, actions, and checks that the
PowerScale administrator must perform.
• Because archive-type data is tiered to the cloud, restores from tape at the target
is not needed.
Scenario: The restricted downtime scenario features a hospital that has a central
repository for patient records that are stored on the PowerScale cluster. The cluster
is shared with physicians and their staff. The staff accesses the patients records to
get information and medical history. The patient record data is kept in the
/ifs/source/data/records directory. In a disaster, the inability to immediately access
records is not an emergency that hurts the business. The staff and physicians can
still see the patient and meet their needs.
Scenario
Financial trading:
• Workflow - ticker data analysis
• Delayed access to the data places the business at risk
Solution
Plan to Support
Workflow SOP:
• The disaster recovery teams are notified and mobilized.
• The failover is automated, with HA servers across sites.
• The recovery teams verify and test for access.
• The organization’s critical systems such as the analytic tick servers that process
real-time data have automated failover with no downtime.
The graphic shows how the high-level milestones for the tick data workflow may look.
The organization collects, stores, and analyzes the data of all their current
investments to quickly react to changing markets and to maximize profits. The tick
data analytics is acted on real-time data, near real-time data, and historical data.
Ticker servers process and analyze the real-time data and then write data to the
PowerScale cluster on a schedule. Near real-time and historical data is accessed
on the PowerScale cluster. In a disaster situation, downtime on the ticker servers
can mean lost opportunities and lost clients for the business. For disaster
protection, the ticker servers form a cluster from both sites, providing no downtime
for real-time processing. Near real-time and historical data is replicated from the
source PowerScale cluster to the cluster at the target site.
Listed are the teams and roles that should be part of a disaster recovery plan.
Team Role
PowerScale administrator on the disaster recovery team performs the recovery that
gives users and applications access to data on the cluster. The PowerScale
administrator follows the steps that are defined in the SOP for each workflow. A key
to meet the recovery milestones is communication and understanding the
dependencies between the technologies. For example, the PowerScale
administrator cannot verify user access to a failed over SMB share if the network
team has missed its milestone. If the organization is small, a single individual can
handle multiple roles.
Cold Site
Warm Site
This scenario scenario required a tighter RTO solution. Thus the organization opted
for a warm recovery site.
Cold Site: The plan details the actions and personnel that are needed from the
moment the incident is detected to the time the incident is resolved. In this
scenario, the secondary site is a cold site and given the 24-hour RTO, it is likely the
secondary site also stores the tape backups. The one hour between incident
detection and team activation may be due to the need to analyze the extent of the
problem. The incident may be a pervasive virus that could not be quarantined or
isolated. A disaster is declared at the incident plus 3 hours. This time may be built
into the plan, stating that a decision must be made at or before this milestone.
Once the disaster is declared, the respective teams are mobilized and the data
recovery begins. After access is restored, the teams provide reports on what
worked, what did not work, and how to improve.
Warm Site: Like the previous scenario, the plan is activated at about three hours
from the time the incident is detected. A four-hour RTO is not enough time to
recover data from tape on the secondary site. Here the teams switchover the
functions to the warm site and the warm site then hosts the data for the business.
1 2 3
1:
• Earthquake - likely
• Heat wave - likely
• Drought - likely
• Diverse Genomics:
3:
Resilience: One of the deciding factors for Diverse Genomics implementing the
PowerScale cluster is the rich set of high availability features. The PowerScale also
can hot swap components, keeping data available while servicing the cluster. The
SyncIQ feature replicates data to the secondary Diverse Genomics data center
located in Tennessee. If a disk or node fails, the components can be replaced with
no disruption. If a node’s network card fails, the card can be replaced with no
downtime. Data replicates to the secondary site, and a failover is tested every 60
days. When data reaches a specific age, it is automatically moved to cloud storage.
Recovery: The Diverse Genomics recovery plan is to bring down the primary site if
possible and fail over to the secondary site. The key individuals are identified in the
disaster plan. The risk at the recovery site in Tennessee is severe storms and
though not likely, tornados. To prevent a perfect storm scenario, the secondary site
has generators and fuel for backup power. Most management and monitoring can
be handled remotely, but if a disaster strikes, key responsibilities are shifted to
designated personnel at the secondary site.
InsightIQ
InsightIQ Overview
587It integrates with PowerScale OneFS operating system to collect and store
performance and file system analytics data.
588
It is compatible with multiple versions of OneFS and can be configured to
monitor one or more PowerScale clusters at a time.
589It uses a web browser interface. Some command-line interface commands are
used for InsightIQ configuration changes and troubleshooting.
InsightIQ provides tools to monitor and analyze historical data from PowerScale
clusters. Using the InsightIQ web application, you can view standard or customized
Performance reports and File System reports, to monitor and analyze Isilon cluster
activity. You can create customized reports to view information about storage
cluster hardware, software, and protocol operations. You can publish, schedule,
and share reports, and you can export the data to a third-party application.
InsightIQ
3.0x
InsightIQ
3.1x
InsightIQ
3.2x
InsightIQ
4.0x
InsightIQ
4.1x
Older OneFS versions may not contain some functionality that is required for all
current InsightIQ features. A significant improvement was made in file system
analytics (FSA) capabilities in OneFS 8.0. The FSA statistics require OneFS 8.0 or
higher, and InsightIQ 4.0 or higher. The InsightIQ and OneFS compatibility matrix is
available in the PowerScale Supportability and Compatibility Reference Guide.
• The configured NFS export must grant the root user write access590 for the
specified InsightIQ virtual machine IP address.
590This configuration enables InsightIQ to mount the cluster or server and create
the necessary directories and files on the cluster or server. InsightIQ connects to
the NFS host as the root user.
591Verify that the parent directory of the datastore is configured with a permission
setting of 755 (read/write/execute for root or the owner and read/execute for group
and others) or higher.
592All the files in the datastore directory must be configured with a permission
setting of 744 or higher. If the issue persists, verify that the directory's owner and
group settings are correctly configured. For an NFS datastore, the owner:group
setting must be nobody:nobody. For a local datastore, the owner:group setting
must be root:root.
Dashboard
The user interface is separated into four major sections, the Dashboard,
Performance Reporting, File System Reporting, and Settings.
Cluster Overview
The Dashboard is an at-a-glance view of real-time cluster health and vital cluster
statistics.
You can quickly view capacity and performance for all connected clusters.
View the status of all the monitored clusters. InsightIQ Dashboard, available
through the InsightIQ web application, shows an overview of the status of all the
monitored clusters. The Cluster Status summary includes information about cluster
capacity, clients, throughput, and CPU usage. Current information about the
monitored clusters appears alongside graphs that show the relative changes in the
statistics over the past 12 hours.
The Aggregated Cluster Overview section displays the total or average values of
the status information for the monitored clusters. The aggregate view supports for
multiple clusters, and cluster-by-cluster view for each individual cluster. Each view
displays a capacity snapshot, key trends for connected and active clients, network
and file system throughput, and CPU usage.
This information can help you decide what to include in a Performance report. For
example, if the total amount of network traffic for all the monitored clusters is higher
than anticipated, a customized Performance report can show you the data about
network traffic. The report can show you the network throughput by direction, by
using breakouts, to help you determine whether one direction of throughput is
contributing to the total more than the other.
CPU usage provides the most interesting performance statistic relative to cluster
activity. When an anomaly is identified on the Dashboard, use the performance
reporting, or file system reporting to analyze the specific details.
Performance reporting enables live activity for viewing with graphic representations,
using on-demand live generation of reports. Data continues to plot while viewing
the output.
Link: See Isilon InsightIQ 4.1.3 User Guide for more information.
From the Report Type drop down list, you can select desired report template. Live
reporting uses any saved standard or custom created report template to display
performance data. The Date Range option provides a mechanism to generate
scheduled reports. The report can be based on the current data, or data from a
specified time period. The Zoom Level of the report determines the granularity of
the data displayed.
The Network Performance report displays the health and status of the clusters
network.
Scroll down in the report to display the available data series generated by the
selected report. Then select breakout category to view the desired metrics.
Breakout categories593
593 It includes client, protocol, op class, direction, interface, node, node pool, or tier.
In the example, data is spread across the node and is displayed below the chart.
Individual lines are displayed for each node. The higher the activity the darker the
time segment displayed.
Chart data displays the averages, and the breakout areas display the details by the
breakout that is selected from high to low.
Saving as CSV
594
The "+" enables to breakout one more level down or regroup data by an
additional breakout category. An element of the breakout can be broken down into
another breakout category.
Video: See Viewing protocol operations latency demo video for more
information.
Breakouts provide heat maps that display variations of color to represent each
component's contribution to overall performance. The darker the color on a heat
map, the greater the activity for that component. Heat maps help you to visualize
performance trends and to identify periods of constrained performance.
If you hover the mouse pointer over any location on a heat map, InsightIQ shows
data for the specified component at that moment in time. Breakouts are sorted by
components that are based on level of activity, with the most active elements at the
top of the list.
Performance Reports
Create performance report templates if a standard template does not meet the
desired requirements.
Many organizations create specific reports for groups monitoring specific functions.
Custom reports are created using a new template from a blank template, or with an
existing template and modifying it to meet the organization’s requirements.
Scheduled Reports
• Live595
• Scheduled596
Configure up to ten email recipients to receive the report as a PDF file, or access
the reports online from the Manage Performance Reporting page.
595 Live reports are then displayed and available in the Live Reporting window.
596Scheduled reports are generated at a specific time and these reports can be
sent as a PDF by email on a scheduled cadence. If starting with a blank template,
choose the modules to be included.
Scheduled Reports
FILE SYSTEM REPORTING tab is used to examine the data capacity, data
distribution, deduplication, and quotas on the cluster.
Capacity
Provides overview
of custer usage by
storage location
Trend
File system analytics597 requires the FSAnalyze job to be run on the cluster.
• The FSA Report field is updated each time the job is run.
• The job is regularly scheduled job in OneFS.
Quotas
If SmartQuotas are licensed on the cluster, FSA captures the quota status when
the FSAnalyze job is run. Quota reports can be viewed with InsightIQ.
597 The File System Analytics feature allows you to view File System reports. When
File System Analytics (FSA) is enabled on a monitored cluster, a File System
Analytics job runs on the cluster and collects data that InsightIQ uses to populate
file system reports. You can modify how much information is collected by the FSA
job through OneFS. You can also configure the level of detail displayed in file
system reports through InsightIQ.
Video: See Capacity usage through FSA demo video for more
information.
Capacity - File Size Details: The file counts by file size provides details for
examining physical file sizes and logical file sizes on the cluster. It can be useful
when performing a data protection level analysis for downloading the CSV files.
Trend Default Graph: The date range, time, and zoom level can be tailored to
isolate a particular period. The default graph displays existing trends in total usage.
Select other data to plot, such as total capacity, provisioned capacity, and writable
capacity. Note that writable capacity is calculated based on existing file size
distribution and the cluster's data protection level for the node pools.
Select the report that best meets the requirements similar to live or scheduled
performance reporting. Next select the FSA Report to use for the analysis. FSA
reports can be compared to one another.
Trend - Forecast Data Usage: InsightIQ 4.0 includes the capability to forecast
data usage to help plan for cluster expansion, or data cleanup. The projection uses
algorithmic driven estimations to forecast future capacity utilization. Time period to
use is selectable. FSA uses the selected range in the forecast calculations. To
598If you limit the size of the InsightIQ datastore through a quota, InsightIQ cannot
detect the available space. The datastore might become full before InsightIQ can
delete older data to make space available for newer data.
assist with charting normalization, select to eliminate outlier data points, and select
to show the standard deviation as part of the plot.
Quota Reports: Quota reports display information about quotas created through
the SmartQuotas software module. Quota reports can be useful if you want to
compare the data usage of a directory to the quota limits for that directory over
time. This information can help you predict when a directory is likely to reach its
quota limit.
Local Users
LDAP
LDAP server configuration can be found under: Settings -> Users -> Configure LDAP tab
Once LDAP is enabled, InsightIQ checks the configured LDAP servers users and
groups for authentication.
If there is not enough free space on the target datastore or if an NFS connection
gets interrupted, the datastore move operations can fail.
599
Once a connection is made between InsightIQ and the LDAP server, you can
add LDAP groups and users to InsightIQ.
• If the connection is permanently severed, you can recover the data only if you
have created a backup of the datastore by exporting it to a .tar file.
• Back up the datastore600 before moving an InsightIQ datastore.
• Set a quota to the target location601.
600If you have created a backup, you can import the backup datastore to a new
instance of InsightIQ to recover the data.
601If a quota is applied to the target location and the quota is configured to report
the size of the entire file system instead of the quota limit, there is less space on
the target location than InsightIQ requires. The migration might fail. If this failure
occurs, InsightIQ automatically transfers the datastore back to the original
datastore.
Performance Tuning
Protocol Balance
• The isi statistics is used to get the balance of the protocol traffic.
• Notice that the most significant number of operations are using SMB.
• Clusters having a predominant external protocol such as smb or nfs can affect
the cluster settings for on-disk identity, smb server settings, or UNIX settings.
• A later output showing the balance shifting significantly could indicate
inefficiencies if the cluster was tuned for a given protocol.
Connection Distribution
• The first isi statistics command queries the current NFS statistics and
the second the current Windows statistics.
• This example shows a nearly idle environment.
• As workflows are added and hosts are accessing the cluster, the output trends
accordingly. This output may also show unbalanced connections across nodes.
• Slow access becomes an issue. The output shows zero connections on node 3,
it can isolate an issue with node 3 or the network coming into node 3.
How many
connected and
active clients by
protocol
Key for current connected SMB Key for current active SMB clients
clients
The graphic shows establishing the baseline by documenting the distribution of connections
between the nodes.
• The protocol traffic is predominantly SMB, hence use smb2 with isi
statistics psat to approximate the mix of read, write, and metadata
components.
• If slow access becomes an issue, this output indicates if the read or write or
metadata ratio has shifted, which could be a possible reason for the issue.
• If an application was migrated to the cluster shifting the read/write ratio from
70:30 to 40:60, it may cause unforeseen latencies.
The graphic shows the top section of the output, indicating the protocol command
rates.
1:
Calculate:
• Total (1255.99) - Write (333.22) + Read (829.23) = Metadata (93.54)
• Read (829/1256) * 100 = 66%
• Write (333/1256) * 100 = 26.5%
• Metadata (94/1256) * 100 = 7.5%
Time in Queue
602
Examining the max, min, and average values for the disk time in queue,
administrators can get a disk drive activity baseline.
Number in Queue
$9 is the “Queued” field in the output Metrics above 5 need attention Gets the average across all disks
Percent Busy
• View and document the balance of disk operations across the nodes.
• In the example output, nodes 5, 6, 7, 8 are doing most of the work while the
other nodes are nearly idle. After investigation, an application is only using
nodes 5 through 8.
• Though the metrics are small and have no impact on production, continued
trending in this unbalance may need to be addressed.
• As a follow-up, use the isi statistics drive --nodes all --sort
OpsIn,Drive command to analyze to the node with the most disk operations.
Busiest Files
• Identify the top 15 files in use and their use rate using isi statistics.
• Each entry for the same path is a different event.
• For example, a read, getattr, lookup, or other operation.
• Multiple instances of the same path aggregate to indicate the total operation
rate for that path.
• Instead of using the all to show all node output, display the busiest files per
node using the switch --nodes 1, which is useful when isolating the node with
issues.
The output shows entries for identical paths with different operation rates.
Misaligned Writes
• Misalignment603 is at the file level, not at the file system level. Use the
isi_for_array command to view the misaligned writes.
• The output is shown in the upper box. After one minute, execute the command
as seen in the output of the lower box.
• Then calculate the difference between the misaligned write request counts from
each command.
• Divide this difference by the time between samples, 60 seconds.
• The result604 is the rate-per-second of misaligned write requests.
1:
604The 0.17 rate for node one is insignificant to performance. For the 8.2 rate, the
administrator may monitor the cluster and if misaligned writes begin to impact
performance, take action.
• In a clustered array, you can expect some resource sharing and locking events.
• Use the isi statistics heat --totalby event,lin,path --limit
50 command to record the most recent 50 locking event counts605 when no
performance issues occur to establish the baseline workload.
• Locking Events are classified as Blocked, Contended606, and Deadlocked607.
• Excessive locking events can degrade performance.
Locking Events
606Blocked and Contended events tend to be correlated together. The new lock
requester is blocked, and the current lock holder gets the contended callback.
Blocked and contended locking events are expected and a storage administrator
may see hundreds or more depending on how busy the cluster is.
607 Deadlock events are different, with no timeout, and deadlock events should be
infrequent.
• Administrators must understand if hops exist between the cluster and a host
because each network hop adds latency.
• To display route and transit delays for packets over IP, run the traceroute or
tracert command.
• Excessive hops may indicate a network issue.
• Increased hops may indicate a breakdown somewhere in the network model.
5 probes
3 probes
• Administrators can record the baseline of packet latency and loss to the target
using the ping command from either Windows or Linux.
• The ping statistics sets a solid baseline to use.
• If encountering problems, compare the statistics to a later execution.
• Check for dropped packets and significant increases in network metrics.
Bandwidth
Jitter
• Jitter is the difference in packet delay between the target and the source.
• Excessive jitter can be the result of network congestion, improper queuing, or
configuration errors.
• To measure Jitter, use iperf. iperf sends the UDP packets between two
hosts running iperf.
Issue on host
The graphic shows that the delay between packets can vary instead of remaining constant.
Retransmission Rate
608 The output shows that the retransmission rate is less than 0.1%, meaning this
retransmission is not significant and not an issue. Retransmission rates above 3%
negatively affect user experience.
1:
Use the isi status command and observe the “Out” column of throughput to
assess throughput balance across the nodes in relation to the IP connections.
List of cluster and node commands that are used for the baseline.
• Lists of all the IP addresses that are bound to external interfaces.
• isi network interfaces list
• To view the SMB open files list, run the two commands Having many open
sessions can impact resources, especially RAM.
• isi smb session list
• isi_for_array -X 'isi smb openfiles list -v --format=csv
--no-header --no-footer'
• To get information about NFS locks on the array, it applies to NFSv3 only and
displays a list of NFS Network Lock Manager advisory locks. If users are unable
to access files, the command can help determine or isolate locking issues.
List of commands for viewing the baseline for cluster and nodes.
• Capacity commands display the free capacity on the cluster and the storage
pool. Reaching 100% capacity locks up the cluster, take proactive measures to
prevent reaching 100% capacity.
• isi status --quiet
• isi storagepool list
• isi status -p
• Gives 10 processes using the most CPU on each node
• isi_for_array -s -X 'top -n -S 10'
• The two memory commands show the status of memory for each node.
Monitoring node CPU and memory can identify a resource unbalance between
nodes.
Performance Benchmarking
Benchmark Overview
• Protocol Latency609 - measures the time from when a request is issued to the
time when the response is received.
• Data Throughput610 - measures the data transfer rate to and from the storage
system in megabytes/gigabytes per second.
• OPS611 - measures the number of operations performed at a protocol level per
second.
609 Latency can be measured at various points and where you measure can help
identify performance issues. The latency measured at the client side provides a
holistic view which encompasses latency in the client, network and storage.
Latency when measured by the storage system normally includes only the latency
of the storage system and excludes the network and client.
610Throughput can be measured either at the client or storage side. These values
should be identical or very close to each other unlike measuring latency.
611One OPS is not the same as one IOPS. Traditional SAN storage systems would
measure performance with IOPS. For NAS systems using SMB or NFS,
performance is normally measured using OPS. This is important because 1 IOP is
not the same as 1 OP. A single SMB operation could cause a lot of actual disk I/Os
The important criteria is that the tool or method that you use is repeatable and
produces consistent results.
A good benchmark must help determine the suitability of storage system to run an
application under consideration.
SPEC SFS 2008 For file-based storage to monitor server throughput and
response time.
SPEC SFS 2014 File-based storage, updated from SPEC SFS 2008 and
used in measuring an end-to-end storage solution for
specific applications
to occur. For example, a single request for a directory listing operation like
READDIRPLUS in NFS will cause a lot of disk I/O to be generated.
mdtest and IOR mdtest is used for testing metadata operations while IOR
is designed to test streaming data.
Why Benchmarking?
Qualifying Questions
Vdbench Overview
• Vdbench can specify metadata operations as part of workflow and not just data.
Many workflows have more than 50% metadata operations.
• Vdbench is capable of easily modeling complex datasets and I/O patterns. You
can model several dataset and I/O pattern sets and them combine them into a
larger test.
− Example: You could have one test that does large file sequential access and
another test that does small file random I/O. Then, you can combine them
into a single test that runs both.
• For both SMB and NFS, verify that there is an SMB share or NFS export
configured for /ifs on PowerScale with the appropriate read and write
permissions. Starting with OneFS 9.0, there are no default shares or exports
configured.
• For performance benchmarks enabling the run-as-root option for SMB and
enabling map-root-to-root for NFS simplifies configuration by bypassing
security. If this security bypass is not acceptable, such as in a production
environment, you need to ensure to properly configure the benchmark
directories for read and write access by the clients.
• For best performance, a client should be connected to one node in the
PowerScale cluster. The client connections should ideally be balanced across
the number of available nodes. Each client should mount PowerScale using the
same path. This will simplify the vdbench profile configuration.
Anatomy of a Profile
• A profile or workload parameter file defines how the benchmark tool will run.
• It is defined by combining four sections: Host Definition (HD), File System
Definition (FSD), File System Workload Definition (FWD), and Run
Definition (RD).
• There are also general sections which contain parameters related to
deduplication, journaling, data validation and so on.
• The sections need to be in a specific order as later sections reference prior
sections. The required order is General -> HD -> FSD -> FWD -> RD.
1: Specifies the hosts that will participate in generating load. The HD section
defines which clients will run the benchmark.
2: Specifies the file system structure that the test will run across. This section
defines the dataset that the benchmark will operate over.
4: Specifies the actual test to execute including parameters like duration and target
OPS.
Each section follows the pattern of key-value pairs. There is a key which defines
which section the parameter will belong, followed by a label and then additional
options. There is a special label called default that can be used in each section to
explicitly set the default values so they do not need to be repeated for each
following entry.
You will normally want to have one HD parameter for each client you want to run
the benchmark.
You can have the same physical or virtual client have more than one host entry.
You can assign specific work for that client and potentially do more work.
1:
• Host Label - each host has a host label. The label uniquely identifies a client.
Special labels such as default and localhost can also be used.
• System - when running Vdbench in a multi-host environment, you can specify
the FQDN or IP address of the host.
2: The example shows the use of the default label as well as the individual entries.
The default values defined above will be applied to each HD entry below. There are
2 hosts defined for running the benchmark and each host will use SSH for the
access method using the root user. Both hosts have Vdbench installed in the
same directory.
You can define one or more FSD to customize the size and distribution of the test
data612. Having multiple FSDs is especially useful in replicating a complete
directory structure for your testing. When a FSD is not used by any test, Vdbench
will not create the dataset. For many benchmarks, the file structure usually ends up
being very simple because defining something complicated is difficult.
612A single test can operate across a single or multiple FSDs. You can model a
shallow and wide directory structure and a deep and narrow directory structure at
the same time. You can define the different file counts and size distribution as well.
3
4
1:
• Label: each FSD requires a unique label to identify this FSD for use.
• Anchor: specifies the parent directory where the test files are created.
• Shared: determines if the FSD being defined is shared between all the clients
or if each client will have their own isolated the dataset.
• Width: specifies the number of directories to create in each parent directory.
• Depth: specifies the number of levels deep to create the directory tree.
• Files: specifies the number of test files to be created in the lowest level
directory.
• Sizes: specifies the size of each test file. Sizes can be a single number
(Example: 50M) or it can be a size distribution which consists of a size and
percentage pair (Example: 1k,30,8k,40,32k,20,1024k,10). The percentages
need to add up to 100.
• Distribution: specifies whether to create files only in the lowest level directories
or in all directories.
2: The example shows created a FWD named fsd_1. Using a numeric suffix can
enable the use of globbing such as * when referencing FSD later in the file.
The FSD will create 2000 files, each 1 megabyte at the lowest level directories in
the directory tree shown in the graphic.
In the example, shared is set to yes which means that all hosts will perform I/O in
the same directory structure. This does not mean that 2 hosts will read and write to
the same file. Instead, for performance and simplicity reasons, every host will work
on a subset of the files in the shared file structure. For example, if you have 2 hosts
and 100 files, one host will get all the odd-numbered files while the second host will
get all the even-numbered files. The set of files that a host gets is predetermined
using an algorithm. When shared is set to no, each host running against this FSD
will create their own working directory and file set.
3: By default, the distribution parameter is set to bottom. The test files are only
created at the bottom level directories, otherwise knows as leaf directories. For the
above example, the size of the dataset is 22 * 2000 * 1 = 8000 megabytes.
4: With distribution set to all, the number of test files specified are created at each
directory in the directory tree.
FWD defines the type of test that will be performed. The FWD ties together the
hosts (HD), the file system definition (FSD) and the type of operations you want to
perform. The FWD is also where you tie together a workload with the data that the
workload acts upon.
When doing a test run, you can have a single line FWD, but that only allows very
limited flexibility. Normally, you will have multiple FWD lines to perform different
operations each at a percentage of the total operations.
1:
• Label: specifies a unique name for the FWD entry. The default label can be
used to serve as default values for all following FWDs.
• Host: specifies the hosts on which the workload will run.
• File System: specifies the names of the FSDs to use to run the workload.
• Operation: specifies a single file system operation that must be done for this
workload. Example: read, write, access, GETATTR, SETATTR, open, close.
You can also create a sequence of operations.
• Transfer Size: data transfer size used for read or write operations. Example:
1M,64k,(8k,50,128k,50)
• File I/O: specifies the type of I/O that needs to be done on each file, either
random or sequential. .
• Thread: number of concurrent threads to run for this workload. You must
ensure to have at least one file for each thread.
• Skew: specifies the percentage of the total work for a particular FWD. The total
skew percentage must add up to 100.
You can define small transfer size I/O on a small file dataset and large block
sequential I/O for a large file dataset. The skew number can be used to easily give
an operation a certain percentage of the total workload. If it is not specified,
Vdbench will try to evenly distribute I/O operations across all the FWDs. For most
workloads, the file I/O pattern should be sequential. Files are normally written or
read as a whole. Random I/O is used only when you have a workflow that reads or
writes random segments within a single file, like a virtual disk image.
RD defines which FWD to run for a single test run, how much I/O to generate and
how long to run the workload. When you have more than one RD entry, Vdbench
will run those tests in sequence. This allows you to setup a very long test cycle with
multiple tests and then have it automatically run through each one without user
interaction.
1:
• Workload: the set of FWDs to execute in the test. To define multiple FWDs, you
can use globbing such as fwd*, or by specifying them individually like
(fwd1,fwd2,fwd8).
• Elapsed Time: the amount of time that this test will run in seconds. The default
value is 30 seconds.
• Interval: the number of seconds between each status report.
• Forward Rate: this is the total number of OPS that the system will attempt to
generate across all the FWD. It can be specified as a single value, a set of
values, or a range with increment. There is a special label max that requests
Vdbench to run as fast as it can.
• Format: specifies how Vdbench will work with the FSD. For large file systems,
you do not want to recreate the entire file system for every run.
2: The example defines an RD to run the workload for fwd_1 for 20 minutes. The
number of operations per second specified is 5000. With format set to restart,
Vdbench will create only files that have not been created and will also expand files
that have not reached their proper size.
• As Vdbench is a Java program, it can run across multiple platforms, even at the
same time.
• The Vdbench primary node can connect to the load generator hosts through 3
different methods: RSH, SSH and vdbench internal RSH.
• Each host can be connected through a different method.
• RSH and SSH can be configured to not require a password. Also, no manual
startup in required.
• Basic syntax to run Vdbench: ./vdbench -f profile_name.txt
1: The command runs the profile.txt file. The -f option is used to specify one
or more profile files to be executed. The profile file contains all the workload
parameters.
2: The -t option is used to run a demo I/O workload. A small temporary file is
created and a 50/50 read/write test is executed for just five seconds. This is a way
to test that Vdbench has been correctly installed and works for the current
operating system platform without the need to first create a profile file.
3: The -tf is used to run a demo file system workload without the need to create a
profile file.
4: The -o option is used to specify the output directory. When you do not specify
this, a default directory called output will be used for every run. You generally want
to use the –o option so you do not overwrite previous benchmark runs. In the
example, the command executes the test1.txt profile and creates the output
directory named test.
5: Adding the + symbol at the end of the output directory names creates directories
with increasing numbers appended starting from 001. In this example, if the output
directory test does not exist, the command will create the output directory named
test. If test already exists, the output directory created is named test001.
6: Adding .tod at the end of the directory name creates output directories with the
timestamp (yyyymmdd.HHMMss) appended. This is useful when you want to
know when a particular run was performed. In the example, the output directory
name would be something like test20200909.031534
Vdbench Output
• Once a run is complete, a large number of files are created in the specified
output directory.
• The tool outputs HTML that can be loaded into a standard web browser for
easier navigation.
• The detailed output of a run is in a column format. All throughput values are in
MB/s.
• The three most important files to analyze after a run include: summary.html,
totals.html and histogram.html.
summary. It has
html the
hyperlin
ks to
each
individu
al host,
each of
the
individu
al FWD
defined
in the
profile,
each
RD in
the
profile,
and so
on.
totals.html Shows a
summari
zed
output
of all the
test runs
without
the
interme
diate
reportin
g. It
provides
just the
totals
and is a
good file
to look
at when
getting
data for
creating
a graph.
histogram Shows
.html the
distributi
on of
latency
for
reads
and
writes
combine
d. It can
be
directly
read
into
Excel as
a tab-
delimite
d file for
further
analysis
.
Benchmarking Considerations
The benchmarking process can be easily manipulated because of the large number
of variables that influence performance results. To level the playing field, test
results need to be categorized by product type, configuration standards need to be
defined for each category and vendors must strictly adhere to the configurations.
Some of the considerations include:
• Use the correct benchmark for your workload613. For example, do not use block-
based benchmark to test a NAS.
• Your dataset should reflect your real workload. A benchmark with 10 large files
is not good if your real dataset has millions or billions of files.
• Your dataset must be large enough so that it does not fit into local client cache
or fully in the storage system cache. As a rule of thumb, the dataset must be
double the memory of the client and storage system cache combined.
• Do not use a single corner case in a real-world workflow as the only metric for a
benchmark.
• Simple benchmarks generally do not provide good results. Example: Drag and
drop copy files for running a throughput test.
• When running a benchmark multiple times, take the average value instead of
the highest value.
613 You need to understand how each of the benchmarks works. Often, a customer
will use a benchmark because they have been using it for a very long time. That
doesn’t mean that the benchmark is still relevant to their workloads today, and most
often they are not. There is a lot of inertia to re-use the same benchmark for every
storage system. You want to use a benchmark that models the customer’s workflow
as closely as possible.
Event Group
Event groups are collections of individual events that are related symptoms of a
single situation on your cluster.
Impact Policy
The relationship between the running jobs and the system resources is complex. A
job running with a high impact policy can use a significant percentage of cluster
resources, resulting in a noticeable reduction in cluster performance. Because jobs
are used to perform cluster maintenance activities and are often running, most jobs
are assigned a low impact policy. Do not assign high impact policies without
understanding the potential risk of generating errors and impacting cluster
performance. Several dependencies exist between the category of the different
jobs and the amount of system resources that are consumed before resource
throttling begins. The default job settings, job priorities, and impact policies are
designed to balance the job requirements to optimize resources. The
recommendation is to not change the default impact policies or job priorities without
consulting qualified Dell Technologies engineers.
Job Engine
The OneFS Job Engine is an execution, scheduling and reporting framework for
cluster-wide management of tasks.
Job Priority
A job can have a priority between 1 (highest priority) to 10 (lowest priority). If a low-
priority job is running when a high priority job is called, the low-priority job pauses,
and the high priority job runs. The job progress is periodically saved by creating
checkpoints. When the higher priority job completes, the checkpoint is used to
restart the lower priority job at the point where the job paused.