02 - PowerScale Hardware Maintenance-SSP - Participant Guide
02 - PowerScale Hardware Maintenance-SSP - Participant Guide
HARDWARE
MAINTENANCE-SSP
PARTICIPANT GUIDE
PARTICIPANT GUIDE
PowerScale Hardware Maintenance-SSP
Maintenance Considerations 6
Maintenance Considerations 7
Hardware Maintenance Basics 7
Electrostatic Discharge 9
Minimize Tool Use 10
PowerScale Nodes Compatibility 11
Drives and Drive Sleds 13
Knowledge Check 15
Hardware Monitoring 16
Indicator Lights 19
Knowledge Check 21
Chassis Replacement Procedure 21
Recommended Tools 25
Safety Precautions and Considerations 26
Knowledge Check 27
Assessment Introduction 41
Assessment Questions 42
Question 42
Question 42
Question 42
Question 42
Question 43
Question 43
Question 43
Question 44
Question 44
Question 44
Question 44
Question 45
Course Completion 46
You Have Completed This Content 46
Maintenance Considerations
The graphic shows a few basic reminders that are common to all
hardware maintenance procedures. If you encounter any difficulties while
performing this task, immediately contact Dell Technical Support.
2: Before disconnecting any cables, ensure that the Do Not Remove LED
on the compute module is off. When the LED is white or On, this shows
that the journal of the node is still active. The Do Not Remove LED is on
the right side of the compute module and looks like a symbol of a hand
with a line through it. Do not disconnect any cables until this LED is off.
4:
5: Save the packaging from the replacement part. Use this packaging to
return the failed part to Dell Technologies. A return label is included with
the replacement part.
See the what are the required fields and data items entered into internal
FA reports that are written per CS request? article to learn more.
7:
After the work is complete, all service personnel, including partners, must
update the Install Database. Go to SolVe Online > Tools and Forms
section and select Install Base Ticket. Complete the form and submit the
case. If meeting any difficulties, you can also contact the Dell Technical
Support page.
Electrostatic Discharge
Anti-static packaging
ESD kit
• Before touching a component, put one hand firmly on the bare metal
surface.
• After removing components from the anti-static bag, do not move
around the room or touch furnishings, personnel, or surfaces.
• If moving or touching something is necessary, first put the component
back in the anti-static bag.
Avoid movement
down. Blue handles indicate that the node should be shut down for the
maintenance procedure.
important for a few reasons1. Nodes in the same node pool may have
different size SSDs2 and different size RAM.
At the back of the chassis, the compute modules are labeled from right to
left, one to four, as shown. Because compute modules are installed in
pairs, called node-pairs, the minimum cluster size is four nodes, and
more nodes must be added in node-pairs.
SSDs with new, larger SSDs to allow more L3 cache space. This lets
customers better utilize storage resources. Every node in the pool must be
the same model or of the same series or family. The node pool must have
the same number of SSDs per node in every node if the OneFS version is
prior to OneFS 8.0.
The graphic shows that the node pairs are either the left half or right half of the chassis.
Drives
Internal to the 2.5" sled, there are individual fault lights for each drive. The
yellow LED associated with each drive is visible through holes in the top
cover. A supercapacitor can keep one light on for around 10 minutes while
the sled is out of the chassis. If more than one light is on (showing multiple
drive failures), the time the LED remains on is correspondingly reduced.
In the 3.5" drive sleds, the yellow drive fault LEDs are on the paddle cards,
and they are visible through the cover of the drive sled identifying which
drive, if any, needs replacement. The graphic shows the 3.5” short drive
sled; the 3.5” long drive sled has four LED viewing locations.
Drive Sled
All twenty sleds can be individually serviced. Only remove one sled per
node at a time on running nodes. The typical procedure is:
3 The service request button tells the node that the sled will be removed.
The node prepares for the sled removal by moving key boot information
from drives on the sled. The node suspends the drives in the sled from the
cluster file system and then spins them down. This is to maximize
survivability during any further failures and to prevent cluster file system
issues that are caused by multiple drives from becoming temporarily
unavailable.
The graphic shows the lights and their information for the drive sleds.
Knowledge Check
Hardware Monitoring
isi_hwmon
OneFS has integrated support for the PowerTools Agent (PTA) and the
iDRAC Service Module (iSM) to support hardware monitoring in the
F200/F600.
CELOG
The OneFS Cluster Event Log (or CELOG) provides a single source for
the logging of events that occur in an Isilon cluster. CELOG helps avoid
receiving alerts or triggering Dial-Home service requests while tests or
planned activities are being made on the PowerScale cluster.
4 The hardware that "isi_hwmon" monitors on the F200 and F600 are:
iDRAC Services, IDSDM, NVDIMM Battery, NVDIMM persistence, chassis
fans, DIMM health, chassis intrusion, system thermals, system power
supplies, and system sensors.
CELOG provides a single source for the logging of events that occur in an
Isilon cluster. Events are used to communicate a figure of cluster health
for various components. CELOG provides a single point from which
notifications about the events are generated, including sending alert email
and SNMP traps.
Cluster events can be easily viewed from the WebUI5 or the CLI, using the
isi event events view command.
Placing the CELOG in maintenance mode does not affect client activity or
performance. The maintenance or test activity may affect client activity or
performance, depending on the type of activity. Upon the expiration of the
maintenance window specified, the CELOG is automatically removed from
maintenance mode.
HealthCheck
6Depending on the nature of the item, when the item is evaluated, either
each node in the cluster is checked or the cluster as a whole is checked.
Indicator Lights
Disk Drive
The hard drive carrier LED indicator and a status LED indicator provide
information about the hard drive status. The activity LED indicator shows
whether the drive is in use or not. The status LED indicator shows the
power condition of the drive.
Indicator Condition
Blinking green and turns off • Hot plugging - blinks green five times at
a rate of 4 Hz and turns off.
• PSU mismatch of efficiency, feature set,
health status, or supported voltage.
Deep Dive: See the SolVe Online Guide and search for the
proper node. Select any related HDD or SSD, or PSU
replacement procedures.
Knowledge Check
nodes side by side. To avoid losing the data stored on the node, replacing
a chassis involves moving the node’s internal components from the failed
chassis to a new chassis. The process requires following the steps exactly
and in order. Perform this procedure only when directed by Dell
Technologies PowerScale Technical Support.
Click each tab to learn how to generate the procedure through SolVe
Online.
Step 1
Step 2
Step 3
Step 4
Click GENERATE.
Recommended Tools
Do not operate the system without the cover longer than five
minutes. Operating the system without the system cover can
result in component damage.
Knowledge Check
7 Identify your system by pulling out the information tag in front of the
system to view the Express Service Code and Service Tag. Alternatively,
the information may be on a sticker on the chassis of the system. The
mini–Enterprise Service Tag (EST) is found on the back of the system.
This information is used by Dell to route support calls to the proper
personnel.
3. Gather logs.9
4. Perform the component replacement or maintenance.
5. Update node firmware.10
6. Gather logs post-maintenance.11
7. Update the install database.12
If additional components, such as rack ears are included when the system
board is received onsite, replace only the system board unless instructed
by Dell Technical Support.
9 Collect cluster logs before all maintenance procedures. Cluster logs
provide snapshots of the cluster, which you can review to ensure that
maintenance is successful. isi_gather_info command collects
configuration and log information from a cluster and automatically uploads
it to Dell for processing. If the environment has a firewall in place that
blocks the uploading of the log information, contact a PowerScale Support
Engineer for a temporary FTP link to upload the logs.
10 It is recommended to update the firmware on the replacement nodes
with latest node firmware package. Node firmware updates reboot only
one node at a time. If the customer cannot tolerate node reboots at the
time, schedule a time with the customer to update the whole cluster.
11 After completing maintenance on a cluster, gather the cluster logs.
12 After all work is complete, update the install database through the
The tabs show the high-level steps for node part replacements. Always
reference the appropriate PowerScale replacement guide for detailed
procedures.
Pre-Replacement Steps
SmartFail node, gather logs, and shut down the node are three steps that
must be performed before proceeding with the replacement procedure.
Replacement Steps
There are three steps that must be performed while replacing a node
component: cabling, extend node, and replace component.
• Cabling: Ensure that the node is powered off, and then label the
cables. Disconnect the power and I/O cables from the node.
• Extend node: Extend the node using the two slam latches on the left
and right sides. Once you extend the node, you can remove the cover
to get access to the internal FRU components.
• Replace component: The first step is to remove the cover. See the
replacement guide and the demonstration videos for detailed
procedures.
Post-Replacement Steps
There are four steps that must be performed after replacing a node
component: slide node into rack, cabling, HealthCheck, gather logs, and
update database.
• Slide node into rack: Once you replace the component and install the
cover, push the node inward until it locks into place. If necessary,
install the PSUs.
• Cabling: Reconnect the I/O cables and then the power cables
according to the labeling.
• HealthCheck and gather logs: As a best practice to ensure that the
system is healthy before declaring the node or cluster ready for
production, run health checks. Collect cluster logs after all
maintenance. Cluster logs provide snapshots of the cluster that you
can review to ensure that maintenance is successful.
• Update database: After all work is complete, update the install
database using the Business Services portal.
Movie:
Select each video to learn how to replace PowerScale F200 and F600
FRU components. The demos are without narration, so no downloadable
scripts.
Movie:
F900 Nodes
Movie:
Knowledge Check
Customers with systems that are configured for dial-home (SRS or email)
have a service request (SR) opened automatically. A customer without
dial-home capability can request a hardware replacement part by opening
an SR with Dell Technologies Customer Service.
The customer has the option to sign up for AutoCRU. The AutoCRU
program indicates that the customer can always replace a customer
replaceable part without communication before a part work order.
Movie:
Select each video to learn how to replace PowerScale F200, F600, and
Accelerator Node CRU components. The demos are without narration, so
no downloadable scripts.
Movie:
F900
Movie:
Knowledge Check
Assessment Introduction
Assessment Questions
Question
1. Which of the following statements are true? Select all that apply.
a. Always review the replacement guide on SolVe Online for full
instructions.
b. Follow ESD procedures when handling components.
c. When performing replacements on multiple nodes, you should
perform each step on every node before moving to the next step.
d. To power off a node, press and hold the power button.
Question
Question
Question
a. Fans
b. Batteries
c. DIMMs
d. Power Supply
e. Drives
Question
Question
Question
7. About how many minutes can the drive fault LED in a Gen6 drive sled
stay illuminated when the sled is removed from the chassis?
a. 1
b. 5
c. 10
d. 15
Question
Question
9. What does it mean when the status indicator on a disk drive is off?
a. Drive is ready for removal.
b. Node is powering off.
c. Drive rebuild is complete.
d. Impending drive failure
Question
10. What does it mean when the status indicator on a PSU is flashing
amber?
a. Problem with the PSU
b. PSU is updating firmware.
c. Hot spare feature activated
d. Peer PSU failure pending
Question
d. Any individual familiar with the F900 node can perform a FRU.
Question
Click the Save Progress and Exit button in the course menu or
below to record this content as complete.
Go to the next learning or assessment, if applicable.