Cluster Admin Guide
Cluster Admin Guide
Table of Contents
Introduction
The document summarizes the HPC solution including architectural diagram, configuration, and
managing of AMOGH HPC Cluster implemented at ADA Bangalore.
The supercomputer AMOGH is based on a cluster configuration of Dell server models from the
Dell India Private Ltd, and the cluster is a combination of various models for Master nodes and
Login nodes PowerEdge R640 , for Compute nodes Dell PowerEdge C6420 with Intel Xeon Gold
6138 2.0G processors. The system is implemented by Locuz Enterprise Solution Ltd. together
with the partner companies Dell, Locuz and DDN.
Total setup comprising of 256 nodes having Dell hardware PowerEdge C6420, Dell PowerEdge
R640 models, which contains master and login nodes, high memory compute, normal memory
compute node. The setup consists two Login nodes which will be used to login and submit jobs
for users. Two master nodes are configured in high availability active/passive mode. DDN Lustre
parallel file system has been configured with DDN Storage and on 6 servers. Networking consist
of 7 Dell GigE switches for OS communication and hardware management. 16 Mellanox EDR
InfiniBand switch is configured for MPI job communication and to provide Lustre file system on all
the compute nodes.
The compute nodes differ in their architecture below is the list of computes according to their architecture.
Total 7 Gigabit Ethernet is used for OS communication and hardware management. For Gigabit Ethernet,
no additional modules or libraries are needed. All Ethernet switches inter connected to 1G Ethernet
switches.
Total 16 Infiniband switch are configured as 100% non-blocking 2:1 Fat tree topology. EDR InfiniBand is a
high-performance switched fabric chassis, which is characterized by its high throughput and low latency.
Logical Connectivity
4 Spine Switch
12 Leaf Switch
2:1 Fat-Tree Topology
Management IP 10.1.129.3/23
InfiniBand IP 10.3.2.3/23
Hostname Ada002.
Management IP 10.1.129.4/23
InfiniBand IP 10.3.2.4/23
Partition
Size
name
/boot 1 GB
/ 1.3 TB
Swap 64 GB
/var 600 GB
/tmp 300 GB
/shared 4.4 TB ISCSI storage in Active/Passive for HA
Management IP 10.1.129.5/16
InfiniBand IP 10.3.2.5/16
Hostname ada004
InfiniBand IP 10.3.2.6/16
Partition
Size
name
/boot 1 GB
/ 1.3 TB
Swap 64 GB
/var 600 GB
/tmp 300 GB
/shared 4.4 TB ISCSI storage in Active/Passive for HA
(Note – Management IP is configured in shared mode in BIOS for all the compute node)
IB HA (mgmt0) 10.1.129.40
IB Management 1 10.1.129.41
IB Management 2 10.1.129.42
Name Management IP
Gig switch1 (Gsw01) 10.1.130.1
Gig switch2 (Gsw02) 10.1.130.2
Gig switch3 (Gsw03) 10.1.130.3
Gig switch4 (Gsw04) 10.1.130.4
Gig switch5 (Gsw05) 10.1.130.5
Gig switch6 (Gsw06) 10.1.130.6
Gig switch7 (Gsw07) 10.1.130.7
Ganana HPC Cluster Manager makes it easier for Admins to build Linux based HPC Cluster, and to
easily manage of their clusters on any x64 hardware. The web-based Portal is available as a
flexible, feature centric for the administrators to interact with their HPC cluster or grid in a natural
and powerful way. By standardizing click of button build & manage compute node images,
management / monitoring packages, middleware software and post installation activities. It is
highly useful in all sorts of HPC environment having a lot of advanced features and doing small
things to save a large amount of time for repeated tasks.
The following key services for cluster operations are always running on both head nodes:
ganana is cluster manager daemon service should be running in active/passive mode on both
the master nodes.
Figure – OS images
iscsi
Multipath HA- LVM PCS
storage
We are using ISCSI Block storage for sharing the data in active/passive mode between both the
master nodes.
Web Interface
The “/shared “directory will be the common directory containing the ganana cluster configuration
that are required on both the nodes for HA.
Authentication
NIS Server is configured with primary and secondary server role for achieving the high
availability requirement
Command Output
NIS Server:
10.1.2.20 Floating IP Address
10.1.2.3 Primary Server
10.1.2.4 Secondary Server
InfiniBand is a special type of networking fabric that has very low latency compared to standard Ethernet
based networks. It enables the use of larger scales MPI jobs that can spread over several nodes.
Mellanox EDR 100Gb/s InfiniBand Switch provides the highest performing fabric solution in a 1U form
factor by delivering up to 7.2Tb/s of non-blocking bandwidth with 90ns port-to-port latency.
Total 7 number of Dell X1052 Ethernet Switch are setup and configured in bus topology.
Specification:
48 GbE ports Per Switch
4 x 10Gb SFP+ ports Per Switch
Gsw01 10.1.130.1
Gsw02 10.1.130.2
Gsw03 10.1.130.3
Gsw04 10.1.130.4
Gsw05 10.1.130.5
Gsw06 10.1.130.6
Gsw07 10.1.130.7
Document: https://www.dell.com/support/home/in/en/inbsd1/product-support/product/networking-x1000-series/docs
Intel MKL
Intel IPP
Intel V Tune Analyzer
$ module load
$icc –v -> check icc version
$ which icc -> check icc path
$ which mpirun -> check mpirun path
Intel MPI
module
(no arguments) print usage instructions
avail or av list available software modules
whatis as above with brief descriptions
load <modulename> add a module to your environment
unload <modulename> remove a module
purge remove all modules
The modules loaded into the user’s environment can be seen with:
$ module list
To check available modules
$module avail
To use the mpich with GCC implementation:
$ module add mpich/ge/gcc
For using the application correct module needs to be loaded in current working shell or in PBSPRO
job submission scripts.
1. All PDUs, Switches, DDN Storage should be Power ON without any error, before powering on the
master node.
2. Power ON Master Node ada001 manually, wait for 5 minutes to power on properly and wait 2 more
additional minutes to allow lustre and nfs mount automatically
4. Check all the license server is up and running (lmgrd and flexlm)
5. Power on ada002 secondary Master node, wait for 5 minutes to power on properly and wait 2 more
additional minutes to allow lustre and nfs mount automatically"
6. Check pcs status in ada001 node, both the nodes should be in standby mode and resource should be in
disabled state.
7. Un-standby the ada001 and ada002 with the time difference of 2 minutes. Execute “pcs cluster
unstandby ada001-eth2;sleep 120; pcs cluster unstandby ada002-eth2” in ada001”
9. Power on all login nodes and compute nodes with racdam/manually and wait till all nodes booted
properly"
10. Verify the lustre and nfs mount in compute and login nodes using the below commands
6. Power off all login nodes, except ada001 and ada002 Nodes.
7. Perform the standby action for ada002 and ada001 with the gap of 1 minutes.
8. Stop the pbs service on ada002 and ada001 with gap of 1 minutes.
9. Sync all the IO operation on the system”
10. Drop the cache of the server by issuing “echo 3 > /proc/sys/vm/drop_caches”
11. Kill all the pid linked the mounted file system
12. Unmount the nfs and lustre filesystem.
13. Remove the lustre module lustre_rmmod
14. Power off the ada002 and ada001