0% found this document useful (0 votes)
60 views25 pages

Cluster Stack Basics

This document provides an overview of cluster stack basics, including: - A cluster approach uses shared filesystems, job management, dedicated compute nodes, and a consistent environment across nodes interconnected with a low-latency network. - Key components of a cluster include basic network services like NTP and DNS, shared storage like NFS, logging, licensing, databases, and specialized components like a job manager and parallel storage. - The document discusses various aspects of cluster networking, interconnect technologies, parallel filesystems, and cluster management software.

Uploaded by

DanielRomero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views25 pages

Cluster Stack Basics

This document provides an overview of cluster stack basics, including: - A cluster approach uses shared filesystems, job management, dedicated compute nodes, and a consistent environment across nodes interconnected with a low-latency network. - Key components of a cluster include basic network services like NTP and DNS, shared storage like NFS, logging, licensing, databases, and specialized components like a job manager and parallel storage. - The document discusses various aspects of cluster networking, interconnect technologies, parallel filesystems, and cluster management software.

Uploaded by

DanielRomero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Linux Clusters Ins.

tute:
Cluster Stack Basics
Bre$ Zimmerman, University of Oklahoma
Senior Systems Analyst, OU Supercompu<ng Center for Educa<on and Research (OSCER)

A Bunch of Computers
Users can login to any node
Filesystems arent shared between nodes
Work is run wherever you can nd space
Nodes maintained individually

4-8 August 2014

Whats wrong with a bunch of nodes?


Compe<<on for resources

Size and type of problem is limited

Nodes get out of sync


Problems for users
Diculty in management

4-8 August 2014

Cluster Approach
Shared lesystems
Job management
Nodes dedicated to compute
Consistent environment
Interconnect

4-8 August 2014

Whats right about the cluster approach?


Easier to use
Maximize eciency
Can do bigger and be$er problems
Nodes can be used coopera<vely

4-8 August 2014

The Types of Nodes


Login

Users login here


Compiling
Edi<ng
SubmiTng and Monitoring jobs

Compute

Users might login here


Run jobs as directed by the scheduler

Support

Users dont login here


Do all the other stu

4-8 August 2014

What a cluster needs the mundane


Network services NTP, DNS, DHCP
Shared Storage -- NFS
Logging Consolidated Syslog as a star<ng point
Licensing FlexLM and the like
Database User and Administra<ve Data
Boot/Provisioning PXE, build system
Authen<ca<on LDAP

4-8 August 2014

What a cluster needs -- Specialized


Interconnect An ideally low-latency network
Job manager Resource manager/ scheduler
Parallel Storage Get around the limita<ons of NFS

4-8 August 2014

Network Services

NTP Network Time Protocol, provides clock


synchroniza<on across all nodes in the cluster
DHCP Dynamic Host Congura<on Protocol,
allows central congura<on of host networking
DNS Provides name to address transla<on for the
cluster
NFS Basic UNIX network lesystem

4-8 August 2014

Logging
Syslog

The classic system for UNIX logging


Applica<on has to opt to emit messages

Monitoring

Ac<ve monitoring to catch condi<ons elec<ve


monitoring doesnt catch
Resource manager
Nagios/cac</zabbix/ganglia

IDS

Intrusion detec<on
Monitoring targe<ng misuse/a$acks on the cluster

4-8 August 2014

10

Basic services, con.nued


Licensing FlexNet/FlexLM or equivalent, mediates access
to a pool of shared licenses.
Database Administra<ve use for logging/monitoring,
dynamic congura<on. Requirements of user so`ware.
Boot/Provisioning For example PXE/Cobbler, PXE/Image
or part of a cluster management suite

4-8 August 2014

11

Authen.ca.on
Flat les -- passwd, group, shadow entries
NIS -- network access to central at les
LDAP -- Read/Write access to a dynamic tree
structure of account and other informa<on
Host equivalency

4-8 August 2014

12

Cluster Networking
Hardware Management Lights out management
External Public interfaces to the cluster
Internal General node to node communica<on
Storage Access to network lesystems
Interconnect high-speed, low-latency for mul<-
node jobs
Some of these can share a medium

4-8 August 2014

13

Interconnect
In the most recent Top 500 list (h$p://top500.org)
there were 224 installa<ons relying on Inniband,
100 using Gigabit Ethernet, and 88 using 10 Gigabit
Ethernet

Ethernet Latency of 50-125 s (GbE), 5-50 s
(10GbE), ~5 s RoCEE
Inniband Latency of 1.3 s (QDR) .7 s (FDR-10/
FDR), .5 s (EDR)

4-8 August 2014

14

Parallel Filesystem
Lustre - h$p://lustre.org/
PanFS - h$p://www.panasas.com/
GPFS -
h$p://www-03.ibm.com/so`ware/products/en/
so`ware

Parallel lesystems take the general approach of
separa<ng lesystem metadata from the storage.
Lustre and PanFS have dedicated nodes for metadata
(MDS or director blades). GPFS distributes metadata
throughout the cluster
4-8 August 2014

15

Cluster Management
Automates the building of a cluster
Some way to easily maintain cluster system
consistency
The ability to automate cluster maintenance tasks
Oer some way to monitor cluster health and
performance

4-8 August 2014

16

Cluster Managemement SoNware

The resource manager knows the state of the various resources on the
cluster and maintains a list of the jobs that are reques<ng resources

The scheduler, using the informa<on from the resource manager
selects jobs from the queue for execu<on

Rocks (h$p://www.rocksclusters.org/wordpress/)
Bright Cluster Manager (
h$p://www.brightcompu<ng.com/Bright-Cluster-Manager)
xCAT (Extreme Cluster/Cloud Administra<on Toolkit) (
h$p://sourceforge.net/p/xcat/wiki/Main_Page/

4-8 August 2014

17

Congura.on Management
While it is true that boo<ng with a central boot server can make it
easier to make sure the OS on each compute node (or, at least, each
type of compute node) has an iden<cal setup/install, there are s<ll
les which wind up being more dynamic. Some such les are
password/group/shadow and hosts les.

Rsync
Cfengine
Chef
Puppet
Salt

4-8 August 2014

18

SoNware Installa.on and Management


All linux distros have some sort of package management tool. For
Redhat/CentOS/Scien<c based clusters, this is rpm and yum. Debian
has dpkg and apt

In any case pre-packaged so`ware tends to assume that it is going to
be installed in a specic place on the machine and that it will be the
only version of that so`ware on the machine.

One a cluster, it may be necessary to look at so`ware installa<on
dierently from a standard linux machine

Install to global lesystem
Keep boot image as small as possible
Maintain mul<ple versions

4-8 August 2014

19

SoNware installa.on and management


There are a couple of tools useful for naviga<ng the dicul<es of
maintaining user environments when dealing with mul<ple versions of
so`ware or so`ware in non-standard loca<ons.
So`Env (h$p://h$p://www.lcrc.anl.gov/info/So`ware/So`env)
Useful for packaging sta<c user environment required by packages
Modules (h$p://modules.sourceforge.net/)
Can be used to make dynamic changes to a users environment.

4-8 August 2014

20

Resource Manager/Scheduler
Accepts job submissions, maintains a queue of jobs
Allocates nodes/resources and starts jobs on
compute nodes
Schedules wai<ng jobs
Available op<ons
SGE (Sun Grid Engine)
LSF / Openlava (Load Sharing Facility)
PBS (Portable Batch System)
OpenPBS
Torque

SLURM
4-8 August 2014

21

Best Prac.ces
Here is a quick overview of the general func<ons to
secure a cluster
Risk Avoidance
Deterrence
Preven<on
Detec<on
Recovery

The priority of these will depend on your security
approach
4-8 August 2014

22

Risk Avoidance
Provide the minimum of services necessary
Grant the least privileges necessary
Install the minimum so`ware necessary
The simpler the environment, the fewer the vectors
available for a$ack.

4-8 August 2014

23

Deterrence
Limit the discoverability of the cluster
Publish acceptable use policies

Preven.on
Fix known issues (patching)
Congure services for minimal func<onality
Restrict user access and authority
Document ac<ons and changes

4-8 August 2014

24

Detec.on
Monitor the cluster
Integrate feedback from the users
Set alerts and automated response

Recovery
Backups
Documenta<on
Dene acceptable loss

4-8 August 2014

25

You might also like