Author(s)
|
Bauer, Gerry (MIT) ; Behrens, Ulf (DESY) ; Bouffet, Olivier (CERN) ; Bowen, Matthew (CERN) ; Branson, James (UC, San Diego) ; Bukowiec, Sebastian Czeslaw (CERN) ; Ciganek, Marek (CERN) ; Cittolin, Sergio (UC, San Diego) ; Coarasa Perez, Jose Antonio (CERN) ; Deldicque, Christian (CERN) ; Dobson, Marc (CERN) ; Dupont, Aymeric (CERN) ; Erhan, Samim (UCLA) ; Flossdorf, Alexander (DESY) ; Gigi, Dominique (CERN) ; Glege, Frank (CERN) ; Gomez-Reino Garrido, Robert (CERN) ; Hartl, Christian (CERN) ; Hegeman, Jeroen Guido (Princeton U.) ; Holzner, Andre Georg (UC, San Diego) ; Hwong, Yi Ling (CERN) ; Masetti, Lorenzo (CERN) ; Meijers, Franciscus (CERN) ; Meschi, Emilio (CERN) ; Mommsen, Remigius (Fermilab) ; O'Dell, Vivian (Fermilab) ; Orsini, Luciano (CERN) ; Paus, Christoph Maria Ernst (MIT) ; Petrucci, Andrea (CERN) ; Pieri, Marco (UC, San Diego) ; Polese, Giovanni (CERN) ; Racz, Attila (CERN) ; Raginel, Olivier (MIT) ; Sakulin, Hannes (CERN) ; Sani, Matteo (UC, San Diego) ; Schwick, Christoph (CERN) ; Shpakov, Denis (Fermilab) ; Simon, Michal (CERN) ; Spataru, Andrei Cristian (CERN) ; Sumorok, Konstanty (MIT) |
Abstract
| The CMS experiment's online cluster consists of 2300 computers and 170 switches or routers
operating on a 24-hour basis. This huge infrastructure must be monitored in a way that the administrators
are pro-actively warned of any failures or degradation in the system, in order to avoid or minimize
downtime of the system which can lead to loss of data taking. The number of metrics monitored per host
varies from 20 to 40 and covers basic host checks (disk, network, load) to application specific checks
(service running) in addition to hardware monitoring. The sheer number of hosts and checks per host in
the system stretches the limits of many monitoring tools and requires careful usage of various
configuration optimizations to work reliably. The initial monitoring system used in the CMS online
cluster was based on Nagios, but suffered from various drawbacks and did not work reliably in the
expanded cluster. The CMS cluster administrators investigated the different open source tools available
and chose to use a fork of Nagios called Icinga, with several plugin modules to enhance its scalability.
The Gearman module provides a queuing system for all checks and their results allowing easy load
balancing across worker nodes. Supported modules allow the grouping of checks in one single request
thereby significantly reducing the network overhead for doing a set of checks on a group of nodes. The
PNP4nagios module provides the graphing capability to Icinga, which uses files as round robin databases
(RRD). Additional software (rrdcached) optimizes access to the RRD files and is vital in order to support
the required number of operations. Furthermore, to make best use of the monitoring information to notify
the appropriate communities of any issues with their systems, much work was put into the grouping of the
checks according to, for example, the function of the machine, the services running, the sub-detectors to
which they belong, and the criticality of the computer. An automated system to generate the configuration
of the monitoring system has been produced to facilitate its evolution and maintenance. The use of these
performance enhancing modules and the work on grouping the checks has yielded impressive
performance improvements over the previous Nagios infrastructure, allowing for the monitoring of many
more metrics per second compared to the previous system. Furthermore the design allows the easy growth
of the infrastructure without the need to rethink the monitoring system as a whole. |