Integrated monitoring of the ATLAS online computing farm

Ballestrero, Sergio; Gament, Costin-Eugen; Twomey, Matthew Shaun; Lee, Christopher; Fazio, Daniel; Scannicchio, Diana; Brasolin, Franco

ATLAS Slides
Report number	ATL-DAQ-SLIDE-2016-765
Title	Integrated monitoring of the ATLAS online computing farm
Author(s)	Ballestrero, Sergio (University of Johannesburg, Department of Physics) ; Brasolin, Franco (INFN Bologna and Universita' di Bologna, Dipartimento di Fisica e Astronomia) ; Fazio, Daniel (CERN) ; Gament, Costin-Eugen (University Politehnica Bucharest) ; Lee, Christopher (University of Cape Town) ; Scannicchio, Diana (University of California, Irvine) ; Twomey, Matthew Shaun (Department of Physics, University of Washington, Seattle)
Corporate author(s)	The ATLAS collaboration
Collaboration	ATLAS Collaboration
Submitted to	22nd International Conference on Computing in High Energy and Nuclear Physics, CHEP 2016, San Francisco, Usa, 10 - 14 Oct 2016
Submitted by	[email protected] on 04 Oct 2016
Subject category	Particle Physics - Experiment
Accelerator/Facility, Experiment	CERN LHC ; ATLAS
Free keywords	Monitoring
Abstract	The online farm of the ATLAS experiment at the LHC, consisting of nearly 4000 PCs with various characteristics, provides configuration and control of the detector and performs the collection, processing, selection and conveyance of event data from the front-end electronics to mass storage. The status and health of every host must be constantly monitored to ensure the correct and reliable operation of the whole online system. This is the first line of defense, which should not only promptly provide alerts in case of failure but, whenever possible, warn of impending issues. The monitoring system should be able to check up to 100000 health parameters and provide alerts on a selected subset. In this paper we present the implementation and validation of our new monitoring and alerting system based on Icinga 2 and Ganglia. We describe how the load distribution and high availability features of Icinga 2 allowed us to have a centralised but scalable system, with a configuration model that allows full flexibility while still guaranteeing complete farm coverage. Finally, we cover the integration of Icinga 2 with Ganglia and other data sources, such as SNMP for system information and IPMI for hardware health.

Torna a la cerca

Registre creat el 2016-10-04, darrera modificació el 2016-12-20

Text complet:

PDF

Enllaç extern:

Original Communication (restricted to ATLAS)

Afegeix-lo al cistell personal
Anomena i desa BibTeX, MARC, MARCXML, DC, EndNote, NLM, RefWorks

CERN Document Server

Access articles, reports and multimedia content in HEP

Main menu

CERN Accelerating science