CERN Accélérateur de science

ATLAS Note
Report number ATL-SOFT-PROC-2017-016
Title PanDA for ATLAS distributed computing in the next decade
Author(s) Barreiro Megino, Fernando Harald (The University of Texas at Arlington) ; De, Kaushik (The University of Texas at Arlington) ; Klimentov, Alexei (Brookhaven National Laboratory (BNL)) ; Maeno, Tadashi (Brookhaven National Laboratory (BNL)) ; Nilsson, Paul (Brookhaven National Laboratory (BNL)) ; Oleynik, Danila (Joint Institute for Nuclear Research) ; Padolski, Siarhei (Brookhaven National Laboratory (BNL)) ; Panitkin, Sergey (Brookhaven National Laboratory (BNL)) ; Wenaus, Torre (Brookhaven National Laboratory (BNL))
Corporate Author(s) The ATLAS collaboration
Collaboration ATLAS Collaboration
Publication 2017
Imprint 15 Jan 2017
Number of pages 7
In: J. Phys.: Conf. Ser. 898 (2017) 052002
In: 22nd International Conference on Computing in High Energy and Nuclear Physics, CHEP 2016, San Francisco, Usa, 10 - 14 Oct 2016, pp.052002
DOI 10.1088/1742-6596/898/5/052002
Subject category Particle Physics - Experiment
Accelerator/Facility, Experiment CERN LHC ; ATLAS
Abstract The Production and Distributed Analysis (PanDA) system has been developed to meet ATLAS production and analysis requirements for a data-driven workload management system capable of operating at the Large Hadron Collider (LHC) data processing scale. Heterogeneous resources used by the ATLAS experiment are distributed worldwide at hundreds of sites, thousands of physicists analyse the data remotely, the volume of processed data is beyond the exabyte scale, dozens of scientific applications are supported, while data processing requires more than a few billion hours of computing usage per year. PanDA performed very well over the last decade including the LHC Run 1 data taking period. However, it was decided to upgrade the whole system concurrently with the LHC’s first long shutdown in order to cope with rapidly changing computing infrastructure. After two years of reengineering efforts, PanDA has embedded capabilities for fully dynamic and flexible workload management. The static batch job paradigm was discarded in favor of a more automated and scalable model. Workloads are dynamically tailored for optimal usage of resources, with the brokerage taking network traffic and forecasts into account. Computing resources are partitioned based on dynamic knowledge of their status and characteristics. The pilot has been re-factored around a plugin structure for easier development and deployment. Bookkeeping is handled with both coarse and fine granularities for efficient utilization of pledged or opportunistic resources. An in-house security mechanism authenticates the pilot and data management services in off-grid environments such as volunteer computing and private local clusters. The PanDA monitor has been extensively optimized for performance and extended with analytics to provide aggregated summaries of the system as well as drill-down to operational details. There are as well many other challenges planned or recently implemented, and adoption by non-LHC experiments such as bioinformatics groups successfully running Paleomix (microbial genome and metagenomes) payload on supercomputers. In this paper we will focus on the new and planned features that are most important to the next decade of distributed computing workload management.
Copyright/License publication: (License: CC-BY-3.0)

Corresponding record in: Inspire


 Notice créée le 2017-01-15, modifiée le 2019-10-15