CERN Accélérateur de science

ATLAS Slides
Report number ATL-SOFT-SLIDE-2015-112
Title Integration of PanDA workload management system with Titan supercomputer at OLCF
Author(s) Panitkin, Sergey (Brookhaven National Laboratory (BNL)) ; De, Kaushik (The University of Texas at Arlington) ; Klimentov, Alexei (Brookhaven National Laboratory (BNL)) ; Oleynik, Danila (Joint Institute for Nuclear Research) ; Petrosyan, Artem (Joint Institute for Nuclear Research) ; Schovancova, Jaroslava (The University of Texas at Arlington) ; Vaniachine, Alexandre (Argonne National Laboratory) ; Wenaus, Torre (Brookhaven National Laboratory (BNL))
Corporate author(s) The ATLAS collaboration
Submitted to 21st International Conference on Computing in High Energy and Nuclear Physics, Okinawa, Japan, 13 - 17 Apr 2015
Submitted by panitkin@bnl.gov on 27 Mar 2015
Subject category Particle Physics - Experiment
Accelerator/Facility, Experiment CERN LHC ; ATLAS
Free keywords ATLAS ; PanDA ; Titan ; Supercomputer ; HPC
Abstract The PanDA (Production and Distributed Analysis) workload management system (WMS) was developed to meet the scale and complexity of LHC distributed computing for the ATLAS experiment. While PanDA currently uses more than 100,000 cores at well over 100 Grid sites with a peak performance of 0.3 petaFLOPS, next LHC data taking run will require more resources than Grid computing can possibly provide. To alleviate these challenges, ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. We will describe a project aimed at integration of PanDA WMS with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Current approach utilizes modified PanDA pilot framework for job submission to Titan's batch queues and local data management, with light-weight MPI wrappers to run single threaded workloads in parallel on Titan's multi-core worker nodes. It also gives PanDA new capability to collect, in real time, information about unused worker nodes on Titan, which allows precisely define the size and duration of jobs submitted to Titan according to available free resources. This capability significantly reduces PanDA job wait time while improving Titan’s utilization efficiency. This implementation was tested with a variety of Monte-Carlo workloads on Titan and is being tested on several other supercomputing platforms.



 Notice créée le 2015-03-27, modifiée le 2016-07-18


  • Send to ScienceWise.info