Distributed Machine Learning Workflow with PanDA and iDDS in LHC ATLAS

Guan, Wen; Lin, Fa-Hui; Korchuganova, Tatiana; Zhang, Rui; Barreiro Megino, Fernando Harald; Maeno, Tadashi; Alekseev, Aleksandr; Zhao, Xin; Yang, Zhaoyu; Weber, Christian; De, Kaushik; Karavakis, Edward; Wenaus, Torre; Klimentov, Alexei; Nilsson, Paul

doi:10.1051/epjconf/202429504019

ATLAS Note
Report number	ATL-SOFT-PROC-2023-010
Title	Distributed Machine Learning Workflow with PanDA and iDDS in LHC ATLAS
Author(s)	Guan, Wen (Brookhaven National Laboratory (US)) ; Maeno, Tadashi (Brookhaven National Laboratory (US)) ; Zhang, Rui (University of Wisconsin Madison (US)) ; Weber, Christian (Brookhaven National Laboratory (US)) ; Wenaus, Torre (Brookhaven National Laboratory (US)) ; Alekseev, Aleksandr (University of Texas at Arlington (UTA)) ; Barreiro Megino, Fernando Harald (University of Texas at Arlington (US)) ; De, Kaushik (University of Texas at Arlington (US)) ; Karavakis, Edward (Brookhaven National Laboratory (US)) ; Klimentov, Alexei (Brookhaven National Laboratory (US)) ; Korchuganova, Tatiana (University of Pittsburgh (US)) ; Lin, Fa-Hui (University of Texas at Arlington (US)) ; Nilsson, Paul (Brookhaven National Laboratory (US)) ; Yang, Zhaoyu (Brookhaven National Laboratory (US)) ; Zhao, Xin (Brookhaven National Laboratory (US))
Corporate Author(s)	The ATLAS collaboration
Publication	2024
Imprint	21 Aug 2023
Number of pages	6
In:	EPJ Web Conf. 295 (2024) 04019
In:	26th International Conference on Computing in High Energy & Nuclear Physics, Norfolk, Virginia, Us, 8 - 12 May 2023
DOI	10.1051/epjconf/202429504019
Subject category	Particle Physics - Experiment
Accelerator/Facility, Experiment	CERN LHC ; ATLAS
Abstract	Machine Learning (ML) has become one of the important tools for High Energy Physics analysis. As the size of the dataset increases at the Large Hadron Collider (LHC), and at the same time the search spaces become bigger and bigger in order to exploit the physics potentials, more and more computing resources are required for processing these ML tasks. In addition, complex advanced ML workflows are developed in which one task may depend on the results of previous tasks. How to make use of vast distributed CPUs/GPUs in WLCG for these big complex ML tasks has become a popular area. In this paper, we will present our efforts enabling the execution of distributed ML workflows on the Production and Distributed Analysis (PanDA) system and intelligent Data Delivery Service (iDDS). First, we will describe how PanDA and iDDS deal with large-scale ML workflows, including the implementation to process workloads on diverse and geographically distributed computing resources. Next, we will report real-world use cases, such as HyperParameter Optimization, Monte Carlo Toy confidence limits calculation, and Active Learning. Finally, we will conclude with future plans.

Corresponding record in: Inspire

Back to search

Element opprettet 2023-08-21, sist endret 2024-12-03

Lignende elementer

Fulltekst:

PDF

Ekstern lenke:

Original Communication (restricted to ATLAS)

Legg i egen kurv
Eksporter som BibTeX, MARC, MARCXML, DC, EndNote, NLM, RefWorks

CERN Document Server

Access articles, reports and multimedia content in HEP

Main menu

CERN Accelerating science