CERN Accelerating science

Thesis
Report number CERN-THESIS-2024-174
Title Data access optimisation at CERN and in the Worldwide LHC Computing Grid (WLCG)
Translation of title Optimisation de l’accès aux données au CERN et dans la Grille de calcul mondiale pour le LHC (WLCG)
Author(s) Chuchuk, Olga (U. Cote d'Azur)
Publication 2024 - 132.
Thesis note PhD : U. Cote d'Azur : 2024
Thesis supervisor(s) Schulz, Markus ; Neglia, Giovanni
Note Presented 19 Feb 2024
Subject category Computing and Computers
Abstract The Worldwide LHC Computing Grid (WLCG) offers an extensive distributed computing infrastructure dedicated to the scientific community involved with CERN’s Large Hadron Collider (LHC). With storage that totals roughly an exabyte, the WLCG addresses the data processing and storage requirements of thousands of international scientists. As the High-Luminosity LHC phase approaches, the volume of data to be analysed will increase steeply, outpacing the expected gain through the advancement of storage technology. Therefore, new approaches to effective data access and management, such as caches, become essential. This thesis delves into a comprehensive exploration of storage access within the WLCG, aiming to enhance the aggregate science throughput while limiting the cost. Central to this research is the analysis of real file access logs sourced from the WLCG monitoring system, highlighting genuine usage patterns. In a scientific setting, caching has profound implications. Unlike more commercial applications such as video streaming, scientific data caches deal with varying file sizes—from a mere few bytes to multiple terabytes. Moreover, the inherent logical associations between files considerably influence user access patterns. Traditional caching research has predominantly revolved around uniform file sizes and independent reference models. Contrarily, scientific workloads encounter variances in file sizes, and logical file interconnections significantly influence user access patterns. My investigations show how LHC’s hierarchical data organisation, particularly its compartmentalization into datasets, impacts request patterns. Recognising the opportunity, I introduce innovative caching policies that emphasize dataset-specific knowledge, and compare their effectiveness with traditional file-centric strategies. Furthermore, my findings underscore the “delayed hits” phenomenon triggered by limited connectivity between computing and storage locales, shedding light on its potential repercussions for caching efficiency. Acknowledging the long-standing challenge of predicting Data Popularity in the High Energy Physics (HEP) community, especially with the upcoming HL-LHC era’s storage conundrums, my research integrates Machine Learning (ML) tools. Specifically, I employ the Random Forest algorithm, known for its suitability with Big Data. By harnessing ML to predict future file reuse patterns, I present a dual-stage method to inform cache eviction policies. This strategy combines the power of predictive analytics and established cache eviction algorithms, thereby devising a more resilient caching system for the WLCG. In conclusion, this research underscores the significance of robust storage services, suggesting a direction towards stateless caches for smaller sites to alleviate complex storage management requirements and open the path to an additional level in the storage hierarchy. Through this thesis, I aim to navigate the challenges and complexities of data storage and retrieval, crafting more efficient methods that resonate with the evolving needs of the WLCG and its global community.

Corresponding record in: Inspire
Email contact: [email protected]

 Запись создана 2024-10-06, последняя модификация 2024-11-13


Полный текст:
Загрузка полного текста
PDF