CERN Accelerating science

002912513 001__ 2912513
002912513 005__ 20241113153323.0
002912513 0248_ $$aoai:cds.cern.ch:2912513$$pcerncds:FULLTEXT$$pcerncds:THESES$$pcerncds:CERN:FULLTEXT$$pINIS$$pcerncds:CERN
002912513 037__ $$aCERN-THESIS-2024-174
002912513 035__ $$9Inspire$$a2806654
002912513 041__ $$aeng
002912513 100__ $$0AUTHOR|(CDS)2676256$$0AUTHOR|(SzGeCERN)839917$$aChuchuk, [email protected]$$uU. Cote d'Azur
002912513 245__ $$aData access optimisation at CERN and in the Worldwide LHC Computing Grid (WLCG)
002912513 242__ $$aOptimisation de l’accès aux données au CERN et dans la Grille de calcul mondiale pour le LHC (WLCG)
002912513 260__ $$c2024
002912513 269__ $$c19/02/2024
002912513 300__ $$a132 p
002912513 500__ $$aPresented 19 Feb 2024
002912513 502__ $$aPhD$$bU. Cote d'Azur$$c2024
002912513 520__ $$aThe Worldwide LHC Computing Grid (WLCG) offers an extensive distributed computing infrastructure dedicated to the scientific community involved with CERN’s Large Hadron Collider (LHC). With storage that totals roughly an exabyte, the WLCG addresses the data processing and storage requirements of thousands of international scientists. As the High-Luminosity LHC phase approaches, the volume of data to be analysed will increase steeply, outpacing the expected gain through the advancement of storage technology. Therefore, new approaches to effective data access and management, such as caches, become essential. This thesis delves into a comprehensive exploration of storage access within the WLCG, aiming to enhance the aggregate science throughput while limiting the cost. Central to this research is the analysis of real file access logs sourced from the WLCG monitoring system, highlighting genuine usage patterns. In a scientific setting, caching has profound implications. Unlike more commercial applications such as video streaming, scientific data caches deal with varying file sizes—from a mere few bytes to multiple terabytes. Moreover, the inherent logical associations between files considerably influence user access patterns. Traditional caching research has predominantly revolved around uniform file sizes and independent reference models. Contrarily, scientific workloads encounter variances in file sizes, and logical file interconnections significantly influence user access patterns. My investigations show how LHC’s hierarchical data organisation, particularly its compartmentalization into datasets, impacts request patterns. Recognising the opportunity, I introduce innovative caching policies that emphasize dataset-specific knowledge, and compare their effectiveness with traditional file-centric strategies. Furthermore, my findings underscore the “delayed hits” phenomenon triggered by limited connectivity between computing and storage locales, shedding light on its potential repercussions for caching efficiency. Acknowledging the long-standing challenge of predicting Data Popularity in the High Energy Physics (HEP) community, especially with the upcoming HL-LHC era’s storage conundrums, my research integrates Machine Learning (ML) tools. Specifically, I employ the Random Forest algorithm, known for its suitability with Big Data. By harnessing ML to predict future file reuse patterns, I present a dual-stage method to inform cache eviction policies. This strategy combines the power of predictive analytics and established cache eviction algorithms, thereby devising a more resilient caching system for the WLCG. In conclusion, this research underscores the significance of robust storage services, suggesting a direction towards stateless caches for smaller sites to alleviate complex storage management requirements and open the path to an additional level in the storage hierarchy. Through this thesis, I aim to navigate the challenges and complexities of data storage and retrieval, crafting more efficient methods that resonate with the evolving needs of the WLCG and its global community.
002912513 536__ $$aCERN Doctoral Student Program
002912513 595__ $$aCERN EDS
002912513 65017 $$2SzGeCERN$$aComputing and Computers
002912513 690C_ $$aCERN
002912513 690C_ $$aTHESIS
002912513 701__ $$aSchulz, Markus$$edir.$$uCERN
002912513 701__ $$aNeglia, Giovanni$$edir.$$uU. Cote d'Azur
002912513 710__ $$5IT
002912513 859__ [email protected]
002912513 8564_ $$82562305$$s13915934$$uhttp://cds.cern.ch/record/2912513/files/CERN-THESIS-2024-174.pdf
002912513 916__ $$sn$$w202440$$ya2024
002912513 963__ $$aPUBLIC
002912513 960__ $$a14
002912513 980__ $$aTHESIS