1.
|
Monitoring techniques and alarm procedures for CMS services and sites in WLCG
/ Molina-Perez, Jorge Amando (UC, San Diego)
/CMS Collaboration
The CMS offline computing system is composed of roughly 80 sites (including most experienced T3s) and a number of central services to distribute, process and analyze data worldwide. A high level of stability and reliability is required from the underlying infrastructure and services, partially covered by local or automated monitoring and alarming systems such as Lemon and SLS; the former collects metrics from sensors installed on computing nodes and triggers alarms when values are out of range, the latter measures the quality of service and warns managers when service is affected. [...]
CMS-CR-2012-100.-
Geneva : CERN, 2012 - 9 p.
- Published in : J. Phys.: Conf. Ser. 396 (2012) 042041
Fulltext: PDF;
In : Computing in High Energy and Nuclear Physics 2012, New York, NY, USA, 21 - 25 May 2012, pp.042041
|
|
2.
|
Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs
Reference: Poster-2013-391
Keywords: WLCG GGUS ALARM ticket Storage CASTOR Batch LSF CERN KIT fail-safe Tier0 Tier1 workflow incident
Created: 2013. -1 p
Creator(s): Dimou, M; Dres, H; Dulov, O; Grein, G
In the Wordwide LHC Computing Grid (WLCG) project the Tier centres are of paramount importance for storing and accessing experiment data and for running the batch jobs necessary for experiment production activities. Although Tier2 sites provide a significant fraction of the resources a non-availability of resources at the Tier0 or the Tier1s can seriously harm not only WLCG Operations but also the experiments' workflow and the storage of LHC data which are very expensive to reproduce. This is why availability requirements for these sites are high and committed in the WLCG Memorandum of Understanding (MoU). In this talk we describe the workflow of GGUS ALARMs, the only 24/7 mechanism available to LHC experiment experts for reporting to the Tier0 or the Tier1s problems with their Critical Services. Conclusions and experience gained from the detailed drills performed in each such ALARM for the last 4 years are explained and the shift with time of Type of Problems met. The physical infrastructure put in place to achieve GGUS 24/7 availability are summarised.
Presented at 20th International Conference on Computing in High Energy and Nuclear Physics 2013 Amsterdam, Netherlands 14 - 18 Oct 2013 2013 , (list conference papers)
|
© CERN Geneva
Access to files
|
|
3.
|
ATLAS Distributed Computing Automation
/ Schovancova, J (Academy of Sciences of the Czech Republic) ; Barreiro Megino, F H (CERN) ; Borrego, C (Physics Department, Universidad Autonoma de Madrid) ; Campana, S (CERN) ; Di Girolamo, A (CERN) ; Elmsheuser, J (Fakultaet fuer Physik, Ludwig-Maximilians-Universitaet Muenchen) ; Hejbal, J (Academy of Sciences of the Czech Republic) ; Kouba, T (Academy of Sciences of the Czech Republic) ; Legger, F (Fakultaet fuer Physik, Ludwig-Maximilians-Universitaet Muenchen) ; Magradze, E (Georg-August-Universitat Goettingen, II. Physikalisches Institut) et al.
The ATLAS Experiment benefits from computing resources distributed worldwide at more than 100 WLCG sites. The ATLAS Grid sites provide over 100k CPU job slots, over 100 PB of storage space on disk or tape. [...]
ATL-SOFT-SLIDE-2012-429.-
Geneva : CERN, 2012 - 12 p.
Fulltext: PDF; External link: Original Communication (restricted to ATLAS)
In : 5th International Conference "Distributed Computing and Grid-technologies in Science and Education", Dubna, Russian Federation, 16 - 20 Jul 2012
|
|
4.
|
|
ATLAS Distributed Computing Automation
/ Schovancova, J (Prague, Inst. Phys.) ; Barreiro Megino, F H (CERN) ; Borrego, C (Madrid Autonoma U.) ; Campana, S (CERN) ; Di Girolamo, A (CERN) ; Elmsheuser, J (LMU Munich) ; Hejbal, J (Prague, Inst. Phys.) ; Kouba, T (Prague, Inst. Phys.) ; Legger, F (LMU Munich) ; Magradze, E (Gottingen U.) et al.
The ATLAS Experiment benefits from computing resources distributed worldwide at more than 100 WLCG sites. [...]
ATL-SOFT-PROC-2012-067.
-
2012. - 6 p.
Original Communication (restricted to ATLAS) - Full text
|
|
5.
|
Xrootd Monitoring for the CMS experiment
/ Tadel, Matevz (UC, San Diego)
/CMS Collaboration
During spring and summer 2011 CMS deployed Xrootd front-end servers on all US T1
and T2 sites. This allows for remote access to all experiment data and is used
for user-analysis, visualization, running of jobs at T2s and T3s when data is
not available at local sites, and as a fail-over mechanism for data-access in
CMSSW jobs.
Monitoring of Xrootd infrastructure is implemented on three levels. [...]
CMS-CR-2012-086.-
Geneva : CERN, 2012 - 10 p.
Fulltext: PDF;
In : Computing in High Energy and Nuclear Physics 2012, New York, NY, USA, 21 - 25 May 2012
|
|
6.
|
The commissioning of CMS sites : improving the site reliability
/ Belforte, S (INFN, Trieste) ; Fisk, I (Fermilab) ; Hernández, J M (Madrid, CIEMAT) ; Klem, J (Helsinki U.) ; Letts, J (UC, San Diego) ; Magini, N (CERN ; INFN, CNAF) ; Saiz, P (CERN) ; Sciabá, A (CERN) ; Flix, J (PIC, Bellaterra ; Madrid, CIEMAT)
The computing system of the CMS experiment works using distributed resources from more than 60 computing centres worldwide. These centres, located in Europe, America and Asia are interconnected by the Worldwide LHC Computing Grid. [...]
CMS-CR-2009-089.-
Geneva : CERN, 2010 - 11 p.
- Published in : J. Phys.: Conf. Ser. 219 (2010) 062047
Fulltext: PDF;
In : 17th International Conference on Computing in High Energy and Nuclear Physics, Prague, Czech Republic, 21 - 27 Mar 2009, pp.062047
|
|
7.
|
COMPASS Production System Overview
/ Petrosyan, Artem (Dubna, JINR)
Migration of COMPASS data processing to Grid environment has started in 2015 from a small prototype, deployed on a single virtual machine. Since summer of 2017, the system works in production mode, distributing jobs to two traditional Grid sites: CERN and JINR. [...]
EDP Sciences, 2019 - 8 p.
- Published in : EPJ Web Conf. 214 (2019) 03039
Fulltext: PDF;
In : 23rd International Conference on Computing in High Energy and Nuclear Physics, CHEP 2018, Sofia, Bulgaria, 9 - 13 Jul 2018, pp.03039
|
|
8.
|
|
9.
|
|
10.
|
|