Abstract
| The Trigger and Data Acquisition (TDAQ) system of the ATLAS detector is composed of a large number of distributed hardware and software components (about 3000 machines and more than 15000 concurrent processes at the end of LHC’s Run I) which in a coordinated manner provide the data-taking functionality of the overall system. The Run Control (RC) system steers the data acquisition by starting and stopping processes and by carrying all data-taking elements through well-defined states in a coherent way (finite state machine pattern). The RC is organized as a hierarchical tree (run control tree) of run controllers following the functional de-composition into systems and sub-systems of the ATLAS detector. During the LHC Long Shutdown 1 (LS1) the RC has been completely re-designed and re-implemented in order to better fulfill the new requirements which emerged during the LHC Run 1 and were not foreseen during the initial design phase, and in order to improve the error management and recovery mechanisms. Indeed given the size and complexity of the TDAQ system, errors and failures are bound to happen and must be dealt with. The data acquisition system has to recover from these errors promptly and effectively, possibly without the need to stop data taking operations. The RC is assisted by the Central Hint and Information Processor (CHIP) that can be considered as its “brain”. CHIP supervises the ATLAS data taking, takes operational decisions and handles abnormal conditions. It automates procedures and performs advanced recoveries. Furthermore it has the possibility to interact with the so-called Test Management service which allows it to make informed decisions based on the outcome of the test results. CHIP is based on a third party open source Complex Event processing Engine, ESPER. In this paper the design, implementation and performances of both the RC and CHIP will be described. Additionally some error recovery scenarios will be analyzed with particular emphasis on the interaction between the RC and CHIP. |