Abstract
| Trigger and data acquisition (TDAQ) systems for modern HEP experiments are composed of thousands of hardware and software components depending on each other in a very complex manner. Typically, such systems are operated by non-expert shift operators, which are not aware of system functionality details. It is therefore necessary to help the operator to control the system and to minimize system down-time by providing knowledge-based facilities for automatic testing and verification of system components and also for error diagnostics and recovery. For this purpose, a verification and diagnostic framework was developed in the scope of ATLAS TDAQ. The verification functionality of the framework allows developers to configure simple low-level tests for any component in a TDAQ configuration. The test can be configured as one or more processes running on different hosts. The framework organizes tests in sequences, using knowledge about components hierarchy and dependencies, and allowing the operator to verify the functionality of any subset of the system. The diagnostics functionality includes the possibility to analyze the test results and diagnose detected errors, e.g. by starting additional tests and understanding reasons of failures. A conclusion about system functionality, error diagnosis and recovery advice are presented to the operator in a GUI. The current implementation uses the CLIPS expert system shell for knowledge representation and reasoning. |