Author(s)
| Sjoen, R (CERN ; Bergen U.) ; Stancu, S (Bucharest, Polytechnic Inst. ; UC, Irvine) ; Ciobotaru, M (Bucharest, Polytechnic Inst. ; UC, Irvine) ; Batraneanu, S M (CERN ; Bucharest, Polytechnic Inst.) ; Leahu, L (CERN ; Bucharest, Polytechnic Inst.) ; Martin, B (CERN) ; Al-Shabibi, A (CERN ; Heidelberg U.) |
Abstract
| The ATLAS data acquisition system consists of four different networks interconnecting up to 2000 processors using up to 200 edge switches and five multi-blade chassis devices. The architecture of the system has been described in [1] and its operational model in [2]. Classical, SNMP-based, network monitoring provides statistics on aggregate traffic, but for performance monitoring and troubleshooting purposes there was an imperative need to identify and quantify single traffic flows. sFlow [3] is an industry standard based on statistical sampling which attempts to provide a solution to this. Due to the size of the ATLAS network, the collection and analysis of the sFlow data from all devices generates a data handling problem of its own. This paper describes how this problem is addressed by making it possible to collect and store data either centrally or distributed according to need. The methods used to present the results in a relevant fashion for system analysts are discussed and we explore the possibilities and limitations of this diagnostic tool, giving an example of its use in solving system problems that arise during the ATLAS data taking . |