23
23
23
huawei.com
White Paper on Data Center Infrastructure Smart O&M
Introduction
Nowadays, the O&M security of most data centers data centers evolve, and how to improve the existing
depends on experienced, qualified, and well-trained data centers based on the information. For teams in
O&M teams. A comprehensive O&M process and a the conventional O&M phase, this white paper also
training system have been developed for some mature describes the overview of availability management and
data centers to minimize the impact of occasional the corresponding digital and intelligent measures for
incidents and personnel turnover on O&M security. data center infrastructure. With these information, the
Digital and intelligent means are driving the secure O&M team can better standardize O&M management,
and sustainable O&M for a few cutting-edge data formulate a smart O&M upgrade plan, and transform
centers. This white paper describes five phases from from conventional O&M to smart O&M. Smart O&M
conventional O&M to smart O&M, and typical features tools are assisting O&M to become more efficient,
of each phase. Data center management personnel securer, and sustainable.
can determine the current phase they are in, how
1
White Paper on Data Center Infrastructure Smart O&M
Figure 1 shows the evolution of O&M from conventional and have rich O&M experience in handling the actual
O&M to smart O&M. The x axis indicates the evolution to situation while meeting the requirements specified for each
smart O&M, and the y axis indicates the comprehensiveness process activity.
and complexity of the O&M process. Intelligent means are
3. Some O&M teams with abundant experience and
lacked in the conventional O&M phase, and O&M security
established processes are prone to be too complacent. They
depends on the experience and expertise of the O&M team.
do not welcome any intelligent means and reject suggestions
Therefore, the sustainable management depends on the
on improving the O&M efficiency. They insist that efficiency
process and continuous improvement of the training system.
improvement will inevitably affect O&M security.
The O&M efficiency decreases to some extent along with
the continuous improvement of the process. However, as In the intelligent O&M phase, digital and intelligent
the O&M team is more familiar with the process, the O&M means are used to solidify and simplify processes. Cloud-
efficiency will be restored. based O&M experts and automated methods minimize
the need for human labor and greatly improve the O&M
In the conventional O&M phase, O&M may be misunderstood
efficiency. O&M security is not affected or O&M is becoming
in the following aspects:
even securer. Smart O&M can not only resolve the O&M
1. Overdependence on O&M teams or individuals often manpower shortage in the existing data centers, but also
impedes process establishment and experience accumulation. help eliminate dependence on individuals and teams through
continuous consolidation and optimization of processes, and
2. The rigid use of processes will eventually cause the O&M
accumulation of experience and expertise.
team to lose patience with the process. As a result, the actual
O&M operations do not comply with the process. The O&M
team needs to combine the process with the actual situation
O&M Process
Intelligentization
Traditional degree
O&M
O&M evolution
Intelligentization
degree
Figure 1
2
White Paper on Data Center Infrastructure Smart O&M
To clearly describe the changes and evolution of conventional O&M and smart O&M, this white paper defines L0 to L5
phases and the typical characteristics of each phase in detail.
Manual O&M Standard O&M Mature O&M Digital O&M Automatic O&M Fully automatic
O&M
L0 L1 L2 L3 L4 L5
Figure 2
L0 manual There is no standard O&M process. O&M depends on the experience of individuals or teams. The O&M quality
O&M cannot be evaluated.
A standard process is formed, and the O&M team can be developed through training and other methods.
L1 standard However, some processes are too rigid or some practices deviate from the processes. The O&M efficiency is low.
O&M O&M highly depends on teams and core members. The O&M quality is difficult to evaluate. The O&M automation
level is low, and monitoring, automatic control, and other auxiliary systems are used to facilitate O&M.
The O&M process is becoming mature and the O&M quality is ensured. However, the O&M efficiency is low. The
L2 mature O&M team development is emphasized. Team capabilities can be well transferred and inherited, but cannot be
O&M optimized independently. The auxiliary systems are more comprehensive, and some core subsystems support
automation.
Based on capabilities in the L2 phase, IT-enabled digital O&M activities are used to manage and drive the
execution of O&M processes. The O&M quality can be accurately evaluated based on big data analysis, and the
L3 digital O&M efficiency can be greatly improved. Key subsystems, such as power distribution and cooling subsystems
O&M can be automatically operated and maintained. Infrastructure resources can adapt to changes in IT and cloud
service requirements and can be managed in a closed-loop manner. Machine intelligence, for example, artificial
intelligence (AI) can replace human intelligence in specific fields such as energy saving and fault prediction.
The infrastructure can be automatically operated and maintained without dedicated infrastructure engineers.
L4 automatic Instead, this job can be taken care of by IT engineers. The infrastructure O&M efficiency is maximized, the process
O&M complexity is greatly reduced, and infrastructure resources can be dynamically adjusted based on IT and cloud
service requirements. Machine intelligence is fully applicable to O&M.
Automatic infrastructure sensing and prediction enable resources to be optimally adjusted based on IT and cloud
L5 fully
service requirements. Thanks to automatic and closed-loop management of possible service faults, unattended
automatic O&M
data center O&M is now possible.
3
White Paper on Data Center Infrastructure Smart O&M
Infrastructure O&M includes physical security management, infrastructure availability management, equipment room
capacity management, vendor management, and comprehensive management. Availability management is the main part
of O&M, including most routine activities, such as preventive maintenance inspection (PMI), periodic device maintenance,
risk management, repair, and emergency drills.
Figure 3
4
White Paper on Data Center Infrastructure Smart O&M
Digital O&M refers to digitization of O&M processes, human activities, and execution results. Digitalization can
standardize human activities and reduce risks caused by misoperations. The O&M process can be continuously optimized
by customizing and optimizing templates and tasks. The entire process is recorded. The execution results are not only
visible, but also can be analyzed and summarized to improve O&M. The following describes an example.
Electronic PMI
“ The O&M personnel of a data center periodically inspect an equipment room to check
whether the security, fire control, smart cooling product, and power distribution of the
equipment room are normal, and whether the equipment room is emitting unusual
smell. The O&M personnel fill in PMI items and write down remarks on paper files in the Plan
conventional O&M approach. The paper files are uneasy to search for and view. It is difficult Do
to make optimization analysis. Electronic PMI enables all processes and human activities to
be digital-driven, monitors the execution of IT O&M personnel, and provides analysis that
users are most concerned about, such as PMI execution status, execution efficiency, and
completion progress. The data center infrastructure management (DCIM) + app approach is
used to perform standard and electronic routine PMI.
Action Check
Electronic PMI automatically executes the plan-do-check-action (PDCA) cycle to achieve a
mobile, standard, visual, and optimizable O&M process.
5
White Paper on Data Center Infrastructure Smart O&M
• Periodically initiate PMI tasks, including daily and weekly scheduled PMI tasks.
• Send notifications by short message service (SMS) message or email.
• Use a personal account to log in to the mobile app.
• Obtain information about device types in a current PMI task.
• Use the DCIM system to automatically obtain the real-time key indicators of devices based on the device types, and compare
the key indicators with the information displayed on the device panel. Take photos of some important device statuses or
running parameters and upload the PMI results to the background system in real time. The background system uses the AI
technology to automatically compare and analyze the PMI results to check whether the PMI results meet requirements.
6
White Paper on Data Center Infrastructure Smart O&M
Risks are mainly caused by manual input in the conventional O&M approach. In addition to monitoring system discovery
and expert identification, digital O&M automatically identifies risks during O&M activities and triggers risk management. For
example, PMI items are directly generated for non-compliance items found during electronic PMI (rules can be defined in the
PMI template). This approach in which the O&M security depends on the DCIM system prevails over the conventional approach
in which the O&M security depends on the O&M team's expertise and a sense of accountability.
Figure 5 Risks
7
White Paper on Data Center Infrastructure Smart O&M
AI is gaining popularity in infrastructure O&M activities, especially in terms of device fault prediction. Effective sample data
and human experience can be used to quickly train a fault prediction model with high accuracy. Device faults can be predicted
and routine PMI and maintenance become more targeted O&M activities. Ever-improving prediction accuracy may eventually
eliminate the need of routine manual O&M.
Model Import
Sound / Output waveform /
Temperature rise curve training
Figure 7
Figure 7 is a typical schematic diagram of performing AI fault prediction on a power supply link. An AI training platform is
used to train a fault prediction model. Sample data required for training is from a DCIM collection system, and includes fault
characteristic data, such as temperatures, voltages, currents, sounds, and images. Human experience or a tested rule can greatly
reduce training difficulty and achieve a better prediction effect. A relatively determined relationship between faults of electronic
components (such as a capacitor) and temperature rises shown in Figure 8 helps AI predict faults better.
Alarm threshold
Time
Figure 8
8
About the author
Yuxue Xia
Yuxue Xia is currently responsible for architecture design and global marketing
of Huawei data center energy DCIM+ products. His career began from
infrastructure and IT fields. He has devoted himself to software development,
architecture design, planning, and marketing of facility monitoring
management software for many years.
Xiaochun Huang
Xiaochun Huang is currently responsible for Huawei Enterprise Data Center
(EDC) operations, who has more than 10 years of experience in data
center operation and maintenance management. He had experienced the
operation and maintenance of Huawei's first-generation enterprise data
center, decommissioning and business relocation, as well as the construction,
verification and operation management of the second-generation enterprise
data center.
General Disclaimer
The information in this document may contain predictive statements including, without limitation,
statements regarding the future financial and operating results, future product portfolio, new
technology, etc. There are a number of factors that could cause actual results and developments
to differ materially from those expressed or implied in the predictive statements. Therefore, such
information is provided for reference purpose only and constitutes neither an offer nor an acceptance.
Huawei may change the information at any time without notice.
Copyright © Huawei Technologies Co., Ltd. 2019. All rights reserved.