0% found this document useful (0 votes)
7 views12 pages

23

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 12

White Paper on Data Center

Infrastructure Smart O&M

huawei.com
White Paper on Data Center Infrastructure Smart O&M

Introduction
Nowadays, the O&M security of most data centers data centers evolve, and how to improve the existing
depends on experienced, qualified, and well-trained data centers based on the information. For teams in
O&M teams. A comprehensive O&M process and a the conventional O&M phase, this white paper also
training system have been developed for some mature describes the overview of availability management and
data centers to minimize the impact of occasional the corresponding digital and intelligent measures for
incidents and personnel turnover on O&M security. data center infrastructure. With these information, the
Digital and intelligent means are driving the secure O&M team can better standardize O&M management,
and sustainable O&M for a few cutting-edge data formulate a smart O&M upgrade plan, and transform
centers. This white paper describes five phases from from conventional O&M to smart O&M. Smart O&M
conventional O&M to smart O&M, and typical features tools are assisting O&M to become more efficient,
of each phase. Data center management personnel securer, and sustainable.
can determine the current phase they are in, how

1
White Paper on Data Center Infrastructure Smart O&M

Evolution of Smart Data Center O&M

Figure 1 shows the evolution of O&M from conventional and have rich O&M experience in handling the actual
O&M to smart O&M. The x axis indicates the evolution to situation while meeting the requirements specified for each
smart O&M, and the y axis indicates the comprehensiveness process activity.
and complexity of the O&M process. Intelligent means are
3. Some O&M teams with abundant experience and
lacked in the conventional O&M phase, and O&M security
established processes are prone to be too complacent. They
depends on the experience and expertise of the O&M team.
do not welcome any intelligent means and reject suggestions
Therefore, the sustainable management depends on the
on improving the O&M efficiency. They insist that efficiency
process and continuous improvement of the training system.
improvement will inevitably affect O&M security.
The O&M efficiency decreases to some extent along with
the continuous improvement of the process. However, as In the intelligent O&M phase, digital and intelligent
the O&M team is more familiar with the process, the O&M means are used to solidify and simplify processes. Cloud-
efficiency will be restored. based O&M experts and automated methods minimize
the need for human labor and greatly improve the O&M
In the conventional O&M phase, O&M may be misunderstood
efficiency. O&M security is not affected or O&M is becoming
in the following aspects:
even securer. Smart O&M can not only resolve the O&M
1. Overdependence on O&M teams or individuals often manpower shortage in the existing data centers, but also
impedes process establishment and experience accumulation. help eliminate dependence on individuals and teams through
continuous consolidation and optimization of processes, and
2. The rigid use of processes will eventually cause the O&M
accumulation of experience and expertise.
team to lose patience with the process. As a result, the actual
O&M operations do not comply with the process. The O&M
team needs to combine the process with the actual situation
O&M Process

Intelligentization
Traditional degree
O&M

O&M evolution

Intelligentization
degree

Figure 1

2
White Paper on Data Center Infrastructure Smart O&M

Five O&M Phases

To clearly describe the changes and evolution of conventional O&M and smart O&M, this white paper defines L0 to L5
phases and the typical characteristics of each phase in detail.

Manual O&M Standard O&M Mature O&M Digital O&M Automatic O&M Fully automatic
O&M

L0 L1 L2 L3 L4 L5

• No standard • Standardized • Mature and • Electronic process • Automatic O&M • Automatic


process process but rigid mature processes and continuous of infrastructure sensing, automatic
optimization adjustment, and
• No training • There is a training • Attach importance • The O&M automatic fault
system system. to the training • Digital efficiency reaches closure
system. technologies are the maximum.
• Individual • Rely on core fully applied, and • Intelligent
responsibilities backbone • Some automation AI takes the lead • Infrastructure prediction
tools are used. in some key tasks. resources of service
• O&M quality • Difficult O&M automatically
cannot be quality evaluation • The O&M quality requirements
• The O&M quality ollaborate with and intelligent
evaluated. and poor is guaranteed, and can be evaluated IT and cloud
sustainability the team building collaboration
and no longer services.
and sustainable depends on • Unattended data
development are people and teams. center
emphasized.

Figure 2

L0 manual There is no standard O&M process. O&M depends on the experience of individuals or teams. The O&M quality
O&M cannot be evaluated.

A standard process is formed, and the O&M team can be developed through training and other methods.
L1 standard However, some processes are too rigid or some practices deviate from the processes. The O&M efficiency is low.
O&M O&M highly depends on teams and core members. The O&M quality is difficult to evaluate. The O&M automation
level is low, and monitoring, automatic control, and other auxiliary systems are used to facilitate O&M.

The O&M process is becoming mature and the O&M quality is ensured. However, the O&M efficiency is low. The
L2 mature O&M team development is emphasized. Team capabilities can be well transferred and inherited, but cannot be
O&M optimized independently. The auxiliary systems are more comprehensive, and some core subsystems support
automation.

Based on capabilities in the L2 phase, IT-enabled digital O&M activities are used to manage and drive the
execution of O&M processes. The O&M quality can be accurately evaluated based on big data analysis, and the
L3 digital O&M efficiency can be greatly improved. Key subsystems, such as power distribution and cooling subsystems
O&M can be automatically operated and maintained. Infrastructure resources can adapt to changes in IT and cloud
service requirements and can be managed in a closed-loop manner. Machine intelligence, for example, artificial
intelligence (AI) can replace human intelligence in specific fields such as energy saving and fault prediction.

The infrastructure can be automatically operated and maintained without dedicated infrastructure engineers.
L4 automatic Instead, this job can be taken care of by IT engineers. The infrastructure O&M efficiency is maximized, the process
O&M complexity is greatly reduced, and infrastructure resources can be dynamically adjusted based on IT and cloud
service requirements. Machine intelligence is fully applicable to O&M.

Automatic infrastructure sensing and prediction enable resources to be optimally adjusted based on IT and cloud
L5 fully
service requirements. Thanks to automatic and closed-loop management of possible service faults, unattended
automatic O&M
data center O&M is now possible.

3
White Paper on Data Center Infrastructure Smart O&M

Infrastructure O&M Overview

Infrastructure O&M includes physical security management, infrastructure availability management, equipment room
capacity management, vendor management, and comprehensive management. Availability management is the main part
of O&M, including most routine activities, such as preventive maintenance inspection (PMI), periodic device maintenance,
risk management, repair, and emergency drills.

Data center digital O&M platform

Supplier Availability Capacity Physical Comprehensive


management management management security architecture
• Evaluation • Equipment room • Visualization • Entry and exit • Comprehensive report
management health check registration statistics
• Data
• Contract • UPS maintenance interconnection • Equipment room • Interconnection with
management • Precision air conditioner inspection the ITSM system
maintenance • Permission review • Microservice-based
• Communication framework
management • Facility risk • Hosting
identification authorization • Domain Rights
• Risk management • Facility risk • Card swiping record Management

• Service report management • Storage media • Mobile app architecture


• Major overhaul record • Risk management
• Emergency drill • Key review • Equipment room O&M
• Equipment life cycle • Physical security CP report
• Equipment room
requirement
management

Figure 3

4
White Paper on Data Center Infrastructure Smart O&M

Digital and Intelligent O&M Activities

Digital O&M refers to digitization of O&M processes, human activities, and execution results. Digitalization can
standardize human activities and reduce risks caused by misoperations. The O&M process can be continuously optimized
by customizing and optimizing templates and tasks. The entire process is recorded. The execution results are not only
visible, but also can be analyzed and summarized to improve O&M. The following describes an example.

Electronic PMI

“ The O&M personnel of a data center periodically inspect an equipment room to check
whether the security, fire control, smart cooling product, and power distribution of the
equipment room are normal, and whether the equipment room is emitting unusual
smell. The O&M personnel fill in PMI items and write down remarks on paper files in the Plan
conventional O&M approach. The paper files are uneasy to search for and view. It is difficult Do
to make optimization analysis. Electronic PMI enables all processes and human activities to
be digital-driven, monitors the execution of IT O&M personnel, and provides analysis that
users are most concerned about, such as PMI execution status, execution efficiency, and
completion progress. The data center infrastructure management (DCIM) + app approach is
used to perform standard and electronic routine PMI.
Action Check
Electronic PMI automatically executes the plan-do-check-action (PDCA) cycle to achieve a
mobile, standard, visual, and optimizable O&M process.

Task management (Plan):


The system provides a routine PMI task template. An administrator can design a task name, PMI items, PMI route, and
frequency based on the template, and send a task order to a PMI engineer.

5
White Paper on Data Center Infrastructure Smart O&M

App PMI (Do):


The PMI engineer can use the PMI app to quickly record problems, take photos onsite, and upload a PMI report in one-
click mode. The PMI engineer can also use the app to perform the following tasks:

• Periodically initiate PMI tasks, including daily and weekly scheduled PMI tasks.
• Send notifications by short message service (SMS) message or email.
• Use a personal account to log in to the mobile app.
• Obtain information about device types in a current PMI task.
• Use the DCIM system to automatically obtain the real-time key indicators of devices based on the device types, and compare
the key indicators with the information displayed on the device panel. Take photos of some important device statuses or
running parameters and upload the PMI results to the background system in real time. The background system uses the AI
technology to automatically compare and analyze the PMI results to check whether the PMI results meet requirements.

Task execution check (Check):


Check PMI execution and quality.

Figure 4 PMI tasks

Template and task optimization (Action):


Optimize the PMI template or adjust the PMI task. For example, adjust the PMI frequency based on the actual situation,
or add check items for the UPS.

6
White Paper on Data Center Infrastructure Smart O&M

Automatic Closed-Loop Risk Management

Risks are mainly caused by manual input in the conventional O&M approach. In addition to monitoring system discovery
and expert identification, digital O&M automatically identifies risks during O&M activities and triggers risk management. For
example, PMI items are directly generated for non-compliance items found during electronic PMI (rules can be defined in the
PMI template). This approach in which the O&M security depends on the DCIM system prevails over the conventional approach
in which the O&M security depends on the O&M team's expertise and a sense of accountability.

Figure 5 Risks

Figure 6 Closed-loop risk management process

7
White Paper on Data Center Infrastructure Smart O&M

Predictive Fault Maintenance

AI is gaining popularity in infrastructure O&M activities, especially in terms of device fault prediction. Effective sample data
and human experience can be used to quickly train a fault prediction model with high accuracy. Device faults can be predicted
and routine PMI and maintenance become more targeted O&M activities. Ever-improving prediction accuracy may eventually
eliminate the need of routine manual O&M.

DCIM AI training platform

Model Import
Sound / Output waveform /
Temperature rise curve training

Audio and video collection

Collector Sound sensor

Mains Transformer Power distribution

Harmonic Three-phase / iron core temperature Three-phase current / voltage /


Ambient temperature power / load rate
Load output waveform

Figure 7

Figure 7 is a typical schematic diagram of performing AI fault prediction on a power supply link. An AI training platform is
used to train a fault prediction model. Sample data required for training is from a DCIM collection system, and includes fault
characteristic data, such as temperatures, voltages, currents, sounds, and images. Human experience or a tested rule can greatly
reduce training difficulty and achieve a better prediction effect. A relatively determined relationship between faults of electronic
components (such as a capacitor) and temperature rises shown in Figure 8 helps AI predict faults better.

Temperature Prediction Overload

Alarm threshold

Time

Figure 8

8
About the author

Yuxue Xia
Yuxue Xia is currently responsible for architecture design and global marketing
of Huawei data center energy DCIM+ products. His career began from
infrastructure and IT fields. He has devoted himself to software development,
architecture design, planning, and marketing of facility monitoring
management software for many years.

Xiaochun Huang
Xiaochun Huang is currently responsible for Huawei Enterprise Data Center
(EDC) operations, who has more than 10 years of experience in data
center operation and maintenance management. He had experienced the
operation and maintenance of Huawei's first-generation enterprise data
center, decommissioning and business relocation, as well as the construction,
verification and operation management of the second-generation enterprise
data center.

HUAWEI TECHNOLOGIES CO., LTD.


Huawei Industrial Base
Bantian Longgang
Shenzhen 518129, P.R. China
Tel: +86-755-28780808

General Disclaimer
The information in this document may contain predictive statements including, without limitation,
statements regarding the future financial and operating results, future product portfolio, new
technology, etc. There are a number of factors that could cause actual results and developments
to differ materially from those expressed or implied in the predictive statements. Therefore, such
information is provided for reference purpose only and constitutes neither an offer nor an acceptance.
Huawei may change the information at any time without notice.
Copyright © Huawei Technologies Co., Ltd. 2019. All rights reserved.

You might also like