0% found this document useful (0 votes)
38 views60 pages

Virtual Infrastructure Requirements Guide

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views60 pages

Virtual Infrastructure Requirements Guide

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

CONFIDENTIAL

Metaswitch Products Virtual


Infrastructure Requirements Guide

VC3-602 - Version 8.5 - Issue 2-1465

January 2023

A Microsoft Company
Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

Notices
Copyright © 2023 Microsoft. All rights reserved.

This manual is issued on a controlled basis to a specific person on the understanding that no part of
the product code or documentation (including this manual) will be copied or distributed without prior
agreement in writing from Metaswitch Networks and Microsoft.

Metaswitch Networks and Microsoft reserve the right to, without notice, modify or revise all or part of
this document and/or change product features or specifications and shall not be responsible for any
loss, cost, or damage, including consequential damage, caused by reliance on these materials.

Metaswitch and the Metaswitch logo are trademarks of Metaswitch Networks. Other brands and
products referenced herein are the trademarks or registered trademarks of their respective holders.
CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

Contents

1 Introduction.............................................................................................................5
1.1 About this document............................................................................................................. 5
1.2 Relevant product versions.....................................................................................................6
1.3 Document structure............................................................................................................... 8
2 Virtual infrastructure requirements....................................................................10
2.1 Contention........................................................................................................................... 10
2.1.1 Contention: Background........................................................................................ 10
2.1.2 Contention: Options............................................................................................... 11
2.1.3 Contention: Requirements..................................................................................... 11
2.2 Compute performance.........................................................................................................12
2.2.1 Processor Architecture.......................................................................................... 13
2.2.2 Maximizing compute performance on servers with multiple CPUs........................ 13
2.2.3 Power management...............................................................................................15
2.3 Networking...........................................................................................................................15
2.3.1 Accelerated data planes........................................................................................ 15
2.3.2 Latency.................................................................................................................. 17
2.3.3 External network access without NAT...................................................................17
2.3.4 IP address assignment/DHCP...............................................................................17
2.3.5 Traffic separation................................................................................................... 18
2.3.6 Bandwidth limiting and shared network resources................................................ 18
2.4 High availability................................................................................................................... 18
2.4.1 Protection against individual failures..................................................................... 19
2.4.2 Protection against virtual infrastructure failure...................................................... 21
2.5 Storage................................................................................................................................ 22
2.6 Other considerations........................................................................................................... 26
2.6.1 VM images.............................................................................................................26
2.6.2 Other software....................................................................................................... 27
2.6.3 Onboard agents..................................................................................................... 27
2.6.4 Time synchronization.............................................................................................27
3 Support for virtual infrastructure management and operational features...... 29
3.1 OpenStack features.............................................................................................................29
3.1.1 Supported OpenStack features............................................................................. 29
3.1.2 Incompatible OpenStack features......................................................................... 31
3.2 VMware features................................................................................................................. 33
3.2.1 Supported VMware features..................................................................................33
3.2.2 Incompatible VMware features.............................................................................. 36
4 Requirements in SLA form................................................................................. 39
4.1 Mandatory requirements......................................................................................................39
Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

4.2 Additional requirements for optional features..................................................................... 40


4.3 Recommendations for performance.................................................................................... 41
4.4 Recommendation for service availability.............................................................................41
5 Environment-specific requirements................................................................... 42
5.1 OpenStack requirements.....................................................................................................42
5.1.1 OpenStack releases.............................................................................................. 42
5.1.2 Detailed OpenStack hints and tips........................................................................ 46
5.2 VMware requirements......................................................................................................... 51
5.2.1 VMware versions................................................................................................... 51
5.2.2 vSphere HA........................................................................................................... 56
5.2.3 Detailed VMware hints and tips............................................................................ 56
6 Designing for High Availability...........................................................................58
6.1 Telco vs. IT approaches to availability................................................................................58
6.2 Hidden failure modes.......................................................................................................... 59
6.3 Best practice for specific environments.............................................................................. 60
CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

1 Introduction

1.1 About this document

A description of the purpose and scope of the Virtual Infrastructure Requirements Guide.

This document sets out the behavior and features that Metaswitch products require from the
underlying virtual platform when running as Virtual Network Functions (VNFs).

Metaswitch products can be deployed in both virtualization environments (such as VMware vSphere)
and full clouds (such as OpenStack). In both cases the requirements on the underlying virtual
platform are the same. This document refers throughout to requirements on the underlying "virtual
infrastructure", which should be read as applying both to virtualization environments and clouds.

This document does not provide general guidance on how to build a virtual infrastructure to meet
those requirements, though it does specify any optional features from OpenStack and VMware that
are needed to satisfy Metaswitch products' requirements. If you are deploying virtualized Metaswitch
products, this document will assist you in understanding what your virtual infrastructure must provide,
and, if you are not providing the virtual infrastructure yourself, will help you agree an SLA with your
virtual infrastructure provider. The guidance for virtual infrastructure providers applies equally whether
your cloud provider is a third party or a separate organization or team within your company.

This document does not set out the resources required for a complete deployment (number of virtual
cores, RAM etc.). This information is defined in the Metaswitch Products OpenStack Deployment
Design Guide or the Metaswitch Products VMware Deployment Design Guide.

This document does not provide detailed instructions for installing, orchestrating or commissioning
Metaswitch products. That information is covered in the relevant product documentation.

The Guide assumes that you are familiar with virtualization and cloud concepts and practices, and
with the Metaswitch product set.

Note:

In addition to providing individual products on VMware, Metaswitch offers a turnkey VMware


deployment, the vNOW NFV Starter Kit. If you are a vNOW customer, you do not need to design or
recruit a third party to design a virtual infrastructure, as your vNOW deployment has been designed
to meet the requirements specified in this document.

You may, however, wish to extend or upgrade your deployment to run additional products or
increase your capacity, as described in the article Extending and upgrading your vNOW VMware
system at https://communities.metaswitch.com. If you do this, you will need to expand your
system as necessary to provide the CPU, storage, and networking resources that your additional/
resized VMs require. Before extending or upgrading your deployment, you should be familiar with
the general principles surrounding performance and contention set out in Virtual infrastructure

1 Introduction 5
Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

requirements on page 10. You must also consult the Metaswitch Products VMware Deployment
Design Guide for detailed information on VM resource requirements.

VMware vSphere and vCloud


"VMware" is used generically throughout this document to refer to a virtual infrastructure provided by
VMware and abstracted from the underlying host hardware by the ESXi hypervisor. VMware provides
management tools for the infrastructure and VMs via vSphere, which provides an accessible set of
applications including a GUI and console to the user, and through which lifecycle operations such as
installation, recovery and resizing of VMs are performed.

Some customers may benefit from the additional functionality offered by the vCloud suite of products,
a VMware offering that adds an additional layer of cloud management tools on top of vSphere.
A subset of Metaswitch products supports deployment on vCloud, managed by the SIMPL VM
(manual deployment and lifecycle management is not supported on vCloud). See VMware versions in
Metaswitch Products VMware Deployment Design Guide for per-product details about vCloud support

In general, customers without vCloud will interact with their deployments via the vSphere client;
customers using vCloud will use the vCloud Director client.

This document uses the following conventions when providing feature descriptions and advice for
VMware deployments:

• The generic term "VMware" refers to all VMware deployments, with or without vCloud.
• Since vCloud is deployed on top of vSphere, services labeled with "vSphere" (e.g. vSphere High
Availability) apply to all VMware deployments, with or without vCloud.
• Information identified as specific to vSphere clients does not apply to deployments with vCloud,
since user interaction with these deployments is via the vCloud Director client.
• Information identified as specific to "vCloud" or the vCloud Director client is for vCloud
deployments only.

1.2 Relevant product versions

This section specifies the product versions to which the Virtual Infrastructure Requirements Guide
applies.

This document applies to the product versions as set out in the table below.

Attention:

This manual does not apply to the Radisys MRF, a third-party component used in the Metaswitch
VoLTE Solutions. Information on the resource specifications for these VMs can be found in the
documentation for the Metaswitch VoLTE solution in which you are deploying it. For guidance
on the virtual infrastructure requirements for the Radisys MRF, please consult your Support
representative.

6 1 Introduction
CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

Applicable product versions

Product Version

BGCF VM (OpenStack only) V11.4.07+

Clearwater Core V8+

Distributed Admission Manager (DAM) V2.0+

Deployment Configuration Store (DCS) V1.0+

Distributed Capacity Manager (DCM) V2.4+

MetaSphere CFS (including RPAS, OBS, MRS)

Metaswitch AGC / MGC

MetaSphere EAS

(including EAS pool server system from V9.2.10 and virtual EAS DSS
from V9.5.20)

MetaView Server

MetaView Director

ESA Proxy

Advanced Messaging Service (AMS) (OpenStack from V9.4)

MetaSphere N-Series V3.9+

Metaswitch CCF V5.0+

Metaswitch Deployment Manager V1.0+

(First available on VMware


at V2.28.0)

MetaView Statistics Engine (MVSE) V3.0+

Mobile Voice Mail (MVM) V2.15.0+

Perimeta (ISC, SSC, MSC) V3.7+

QCall V1.0.0+

Rhino VoLTE TAS V2.6.0+

Rhino nodes (Mobile Control Point) Rhino MCP nodes - V1.0+

1 Introduction 7
Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

Product Version

Rhino TSN and REM nodes


- V4.0+

Rhino nodes (MaX UC) Rhino MAX nodes -


V9.6.00+

Rhino MAG nodes - V3.0+

Secure Distribution Engine (SDE) V1.0+

(V1.0 only available on


OpenStack)

Service Assurance Server V9.1+

ServiceIQ Monitoring (SIMon) V7.0+

Storage Cluster V1.0.0+

A number of different VMs are build on the Rhino platform and share certain infrastructural properties.
These nodes form the basis of the Rhino VoLTE TAS and the Mobile Control Point (MCP) and
are also used in the MaX UC solution. In this document, the node type is specified only where
requirements differ between types of Rhino node; for requirements common to all VMs built on Rhino
the umbrella term "Rhino nodes" is used.

The following Rhino node types are covered in this document:

• MMT
• SMO
• MAG
• MAX
• TSN
• MCP
• REM.

This table indicates only the versions of a given product for which the guidance in this document
is valid. It does not provide information about version compatibility between different Metaswitch
products. Please see individual product guidance or speak to your Metaswitch Support representative
for details.

1.3 Document structure

A description of the high-level structure of the Virtual Infrastructure Requirements Guide.

8 1 Introduction
CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

Virtual infrastructure requirements on page 10 introduces the general requirements on virtual


infrastructures and the reasons underpinning those requirements.

Support for virtual infrastructure management and operational features on page 29 provides
guidance on the compatibility of specific OpenStack and VMware features with Metaswitch products.

Requirements in SLA form on page 39 condenses the preceding information into an SLA checklist.

Environment-specific requirements on page 42 covers requirements that are specific to OpenStack


and VMware, together with some hints and tips on how to meet the requirements set out in this
document.

1 Introduction 9
Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

2 Virtual infrastructure requirements

This section provides an overview of the requirements of a virtual infrastructure running Metaswitch
products.

All virtual infrastructures provide basic compute, network and storage resources suitable for running
a wide range of applications. A carrier-grade voice service is a particularly demanding application,
having to deliver high-scale real-time media with high quality audio and very high availability. As
such, it places more demands on the underlying infrastructure than, say, a typical web service. At a
high level, it requires the following from the underlying virtual infrastructure in addition to the normal
demands required by all types of application.

• The ability to specify uncontended compute, RAM and disk I/O - otherwise audio quality may suffer
or services such as real-time analytics may be unable to keep up with traffic loads.
• A network layer capable of handling large numbers of small audio packets, together with control
over CPU placement - otherwise media scalability is poor.
• Features to support application-level high availability mechanisms - otherwise media streams
cannot be maintained over failures and single node failures can take down an unacceptably large
proportion of capacity.

Standard virtualization platforms are perfectly capable of delivering all of these requirements; to
successfully run a voice service you need to ensure that your chosen platform does so.

The remainder of this section describes each of the above and gives more detailed requirements.
Not all Metaswitch products require all of these features; see the following sections for further
information, and the Metaswitch Products VMware Deployment Design Guide or the Metaswitch
Products OpenStack Deployment Design Guide for detailed per-product requirements.

2.1 Contention

2.1.1 Contention: Background

This section explains the use of contended resources in virtual infrastructures, and the need for
uncontended resources in a virtual infrastructure deployed for telecommunications services.

Virtual infrastructures can allocate multiple VMs to a host such that their aggregate resource demands
exceed the resources actually available on the host - in other words, the VMs have to contend for the
host’s resources.

It is typical for resources to be contended to some degree in virtual infrastructures, and follows
naturally from a situation where multiple tenants share the same infrastructure. Indeed for IT
applications with very bursty workloads, such over-contention is one of the key benefits of a virtual
infrastructure. However, real-time applications such as telecommunications are unsuited to the use of
contended resources for several reasons.

10 2 Virtual infrastructure requirements


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

• The nature of the real-time services delivered by our software is sensitive to resources being
monopolized by other applications, even for brief periods. A few tens of milliseconds of delay
on an RTP packet is enough for it to be dropped from the media stream. If enough packets are
dropped, audio quality noticeably suffers. While call signaling flows are less sensitive, delays of a
few hundred milliseconds typically trigger message retransmissions, resulting in unnecessary and
unexpected additional load on the network.
• Contention works well when different applications are busy at different times, i.e. when their load
is uncorrelated. However, bursts of traffic in telephony networks often affect multiple network
elements at the same time, and if they are all being run on the same virtual infrastructure they will
all be busy at the same time. Therefore headroom to accommodate traffic spikes is best provided
at the scope of individual network elements, rather than being pooled.
• When a product has guaranteed resources, it is able to anticipate congestion, and throttle a
proportion of incoming load in advance of becoming completely congested. This allows for more
graceful handling of congestion within the network.
• Resource contention makes troubleshooting of performance problems substantially more difficult,
as it introduces a much wider range of possible causes.

2.1.2 Contention: Options

This section describes the approaches typically taken to resource contention in virtual infrastructures.

There are three different approaches that can be taken for any given resource.

• The resource can be dedicated - a quantity of a hardware resource is reserved for a particular VM
and guaranteed by the virtual infrastructure to be available for its sole use.
• The resource can be pooled but uncontended - a hardware resource is available for use by
multiple VMs, but there is sufficient for all VMs to meet their requirements simultaneously. This
is different from dedicated, as the user or orchestrator is responsible for provisioning enough
resource for the sum of the maximum each VM might need - the virtual infrastructure is not asked
to guarantee it.
• The resource can be pooled and contended - a hardware resource is available for use
by multiple VMs, but there is insufficient for all VMs to meet their maximum requirements
simultaneously.

Of these approaches, only the first - dedicated resources - requires explicit functional support from
the virtual infrastructure itself. The second does not require platform support, but does require a policy
decision by the virtual infrastructure provider not to run over-committed.

2.1.3 Contention: Requirements

This section describes the contention requirements for Metaswitch products.

• All products require dedicated RAM to operate correctly. This is because demand for RAM does
not reliably drop away in proportion (for example) to call load at quiet times, so the danger of
contention is high. The impact of such contention may be severe (including service outages).

2 Virtual infrastructure requirements 11


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

• All products require dedicated disk space to operate correctly. Again, this is because the risk
posed by pooled usage is unreasonably high. It is hard to guarantee avoiding contention, and
if resources become contended service outages may occur and be difficult to recover from
(alongside a lack of e.g. billing and diagnostic information).
• SAS requires dedicated disk I/O because of its requirement for sustained high I/O rates for storing
events. Similarly EAS requires dedicated disk I/O because of the volume of message data it is
reading and writing on demand.
• For all other cases (e.g. CPU), we strongly recommend that resources are at least pooled but
uncontended, and dedicated if possible. Sufficient resources should exist on the host so that those
needed by each VM are not contended with any other VMs on the host.

That is, Metaswitch products require the virtual infrastructure to support dedicated RAM, disk space
and disk I/O (the last of these being required only by SAS and EAS), and we strongly recommend that
the virtual infrastructure should support VMs with uncontended CPU and network bandwidth.

Note also that contention does not arise just between VMs competing for the same resources -
the virtual infrastructure itself also consumes some resources on hosts. Our requirement for some
resources to be dedicated and our strong recommendation for others to be uncontended applies to
both contention from other VMs and contention from the virtual infrastructure itself.

In the event that you do choose, or are obliged by the cloud provider, to use contended resources
between VMs where permitted by the rules above, the following considerations are important.

• Service levels (response times, etc.) delivered by products using contended resources can
become unpredictable, as they may be impacted by activity on other VMs. Products handling real-
time media and signaling are particularly sensitive in this respect.
• For products with 1+1 fault tolerance, the standby instance could become active at any time, and
therefore should not be contended in any way you would not be happy for the active instance to be
contended.
• Products will be unable to anticipate resource exhaustion, resulting in less smooth management of
congestion in the network.
• Some products respond dynamically to resource contention by throttling load, leading to lower
resource utilization. These systems are prone to being completely overrun by applications that
back off less gracefully. Even if you are unwilling or unable to ensure dedicated or uncontended
access for all resources required by a VM, it is still valuable to require that at least some resources
are available to it, to prevent it being completely starved of resources by other applications.
• If you encounter performance issues in a contended resource deployment then Metaswitch
Support will ask you to use hypervisor diagnostics to identify the root cause. The solution or
troubleshooting process may require that you assign uncontended resources.

2.2 Compute performance

This section explains the demands placed on cloud compute infrastructure by time-sensitive network
functions such as telecommunications.

12 2 Virtual infrastructure requirements


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

Compared with web applications, time-sensitive network functions such as call control and user plane
packet handling place stricter demands on the compute infrastructure in order to achieve the expected
level of service, in terms of both scale and audio quality. Therefore, it is important that your virtual
infrastructure can provide the features described in the following sections.

2.2.1 Processor Architecture

This section outlines the host processor architecture supported by Metaswitch products.

Metaswitch products are supported on 64-bit Intel using the 64-bit version of the x86 instruction set
(commonly referred to as x86-64, x86_64 or x64). Metaswitch products are not supported on 32-bit
processors or on any AMD processors.

We recommend that you use recent Intel server processors (Xeon or equivalent) from no earlier than
the Sandy Bridge family (released in 2011) when deploying Metaswitch products.

2.2.2 Maximizing compute performance on servers with multiple CPUs

This section explains how to optimize CPU placement for Metaswitch products on servers with dual
CPUs.

Virtual infrastructure hosts often have two (or more) CPUs, and as a result, employ Non-Uniform
Memory Access (NUMA).

NUMA means that memory, I/O devices and CPUs are grouped into NUMA nodes. Memory and I/O
access is faster for CPU cores in the same NUMA node, but slower for CPU cores in different NUMA
nodes.

The capacity figures quoted in the Metaswitch Products VMware Deployment Design Guide and
Metaswitch Products OpenStack Deployment Design Guide assume that you will observe the
following two principles when placing VMs on multi-CPU servers.

• If possible, the entire VM should be allocated to a single physical CPU rather than spread over
multiple CPUs.
• Where a VM fits on a single CPU, it should be assigned to the same NUMA node as the Ethernet
PCI I/O devices that it is using.

If these optimal choices cannot be made then the Metaswitch products will run but will likely not meet
the quoted capacity figures. This is particularly important for those products handling media at scale,
notably Perimeta.

Attention:

Rhino nodes make extensive use of memory access and I/O devices. You must observe the
principles given above to ensure that Rhino nodes perform as expected and meet the quoted
capacity figures.

The two principles of optimal VM placement are discussed further below.

2 Virtual infrastructure requirements 13


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

Allocating a VM to a single host CPU

Confining a VM to run on a single CPU normally produces better performance than splitting it across
multiple CPUs because of the NUMA effects discussed earlier.

Virtual infrastructures differ on whether they will by default allocate an entire VM to a single CPU, or
whether that requires some specific feature to be invoked when the VM is created or the hosts to be
configured in a specific way.

In order to tell if a VM fits on a single CPU it is necessary to understand the concepts of logical
processors and vCPUs.

• A physical CPU presents one or more logical processors to the hypervisor. In a multi-core CPU
with Intel hyper-threading enabled, there are two logical processors per core.
• The compute requirements of a Metaswitch VM are expressed as a number of virtual CPUs
(vCPUs).

A VM is said to fit on a single CPU if its number of vCPUs is less than or equal to the number of
logical processors presented by that CPU. For example: an Intel Xeon E5-2650 v2 CPU has 8 cores
and supports hyper-threading. This means that it presents 16 logical processors.

Note:

Virtual infrastructure tasks, including the running of the hypervisor or virtual networking, use
processor resources, reducing the amount of space available to fit in VMs. For example, if 2 logical
processors of a Xeon E5-2650 v2 CPU are required by the hypervisor, this only leaves 14 logical
processors free, which means a 16-vCPU VM can no longer fit on a single CPU.

VMware and OpenStack implement the reservation of resources for infrastructure overhead in
different ways:

- OpenStack allows you to explicitly reserve processor cores for infrastructure tasks so that they
are not made available to VMs. Failure to reserve the necessary CPU bandwidth may lead to
impaired VM performance as a result of resource contention between host and guests.

- On VMware, it is not possible to reserve CPUs for infrastructure tasks, but you must not use up
all cores on the physical host when allocating your VMs to hosts. If you do not leave sufficient
processor resource free for the infrastructure to operate, VMs may fail to instantiate or suffer
impaired performance.

For further details, see OpenStack overhead in the Metaswitch Products OpenStack Deployment
Design Guide or VMware overhead in the Metaswitch Products VMware Deployment Design
Guide.

For the purposes of profiling and quoting capacity of Metaswitch VMs, we assume that

• any VM requiring more than 16 vCPUs may be split across multiple CPUs
• any VM requiring 16 vCPUs or fewer is allocated to a single CPU.

14 2 Virtual infrastructure requirements


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

NUMA I/O affinity

On a multi-CPU server, each hardware Ethernet controller belongs to a single NUMA node. Where a
VM does a lot of network I/O, its performance is significantly improved if it can constrained to run on
the same NUMA node as all of the Ethernet devices that it is using. This constraint is known as I/O
affinity.

Running VMs with I/O affinity makes the task of placing a VM more complex: it is no longer just a
matter of finding a spare CPU, but now also involves finding a spare CPU with associated Ethernet
devices that serve the right networks.

Where VMs are manually placed and instantiated, the task of achieving I/O affinity is a problem
for a human operator. In a cloud environment, it is a problem for the cloud VM scheduler (e.g. the
OpenStack Nova component). If it is not possible to achieve I/O affinity, then Metaswitch VMs will
operate correctly, but with suboptimal networking performance.

2.2.3 Power management

This section outlines the need to restrict power management features so that they do not impair the
performance of Metaswitch products.

The host hardware and operating system should be configured to provide high and consistent
performance to the guest. Some virtual infrastructures support power management, where hosts can
automatically reduce the clock rate or even turn off entire cores in order to save power. Any reduction
in the clock rate or number of cores in use will affect the performance of the guest; in some cases, the
impact can be considerable.

Power management must not be so aggressive that timely guest scheduling is impacted, as this would
impair performance and capacity and could lead to degraded calls.

If you are running high load through VMs running on only a small number of cores, even conservative
power management policies can cause the host CPU to decide that it is generally idle, and enable
power-saving mode. This can have a significant impact on the performance provided by your
virtualization infrastructure. This is most often relevant when performance testing small deployments
prior to building out a full deployment. You may find your measured performance is lower than
expected unless you run separate load on the same virtualization infrastructure to keep the CPU from
entering power-saving mode, or unless you disable power management entirely during your testing.

2.3 Networking

2.3.1 Accelerated data planes

This section outlines the mechanisms available for optimizing cloud performance for very high traffic
of concurrent data packets.

2 Virtual infrastructure requirements 15


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

Virtual hosts normally run a virtual switch (vSwitch) in order to share physical network interfaces
between multiple VMs. There are multiple different types and versions of vSwitch, and these vary
hugely in their ability to process VoIP workloads.

VoIP network usage is characterized by a high rate of small RTP packets. While the total network
throughput may be similar to a web service, the total number of data packets to be handled by the
network stack may be orders of magnitude higher. This is particularly relevant for virtual functions
implementing media handling functions such as session border control, firewalling and deep packet
inspection. Virtual functions just handling control plane functions such as SIP signaling will typically
not have such intensive demands.

Traditional vSwitches have not been well optimized for this traffic profile. With some vSwitches, it is
not possible to reach 1000 media sessions on a server without packet drops, even though the same
hardware could handle more than 50,000 sessions without packet drops in a bare metal deployment.

In recognition of this issue, many virtual infrastructures have added specific support for high scale
networking, often offering performance nearly equal to bare metal, but requiring explicit support from
the VM. Collectively these are often referred to as "accelerated data planes." There are three common
approaches.

• SR-IOV, where the VM is granted direct access to a virtual "slice" of the physical NIC, completely
bypassing the virtual infrastructure's network layer. This gives very high performance but at the
expense of interfering with some virtual infrastructure management operations - for example, it
may break the ability to move a running VM from one host to another without interruption, or the
ability to enforce security policy within the on-host vSwitch.
• PCI passthrough, where the VM is granted full access to and full control of the physical NIC. This
has similar advantages and disadvantages to SR-IOV. It can be used in some deployments where
SR-IOV is not possible, but it means that the physical NIC cannot be shared between multiple
VMs.
• An accelerated vSwitch, where the virtual infrastructure's switching layer is enhanced with
techniques such as Intel's Data-Plane Development Kit (DPDK) to offer accelerated passing of
packet data from the physical NIC to the VM.

To achieve optimal data plane performance for Perimeta, Rhino nodes, EAS (OpenStack only),
Storage Cluster (OpenStack only), and the Secure Distribution Engine (SDE), one of these three
approaches is required. If none of them is available, performance of Perimeta, Secure Distribution
Engine, EAS (OpenStack only), Storage Cluster (OpenStack only) and Rhino nodes will be limited by
the vSwitch, sometimes very severely.

For more information, see:

• Specific high-performance requirements for Perimeta, Rhino nodes and the Secure Distribution
Engine in Metaswitch Products OpenStack Deployment Design Guide
• Specific high-performance requirements for Perimeta, Rhino nodes and the Secure Distribution
Engine in Metaswitch Products VMware Deployment Design Guide.

16 2 Virtual infrastructure requirements


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

2.3.2 Latency

This section describes the networking latency requirements of Metaswitch products.

Latency on links between Metaswitch products

Requirements on latency of links between products are the same in a virtual environment as in a bare
metal deployment. See the applicable product documentation for specific requirements.

The important additional factor to remember in virtualized environments is that the latency arises
not only from physical links and network hubs/switches/routers between host servers but also now
from elements not present in the physical environment, such as the hypervisor and vSwitch, which
can add a few milliseconds to overall latency as perceived by guests. Additionally, where the virtual
infrastructure is providing storage accessed over a network, latency in accessing it must also be taken
into account.

Latency on links between VM instances in a 1+1 active/standby pair

For certain products, there is a limit on the latency on links between two VM instances if they are
deployed as a 1+1 active/standby pair. These are as follows.

• Perimeta - maximum round trip of 15ms.


• CFS or AGC/MGC - maximum round trip of 50ms.

2.3.3 External network access without NAT

This section outlines Metaswitch products' need for direct access to external networks.

Any Metaswitch solution has component products that require access to external networks to provide
service. Further, these products use their assigned IP addresses in protocol exchanges with clients
and peers in those external networks, for example in SIP SDP negotiations.

It is not sufficient for the virtual infrastructure to provide only private virtual networks where packets
must pass through a network address translation (NAT) function to access external networks:
the infrastructure must allow for direct connectivity to external networks without intervening NAT.
However, it is acceptable for NATs to be in place between remote devices and their access networks,
so long as the IP addresses assigned to the Metaswitch products are directly reachable from those
remote devices.

2.3.4 IP address assignment/DHCP

This section outlines Metaswitch products' support for DHCP.

Some Metaswitch products require DHCP to receive initial management IP address assignments, and
others can use it.

It is important that any product instance receiving any IP address via DHCP and continuing to use it
once in service continues to receive the same IP address in future, even across reboots, etc. This is

2 Virtual infrastructure requirements 17


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

because other product VM instances may be configured with knowledge of that IP address in order to
enable inter-product communication.

2.3.5 Traffic separation

This section outlines Metaswitch products' needs regarding traffic separation.

In a virtualization environment there are sets of networks for both the virtualization environment
itself (e.g. management, storage) and the applications running in that environment. For example,
Metaswitch products use separate management, HA, signaling and media networks, where
the Metaswitch management network is not the same as the virtualization environment’s own
management network.

Each virtualization environment has its own recommended best practice for how its networks
should be physically separated. You should follow these recommendations. Typically, these
recommendations are that management and storage should be separated onto distinct physical
networks, with the applications’ (virtual) networks on further separate physical networks.

Metaswitch products require that the distinct application networks appear as different virtual networks
(i.e. that they are exposed over different vNICs). They do not in general require that they are
physically separated (for exceptions, see External network access without NAT on page 17).
However, some customers may mandate physical separation of application networks for their
deployments, for example to separate untrusted public internet traffic from trusted internal traffic. In
that case, the virtualization environment must support multiple physically separated network domains
on which virtual networks can be configured.

2.3.6 Bandwidth limiting and shared network resources

This section describes the network protection requirements relating to bandwidth limitation and shared
network resources of Metaswitch products.

Some Metaswitch products are typically connected to untrusted networks such as the public internet.
These networks can be subject to DDoS attacks, traffic overload and similar conditions. This must not
have an impact on shared resources within the virtual infrastructure such as vSwitches/vRouters or on
trusted network traffic.

In addition, some Metaswitch products require traffic on untrusted networks to be firewall and rate
limited, to avoid adversely affecting VNF operation. Addressing these issues typically requires
firewalling on untrusted networks, either within the virtual infrastructure or upstream of it.

2.4 High availability

This section describes the high-availability mechanisms used by Metaswitch products, and the
demands these mechanisms place on the underlying virtual infrastructure.

18 2 Virtual infrastructure requirements


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

Metaswitch products are designed to offer Telco-grade availability, meaning 99.999% ("five nines")
availability of service (handling new call requests etc.) and maintaining calls over hardware and
software failures. This places requirements on the underlying virtual infrastructure platform.

• To keep calls up in the event of a VM or host failing requires the ability to quickly move an IP
address from one VM to another - that follows from the nature of VoIP protocols.
• To provide resilience against server failure, it must be possible to influence the allocation of VMs
to avoid multiple VMs that are supposed to provide redundancy protection for each other being
instantiated on the same host.
• To recover the system automatically to full redundancy after a failure, it must be possible to
automatically restart failed instances.

Those requirements are discussed in more detail in Protection against individual failures on page
19 and Protection against virtual infrastructure failure on page 21. Together, they are sufficient
to protect against individual hardware or software failures. However, they may not be sufficient to
protect against failure of the entire infrastructure. Whether or not that is required is a more complex
question, discussed separately below.

2.4.1 Protection against individual failures

This section describes the redundancy mechanisms used by Metaswitch products to guard against
individual instance failure.

This category of protection, usually known as "local equipment protection", protects against individual
hardware or software failures, such as a server failing or an application instance crashing. The goal is
to achieve continued availability of the service (i.e. all new requests succeed) and, where applicable,
for any calls in progress to be maintained.

Some Metaswitch products implement this level of redundancy by having two VMs running at all
times, one of which acts as the primary instance and one as the standby. Product-level interactions
between the two instances, not virtual infrastructure capabilities, determine which of them is the active
and which is the standby at any instant in time.

In the event of the failure of either the active VM or the host on which the active VM is running, the
product maintains service by switching to the standby VM.

Other Metaswitch products implement redundancy by deploying as an N+K pool - a pool of identical
VMs, each one capable of handling any request for any subscriber. N is the number of VMs required
to handle peak load; K is the number of additional VMs required to provide redundancy. K will be
greater than 1 for large N as the chances of simultaneous failures increases. During normal operation
all N+K VMs will be running.

Note:

If Rhino nodes or Storage Cluster VMs are deployed as a clustered solution, the value of N+K must
be at least 3.

2 Virtual infrastructure requirements 19


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

The virtual infrastructure must provide the following features for these mechanisms to function
properly.

Fast move of IP addresses

Switching service from an active to a standby requires that the standby take over the virtual IP
address formerly used by the active. To do this, Metaswitch products use a conventional L2 failover
scheme, where the standby uses a gratuitous ARP broadcast (or IPv6 equivalent) to advertise
ownership of the address.

This scheme places two requirements on the virtual infrastructure.

• It must support gratuitous ARP, meaning that ARP broadcasts are permitted and reach all hosts
and routers on the same (virtual) L2 subnet.
• It must be possible to reserve an IP address to use as this "virtual" IP address, such that it will not
be allocated to any other resource.

Anti-affinity

VMs must be distributed across distinct hosts in such a way that service is not lost or capacity
reduced by the failure of a single host.

• Products using active/standby architectures must have the active and standby running on separate
hosts.
• Products using an N+K pool architecture must not have more than K instances running on any
given host.

Virtual infrastructures support this through anti-affinity - a mechanism whereby it is possible to specify
at creation time that a VM should or should not be run on the same host as another VM. Metaswitch
products require that the virtual infrastructure supports anti-affinity. This could either be through
manual control over placement or by tagging the relevant VMs in some manner.

Note:

If the virtual infrastructure supports automatic healing, that healing function must respect anti-
affinity.

Auto-restart

The application redundancy mechanisms described above ensure service availability over failures.
However, the system will be running non-redundantly until a failed VM is restored. In a fully
orchestrated system this role may be performed by a combination of system monitoring and the
orchestrator. In the absence of full orchestration this must normally be done manually by Ops
staff responding to alarms, although some cases (a software failure in the VM, rather than host or
infrastructure failure) can be automated via a "watchdog" function.

There are two levels of watchdog:

• those provided by the virtual infrastructure


• those running as software within the VM.

20 2 Virtual infrastructure requirements


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

VM software watchdogs can recover application level failures and some but not all OS failures.
Watchdogs provided by the virtual infrastructure should be able to recover all VM software failures,
independent of whether they are in the OS or application.

Metaswitch products do not require watchdogs to be provided by the virtual infrastructure, but will
use such watchdogs if they are available. If watchdogs are not provided by the virtual infrastructure,
Metaswitch products use their own software watchdog.

2.4.2 Protection against virtual infrastructure failure

This section describes the redundancy mechanisms through which Metaswitch products guard against
virtual infrastructure failure.

A service deployed in a single virtual infrastructure cannot be more available than the virtual
infrastructure itself. The availability a single virtual infrastructure instance can achieve and whether
that is sufficient will vary depending on how the virtual infrastructure has been built and what your
target for service availability is. Your virtual infrastructure provider should be able to specify availability
figures for both individual VMs and the virtual infrastructure as a whole.

Typically the virtual infrastructure as a whole (meaning each of the compute, network and storage
services it provides) must have 99.9999% ("six nines") availability and individual VMs must have
99.9% ("three nines") availability to host a 99.999% ("five nines") reliable service, which is the norm in
telecommunications applications.

In those cases where a single virtual infrastructure instance cannot provide the required level of
availability, then clearly the service must be deployed across multiple virtual infrastructure instances.
These infrastructures must be independent and not coupled, i.e. a failure in one must not result
in the failure of another. Two availability zones or regions or similar such concepts within a single
infrastructure instance do not count as independent. Critically, it must be possible to upgrade one
virtual infrastructure instance with absolutely no impact on any other instance.

There are two possible deployment models here, and Metaswitch products support both. They differ in
their ability to keep calls up over the failure of an entire virtual infrastructure and their requirement for
the fast move of IP addresses between virtual infrastructures.

• The first model delivers both availability of service and maintenance of existing calls even after
a complete virtual infrastructure failure, but requires the fast move of IP addresses (as defined
above, including support for the L2 gratuitous ARP mechanism) to be supported between
virtual infrastructures. In this model the Metaswitch products are spread across multiple virtual
infrastructures. Active and standby instances for a given product are deployed in separate virtual
infrastructures; similarly, the VM instances making up a pool for a given product are spread across
virtual infrastructures. Anti-affinity is therefore achieved both through individual virtual infrastructure
anti-affinity mechanisms and as a natural consequence of the spreading.
• The second model delivers availability of service but does not maintain calls over a complete
virtual infrastructure failure. As with the first model the products are spread across virtual
infrastructures, but fast move of IP addresses between virtual infrastructures is not required.

2 Virtual infrastructure requirements 21


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

The trade-off here is the ability to maintain calls over virtual infrastructure failure vs. the practicality
and cost of supporting fast move of IP addresses between virtual infrastructures.

You must therefore ensure that your infrastructure meets the following requirements:

• Unless a single instance of your virtual infrastructure can deliver a level of availability high enough
to host a service reaching your target availability, you must provide multiple virtual infrastructure
instances.
• If you are deploying across multiple virtual infrastructures and you require calls to be maintained
over virtual infrastructure failure, your virtual infrastructure must support the fast move of IP
addresses between virtual infrastructures by gratuitous ARP, as described in Protection against
individual failures on page 19.

Products with quorate N+K pools

The following products have N+K VM pools where service is provided via quorum operations. If you
are deploying a site which contains multiple failure zones, you must deploy the N+K VM pools across
three or more failure zones (see Hidden failure modes on page 59 for a detailed explanation) to
provide high availability - do not just deploy the N+K VM pools across two failure zones, otherwise
failure of a single zone can take down the entire site.

• MVM, for its OAM VM pool. Alternatively, each MVM Evolved Service Center (ESC) can be treated
as a single failable entity and its VMs deployed wholly within a single failure zone.
• Deployment Configuration Store (DCS).
• Metaswitch Deployment Manager.

2.5 Storage

This section describes Metaswitch products' requirements for virtual storage.

General requirements for Metaswitch products


Metaswitch products other than the Object Backup Store (OBS), Service Assurance Server (SAS),
Storage Cluster, and Rhino TSN do not place any particular requirements on virtual storage.

For products other than OBS, SAS, EAS pool server systems and Rhino TSN, the storage
mechanisms provided natively by virtual infrastructures are generally sufficient:

• Virtual infrastructures all provide block devices to VMs - that is, virtual devices which appear to
the VM's OS like an ordinary disk, and which can be mounted, partitioned and formatted.
• Virtual infrastructures may distinguish between two types of block device:

• Ephemeral block devices, whose lifecycle is tied to the VM, which cannot be detached and
reattached to a different VM, and which lose all data when the VM is destroyed
• Persistent block devices, also often known as volume storage, which exist independently of
any single VM. These devices can be attached, detached and reattached between VMs at will,
and retain all their data even if a VM to which they are attached is destroyed.

22 2 Virtual infrastructure requirements


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

• All Metaswitch products require block storage. Many require persistent storage as part of their high
availability implementation; some do not and can use ephemeral storage only.
• Metaswitch products generally only require modest sized block devices (<1TB) with expected
performance no greater than you would get from a single physical HDD.
• In addition to block storage devices, virtual infrastructures may provide object (also called key-
value) storage services; Metaswitch products use only block devices.

The demands of SAS and OBS (described in OBS and SAS storage requirements on page 23)
are more exacting, and meeting them entails a working understanding of how virtual infrastructures
provide block devices, as described in Ephemeral and Persistent storage types on page 23.

EAS pool server systems have the following requirements.

• EAS pool server systems deployed with NAS devices as a data store do not place any additional
storage demands directly on the virtual infrastructure, but do require the provision of separate
storage accessed by the guest VMs via NFS.
• EAS pool server systems deployed with the Storage Cluster as a data store place additional
demands directly on the virtual infrastructure, as the Storage Cluster is deployed as a set of VMs
on that same infrastructure.

These additional requirements are described in Storage for EAS pool server systems on page 24.

Rhino TAS Storage Nodes (TSNs) provides databases for use by the other node types in Rhino
VoLTE TAS and Mobile Control Point deployments. Rhino TSNs have additional requirements, which
are described in Storage Cluster and Rhino TSNs on page 25.

Some storage systems offer real-time compression, known as 'deduplication'. Our stated resource
requirements assume no deduplication, and we would expect the gains from deduplication to be
marginal. If the storage systems meet the stated requirements for capacity, iops and latency, then it is
nevertheless acceptable for these systems to perform deduplication.

Ephemeral and Persistent storage types


If your virtual infrastructure distinguishes between ephemeral and persistent block storage, you need
to be aware of how each storage type is provided:

• Ephemeral storage is often (but not exclusively) provided from local host disks, or storage arrays
directly attached to the host. Such storage is increasingly being implemented using SSDs as
this enables a single host to support many VMs, each of which needs a small root disk with
performance similar to an HDD, but this is not universal.
• Persistent storage is always accessed over the network, but could be provided by a wide range
of sources such as dedicated SAN or NAS devices or software-defined-storage (such as ceph
or VMware vSAN) using underlying "just a bunch of disks" (JBOD) hardware, using a variety of
redundancy techniques.

OBS and SAS storage requirements


By default, both OBS and SAS use generic persistent storage. However, given the variety of possible
implementations of both ephemeral and persistent storage, and the wide range of performance and

2 Virtual infrastructure requirements 23


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

capacity observed in real-world virtual infrastructures, Metaswitch expects that there will need to be
a discussion with each virtual infrastructure provider about the optimal mapping of OBS and SAS to
their environment. We do, however, offer the following general guidance.

OBS

OBS is responsible for the geo-redundant storage of subscriber data. Internally, it uses the Cassandra
NoSQL DB. Cassandra internally stores data in high-churn journal files, which are small but need
very high I/O rates for low-latency performance, and longer-lasting data tables (SSTables), which
may be large and still need reasonably good I/O rates. Further, Cassandra manages replication at the
application level, spreading the data across multiple nodes. It does not require or expect individual
nodes to have redundant storage, and in fact any redundant storage is wasteful.

• Accepted best practice for deploying Cassandra in a virtual infrastructure that distinguishes
between ephemeral and persistent storage is to use fast (ideally local SSD) ephemeral storage for
both journals and SSTables.
• If using volume storage, you should use storage with the lowest redundancy level possible
consistent with no two Cassandra nodes sharing a single point of failure. For example, it would not
be acceptable to store the data on non-redundant storage if that storage might be serving more
than one Cassandra node. In such a case a single failure could result in the failure of multiple
supposedly independent Cassandra nodes, and hence in data loss.

SAS

SAS is a diagnostics and analytics engine storing real-time logs for up to a week. It places very high
demands on storage (up to 20TB for versions up to V10.2 and up to 10TB for V10.2 and later) and
I/O rate (up to 1200 IOPS for versions up to V10.2 and up to 600 IOPS for V10.2 and later). SAS is
usually regarded as mission-critical. Where this is true, its data must be stored in a persistent, resilient
store, so that no single disk failure loses an entire week's data set.

The techniques used by virtual storage solutions to provide resilience to disk failure vary. Some simply
hold multiple copies of the data (replication); others use RAID arrays or more sophisticated forms
of erasure encoding. As a consequence, the storage overhead (i.e. the difference between the size
of raw storage required and the size of data being stored) can vary greatly, from tens of per cent for
erasure coding up to a factor of 3 for simple replication.

The virtual infrastructure may offer different tiers of resilient storage with different degrees of
redundancy. If so, given the nature of SAS data (transient logs rather than long-term subscriber data)
and its large size, Metaswitch recommends using the tier with the lowest storage overhead.

Additionally, SAS is also supported on ephemeral storage, provided it can deliver sufficient storage
capacity and I/O rate for the load on the system. When deployed in this manner, of course, any
hardware failure results in the loss of SAS data. This may be acceptable in some circumstances.

Storage for EAS pool server systems


MetaSphere EAS can be deployed either as a single VM backed by shared redundant storage,
as a pair of VMs in active/standby mode, or as a pool of load balanced VMs. When deployed as a

24 2 Virtual infrastructure requirements


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

pooled server system, MetaSphere EAS requires a data store. This role can be fulfilled by one of the
following components.

• A NAS device, as described in NFS storage on page 25.


• Metaswitch's Storage Cluster, as described in Storage Cluster and Rhino TSNs on page 25.

NFS storage

MetaSphere EAS can be deployed alongside a NAS device (sometimes also referred to as a filer),
which acts as a data store. This is accessed via NFS from within the guest.

You can either use the UC9035 NAS device supplied by Metaswitch, or provide your own filer with
equivalent capabilities. See https://communities.metaswitch.com/docs/DOC-138232 for full details of
the requirements that a third-party filer must fulfill.

Note:

Network access to NFS storage from within the guest is only required and supported for EAS pool
server VMs. Other products do not make direct use of NFS storage.

Storage Cluster and Rhino TSNs


The Storage Cluster can be deployed alongside MetaSphere EAS pooled server systems running
V9.6 or later instead of a NAS device. A single Storage Cluster can be used to provide a data store
for a single pooled server system, or two separate pooled server systems (each with a local storage
cluster) can be linked together to form a MetaSphere EAS Geographically Redundant (GR) system.

A Rhino TAS Storage Node (TSN) is a VM that runs two Cassandra databases and provides
these databases' services to the other node types in Rhino VoLTE TAS and Mobile Control Point
deployments. Rhino TSNs run in a cluster with between 3 and 30 nodes per cluster depending on
deployment size, with load-balancing performed automatically.

The Storage Cluster and a Rhino TSN cluster both provide a redundant data store, designed to
be deployable on non-redundant underlying hardware. These components must provide storage
performance similar to an SSD, including high rates of IOPS (as indicated in the Metaswitch Products
VMware Deployment Design Guide and Metaswitch Products OpenStack Deployment Design
Guide) and low read latency of less than a millisecond. The following is our standard recommended
specification, although you may choose to use a different specification as long as it meets the
requirements given above.

• Fast underlying block devices. We strongly recommend that you use SSDs. The overall latency
and throughput of IOPS is normally a more important consideration than capacity for Storage
Cluster VMs or Rhino TSNs.
• Non-redundant storage.

• Storage Cluster VMs and Rhino TSNs replicate the data around the local cluster and operate
using a quorum concept. This means that data is written out 3 times.

2 Virtual infrastructure requirements 25


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

• Additional storage redundancy is often either wasteful or can in some cases cause degraded
performance (for example, if the storage layer makes the guest VM wait for replication to
complete after each storage request).
• Running Storage Cluster VMs or Rhino TSNs on underlying block devices using distributed
storage technology such as Ceph RBS or VMware vSAN may cause performance issues. To
avoid this, please discuss the storage layer design and configuration with your NFVI provider.
For example, consider configuring the storage layer to only use storage that is local to the host
and / or avoid replicating the data between Storage Cluster VMs or Rhino TSNs.
• 10GiB NICs available to Storage Cluster VMs or Rhino TSNs both for application layer and storage
layer network access.

It is often acceptable for lab systems to use lower specifications. Note that while this will often provide
sufficient performance for lab testing, the impact on performance is generally non-linear, and intensive
operations such as upgrades may take longer than they would on a Storage Cluster or a Rhino TSN
cluster using underlying hardware that meets the recommendations above.

Note:

Internally, Storage Cluster VMs and Rhino TSNs use Ceph functionality. Metaswitch are not
endorsed by or associated with Ceph or the Ceph trademark in any way. However, this information
may be useful when planning the provision of block devices for Storage Cluster and Rhino TSN
deployments.

MVM storage
MVM has additional requirements for distributed storage services that go beyond the basic virtual
storage offerings (specifically, a Cassandra database and a media store with an S3 API). Contact your
support representative for more details.

2.6 Other considerations

2.6.1 VM images

This section explains the mode of delivery of virtualized Metaswitch products and the support
requirements for this mode of delivery.

Metaswitch products are delivered as complete VM images containing a Linux OS and the Metaswitch
application.

This is in contrast to the alternative model of delivering an application-only package to be installed


onto a VM which has been pre-instantiated with a base OS provided by the virtual infrastructure.

The virtual infrastructure, and any usage policies, must support the model of complete VM images
including an OS.

26 2 Virtual infrastructure requirements


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

2.6.2 Other software

This section outlines Metaswitch products' lack of dependency on other software in the virtual
infrastructure.

Metaswitch products have no requirements for additional software or services, either on the compute
hosts or elsewhere in the virtual infrastructure, beyond what is explicitly specified in this document.
So, for example, there is no dependency on an external SQL database.

2.6.3 Onboard agents

This section offers guidance on the use of onboard agents in a virtual infrastructure on which
Metaswitch products are deployed.

Some virtual environments require custom onboard agents to be installed on VMs, for example to
perform orchestration actions or feed information into a common fault and performance monitoring
infrastructure.

Onboard agents can create conflicts with Metaswitch products. Each Metaswitch VM is designed as
a complete, self-contained bundle including a customized OS. The bundled OS has been specifically
configured to run Metaswitch products optimally, and to disable unneeded services. The current
generation of Metaswitch VMs closely manages the underlying operating system files and services,
and on detecting changes, may repair or revert them.

Therefore, rather than using onboard agents, it is preferable for you to make use of the native fault
and performance management and orchestration APIs already provided by our products.

However, if that option is not feasible for your environment and you have a definite requirement to
install extra software, please discuss with your Metaswitch Support representative (before attempting
any integration).

2.6.4 Time synchronization

This section explains Metaswitch products' requirements and mechanisms for accurate timekeeping.

Metaswitch products must keep accurate time to operate correctly, including maintaining the integrity
of billing records and other logs. They therefore use NTP to ensure accurate timekeeping, and to
avoid a single point of failure must be configured with the IP addresses of two or more NTP servers.
NTP traffic to and from Metaswitch VMs will be carried over the Metaswitch management network;
there must be NTP servers reachable from this network.

Furthermore, in most virtual environments VM instances inherit their initial time settings from the host.
To avoid unnecessary time jumps when starting instances, the hosts must also keep accurate time.

Some Metaswitch products running in Azure support synchronizing their time with Azure's PTP
service on an ongoing basis. If your product does not use Azure's PTP service, or is not deployed
in Azure, then your product must use NTP servers if you need to synchronize time. Other virtual

2 Virtual infrastructure requirements 27


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

infrastructures can provide mechanisms to allow VMs to sync their time on an ongoing basis, but
these mechanisms are incompatible with Metaswitch products and must not be used with Metaswitch
VMs. Refer to your product's documentation for details on how it performs time synchronization.

Attention:

Rhino nodes use the system clock to monitor the Java Virtual Machine and assume that the clock
always moves forward. Clock jumps cannot be more than 8 seconds or Rhino nodes can exhibit
unpredictable behavior.

28 2 Virtual infrastructure requirements


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

3 Support for virtual infrastructure management and


operational features

This section outlines Metaswitch products' support for specific OpenStack and VMware features.

Virtual infrastructures may support a wide range of management and operational features beyond
the basic virtualization functions of creating and operating VMs, virtual storage devices and virtual
networks.

• Many of these features are fully supported and can be used without restriction with Metaswitch
products.
• A few of these features are known to be incompatible with Metaswitch products, and must not be
used.
• Finally, there are features that are unsupported as they have not been tested with Metaswitch
products. These features are not listed individually; any feature not explicitly listed as supported
or incompatible has not been tested with Metaswitch products and you should assume that it is
unsupported.

Attention:

If you wish to make use of any virtual infrastructure features not explicitly listed as supported, you
will need to test the impact of these features in your environment and you will be responsible for
any issues they cause. If you are thinking of using such features it is strongly recommended that
you discuss this in advance with your Metaswitch Support representative.

3.1 OpenStack features

3.1.1 Supported OpenStack features

This section provides information about OpenStack features that are fully supported with some or all
Metaswitch products.

Feature Description

Migrate Moves a VM from one host to another. This specifically refers to


"non-live migration" as opposed to "live migration," as defined in
http://docs.openstack.org/admin-guide-cloud/compute-configuring-
migrations.html. The latter is not supported, as discussed below. The
former is supported for all Metaswitch products.

Resize instance (not Changes the resources allocated to a VM. Supported as a


supported on Clearwater mechanism for moving between supported VM sizes on all products

3 Support for virtual infrastructure management and operational features 29


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

Feature Description

Core, Perimeta, MVM, the excluding Clearwater Core, Perimeta, Secure Distribution Engine,
Secure Distribution Engine MVM and DCM, provided the appropriate Metaswitch MOPs are
(SDE), the BGCF VM or followed.
DCM)
For Rhino nodes use of this feature requires a JVM restart and
in some cases changes in the JVM configuration (Heap size for
instance).

Evacuate (not supported on Reinstantiates VMs on a different host following host failure on all
Clearwater Core, Perimeta, products excluding all products that use ephemeral storage for their
MVM, DCM, MetaView boot device (Clearwater Core, Perimeta, Rhino nodes, MVM, DCM,
Statistics Engine, the Secure MetaView Statistics Engine and the Secure Distribution Engine).
Distribution Engine or the
BGCF VM)

Shelve, unshelve Shelve both stops a VM and releases all resources associated with it
other than disk space. Unshelve reinstates and restarts the VM.

Lock, unlock Protects VMs against inadvertent deletion or other changes.

VLAN trunking Perimeta can set VLAN tags on packets when deployed in an
OpenStack environment subject to the following conditions.

• Perimeta must be used with virtio network interfaces.


• Guest VLANs are supported by the underlying OpenStack network
implementation.

This feature does not apply to other Metaswitch products.

Soft affinity Soft affinity rules enable you to specify a desired behavior for
allocating VMs to hosts, falling back to the undesired behavior if host
resources are not available to fulfill the desired behavior.

Soft anti-affinity rules (implemented using server groups) are


supported and recommended for all Metaswitch VMs deployed in
pools.

GRE, VXLAN and other These network types can have an MTU size lower than the standard
overlay network types 1500. They are supported for

• Perimeta on management, HA and service interfaces


• Secure Distribution Engine on all interfaces (default: 1400)
• CFS and MVD on HA interfaces only.

They are incompatible in all other cases unless they are engineered
to have an MTU of at least 1500 bytes as viewed bu the VM guest.

30 3 Support for virtual infrastructure management and operational features


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

Feature Description

CPU model PCID flag (from This feature extends the CPU flags that show up in the guest.
OpenStack Pike onward) Including the PCID flag helps mitigate slowdown that may result from
patching a host to guard against Meltdown and Spectre exploits.

Attention:

Please note the restrictions on the Resize instance and Evacuate functions in the table above.
These functions are not supported on Clearwater Core, Perimeta, MVM, DCM, BGCF VM or
Secure Distribution Engine. Additionally, the Evacuate function is not supported on MetaView
Statistics Engine.

3.1.2 Incompatible OpenStack features

This section lists OpenStack features that are incompatible with some or all Metaswitch products.

Feature Description

Pause, Suspend and Per-VM operation. Breaks NTP sync. Use stop and start instead.
Resume

Rebuild Recreates a VM from scratch. Must not be used with products that
boot from volume storage rather than an image, namely CFS, AGC/
MGC, RPAS, OBS, ESA Proxy, EAS, AMS, MVD, MVS, SAS, Rhino
nodes.

For Perimeta, the use of this feature is permissible but strongly


not recommended. The use of this feature may cause network
connectivity problems due to incorrectly assigned vNICs.

This feature is compatible with the Secure Distribution Engine (SDE)


and used for upgrading its VM instances.

Resize instance Changes the resources allocated to a VM. Cannot be used with
Clearwater, Perimeta, MVM, DCM, BGCF VM or Secure Distribution
Engine.

Evacuate Reinstantiates VMs on a different host following host failure. Cannot


be used with Clearwater Core, Perimeta, MVM, DCM and MetaView
Statistics Engine.

Snapshots Takes a copy of a running VM. This feature is unreliable and


disruptive, and should not be used for taking regular system backups;
application backup mechanisms should be used for this purpose.

3 Support for virtual infrastructure management and operational features 31


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

Feature Description

You should use snapshots only when specifically recommended by


Metaswitch for a particular process on a product in your deployment.

Virtio multiqueue Not supported by Metaswitch products.

MTU advertisement Advertises MTU values to instances via the Neutron DHCP agent.
The advertised MTU values are ignored by Perimeta, Secure
Distribution Engine, DCM and SAS.

Real-time instances Allows vCPUs to be given a fixed real-time scheduler priority.


This feature is incompatible with Perimeta using SR-IOV, and
unsupported for all other products.

Virtual network device Provides a mechanism for users to tag a device they have assigned
tagging to their guest with a specific role. Metaswitch products ignore tags
assigned to devices.

Provisioning VLAN tag This feature requires the use of trunk ports. Metaswitch products will
information to guest via generally ignore the additional metadata provided.
metadata (from OpenStack
Ocata onward)

GRE, VXLAN and other These network types can have an MTU size lower than the standard
overlay network types 1500. They are supported for

• Perimeta on management, HA and service interfaces


• Secure Distribution Engine (default: 1400)
• CFS and MVD on HA interfaces only.

They are incompatible in all other cases unless they are engineered
to present an MTU of at least 1500 bytes on all networks as viewed
by VM guests.

Volume extension (from This feature allows in-use Cinder volumes to be extended.
OpenStack Pike onward) Metaswitch products expect volumes of a set size and do not expect
them to change size while in use.

Writeable MTU (from This feature allows the setting of specific MTU values on networks.
OpenStack Pike onward) Metaswitch products expect standard MTU values for both overlay
and non-overlay networks.

Support for virtual GPUs Metaswitch products do not require or understand virtual GPU
(from OpenStack Queens devices.
onward)

32 3 Support for virtual infrastructure management and operational features


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

Feature Description

Volume multi-attach (from This feature allows a single volume to be attached to multiple
OpenStack Queens onward) instances simultaneously. Metaswitch products expect exclusive
access to the storage volumes they use.

3.2 VMware features

3.2.1 Supported VMware features

This section provides information about VMware features that are fully supported with some or all
Metaswitch products.

Feature Description

vSphere HA Not to be confused with VMware App HA, which is unsupported -


see below. vSphere HA restarts failed VMs.

Cold migration Cold migration moves a VM from one host to another (change
compute resource) and/or one datastore to another (change
storage resource) while the VM is powered down (as distinct
from vMotion below, which moves the VM while it is running).

Cold migration is supported on all products, although extra steps


may be needed on products subject to physical constraints to
reconnect the necessary physical resources on the new host
before powering the VM back on:

• For Perimeta, Secure Distribution Engine (SDE) and Rhino


nodes using SR-IOV, the VM must be reconfigured to use an
appropriate set of SR-IOV VFs on the new host.
• For DCM using USB tokens, you must move the USB token to
the new host and reconfigure vSphere to pass it through to the
migrated DCM VM.

Note that Metaswitch VMs do not support being migrated


between vCenter Servers.

vMotion Also known as hot or live migration, vMotion moves a VM from


one host to another while it is running. Its operation involves
(Not supported on all products:
briefly freezing the VM, so it will cause a transient impact to its
see Description)
ability to service load, for example resulting in media glitches.
vMotion is supported on all products apart from those subject to
physical constraints, namely:

3 Support for virtual infrastructure management and operational features 33


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

Feature Description

• Perimeta or Secure Distribution Engine using SR-IOV


• Perimeta using PCI passthrough
• DCM using USB tokens
• Rhino nodes
• Clearwater Core
• Metaswitch CCF

However, it should be viewed as a maintenance activity only to


be used during maintenance windows or periods of low load. See
note below for further details.

Note that Metaswitch products do not support the enhancement,


introduced in vSphere 6.0, that allows vMotion between vCenters.

Encrypted vMotion Encrypted vMotion (introduced in vSphere 6.5) Encrypts the


traffic used to migrate VMs between hosts. This is supported
subject to the same restrictions as for vMotion, above.

Distributed Resource Scheduler DRS automatically balances VM workloads between hosts so


(DRS), Distributed Power that resources are fairly shared out. It can run on a schedule or
Management (DPM) manually. DPM moves all VM workloads to fewer hosts so that
one or more hosts can be shut down to save power during a quiet
(Not supported on all products)
period. Both use vMotion to move VMs.

DRS and DPM are supported, subject to the same product


restrictions as vMotion, noted above. DRS and DPM should
be configured as Manual, and appropriate anti-affinity must be
added to the DRS configuration to keep the VM instances of the
same products distributed across multiple physical hosts.

Metaswitch does not recommend the automated use of DRS or


DPM as it is likely to be service impacting. In particular, in an
environment where Metaswitch and non-Metaswitch VMs share
a host there is the possibility of a non-Metaswitch VM ramping up
quickly and forcing a Metaswitch product VM to be relocated at
a busy time. We recommend that you do testing in your labs to
ensure that production systems will not be impacted when these
options are in use.

Content Library Content Library (introduced in vSphere 6.0) provides a


centralized storage for VM templates and OVF images making it
easier to store, manage and share content on the vCenter.

This is not currently supported for Clearwater. It is supported for

34 3 Support for virtual infrastructure management and operational features


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

Feature Description

• Perimeta from V4.1 onward


• All Secure Distribution Engine versions
• All Rhino versions
• DCM from V3.0 onward
• MVM from V2.15.0 onward
• all other products from V9.2.10 onward.

Component Protection Component Protection (introduced in vSphere 6.0) is an


extension to vSphere HA that protects against datastore
inaccesibility by providing automated recovery for VMs that lose
access to their storage. Two levels of component protection are
available:

• APD (All Paths Down) protection provides configurable


restart options for recovering VMs that have temporarily lost
contact with their storage.
• PDL (Permanent Device Loss) provides failover of VMs
to a new host when an unrecoverable loss of datastore
accessibility occurs.

Metaswitch supports both types of configuration, but recommends


using APD with the following settings:

• Configure Response for datastore with All Paths


Down (APD) to power off and restart affected VMs (either
conservatively or aggressively).
• Leave Delay for VM failover for APD set to its default of
three minutes unless you have a specific requirement to
change this setting.

We do not recommend any particular setting for the Response


for APD recovery after APD timeout setting.

NSX Network Virtualization and Supported by Metaswitch products.


Security Platform

Attention:

Please note the restrictions detailed in the table above:

• vMotion, DRS, and DPM are incompatible with Perimeta with PCI passthrough; Perimeta,
Secure Distribution Engine or Rhino nodes with SR-IOV; DCM with USB tokens; Clearwater
Core; and Metaswitch CCF.

3 Support for virtual infrastructure management and operational features 35


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

• For products deployed as 1+1 pairs, vMotion should be used only on the standby instance, as
using it on the active instance can cause detectable media glitches. (In the case of CFS, for
example, the impact is broadly comparable to that of a Software Protection Switch.) You can
use an SPS to ensure that the instance is operating as the standby before using vMotion. You
should do this during a maintenance window if possible.
• The vSphere 6.0 extension to vMotion that allows the movement of VMs between vCenters is
incompatible with Metaswitch products.
• DRS and DPM should be configured for Manual operation only.

3.2.2 Incompatible VMware features

This section lists VMware features that are incompatible with some or all Metaswitch products.

Feature Description

Suspend/Resume This features freezes VMs to allow maintenance operations on a host.


Causes an outage if used on an active VM, or a loss of redundancy if
used on a standby. Use vMotion instead,or require that VMs are shut
down and then restarted.

Snapshots These preserve the state and data of a virtual machine at a specific
point in time, including the contents of the virtual disk, and optionally
the virtual machine's memory as well. Restoring a snapshot can
interfere with product operations, including HA mechanisms. Taking
snapshots, particularly including memory dumps, can be detrimental
to performance / availability, and the way VMware stores them can
require significant storage over time.

This feature should therefore not be used for taking regular system
backups; application backup mechanisms should be used for
this purpose. You should use snapshots only when specifically
recommended by Metaswitch for a particular process on a product in
your deployment.

vSphere Data Protection This feature offers an efficient backup solution at the scope of entire
virtual machines. However, it uses snapshots and must therefore not
be used.

vSphere Data Replication This feature builds on the backup facilities provided by vSphere Data
Protection to allow a recovery site to be maintained. It therefore also
uses snapshots and so must not be used.

Cloning The clone operation takes a copy of an existing VM with the


expectation that both existing and new VMs will continue to run.

36 3 Support for virtual infrastructure management and operational features


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

Feature Description

However, cloning can interfere with existing systems, for instance by


causing duplicate IP addresses, and must not be used.

VMware Fault Tolerance This provides full stateful sync between an active and standby VM
and almost hitless failover should the active fail. VMware ensures
all I/O, CPU, memory and storage actions on the active host are
synchronized across the network to the standby. Metaswitch products
implement their own more efficient application-level redundancy
mechanisms, and VMware Fault Tolerance must not be used
alongside or instead of those.

VMware App HA This is application-level monitoring with policy-based failover and


restart, going beyond basic watchdog function , and adds detailed
policy configuration and tiered restart (e.g. attempt restart of specific
failed service N times in given period before moving to application- or
VM-level intervention). As with VMware Fault Tolerance, Metaswitch
products implement their own redundancy mechanisms and VMware
App HA must not be used alongside or instead of those.

VMware Clock This synchronizes the clocks on guests with that on the host,
Synchronization manipulating the guest OS's system clock to correct for the variable
rate of virtual timing devices. Metaswitch products must not use this
feature, and require that you configure your VM guests to synchronize
to one or more NTP servers in your local network.

In no circumstances should VMware Tools Clock Synchronization (or


any other host-based time synchronization) be used in conjunction
with NTP on the guest. Doing this would cause multiple services
attempting to make conflicting updates to the clock.

vMotion, Distributed vMotion moves a VM from one host to another while it is running. Its
Resource Scheduler operation involves briefly freezing the VM, so it will cause a transient
(DRS), and Distributed impact to its ability to service load, for example resulting in media
Power Management (DPM) glitches.
on Perimeta or Secure
DRS automatically balances VM workloads between hosts so that
Distribution Engine (SDE)
resources are fairly shared out. It can run on a schedule or manually.
with SR-IOV or DCM using
DPM moves all VM workloads to fewer hosts so that one or more
USB tokens
hosts can be shut down to save power during a quiet period. Both use
vMotion to move VMs.

vMotion, DRS and DPM must not be used with Perimeta and Secure
Distribution Engine configured to use SR-IOV or DCM configured to
use USB tokens.

3 Support for virtual infrastructure management and operational features 37


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

Feature Description

Migration between vCenter As noted in Supported VMware features on page 33, Cold
Server instances (including migration and vMotion are supported for most Metaswitch products.
cross vCenter Server However, the ability to migrate VMs between vCenters (introduced
vMotion and long distance in vSphere 6.0), is incompatible with Metaswitch products, as the
vMotion) resulting VM has no OVF environment data. Metaswitch products rely
on this data for network configuration.

Note that this restriction applies regardless of whether the VM is


running or powered down at the time of the migration.

VM secure boot Secure boot of VMs (introduced in vSphere 6.5) is incompatible with
Metaswitch products as it requires UEFI boot and Metaswitch VMs
use BIOS boot.

Instant Clone Instant clone creates a clone of a VM instantaneously using shared


disk and memory resources. As with the Cloning feature above, it can
interfere with existing systems by introducing duplicate IP addresses
into the deployment, so it must not be used with Metaswitch products.

38 3 Support for virtual infrastructure management and operational features


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

4 Requirements in SLA form

This section summarizes the cloud requirements described in this manual in a form suitable for
inclusion in an SLA, indicating any requirements which apply only to some products.

4.1 Mandatory requirements

The virtual environment must provide the requirements listed in this section.

Requirement Products

M.1 It must be possible to guarantee that the RAM allocated to All


a VM is available at all times.

M.2 It must be possible to guarantee that the disk space All


required by the service is available at all times.

M.3 It must be possible to guarantee that volume storage SAS, EAS, Storage Cluster
provides a guaranteed level of disk I/O.

M.4 Hosts must use Intel x86 processors, not AMD. All

M.5 Network latency between VMs must be within individual All


product targets.

M.6 VMs must have direct access to external networks without All
intervening NAT.

M.7 It must be possible to use L2 gratuitous ARP to move an IP CFS, AGC/MGC, MVD,
address between VMs, and to reserve an IP address to use MVS, Perimeta, Secure
for this purpose. Distribution Engine (SDE)

This does not apply to standalone (non-HA) Perimeta or


MVS.

M.8 It must be possible to instantiate active/standby VMs and N All except for N-Series and
+K pools of VMs with appropriate anti-affinity rules that are EAS.
preserved over any automatic processes such as healing or
migration.

M.9 The virtual infrastructure must be able to specify to its users All
its overall availability and the availability of individual VMs.

4 Requirements in SLA form 39


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

Requirement Products

M.10 [If an individual virtual infrastructure's availability is not All


enough to create the required service level availability]
Multiple virtual infrastructures must be available.

M.11 [If using multiple infrastructures and calls must be CFS, AGC/MGC, MVD,
maintained over infrastructure failure] The L2 gratuitous MVS, Perimeta, Secure
ARP and IP reservation functions mandated in M.8 must Distribution Engine
work across virtual infrastructures.

M.12 The virtual infrastructure and associated use policies All


must support VMs delivered as complete VM images
containing a Linux OS together with the application, without
restrictions on OS version.

M.13 The virtual infrastructure must provide ephemeral and All


persistent volume storage.

M.14 The virtual infrastructure must not use any of the platform- All
specific incompatible features detailed in Support for virtual
infrastructure management and operational features on
page 29 on Metaswitch VMs.

M.15 The virtual infrastructure must keep accurate time. There All
must be two or more NTP servers reachable from the
Metaswitch VM's management networks.

This does not apply to Perimeta deployed in Azure.

M.16 The virtual infrastructure (combined with external All


infrastructure such as physical firewalls) must be resilient
to excess traffic on untrusted networks and must be able to
firewall traffic to specific VNFs. In particular, excess traffic
on untrusted networks must not impact other networks
through vSwitch congestion and similar conditions.

4.2 Additional requirements for optional features

The virtual environment may have to provide the requirements listed in this section, depending on
specific customer requirements.

40 4 Requirements in SLA form


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

Requirement Products

O.1 If the customer requires physical separation of virtual All


networks, the virtual environment must support a way of
setting this up.

4.3 Recommendations for performance

This section lists cloud recommendations that guarantee stated capacity will be met. If any of these
are not met, the Metaswitch products will still run, but the operator must do their own testing to
determine the performance and capacity of the products in their environment.

Requirement Products

P.1 CPU and network bandwidth should not be over-committed. All

P.2 For all Metaswitch VM flavors with 16 or fewer vCPUs, it All


should be possible to pin the VM to a single CPU (NUMA
node)

P.3 NUMA placement should be supported. All

P.4 Any power management should not impact VM scheduling. All

P.5 The virtual infrastructure should provide an accelerated Perimeta, Rhino nodes,
data plane - SR-IOV, PCI passthrough or a DPDK- Secure Distribution Engine,
accelerated vSwitch (only SR-IOV for the Secure EAS (OpenStack only),
Distribution Engine (SDE)). Storage Cluster (OpenStack
only)

P.6 The virtual infrastructure should provide suitable storage OBS, SAS, MVM, Storage
mappings. Cluster

4.4 Recommendation for service availability

The recommendation listed in this section is desirable to maximize service availability.

Requirement Products

A.1 The virtual infrastructure should provide VM watchdog All


functions.

4 Requirements in SLA form 41


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

5 Environment-specific requirements

This section outlines requirements for Metaswitch products that are specific to particular virtualization
environments or clouds.

5.1 OpenStack requirements

5.1.1 OpenStack releases

Metaswitch virtual products support OpenStack with the KVM hypervisor. This section lists the
OpenStack releases supported by each Metaswitch product.

OpenStack release support by product

Support and testing policy

Metaswitch products are currently supported from OpenStack's Newton release onward. The support
matrix for Metaswitch product versions and OpenStack releases defines three levels of support:

• Tested (Te): The specified product version has been thoroughly tested running on the specified
OpenStack release by our product teams and is guaranteed to work and to meet our stated
capacity and performance benchmarks.
• Supported (Su): You may run the specified product version on the specified OpenStack release;
however, this product version/OpenStack release combination has not undergone extensive
testing in our labs. You must therefore test all aspects of your deployment in your own lab before
deploying this product version/OpenStack release combination.
• Not supported (No): We do not support the specified product version/OpenStack release
combination, and make no guarantees whatsoever that the product will work as intended or that
we will be able to assist with any problems you may encounter when running this combination.
We withdraw support for an OpenStack release when it is out of support with Red Hat (however,
we may declare support for a release before it is adopted by Red Hat). For details of Red
Hat's support schedule see Red Hat OpenStack Platform Lifecycle. Mappings between named
OpenStack releases and Red Hat OpenStack Platform release numbers, where applicable, are
listed below.

Attention:

As part of our support agreement, we require you to adhere to the following conditions when
deploying Metaswitch products on OpenStack:

• You agree to source your OpenStack infrastructure from a reputable vendor and ensure that
the release you have deployed is still in support by that vendor (vendors typically provide an
extended support period of up to five years for selected OpenStack releases).

42 5 Environment-specific requirements
CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

• When deploying a product version/OpenStack release combination that is Supported but not
Tested, you agree to perform extensive lab tests on your deployment before making it live.
• Metaswitch will make every effort to help you solve any unexpected problems that may arise
when deploying on an OpenStack release that has not yet been specifically tested with our
products. However, you understand that this fixes and workarounds will take time to develop,
and on rare occasions - for example, if an OpenStack release contains a bug that breaks
compatibility with our products in a fundamental way - it may not be possible to work around the
problem and you may need to consider deploying on a different OpenStack release.

If you intend to deploy existing versions of the Metaswitch products on OpenStack releases that are
not explicitly identified as supported, please discuss this with your Support representative.

Supported releases

Metaswitch products support the following OpenStack releases, as detailed in the table below:

• N: Newton (equivalent to Red Hat OpenStack Platform release 10)


• O: Ocata (Red Hat 11)
• P: Pike (Red Hat 12)
• Q: Queens (Red Hat 13)
• R: Rocky (Red Hat 14)
• S: Stein (Red Hat 15)
• T: Train (Red Hat 16)
• U: Ussuri
• V: Victoria
• W: Wallaby

Table 1: Supported OpenStack releases by Metaswitch product

Product Version(s) N O P Q R S T U V W

Perimeta V4.2- Te No No No No No No No No No
V4.2.20

V4.2.40- Te Te No No No No No No No No
V4.6.20

V4.6.40- Te Te Te Te No No No No No No
V4.8.20

V4.8.25+ Te Te Te Te Te Su Su Su Su Su

5 Environment-specific requirements 43
Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

Product Version(s) N O P Q R S T U V W

Clearwater V11.1, Te Te No No No No No No No No
Core V11.2,
V11.2.01

V11.3, Te Te Te Te Su Su Su Su Su Su
V11.3.01,
V11.4,
V11.4.02

V11.5+ No No Te Te Su Su Te Su Su Su

DCM V3.1-V3.2 Te No No No No No No No No No

V3.3 Te Te No No No No No No No No

V3.4 Te Te Te Te Su Su Su Su Su Su

V4.0 Su Su Su Su Te Te Te Su Su Su

Service V9.3.20 Te No No No No No No No No No
Assurance
Server V9.4 - V10 Te Te No No No No No No No No

V11 - V12 Te Te Te Te No No No No No No

V12.10+ Te Te Te Te Te Te Te Su Su Su

CFS, AGC, V9.3.20 Te No No No No No No No No No


MGC,
EAS, MVS, V9.4- Te Te No No No No No No No No
MVD,RPAS, V9.4.30
OBS, ESAP
V9.5- Te Te Te Te No No No No No No
V9.5.30

V9.5.40- Te Te Te Te Te No No No No No
V9.6.10

V9.6.20+ Te Te Te Te Te Te Te Su Su Su

Advanced V9.4- Te Te Te Te No No No No No No
Messaging V9.5.30
Service
(AMS)

44 5 Environment-specific requirements
CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

Product Version(s) N O P Q R S T U V W

V9.5.40- Te Te Te Te Te No No No No No
V9.6.10

V9.6.20+ Te Te Te Te Te Te Te Su Su Su

Metaswitch V1.0 Te Te Te Su Su Su Su Su Su Su
Deployment
Manager

MetaView V3.0-V3.2 Te No No No No No No No No No
Statistics
Engine

Metaswitch V5.0 Te Te No No No No No No No No
CCF
V6.0-V8.0 Te Te Su Su Su Su Su Su Su Su

V9.0 No No Su Su Te Su Su Su Su Su

MVM V2.15.0 Te Te Su Su Su Su Su Su Su Su

QCall V1.0+ Te Te Su Su Su Su Su Su Su Su

ServiceIQ V6.3.8+ Te Te Te Te Te Te Te Su Su Su
Management
Platform
(SIMPL)

ServiceIQ V1.0+ Te Te Te Te Te Su Su Su Su Su
Monitoring
(SIMon)

Rhino V2.6.0 - Te Su Su Su Su Su Su Su Su Su
VoLTE V4.0
TAS nodes
(MMT, SMO,
MAG, TSN),
standalone
TSN and
REM nodes

Rhino MCP V1.0 Su Su Su Su Su Su Su Su Su Su


node

5 Environment-specific requirements 45
Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

Product Version(s) N O P Q R S T U V W

Rhino nodes Rhino MAX Te Su Su Su Su Su Su Su Su Su


(MaX UC) nodes -
V4.0.0

Rhino MAG
nodes -
V3.0.0

Group V3.0 - Te Te Su Su Su Su Su Su Su Su
Application V3.1.1
Server

Secure V1.0+ No Te Su Su Su Su Su Su Su Su
Distribution
Engine
(SDE)

Deployment V1.0+ No Te Su Su Su Su Su Su Su Su
Configuration
Store (DCS)

Storage V1.0 Te Te No No No No No No No No
Cluster
V2.0+ Te Te Te Te Te Te Te Te Te Te

BGCF VM V11.4.07+ No No Te Te Te Te Su Su Su Su

Distributed V2.0+ No No No Te Te Su Su Su Su Su
Admission
Manager
(DAM)

5.1.2 Detailed OpenStack hints and tips

This section contains detailed hints and tips on assessing whether a certain cloud will meet the
requirements set out in this manual. It is not intended to be an exhaustive reference.

Guaranteed RAM

Requirement Products

M.1 It must be possible to guarantee that the RAM allocated to All


a VM is available at all times.

46 5 Environment-specific requirements
CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

This can currently only be achieved in OpenStack by the cloud provider setting the
ram_allocation_ratio Nova config setting to 1. This can either be set on all hosts or on one or
more host aggregates.

Also, to prevent pre-emptive swapping out of VM memory it is recommended to set the


vm.swappiness sysctl setting to 0 on the host.

Uncontended resources

Requirement Products

M.2 It must be possible to guarantee that the disk space All


required by the service is available at all times.

To guarantee dedicated disk space, the disk volumes used by Metaswitch product VMs must be thick
provisioned on the storage back end.

Direct access to external networks

Requirement Products

M.6 VMs must have direct access to external networks without All
intervening NAT.

Networks in OpenStack are categorized as either tenant networks or provider networks.

• Tenant networks, which can be created by users, are purely virtual networks and details about how
they are physically realized are hidden from those users. Tenant networks can usually only access
external networks via an OpenStack L3 gateway using NAT.
• Provider networks can only be created by administrators, and map directly to physical networks,
including external networks.

To meet this requirement on OpenStack you will usually need to have access to provider networks.

Moving and reserving IP addresses

Requirement Products

M.8 It must be possible to use L2 gratuitous ARP or neighbor CFS, AGC/MGC, MVD,
advertisement packets to move IP address(es) between MVS, Perimeta, Secure
VMs, and to reserve an IP address to use for this purpose. Distribution Engine (SDE)

Unless using SR-IOV, this feature requires that the allowed_address_pairs Neutron API
extension is supported.

5 Environment-specific requirements 47
Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

Anti-affinity rules

Requirement Products

M.9 It must be possible to specify appropriate anti-affinity rules All


for both active/standby and N+K pools.

Anti-affinity for active/standby is standard OpenStack function (scheduler hints). For N+K pools the
OpenStack cloud must support at least one of the following:

• server groups
• host aggregates
• availability zones
• regions

For most deployments, server groups with anti-affinity rules are the easiest (and therefore
recommended) method of meeting the anti-affinity requirements of Metaswitch products.

• For highly available VM pairs, server groups with hard anti-affinity rules should be used.
• For pools of N+K pools of VMs, server groups with soft anti-affinity rules should be used. See Anti-
affinity and scheduling in Metaswitch Products OpenStack Deployment Design Guide for details on
how to configure soft anti-affinity in the Nova scheduler.

Attention:

Server groups are enabled by default in OpenStack but may be disabled in some distros (see
e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1150728). To enable server groups, add
ServerGroupAntiAffinity to scheduler_default_filters in /etc/nova/nova.conf
and restart nova services on the controller node and all compute host nodes.

Storage

Requirement Products

M.14 The virtual infrastructure must provide ephemeral and All


persistent volume storage.

OpenStack's Cinder LVM implementation works by creating a logical volume on a single storage host
and exposing it as an iSCSI target. It is therefore a single point of failure and not suitable for providing
persistent storage for Metaswitch products in production environments.

Attention:

Metaswitch products require thick provisioning. From OpenStack Pike onward, the Cinder LVM
backend lvm_type setting will use the auto value, which specifies thin provisioning, unless you

48 5 Environment-specific requirements
CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

specify a specific value. From OpenStack Pike onward you must set lvm_type to default to
ensure that your deployment uses thick provisioning.

Volume storage must be provided by shared storage using either a distributed storage cluster, e.g.
Ceph, or by dedicated storage array hardware. Shared storage is always expected to use RAID or an
equivalent mechanism for redundancy.

Some products use file-backed ephemeral storage for their boot device as described in Storage
options in Metaswitch Products OpenStack Deployment Design Guide. To ensure that file-backed
ephemeral storage does not become fragmented over time, where this storage is used it should be
preallocated at instantiation. This can be achieved by setting the preallocate_images Nova config
option to space on Compute hosts.

CPU not over-committed

Requirement Products

P.1 CPU and network bandwidth should not be over-committed. All

This can currently only be achieved in OpenStack by the cloud provider setting the
cpu_allocation_ratio Nova config setting to 1. This can either be set on all hosts or on one or
more host aggregates.

Additionally, from OpenStack Kilo onwards this can be achieved by pinning the vCPUs of VMs to
physical CPUs of the host. We recommend that you use host aggregates to identify those hosts that
can provide dedicated physical CPUs. This requires the operating system used by the host to include
QEMU 2.1 (or above) compiled with NUMA support.

NUMA

Requirement Products

P.3 NUMA placement should be supported. All

I/O based NUMA scheduling is enabled by default on all supported OpenStack versions and no further
configuration. This is of particular importance for high-performance Perimeta.

SR-IOV and PCI device affinity rules

Requirement Products

P.5 The virtual infrastructure should provide an accelerated Perimeta, Rhino nodes,
data plane - SR-IOV, PCI passthrough or a DPDK- Secure Distribution Engine,
accelerated vSwitch (only SR-IOV for the Secure EAS, Storage Cluster
Distribution Engine (SDE)).

5 Environment-specific requirements 49
Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

From OpenStack Pike onward, the Nova scheduler can be configured to favor hosts which best match
the requested PCI devices for an instance, including any Virtual Function (VF) devices if using SR-
IOV. If you are using SR-IOV in your deployment but it is not supported by all hosts, you may wish to
use the Nova scheduler to ensure that any VMs that do not use SR_IOV are preferentially assigned
to hosts that do not support it, leaving those hosts that do support SR-IOV for VMs that require it. For
more information, see Anti-affinity and scheduling in Metaswitch Products OpenStack Deployment
Design Guide.

Compatible networking layer

The following is an overview of networking layer compatibility with Metaswitch products.

Note:

A "Compatible" networking layer is known to work, subject to being configured in such a way that it
confirms to the other requirements in this document. It may be possible to configure a Compatible
networking layer so that the Metaswitch product(s) are inoperable with it.

Table 2: Networking layer compatibility

Perimeta CFS, AGC/MGC, MVD, Other products


MVS

Open vSwitch Compatible Compatible Compatible

SR-IOV Compatible Compatible Compatible

Full PCI passthrough Compatible Compatible Compatible

Juniper Contrail Compatible Compatible Compatible


vRouter

Calico Incompatible Incompatible Compatible

Other networking layers Contact your sales Contact your sales Compatible
engineer or support engineer or support
representative representative

In general, compatibility of a networking layer with a Metaswitch product is dependent on the following
properties.

• The ability to move and reserve IP addresses (see requirement M.8 in Moving and reserving IP
addresses on page 47).
• The driver required for the vNIC exposed by the networking layer.

• Perimeta is only compatible with the following network interface drivers.

• virtio - an option in most vSwitch implementations (e.g. standard Open vSwitch ports, OVS-
DPDK vhost-user backed ports, etc.).
• ixgbevf (SR-IOV) - Intel NICs using the 92599 Ethernet controller.

50 5 Environment-specific requirements
CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

• enic (PCI passthrough) - Cisco VIC NICs in the Cisco UCS platform.

For more information, see Specific high-performance requirements for Perimeta, Rhino nodes
and the Secure Distribution Engine in Metaswitch Products OpenStack Deployment Design
Guide.
• Other products are compatible with commonly used network interface drivers established in
Linux based operating systems.

Performance of networking layers will vary widely, especially in packet rate throughput. This has the
largest effect on media handling products, especially Perimeta, but all products may hit limits in the
networking layer at lower loads than their rated capacity. The best performance is achieved when
using SR-IOV or PCI passthrough.

5.2 VMware requirements

5.2.1 VMware versions

Metaswitch virtual products support VMware with the ESXi hypervisor. This section lists the VMware
versions supported by each Metaswitch product.

Except where explicitly indicated in VMware version support by product on page 52, there is
no version dependency between vCenter version and Metaswitch software version: all versions of
Metaswitch products supported on VMware are designed to be compatible with all supported vSphere
versions. There is no requirement, for example, to run a particular Metaswitch product version on
vSphere 5.5 or vSphere 6.0. We aim to support new vSphere versions within four months of release.

Note:

For supported deployments across multiple hosts and vCenters, those hosts and vCenters may be
versioned independently, as long as these versions are fully supported by all of the products in your
deployment. For example, should you upgrade a vCenter from v5.5 to v6.0, there is no requirement
that you simultaneously upgrade your other vCenters.

Attention:

If you upgrade VMware, do not change VM hardware versions or upgrade VMware Tools on any
Metaswitch VM. VM hardware versions should remain as specified in the product OVFs or OVAs.
Other VM hardware versions are not tested, and in some cases may cause improper operation of
the VM.

5 Environment-specific requirements 51
Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

VMware version support by product

Product vSphere version vCloud version

Clearwater Core Clearwater Core V11.5 Not supported


and later: V6.5+

Clearwater Core V11.4:

• Supported: 5.5+
• Recommended:
6.0+

Earlier Clearwater Core


versions:

• Supported: 5.5-7.0
• Recommended:
6.0-7.0

Deployment Configuration Store (DCS) Supported: 6.7 V1.3 and later:

10.2

(not supported prior to


V1.3)

Distributed Admission Manager (DAM) Supported: 5.5-6.7 Not supported

Recommended: 6.0-6.7

Distributed Capacity Manager DCM V4.0+: DCM V4.1+:

• Supported: 6.5-7.0 • Supported: 10.2,


10.2.1
DCM V3.4:
Earlier DCM versions:
• Supported: 5.5-7.0
contact your support
• Recommended:
representative for
6.0-7.0
details.
DCM V3.3:

• Supported: 5.5-6.7
• Recommended:
6.0-6.7

Earlier DCM versions:

• Supported: 5.1-6.5
• Recommended:
6.0-6.5

52 5 Environment-specific requirements
CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

Product vSphere version vCloud version

MetaSphere CFS (including RPAS, OBS, MRS), V9.5.40 and later: V9.6.20 and later:
Metaswitch AGC / MGC, MetaSphere EAS,
• Supported: 5.5-7.0 10.2
MetaView Server, MetaView Director
• Recommended: (not supported prior to
6.0-7.0 V9.6.20)
Earlier versions:

• Supported: 5.5-6.7
• Recommended:
6.0-6.7

MetaSphere N-Series Supported: 5.5-7.0 Not supported

Recommended: 6.0-7.0

ESA Proxy, Advanced Messaging Service (AMS) V9.5.40 and later: Not supported

• Supported: 5.5-7.0
• Recommended:
6.0-7.0

Earlier versions:

• Supported: 5.5-6.7
• Recommended:
6.0-6.7

Metaswitch CCF Metaswitch CCF V9.0 Not supported


and later:

• Supported: 6.5+
• Recommended:
7.0+

Metaswitch CCF V8.0


and later:

• Supported: 5.5+
• Recommended:
6.0+

Earlier Metaswitch CCF


versions:

• Supported: 5.5-7.0
• Recommended:
6.0-7.0

5 Environment-specific requirements 53
Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

Product vSphere version vCloud version

Metaswitch Deployment Manager (MDM) Supported: 6.7-7.0 V2.32.0 and later:

10.2

(not supported prior to


V2.32.0)

MVM Not supported Not supported

Perimeta Perimeta V4.9.25 and From V4.9.10:


later:
10.2
• Supported: 6.5-7.0
(not supported prior to
• Recommended: V4.9.10)
6.5-7.0

Perimeta V4.8.20 to
V4.9.20:

• Supported: 6.5-7.0
• Recommended:
6.5-6.7

Perimeta V4.5 to
V4.8.10:

• Supported: 5.5-6.7
• Recommended:
6.0-6.7

Earlier Perimeta
versions:

• Supported: 5.1-6.5
• Recommended:
6.0-6.5

See additional note


below.

QCall QCall V2.3.2 and later: Not supported

• Supported: 6.5-7.0

Earlier QCall versions:

• Supported: 5.5-6.7
• Recommended:
6.0-6.7

54 5 Environment-specific requirements
CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

Product vSphere version vCloud version

Rhino VoLTE TAS nodes Supported: 5.1-7.0 Not supported

Recommended: 6.0-7.0

See additional note


below.

Rhino Mobile Control Point nodes Supported: 5.1-7.0 Not supported

Recommended: 6.0-7.0

See additional note


below.

Rhino MaX nodes Supported: 5.1-7.0 Not supported

Recommended: 6.0-7.0

See additional note


below.

Secure Distribution Engine (SDE) Supported: 6.7 Not supported

Service Assurance Server Supported: 5.5-7.0 V12.20 and later:

Recommended: 6.0-7.0 10.2

(not supported prior to


V12.20)

(only supported in
SIMPL VM managed
deployments)

ServiceIQ Management Platform (SIMPL) Supported: 6.5 - 7.0 V6.6.0 and later:

10.4

(not supported prior to


V6.6.0)

ServiceIQ Monitoring (SIMon) Supported: 5.5-7.0 Not supported

Recommended: 6.0-7.0

All other products Supported: 5.5-6.5 Not supported

Recommended: 6.0-6.5

For Rhino nodes and Perimeta, vSphere 5.1 has a specific limitation that you must consider when
deciding which version of vSphere to use. As indicated in Recommendations for performance on

5 Environment-specific requirements 55
Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

page 41, SR-IOV or PCI passthrough must be used for high performance. In vSphere 5.1, you cannot
configure SR-IOV using the VMware vSphere Client or Web Client. Instead, you must use the CLI
exposed by the host.

Attention:

If you plan to use NSX-T networking, you must use vCenter Server 6.7 Update 3b or later. When
NSX-T networking is used with earlier vCenter Server versions, some Metaswitch VMs may not be
able to reliably read their networking configuration from the VMware environment.

5.2.2 vSphere HA

This section describes vSphere HA and details the Metaswitch products that require it as their
redundancy mechanism.

vSphere High Availability (vSphere HA) allows manual or automated recovery of a VM when its
host fails. Products using vSphere HA must also be using shared persistent storage. vSphere HA is
required for the following products, which use it to deliver local resilience to failure:

• Single-instance EAS
• N-Series, except for N-Series (Basic) Music-on-Hold servers deployed in a pool for scaling
• MetaView Server (MVS) unless deployed as an active-standby pair.

vSphere HA is supported for products not listed above. For those, its effect is to allow automatic
restoration of redundancy.

5.2.3 Detailed VMware hints and tips

This section contains detailed hints and tips on assessing whether a certain virtualization environment
will meet the requirements set out in this manual. It is not intended to be an exhaustive reference.

Uncontended resources

Requirement Products

M.2 It must be possible to guarantee that the disk space All


required by the service is available at all times.

To guarantee dedicated disk space, the disk volumes used by Metaswitch product VMs must be thick
provisioned on the storage back end.

CPU not over-committed

Requirement Products

P.1 CPU and network bandwidth should not be over-committed. All

56 5 Environment-specific requirements
CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

To ensure that your Metaswitch VMs have guaranteed access to CPU resources, you must ensure
that any hosts on which Metaswitch VMs run are not overcommitted. In particular, you must ensure
that the processor resource required by the VMs and the hypervisor/management processes on a
given host does not exceed the number of logical cores available.

Attention:

Reserving CPU resources for the sole use of Metaswitch VMs on a given host is necessary but not
sufficient to ensure that there is no contention for resources with non-Metaswitch VMs deployed on
the same host. To guarantee that Metaswitch VMs can access the reserved CPU resource at all
times, you must either run Metaswitch VMs on dedicated hosts separate from third-party VMs in the
VMware deployment or reserve CPU for all VMs on hosts shared with Metaswitch VMs. See CPU
resource considerations in Metaswitch Products VMware Deployment Design Guide for details.

5 Environment-specific requirements 57
Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

6 Designing for High Availability

The following sections offer advice and guidance on how to design your virtual infrastructure for high
availability.

Protection against virtual infrastructure failure on page 21 describes the protection Metaswitch
products offer against failure of the virtual infrastructure. The following sections outline some key
decisions you will need to make about infrastructure-level availability, and best-practice guidelines for
achieving your desired availability model.

In particular, it is important to consider which of the following availability models you will use to provide
the desired level of service availability based on which offers the best cost/benefit tradeoff for your
services:

• A single virtual infrastructure for your entire deployment that will provide Telco-grade reliability
• Multiple virtual infrastructures, each providing IT-grade reliability, and providing sufficient resources
that deployment-level availability is maintained in the event of a single infrastructure failure.

6.1 Telco vs. IT approaches to availability


The traditional Telco industry approach to high availability starts with building very reliable Central
Offices (COs) that can deliver 99.9999% uptime with failures expected only every few years, usually
as a result of an external catastrophe such as fire or flood.

In this approach, the entire physical infrastructure is redundant. There are multiple power feeds
from independent sources together with backup generation capacity and/or battery banks. There is
redundant air conditioning. Networking is fully redundant, with each server connected multiple times
to multiple switches. Individual servers are usually NEBS grade, so for example are able to operate at
55C for days after a total cooling failure. Larger COs are split into multiple rooms with fire protection
between them. This approach is proven to deliver high reliability – but is extremely expensive.

By contrast, data centers designed to host enterprise IT workloads such as email, databases and web
servers are usually built with a lower level of redundancy, aiming for decent reliability but at a much
lower cost. For example, they use COTS equipment such as standard Dell servers rather than NEBS-
grade components.

The Telecommunications Industry Association defined a commonly-used grading system for data
centers (see ANSI/TIA-942). It has four Tiers, where Tier 1 is a basic server room and Tiers 2 to 4
add various levels of physical redundancy, with expected availability ranging from 99.7% to 99.99%
respectively, and with outages expected every few weeks or months.

The trend in the industry, pioneered by the high-scale web service providers such as Google,
Facebook and Amazon, is now to deliver very high levels of service availability through smart
application designs running across multiple independent cheap data centers, usually built to simple
Tier 2 standards. Individual servers may not even have redundant networking or RAIDed disks.

58 6 Designing for High Availability


CONFIDENTIAL Virtual Infrastructure Requirements Guide (V8.5)

There are pros and cons to both of these models, and only you can work out which approach is best
for you. Metaswitch services are designed to be able to run and deliver high availability in both these
environments.

6.2 Hidden failure modes


When designing a virtual infastructure, it is easy to overlook one common but subtle cause of
infrastructure failure: the use of distributed state storage, and the interaction with data center failure
zones.

Failure zones
Data centers are often internally divided into multiple physical failure zones, each typically with their
own power supplies or cooling. These could be as simple as adjacent racks in the same room, or
physically separate buildings within a campus to prevent fires spreading. If a zone fails, the data
center as a whole carries on running, albeit with reduced capacity.

This makes the data center much more reliable than the individual failure zones. For example, in a
Tier 4 data center then each failure zone individually may only be 99.9% available, but having two or
more makes the data center 99.995% available.

Distributed state storage


It is now quite common for virtual environments (and applications) to make use of distributed state
storage. For example, OpenStack clouds will often use ceph and VMware deployments may use
vSAN as their distributed storage solutions. Applications often use NoSQL databases like Cassandra
or MongoDB.

All of these distributed storage solutions work by sharding each chunk of data into multiple copies
which are stored on separate nodes. A fundamental property of such systems is that they can only
provide two of availability, data consistency and partition tolerance under failure conditions (see
https://en.wikipedia.org/wiki/CAP_theorem). Most storage solutions (including ceph and vSAN)
choose partition tolerance (otherwise single network failures bring down the entire system) and data
consistency, as that provides the behavior their clients expect (specifically, that data doesn’t “change
under their feet”). However, this means they use quorum writing (and possibly reading), meaning that
for a write (or read) to succeed more than half of the underlying storage nodes must be available.

Implications
If your virtual infrastructure has a critical subsystem that uses a quorum (such as ceph or vSAN)
then it must be deployed across three or more failure zones. If deployed across just two, then there
is a 50% chance that a single zone failing will bring down the entire virtual infrastructure (as it is no
longer quorate). It is easy to be fooled into thinking that you have a 99.995% available data center
constructed from two independent zones which are each 99.9%, whereas in reality you only have
99.9% availability because of this hidden quorum issue.

6 Designing for High Availability 59


Virtual Infrastructure Requirements Guide (V8.5) CONFIDENTIAL

If your data center just has two main failure zones, and your virtual infrastructure relies on quora, then
you have two options:

• One compromise option favored by some SPs is to build “two and a half” zones – that is, two large
zones hosting compute and storage nodes plus one smaller zone that only hosts the much smaller
number of servers needed to provide quorum breaking. If you are deploying in a CO, you may
find it already has small side rooms typically used for management servers which are ideal for this
purpose.
• You can deploy two instances of your virtual infrastructure, each wholly contained within a single
failure zone, and then use the application techniques outlined in 2.4.2 to spread the service across
the two.

6.3 Best practice for specific environments

OpenStack
You should follow OpenStack’s and your distro vendor’s best practice for deploying high availability.
See http://docs.openstack.org/ha-guide/.

VMware
You should follow VMware’s best practice for virtual environments. See http://pubs.vmware.com/
vsphere-55/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-55-networking-guide.pdf.

For the highest availability of individual hosts, use multiple physical links to a redundant switch
infrastructure, with the virtual network configuration such that each virtual interface is served by at
least two of these physical links.

Note:

Perimeta SR-IOV interfaces are an exception to the above. Here, link redundancy is implemented
at the VM level through Perimeta IP configuration, since it cannot be provided by the underlying
virtual switch.

You must ensure that the link used for high availability traffic has a bandwidth of 1Gbps.

60 6 Designing for High Availability

You might also like