LightOS Install Guide V3 2 1
LightOS Install Guide V3 2 1
Lightbits Labs
April 2023
1
CONTENTS CONTENTS
Contents
Lightbits™ v3.x Installation and Configuration Guide 4
Troubleshooting 55
Ansible Role Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
SSH Strict Key Errors When Using sshpass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Free Space in Linux OS for etcd Logical Volume Manager Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Recovering from Cluster Installation Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Log Artifacts Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Fully Clean Lightbits From Servers or Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Appendixes 59
Host Configuration File Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Host Configuration File Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Example 1: Data Network Interface Manually Configured . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Example 2: Data Network Interface Automatically Configured . . . . . . . . . . . . . . . . . . . . . . . . . 63
Example 3: Override the Lightbits Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Example 4: Provide Custom Datapath Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Example 5: Use the Linux Volume Manager (LVM) Partition for etcd Data . . . . . . . . . . . . . . . . . 65
Example 6: Profile-Generator Overrides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Example 7: Dual Instance Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Example 8: Single IP Dual NUMA Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Performing an Offline Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Offline Ansible Controller Installation and Self-Signed Certificates . . . . . . . . . . . . . . . . . . . . . 70
Configuring the Data Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Automatic Data Network Configuration (Recommended) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Manual Data Network Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
etcd Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Using SSH-Key Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Network Time Protocol Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Automated Client Connectivity Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Configuring Grafana and Prometheus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Prerequisite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Installing Grafana and Prometheus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Lightbits Monitoring Integration with Existing Grafana and Prometheus . . . . . . . . . . . . . . . . . . . 75
Using Grafana and Prometheus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Open TCP Ports and Verify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Installation Behind HTTP-Proxy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Single-IP-Dual-NUMA Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Adding a JWT Token To a Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Running lbcli From a Non-Lightbits Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
About - Legal 99
Note: There are two types of installation methods: the online installation method, which connects to online
repositories to download the Lightbits software; and the offline installation method, which grabs the Lightbits
software files locally from the Ansible host. Note that for the offline method, Lightbits will provide the software
files in advance. This guide mainly covers the online installation method, but notes are provided on what differs
with the offline installation method as well.
Based on the numbers next to each component or resource in the diagram, see the following table for a description of the
components and resources in the Lightbits cluster topology diagram.
Lightbits Cluster Topology Components Table
Lightbits also protects the storage cluster from additional failures not related to the SSDs (e.g., CPU, memory, NICs),
software failures, network failures, or rack power failures. It provides additional data security through in-server Erasure
Coding (EC) that protects servers from SSD failures and enables non-disruptive maintenance routines that temporarily
disable access to storage servers (e.g., TOR firmware upgrades).
The following sections describe the failure domain and volume components used in the Lightbits cluster architecture.
Note: For more information about Lightbits cluster architecture, see the Deploying Reliable High-Performance
Storage with Lightbits Whitepaper.
Nodes
Each server can be split into multiple logical nodes. Each logical node owns a specific set of SSDs and CPUs, and a
portion of the RAM and NVRAM. The physical network can be shared or exclusive per node.
Nodes can be across NUMAs or per NUMA. There is no relation or limitation between a logical node and the NUMA of
the resources used by the logical node.
Each storage server runs a single Node Manager service. The service controls all the logical nodes of the storage server.
Note: The current Lightbits release only supports up to two logical nodes per server. The single logical node
deployment is commonly referred to as “single instance, node or NUMA deployment”. Dual logical node deployment
is referred to as “dual instance, node or NUMA deployment”.
Failure Domains
Users define the Failure Domains (FD) based on data center topology and the level of protection that it strives to achieve.
Each server in the cluster can be assigned to a set of FDs.
An example of an FD definition is separating racks of servers by FD labels. In this case, all servers in the same rack are
assigned the same FD label, while servers in different racks are assigned distinct labels (e.g., FD label = rack ID). Two
replicas of the same volume will not be located on two nodes in the same rack.
The system stores different replicas of the data on separate FDs to keep data protected from failures.
The definition of an FD is expressed by assigning FD labels to the storage nodes. Single or multiple FD labels can be
assigned to every node.
Another example of an FD definition is grid topology, in which every node is assigned a label of a row and a label of a
column. In this case, the volume is not stored on two servers that are placed on the same row or on the same column.
Note: Per the previous section, servers can be configured using a single or dual instance. The same Failure Domain
rules apply to dual instance, in addition to the fact that volumes will never be placed on a different node of the
same server. This is because any server failure will usually affect both nodes.
For more information on Failure Domain configuration, see the Lightbits Administration Guide.
Lightbits recommends you complete each of these steps in the order that they are written to ensure a successful software
installation and connection between the Lightbits Storage Server and the clients.
Note: To complete the installation process, you must have the Lightbits Installation - Customer Addendum
that was sent to you by Lightbits. The customer addendum contains customer-specific information and is referred
to throughout the installation procedure.
Installation Preparation
Before you begin the installation, Lightbits recommends that you create a reference table to list the networking and server
names you will use for your Lightbits cluster. The following is an example of a table you can use with the Configuring
the Ansible Environment section.
Installation Planning Table
Note: The following represents a cluster with three Lightbits servers with a single client.
Note that during this installation, client00 will function as the Ansible installation host. However, any server will work
that has connectivity over SSH to the management IP of each server.
This table appears throughout this installation guide to help you follow the Lightbits installation process, to show the
progress you have made to complete the installation, and to successfully configure a cluster of servers.
Additional relevant information about this cluster: * The servers are Centos 7.9; however, other OSs are supported
with different build releases. For additional information, see General System Requirements. * The Lightbits GA release
will be installed. Our Red Hat build is supported for additional OSs. For additional information, see General System
Requirements. * We will install in Single Instance/NUMA/node mode; however, if the servers have dual or more NUMAs,
we could also do a dual instance/NUMA/node installation. Examples for this will be provided.
Lightbits Cluster Installation Process
# Installation Steps
1 Connecting your installation workstation to Lightbits’
software repository
2 Verifying the network connectivity of the servers used in
the cluster
3 Setting up an Ansible environment on your installation
workstation
4 Installing a Lightbits cluster by running the Ansible
installation playbook
5 Updating clients (if required)
6 Provisioning storage, connecting clients, and performing
IO tests
• The Linux distribution that your clients use must have the NVMe/TCP client-side drivers. These drivers are included
starting with Linux kernel v5.3.5 and above.
– If your system’s Linux distribution does not include this kernel version or a later version, download back-ported
NVMe/TCP client side drivers for specific kernels and distributions from the Lightbits drivers webpage.
• The Lightbits software kernel requires a boot partition with at least 512 MB available.
• To complete the installation process, you will need information from your version of the Lightbits Installation-
Customer Addendum. If you do not have the customer addendum, contact a Lightbits representative to receive
a copy.
• For more information about which Python version supports Ansible, see the Ansible Installation Guide.
• Lightbits comes in two kinds of releases: GA or Red Hat. The GA releases support the Centos OS and come with
a Lightbits-customized kernel. The Red Hat releases support Red Hat-based OS (Alma and Rocky included), and
require a specific system kernel to be installed. For more, see Red Hat Linux Installation.
The following table details the supported Lightbits operating systems and kernels.
Note: For Lightbits GA releases, the kernel version shown is installed on the servers by the Ansible installation. For
Lightbits RHEL releases, the kernel version shown must be pre-installed on the servers for the Ansible installation
of Lightbits in order to work.
Note: If using Single IP Dual Numa configuration (see Single-IP-Dual-NUMA Configuration), open the above
ports and two additional ports: 4421 and 22227. Duroslight will use 4420 and the additional 4421 port. Replicator
will use 22226 and the additional 22227 port.
See the Open TCP ports and verify according to the examples below of how to open and test TCP ports.
If you need to check a port’s accessibility, you can use the following procedure with the open-source nmap program:
1. Install the open-source nmap program with the following command: bash $ yum install -y nmap > >Note: If
testing port accessibility from a non-rpm/yum based operating system, the installation will differ, but the commands
below should still work, as nmap installs and relies on nc (netcat). >
2. Check a port’s accessibility with either of the following commands: bash $ nc -v -z <ip> <start port>-<end
port> or bash $ nc -v -u <ip> <start port>-<end port>
3. You must have the netcat program running in listen mode on the server you are testing with the following command:
bash $ nc -l -p <port>
Document Description
Lightbits Installation Guide (this document) Contains the instructions to install the Lightbits cluster
software, installs the Linux cluster client software, and
then connects the cluster client to Lightbits.
Installation Guide - Customer Addendum Includes customer-specific passwords to access installation
files.
Lightbits Administration Guide Provides detailed information about the operations you
can perform using the Lightbits lbcli CLI command and
REST API. Note: After you complete the installation
process in this document, you should refer to the
Administrator’s Guide for important management and
automation instructions.
User’s Manual: Lightbits REST and CLI API Lists the low level details for the REST API and CLI
command usage. This document is typically used as a
reference manual when building and administering the
system. Note: See the Administrator’s Guide for detailed
examples for using the REST API and CLI commands.
The following diagram shows how to use the documents to install, test, and maintain Lightbits products, and how the
above referenced documents can be used to support the typical user experience.
Note that: * The servers are Centos 7.9; however, other OSs are supported with different build releases. For additional
information, see General System Requirements. * The Lightbits GA release will be installed. Our Red Hat build is
supported for additional OSs. For additional information, see General System Requirements. * We will install using
Single Instance/NUMA/node mode; however, if the server supports dual or more NUMAs, we could also do a dual
instance/NUMA/node installation. Example Ansible configurations are provided in Host Configuration File Examples.
Also, review the data networking and NVMe drive placement of the servers. This will be important during the installation
configuration phase.
The online installation requires an internet connection and the need to configure several files on your system. The file
repository URL is accessible and the RPMs are updated. The data interfaces of each server must be pre-configured on
the same subnet. Our subnet is 10.0.10.0/24 (if the data interfaces are not pre-configured, they can be configured later
using Ansible).
Additionally, check the NUMA placement of the NVMe drives. The command below shows which NUMA each NVMe
drive belongs to (you can update your table with this information). Note that the main example of this installation section
assumes that each server has six NVMe drives in NUMA 0.
$ lspci -mm | grep -Ei "nvme|SSD|Non - Volatile memory controller " | awk '{print $1}' |
xargs -I{} bash -c 'D=/ sys/bus/pci/ devices /0000:{}/; echo -n "$D: "; echo $(cat $D/
numa_node 2> /dev/null), $(cat $D/ label 2> /dev/null)' | nl
# The example output is unrelated to the cluster we are installing . However , it shows
how to interpret the command output . This shows 8 drives in NUMA0 and 8 drives in
NUMA1 . The column after numa_node shows the NUMA ID.
1 /sys/bus/pci/ devices /0000:62:00.0/ numa_node 0
2 /sys/bus/pci/ devices /0000:63:00.0/ numa_node 0
3 /sys/bus/pci/ devices /0000:64:00.0/ numa_node 0
4 /sys/bus/pci/ devices /0000:65:00.0/ numa_node 0
5 /sys/bus/pci/ devices /0000:66:00.0/ numa_node 0
6 /sys/bus/pci/ devices /0000:67:00.0/ numa_node 0
7 /sys/bus/pci/ devices /0000:68:00.0/ numa_node 0
8 /sys/bus/pci/ devices /0000:69:00.0/ numa_node 0
9 /sys/bus/pci/ devices /0000: b3 :00.0/ numa_node 1
10 /sys/bus/pci/ devices /0000: b4 :00.0/ numa_node 1
11 /sys/bus/pci/ devices /0000: b5 :00.0/ numa_node 1
12 /sys/bus/pci/ devices /0000: b6 :00.0/ numa_node 1
13 /sys/bus/pci/ devices /0000: b7 :00.0/ numa_node 1
14 /sys/bus/pci/ devices /0000: b8 :00.0/ numa_node 1
15 /sys/bus/pci/ devices /0000: b9 :00.0/ numa_node 1
16 /sys/bus/pci/ devices /0000: ba :00.0/ numa_node 1
The online installation requires an internet connection on the Lightbits servers and the need to configure several files on
your system.
Note: An offline installation method is available that does not require an internet connection to access the file
repository URL. For more information, see Performing an Offline Installation.
# Installation Steps
1 Connecting your installation workstation to
Lightbits’ software repository
2 Verifying the network connectivity of the servers used in
the cluster
3 Setting up an Ansible environment on your installation
workstation
4 Installing a Lightbits cluster by running the Ansible
installation playbook
5 Updating clients (if required)
# Installation Steps
6 Provisioning storage, connecting clients, and performing
IO tests
Notes: - To proceed, see the Linux Repo File Customer TOKEN section in your Lightbits Installation Customer
Addendum, for the token that is required to access the yum repository. Access to this repository is required to
install the Lightbits cluster software.
- Contact Lightbits Support if you do not have this addendum document.
- If you are using the offline installation method, you can skip this step and proceed to Verifying Network Connec-
tivity for the Servers in the Cluster.
- For information on installing Red Hat, see Red Hat Linux Installation. Note that Red Hat releases will have a
slightly different baseurl, which will be visible in the Lightbits Installation Customer Addendum.
Verify that you have the TOKEN & baseurl for the Lightbits RPM Repository. Log in to any of the future Lightbits
servers and test the connection to the repository.
Note: Ideally you will want to test from each Lightbits server. However, testing on one and verifying that the rest
have internet connectivity should be sufficient. If one of the servers is not able to reach the repository, there will
be clear error messages during the install, which can be resolved later.
1. In your preferred text editor, open a new file in the workstation’s following CentOS directory: bash /etc/yum.
repos.d/lightos.repo
2. Copy the following template into the file.
# Lightbits repository
[ lightos ]
name= lightos
baseurl =https :// dl. lightbitslabs .com/< YOUR_TOKEN >/ lightos -3-< Minor Ver >-x-ga/rpm/el
/7/ $basearch
repo_gpgcheck =0
enabled =1
gpgcheck =0
autorefresh =1
type=rpm -md
For the <YOUR_TOKEN>, enter the Lightbits token that was included in your copy of the Lightbits Installation Customer
Addendum.
Verify that the baseurl path is correct, with the Lightbits Installation Customer Addendum. Specifically the parts after
the <YOUR_TOKEN>.
3. Save the lightos.repo file.
4. Verify your system’s connectivity to the repository by entering the yum repolist command. This command displays
the enabled software repositories. For example:
$ yum repolist
Make sure that the command exits successfully. If it shows any error, address those before continuing.
Note: For information on installing Red Hat, see Red Hat Linux Installation.
# Installation Steps
1 Connecting your installation workstation to Lightbits’
software repository
2 Verifying the network connectivity of the servers
used in the cluster
3 Setting up an Ansible environment on your installation
workstation
4 Installing a Lightbits cluster by running the Ansible
installation playbook
5 Updating clients (if required)
6 Provisioning storage, connecting clients, and performing
IO tests
Lightbits recommends that you verify the network connectivity for the servers you plan to use in the Lightbits cluster
before you run the Ansible playbook. To simply confirm the connectivity status, use a ping command for each of the
management NIC IPs and data NIC IPs in the servers.
Referring back to the Installation Planning Table, the example uses three Lightbits servers. Each server has a management
IP.
Before proceeding with the installation, enter the following ping command from the Ansible installation host, to confirm
that each Lightbits server is accessible via the Management Network IP.
$ ping -c 4 192.168.16.22
PING 192.168.16.22 (192.168.16.22) 56(84) bytes of data.
64 bytes from 192.168.16.22: icmp_seq =1 ttl =64 time =0.208 ms
--- 192.168.16.22 ping statistics ---
1 packets transmitted , 1 received , 0% packet loss , time 0ms
rtt min/avg/max/mdev = 0.208/0.208/0.208/0.000 ms
Continuing with the example, a ping command is sent to each of the management network IPs and data network IPs.
Before continuing, confirm the following connections: * The Ansible installation host has ping connectivity to each Lightbits
storage server’s management network IP. It does not need to have data network connectivity. * Each Lightbits storage
server has connectivity between all of the management network IPs. * Each Lightbits storage server has connectivity
between all of the data network IPs. * Additionally, review the section on Required Ports, and make sure all of those
ports are open and accessible on the Lightbits storage servers.
# Installation Steps
1 Connecting your installation workstation to Lightbits’
software repository
# Installation Steps
2 Verifying the network connectivity of the servers used in
the cluster
3 Setting up an Ansible environment on your
installation workstation
4 Installing a Lightbits cluster by running the Ansible
installation playbook
5 Updating clients (if required)
6 Provisioning storage, connecting clients, and performing
IO tests
Ansible Method
For the Ansible method, follow these instructions to install the dependencies.
If not found, you can also install Ansible using pip for Python3:
$ pip3 install ansible
Note: If the installation fails due to a UnicodeEncodeError, it is because the locale is not fully configured on the
Ansible host. Set the local LC_ALL environment variable and run the “pip3 install ansible” command again. For
example for UTF8 systems, set the local to: export LC_ALL=en_US.UTF-8
Docker Method
Rather than installing and preparing Ansible and its dependencies, we also provide a custom Ansible image to deploy the
Lightbits cluster image that contains all dependencies.
[ root@client1 ~]#
$ getenforce
Permissive
Note: The Docker username and password can be extracted from the repository baseurl. The username is the
path bit after the TOKEN in the baseurl. The password is the token.
Note: The directory does not have to be called light-app and does not have to be in $HOME. We only suggest to
do so, as the example going forward through the sections will assume that the installation package was extracted
into $HOME/light-app/.
Unpacking this tarball creates the following Ansible directory structure inside of the light-app directory, which contains
the Ansible environment where the “ansible-playbook” command runs.
���
ansible�
��� inventories�
��� ...���
ansible .cfg���
playbooks�
��� ...���
plugins�
��� ...���
roles
��� ...
Notes: - The servers’ hostnames and Ansible names do not have to match. We usually refer to the servers’ Ansible
names as server00, server01, server02, etc. These will become the servers’ identifying names with the Lightbits
software. Therefore this will give the servers a name of serverXX going forward.
- For the ansible_host field, provide the management IP. However if the servers are only configured with data IPs
and no management IP, then provide the data IPs of the server (in this case the data IP doubles as the management
and data IP).
1. Open a text editor and edit the copied hosts example file, which is now found in the new ~/light-app/
ansible/inventories/cluster_example/hosts path. Replace the ansible_host, ansible_ssh_pass, and
ansible_become_user values with your environment’s relevant values. This is for each server that will be in
your cluster. Refer to the following example for reference.
[ duros_nodes ]
server00
server01
server02
[ duros_nodes :vars]
local_repo_base_url = https :// dl. lightbitslabs .com/< YOUR_TOKEN >/ lightos -3-< Minor Ver
>-x-ga/rpm/el /7/ $basearch
auto_reboot =true
cluster_identifier =ae7bdeef -897e -4c5b -abef -20234 abf21bf
[etcd]
server00
server01
server02
[ initiators ]
client00
• You can replace the ansible_host flag’s value with the interface DNS name or IP address. In this example, the
management network IP addresses from the cluster details table are used, not the data network IPs.
• Also in this example hosts file, there is a “local_repo_base_url” entry that includes . This information was provided
to you in the Customer Addendum. You will need to enter this value here before proceeding.
2. Remove the client00 line in the top section and “[initiators]” sections.
Note: It is possible to set up the Ansible files to install and configure clients. However, this section only describes
how to install the Lightbits storage servers. The next section details how to configure and connect clients.
3. Take into account the following information when filling out the “hosts” file.
More information about the hosts file: - The top section of the “hosts” describes how Ansible will connect to other
servers to install Lightbits. It also provides a friendly name for each Lightbits server; for example, “server00”. These
names will be used going forward by the Lightbits software as the identifying name for the servers. - The “duros_nodes”
section describes where Lightbits will be installed. - The “duros_nodes:vars” section describes from where Lightbits will
be installed. In this case/example the repo URL is provided, as this uses the online method. However, for the offline instal-
lation method this section is different. For more, see Single-IP-Dual-NUMA Configuration. - The “local_repo_base_url”
field must be filled in with the and remainder of the URI to properly direct the Ansible installation to the correct Lightbits
repository. Different release versions will have slightly different paths. The TOKEN is provided in the Customer Installa-
tion Addendum. The “local_repo_base_url” must be correct or else the installation will fail. The “local_repo_base_url”
value should be the same as the “baseurl” value in Connecting to the Lightbits Software Repository. If that worked
successfully, then it is ok. - The “auto_reboot” and “cluster_identifier” fields should be left as is. - The “auto_reboot”
field instructs the servers to reboot during the installation. This is an important part of the installation. - The “clus-
ter_identifier” can be left as is, as it is unused. Note that the cluster will end up getting an auto-generated ID called
the “clusterName” key, which can be changed after installation. - Lightbits uses “etcd” for the key/value database of the
cluster. The “etcd” section describes where etcd will be installed and which servers will become members of etcd. Every
server appearing in “duros_nodes” must appear in the “etcd” section.
Host File Server Variables
Note: For information on installing Red Hat, see Red Hat Linux Installation.
4. The final “hosts” file will look similar to the above output. Save and exit out of the ~/light-app/ansible/
inventories/cluster_example/hosts file.
Multi-Tenancy
Lightbits v2.2.1 and above enforces tenant isolation on the control plane (“multi-tenancy”). With multi-tenancy, multiple
tenants can share a Lightbits cluster without being able to see or affect each other’s resources when accessing the Lightbits
API or using the Lightbits command line tools.
Command line tools and all other API users must use the v2 Lightbits API. The v2 API includes provisions for authenti-
cation and authorization via standard JSON Web Tokens (“JWTs”), as well as transport security for all API operations.
The following three predefined roles are created by default:
• cluster-admin (system scope)
• admin (project scope)
• viewer (project scope)
Currently, roles cannot be added.
At installation, the user can provide their own certificate and CA to be used by the peers. If these files are not provided,
the installation will generate self-signed certificates.
Certificates Directory
By default, certificates are stored at certificates_directory=~/lightos-certificates on the Ansible controller ma-
chine.
certificates_directory can be overridden via cmd-line:
ansible - playbook playbooks /deploy - lightos .yml \
-e 'certificates_directory =/ path/to/ certs ' ...
Or via group_vars/all:
yaml
certificates_directory =/ path/to/ certs
Certificate Types
Implementing multi-tenancy involves three sets of certificates:
• etcd Certificates For mTLS Peer Communication
• API Service Certificates For TLS
• System Scope Cluster Admin Certificates
• etcd-ca: Certificate authority (CA) parameters for etcd certificates. This CA is used to sign certificates used by
etcd (such as peer and server certificates).
• {ansible_hostname}-cert-etcd-peer: The peer certificate is used by etcd for peer communication.
These files are passed to the following etcd parameters: --peer-cert-file and --peer-key-file.
Note: {ansible_hostname} is the name we gave the etcd node in the hosts file.
Example
A 3-node cluster with server00-02 will result in:
etcd -ca -key.pem
etcd -ca.pem
server00 -cert -etcd -peer -key.pem
server00 -cert -etcd -peer.pem
server01 -cert -etcd -peer -key.pem
server01 -cert -etcd -peer.pem
server02 -cert -etcd -peer -key.pem
server02 -cert -etcd -peer.pem
Notes: - These names are hard-coded in the installation script. Only the source directory can change.
- If these files are not provided, the installation will generate self-signed certificates and place them at
certificates_directory on the Ansible controller machine.
Notes: - Certificate file names are hard-coded in the installation script. Only the source directory can change.
These are pairs and go together.
• In case we want to regenerate the self-signed certificates, we should delete the certificates_directory and all of
its content.
Note: If the latest Lightbits release supports a kernel or distribution newer than your OS, upgrade your OS and
kernel to match the supported OS and kernel before continuing with the Lightbits installation.
1. (Applies only to Red Hat) Make sure Red Hat subscription manager is registered and attached.
2. Edit the hosts file with the required target details. Consult with Lightbits Support for the Red Hat repository
baseurl value.
3. To ensure that the kernel does not get overwritten by another kernel, add to group_vars/all.yml. Add
use_lightos_kernel: false.
4. From Red Hat 8 based releases and onward, Chrony took over NTP as the default network time protocol. Edit
all.yml to ensure that NTP is not installed and Chrony is configured. Comment out the NTP sections and set the
following NTP variables to false.
ntp_enabled : false
chrony_enabled : true
ntp_manage_config : false
use_lightos_kernel : false
# ntp_servers :
# - "0{{ '.' + ntp_area if ntp_area else '' }}. pool.ntp.org iburst "
# - "1{{ '.' + ntp_area if ntp_area else '' }}. pool.ntp.org iburst "
# - "2{{ '.' + ntp_area if ntp_area else '' }}. pool.ntp.org iburst "
# - "3{{ '.' + ntp_area if ntp_area else '' }}. pool.ntp.org iburst "
# ntp_packages :
# - "autogen - libopts *. rpm"
# - " ntpdate *. rpm"
# - "ntp *. rpm"
5. Install the Lightbits software as described in the Lightbits Cluster Software Installation Process section.
If all of the machines in the cluster have PMEM (NVDIMM or Intel Optane) installed, the persistent_memory flag
must be set as follows:
persistent_memory : true
If there are machines in the cluster that do not have PMEM installed, then set this flag to false.
Note: Since the persistent_memory flag is a global property for all of the clusters, it is important to declare this
flag only once under the all.yml file and not in host_vars files with different values.
IP ACL allows support for restricted/non-restricted access to a cluster. This feature must be enabled during installation,
by setting the enable_iptables flag; otherwise it cannot be used.
When the enable_iptables flag is set to true, access to the cluster nodes is allowed only from client IPs that are defined
per volume using the ip_acl setting of each volume. By default, it is set to false. In order to use this mode, add the
following to all.yml:
enable_iptables : true
NTP vs Chrony
Check if your OS prefers NTP or Chrony, and proceed using that option.
NTP:
For NTP configurations, the default settings in the all.yml file can be used (note that the defaults are greyed out). You
can uncomment and edit the prefered NTP servers, NTP version, and its dependencies packages. For more information
on these parameters, see Network Time Protocol Configuration.
Note: When configuring to use NTP, make sure chrony_enabled is either commented out like above, or set to
false.
Chrony:
For Chrony configurations - which are the default for Red Hat 8 based releases and onward - configure the all.yml settings
as follows:
Disable NTP and enable Chrony. Additionally, make sure that the Chrony service is configured on your servers.
ntp_enabled : false
chrony_enabled : true
ntp_manage_config : false
use_lightos_kernel : false
# ntp_servers :
# - "0{{ '.' + ntp_area if ntp_area else '' }}. pool.ntp.org iburst "
# - "1{{ '.' + ntp_area if ntp_area else '' }}. pool.ntp.org iburst "
# - "2{{ '.' + ntp_area if ntp_area else '' }}. pool.ntp.org iburst "
# - "3{{ '.' + ntp_area if ntp_area else '' }}. pool.ntp.org iburst "
# ntp_version : "ntp -4.2.6 p5 -29. el7. centos . x86_64 "
# ntp_packages :
# - "autogen - libopts *. rpm"
# - " ntpdate *. rpm"
# - "ntp *. rpm"
Note: Verify that the date and time are in sync on the Lightbits storage servers and the Ansible installation host.
You can use date and run it simultaneously on all servers.
Setting use_lightos_kernel to false ensures that the kernel that is on the servers remains as is.
If a GA based release is used, remove that line or set it true.
use_lightos_kernel : true
Setting use_lightos_kernel to true will install the Lightbits supplied kernel, which is a requirement for the Lightbits GA
releases.
$ cd ~/ light -app
$ ansible all -i ansible / inventories / cluster_example / hosts -m command -a id
The expected output for each machine is its “id” output, which should return the username and groups.
The following is an example of a good output:
server00 | CHANGED | rc =0 >>
uid =0( root) gid =0( root) groups =0( root)
server01 | CHANGED | rc =0 >>
uid =0( root) gid =0( root) groups =0( root)
server02 | CHANGED | rc =0 >>
uid =0( root) gid =0( root) groups =0( root)
Any output other than an error is a good output. If there are connection issues, verify that SSH is properly set up and
make sure that the ansible_ssh_user and ansible_ssh_pass are properly configured.
After the above command is successful, test that the privileged user can access the machines over SSH. Note that if the
ansible_ssh_user is root, you can skip this final verification.
$ cd ~/ light -app
$ ansible all -i ansible / inventories / cluster_example / hosts -m command -a id -b
Notes: - The last test will use the ansible_become_user from the hosts file, which is usually root. This will
ideally test connecting to each machine via ansible_ssh_user and ansible_ssh_pass, and raise the privilege
to the ansible_become_user using the sudo password - which is configured via the ansible_become_pass.
- If key-based authentication is used instead and you get a connectivity error, make sure that it is properly
configured.
Defining Configuration Files for Each “Ansible Host” (Server) in the Cluster
Return to the /~/light-app/ansible/inventories/cluster_example directory you created in Inventory Structure and
Adding the Ansible Hosts File.
~/ light -app/ ansible / inventories / cluster_example
|-- cluster_example
|-- group_vars
| |-- all.yml
|-- hosts
|-- host_vars
|-- client00 .yml <- This file can be ignored or deleted .
|-- server00 .yml
|-- server01 .yml
|-- server02 .yml
From this path we will edit each of the yml files found in the ~/light-app/ansible/inventories/cluster_example/host_vars
subdirectory. In our example cluster, we have three Lightbits storage nodes that are defined by the files:
• host_vars/server00.yml
• host_vars/server01.yml
• host_vars/server02.yml
1. In each of the host variable files, update the following required variables:
Required Variables for the Host Variable File
Variable Description
name The cluster server’s name. Example: serverXX. Must
match the filename (without the extension) and server
names configured in th “hosts” file.
instanceID The configuration parameters for the logical node in this
server. Currently, Lightbits supports up to two logical
nodes per server.
ec_enabled (per logical node) Enables Erasure Coding (EC) protects
against SSD failure within the storage server by preventing
IO interruption. Normal operation continues during
reconstruction when a drive is removed.
failure domains (per logical node) The servers sharing a network, power
supply, or physical location that are negatively affected
together when there are network, power, cooling, or other
critical service experience problems. Different copies of the
data are stored in different FDs to keep data protected
from various failures. To specify the servers in the FD, you
must add the server names. For further information, see
Defining Failure Domains.
data_ip (per logical node) The data IP used to connect to other
servers. Can be IPv4 or IPv6.
storageDeviceLayout (per logical node) Sets the SSD configuration for a node.
This includes the number of initial SSD devices, the
maximum number of SSDs allowed, allowance for NUMA
across devices, and memory partitioning and total capacity.
For further information, see Setting the SSD
Configuration.
initialDeviceCount The number of NVMe drives accounted for this instance to
use.
maxDeviceCount The maximum number of NVMe drives current instances
can support. Commonly configured equal to
initialDeviceCount or higher.
allowCrossNumaDevices Leave this setting set as “false” if all of the accounted
NVMe drives for this instance are in the same NUMA. Set
it to “true” if to access the NVME drives this instanceID
will need to do cross-NUMA communication.
deviceMatcers This determines which NVMe drives will be considered for
data and which will be ignored. For example, if the OS
drive is an NVME drive, it can be ignored using the name
option. The default settings do a good job by only
counting NVMe drives greater than 300 GiB and without
partitions to be part of the data.
Note: The following is an example for three Lightbits servers in a cluster with a single client.
server01.yml
name: server01
nodes:
- instanceID : 0
data_ip : 10.10.10.101
failure_domains :
- server01
instanceID : 0
ec_enabled : true
lightfieldMode : SW_LF
storageDeviceLayout :
initialDeviceCount : 6
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
# - model =~ ".*"
- partition == false
- size >= gib (300)
# - name =~ " nvme0n1 "
server02.yml
name: server02
nodes:
- instanceID : 0
data_ip : 10.10.10.102
failure_domains :
- server02
ec_enabled : true
lightfieldMode : SW_LF
storageDeviceLayout :
initialDeviceCount : 4
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
initialDeviceCount : 6
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
# - model =~ ".*"
- partition == false
- size >= gib (300)
# - name =~ " nvme0n1 "
Notes: - See Host Configuration File Variables for the entire list of variables available for the host variable files.
- You can also reference additional host configuration file examples.
- Typically the servers should already be configured with the data_ip. However, the Ansible playbook can configure
the data NIC IP; for that you will need to add a section data_ifaces with the data interface name. For further
information, see Configuring the Data Network. Also section 4.4.9 shows an example of this configuration.
- If you need to create a separate partition for etcd data on the boot device, see etcd Partitioning.
- Based on the placement of SSDs in the server, check if you need to make a change in the client profile to permit
cross-NUMA devices.
- Starting from Version 3.1.1, data IP can be IPv6. For example: data_ip: 2600:80b:210:440:ac0:ebff:fe8b:
ebc0
name: server00
data_ifaces :
- bootproto : static
conn_name : ens1
ifname : ens1
ip4: 10.10.100/24
nodes:
- instanceID : 0
data_ip : 10.10.100
failure_domains :
- server00
- rack00
ec_enabled : true
lightfieldMode : SW_LF
storageDeviceLayout :
initialDeviceCount : 6
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
# - model =~ ".*"
- partition == false
- size >= gib (300)
# - name =~ " nvme0n1 "
Server01 failure_domains array is configured with its own server name and the rack it is placed in “rack00”.
name: server01
data_ifaces :
- bootproto : static
conn_name : ens1
ifname : ens1
ip4: 10.10.10.101/24
nodes:
- instanceID : 0
data_ip : 10.10.10.101
failure_domains :
- server01
- rack00
ec_enabled : true
lightfieldMode : SW_LF
storageDeviceLayout :
initialDeviceCount : 6
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
# - model =~ ".*"
- partition == false
- size >= gib (300)
# - name =~ " nvme0n1 "
Make a note of the items in both server00 and server01 failure_domains arrays.
Since both servers share the same “rack00”, volumes replicas will not be shared between these two servers (and their
nodes).
If the lists were default, then volume replicas would be shared between the servers. Default means server00 failure_domain
array only had “server00”, and server01 failure_domain array only had “server01”.
Notes: - At a minimum or good default configuration, configure the failure_domains with the server names. Add
other items as above with “rack00”, to help control the flow of volume replication.
- In a dual instance/node setup, volume replicas will not land on other nodes of same server.
- See Host Configuration File Variables for the entire list of variables available for the host variable files.
Notes: - The configurations above have a “data_ifaces” section for each server configuration. Typically this section
is not included as the servers should be preconfigured with their data IPs; however, we can instruct Ansible to
configure the data IPs during the Lightbits installation, so that the “data_ifaces” section tells Ansible to configure
the IP and subnet on said interface.
- Note that for ipv6 addresses, you will use ‘ip6: ip/prefix’ format. For example: ip6: 2001:0db8:0:f101::1/64.
- The addresses used for ip4 or ip6 fields must match the address used in data_ip. The only difference is that ip4
and ip6 show the subnet or prefix as well. However, note that data_ip only shows the address without the subnet
or prefix.
Note: If Erasure Coding is enabled (ec_enabled: true), you must have a minimum of six SSDs installed in that
node.
To specify the SSD configuration for a node, you must enter a value for the total drive slots available for your Lightbits
node to the host configuration file as follows:
name: server00
data_ifaces :
- bootproto : static
conn_name : ens1
ifname : ens1
ip4: 10.10.10.100/24
nodes:
- instanceID : 0
data_ip : 10.10.10.100
failure_domains :
- server00
ec_enabled : true
data_ip : 10.10.10.100
lightfieldMode : SW_LF
storageDeviceLayout :
initialDeviceCount : 6
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
# - model =~ ".*"
- partition == false
- size >= gib (300)
# - name =~ " nvme0n1 "
Note: See Host Configuration File Variables for the entire list of variables available for the host variable files.
Run ls and make sure you see the following files and folders.
ansible
ansible .cfg
playbooks
plugins
roles
$ tree ansible
ansible���
inventories
��� cluster_example
��� group_vars
� ��� all.yml
��� host_vars
� ��� client_0 .yml
� ��� server00 .yml
� ��� server01 .yml
� ��� server02 .yml
��� hosts
# Installation Steps
1 Connecting your installation workstation to Lightbits’
software repository
2 Verifying the network connectivity of the servers used in
the cluster
3 Setting up an Ansible environment on your installation
workstation
4 Installing a Lightbits cluster by running the
Ansible installation playbook
5 Updating clients (if required)
6 Provisioning storage, connecting clients, and performing
IO tests
As discussed in Prepare Installation Workstation (Ansible Controller), we support installing using Ansible or using a
prebuilt Docker image that contains Ansible. Pick the method that applies to your installation environment and follow
the commands to install the Lightbits cluster software on the storage servers. Afterwards, go to the bottom of the section
to confirm a successful installation.
Note: For both methods, we provide the simple default installation methods. However, we provide other more
advanced installation configuration examples in the Ansible Docker section. Note, however, that these same
examples can be adopted into the Ansible method; for that you will just skip the Docker commands and just refer
to the ansible-playbook commands as the template.
Note: The Ansible playbook operations below can take several minutes. The output will report the status of all
the tasks that succeeded/failed on the nodes.
To install the cluster software and configure the cluster, change into light-app directory with cd ~/light-app, and enter
the following command to run the playbook:
ansible - playbook -i ansible / inventories / cluster_example / hosts playbooks /deploy - lightos .
yml -vvv
Notes: - This command must be run from the directory where light-app was extracted to. Then all of the paths
will work as displayed.
- The inventory file points to a “hosts” file, which instructs Ansible where to deploy Lightbits.
- The selected playbook, “deploy-lightos.yml”, instructs Ansible on how to install and configure the Lightbits cluster
on the servers mentioned in the “hosts” file.
- Ansible will log to its default path as specified by ansible.cfg. By default that is /var/log/ansible.log. The log
path can be changed by prefixing ANSIBLE_LOG_PATH=/var/log/ansible.log ansible-playbook ...
- The following files will be created into the home directory: lightos-system-jwt & lightos-default-admin-jwt.
- Certificates used by the cluster will be saved into a new directory, lightos-certificates. This directory will be
created in the home directory.
- It is recommended to make a secure backup of this content, or at a minimum, the jwt files and lightos-certificates
directory.
- Debug level verbosity is enabled with -vvv. It helps diagnose any issues if they happen.
When the installation is done, the cluster will be bootstrapped with a system-scope project. You will need access to the
JWT. By default the cluster-admin JWT is placed in ~/lightos-system-jwt of the Ansible host. This path can be changed
by editing group_vars/all.yml before running the ansible-playbook, and appending this variable system_jwt_path: "
{{ '~/lightos-system-jwt' | expanduser }}"
We will pre-create the /opt/lightos-certificates directory, so that our certificates get saved outside of the container.
Command breakdown:
• Mount host’s /opt/lightos-certificates to docker’s /lightos-certificates to store generated certificates on the host.
Docker will create the /opt/lightos-certificates directory on the host if it is missing.
• Mount the current working directory or $PWD to /ansible inside the container, to have access to the playbook and
roles. The current working directory at this point will be where light-app was extracted.
• Set the WORKDIR to /ansible inside the container. This sets the current working directory within docker to
/ansible.
Note: For information on installing Red Hat, see Red Hat Linux Installation.
Note: For information on installing Red Hat, see Red Hat Linux Installation.
Notes: - The “failed=0” indicates that the installation finished without errors.
- If the installation process failed, see Recovering from Cluster Installation Failure.
The installation flow is now complete, and you can move on to the client configuration sections of the Installation Guide.
Note: You should also make sure you back up your installation files properly. For more, see Lightbits Software
Installation Planning.
Note: The contents of a JWT are long and all are on a single line.
2. Log in to any Lightbits server and paste the contents into the shell. The JWT will now be available via the
$LIGHTOS_JWT environment variable.
3. Check the state of the servers, nodes, and cluster.
The servers look healthy as they all state “NoRiskOfServiceLoss”:
At this point the cluster’s health has been confirmed at the node, server, and cluster level.
Note: The Linux Repo File Customer TOKEN section in your Lightbits Installation Customer Addendum has
the TOKEN that is required to install the Lightbits NVME-Client-DEBs.
Note: All of the required parameters for the curl command above will be in your Lightbits Installation
Customer Addendum. This includes the TOKEN, path, and GPG KEY fingerprint.
Notes: - Token and path are provided via the Customer Addendum.
- Replace “xenial” in the URL with the correct codename of your Ubuntu OS.
- Run lsb_release -a to verify your codename.
4. Editing the repository file to point at the correct GPG key or force trust.
Correct GPG Key:
By default, the created lightos.list repo file points to an incorrect path for the GPG key, so running apt-get update
at this point will fail.
First, confirm the correct GPG key path by running apt-key list.
Locate the Lightbits key. It should sit in the /etc/apt/trusted.gpg file.
Edit the repo file /etc/apt/sources.list.d/lightos.list and replace [signed-by=/path/to/key] to the correct path
[signed-by=/etc/apt/trusted.gpg].
Force Trust:
In case you want to bypass the GPG verification, edit the /etc/apt/sources.list.d/lightos.list file and replace
[signed-by=/path/to/key] with [signed-by=/etc/apt/trusted.gpg] after the deb and deb-src parts:
$ cat /etc/apt/ sources .list.d/ lightos .list
deb [ trusted =yes] https :// dl. lightbitslabs .com/<TOKEN >/ lightos -3-< Minor Ver >-x-
ga/deb/ ubuntu xenial main
deb -src [ trusted =yes] https :// dl. lightbitslabs .com/<TOKEN >/ lightos -3-< Minor Ver
>-x-ga/deb/ ubuntu xenial main
Note: You can copy the repo file content from an installed Lightbits server’s file, /etc/yum.repos.d/lightos.repo.
It will have the correct token and baseurl.
Note: An optional Ansible playbook is available to you that performs the following:
- Installs kernel v5.x, which includes the nvme-tcp upstream driver.
- Creates a small 4GB volume with a replication factor of 2.
- Runs the nvme connect command to connect the client machine to the cluster volume.
- Runs an fio read/write workload for 30 seconds.
- Performs a cleanup that disconnects the nvme client and removes the volume.
For more information about using this optional playbook, see Automated Client Connectivity Verification.
Notes: - Before proceeding with the installation, you must have the GNU Wget software installed. You can
download the software at https://www.gnu.org/software/wget/
- You can use any kernel version v5.3.5 or above, which is written in the following instructions.
- The instructions below are only for Centos 7.9. Updating the kernel will vary for different OSs. Please verify with
the official OS documentation for how to upgrade the kernel.
To install the latest kernel on CentOS 7.9, perform the following steps.
1. Update the yum repo:
yum update
2. Install elrepo for CentOS 7.9:
yum install https :// www. elrepo .org/elrepo -release -7. el7. elrepo . noarch .rpm
rpm --import https :// www. elrepo .org/RPM -GPG -KEY - elrepo .org
3. Identify the new kernel index in the output list of the command above. In the following example, the new kernel
has an index value of 0 because it is at the top of the list of available kernels.
index =0
kernel =/ boot/kernel -ml -5.4.11 -1. el7. elrepo . x86_64
args="ro crashkernel =auto rd.lvm.lv= CentOS_rack05 - server67 /root rd.lvm.lv=
CentOS_rack05 - server67 /swap rhgb quiet LANG= en_US .UTF -8"
root =/ dev/ mapper / CentOS_rack05 --server67 -root
initrd =/ boot/initramfs -kernel -ml -5.4.11 -1. el7. elrepo . x86_64 .img
title= CentOS Linux (kernel -ml -5.4.11 -1. el7. elrepo . x86_64 ) 7 (Core)
index =1
kernel =/ boot/vmlinuz -3.10.0 -957. el7. x86_64
args="ro crashkernel =auto rd.lvm.lv= CentOS_rack05 - server67 /root rd.lvm.lv=
CentOS_rack05 - server67 /swap rhgb quiet LANG= en_US .UTF -8"
root =/ dev/ mapper / CentOS_rack05 --server67 -root
initrd =/ boot/initramfs -3.10.0 -957. el7. x86_64 .img
title= CentOS Linux (3.10.0 -957. el7. x86_64 ) 7 (Core)
index =2
kernel =/ boot/vmlinuz -0- rescue -9758554168974 f5dbe0d6dac5a6ac621
args="ro crashkernel =auto rd.lvm.lv= CentOS_rack05 - server67 /root rd.lvm.lv=
CentOS_rack05 - server67 /swap rhgb quiet "
root =/ dev/ mapper / CentOS_rack05 --server67 -root
initrd =/ boot/initramfs -0- rescue -9758554168974 f5dbe0d6dac5a6ac621 .img
title= CentOS Linux (0- rescue -9758554168974 f5dbe0d6dac5a6ac621 ) 7 (Core)
index =3
non linux entry
4. Use the following command to set the default kernel index value.
In this example, the new kernel grub entry index value number is 0. So we set the default index to 0. This will make
the OS boot off of this kernel on the next boot.
7. After the client reboots, you must log in and verify that the client is now running from the new kernel using the
Linux command uname -r.
For example:
$ uname -r
kernel -ml -5.4.11 -1. el7. elrepo . x86_64
WARNING: apt does not have a stable CLI interface. Use it with caution in scripts.
3. (Optional) If the command returns this value, you need to delete the NVMe CLI package from your system with
the following command:
$ apt -get remove nvme -cli
4. With the public NVMe CLI version deleted from the system, you can install the NVMe CLI from the Lightbits RPM
repository by entering the following in the system’s command shell:
5. Enter the following command to verify that the NVMe CLI version is v1.9.1.
$ apt list --installed | grep nvme -cli
WARNING: apt does not have a stable CLI interface. Use it with caution in scripts.
Note: These instructions will work on any Lightbits client’s side deb that you want to install on your client.
1. (Optional) If a public NVMe CLI version is installed on your system, you can replace it with the NVMe CLI version
supplied by Lightbits. Before installing the supplied NVMe CLI from the Lightbits repository you’ll need to remove
the public NVMe cli from your system.
To check if you have an NVMe CLI package installed, enter the following in the system’s command shell:
$ apt list --installed |grep nvme -cli
WARNING : apt does not have a stable CLI interface . Use with caution in scripts .
2. (Optional) If the command returns this value, you need to delete the NVMe CLI package from your system with
the following command:
$ apt -get remove nvme -cli
3. With the public NVMe CLI version deleted from the system, you can install the NVMe CLI from the Lightbits RPM
repository by entering the following in the system’s command shell:
$ apt -get install nvme -cli
4. Enter the following command to verify that the NVMe CLI version is v1.9-1.
$ apt list --installed |grep nvme -cli
WARNING : apt does not have a stable CLI interface . Use with caution in scripts .
The output for this command can include additional package names with the nvme string.
Make the setting boot persistent by loading the module on boot with this setting:
$ echo nvme_tcp > /etc/modules -load.d/ nvme_tcp .conf
Multipath
By default, multipath should be enabled with the nvme_core module.
However, you can run the following command to check:
$ grep -r "" /sys/ module / nvme_core / parameters
The output should show the full path of the kernel in the format of /boot/vmlinuz-...
Now configure the kernel boot arguments to load enable multipath. Make sure to put the full path of the default kernel
into the command below:
$ grubby --args= nvme_core . multipath =Y --update - kernel /boot/vmlinuz -...
Note: grubby is available out of the box on Red Hat and Centos based flavors. For distributions that do not have
grubby, use the next method.
Then, update the initramfs, which OSes use to load and configure modules on boot. Use the appropriate tool for the OS:
* On Red Hat/Centos, run dracut -f. * On Debian/Ubuntu systems, run update-initramfs -u.
Reboot
It is recommended to reboot the client to make sure that all of the settings are loaded properly. Make sure that the
nvme_tcp modules are loaded on boot and that multipath is enabled.
$ lsmod | grep nvme; grep -r "" /sys/ module / nvme_core / parameters ;
nvme_tcp 24576 0
nvme_fabrics 20480 1 nvme_tcp
nvme_core 49152 4 nvme_fabrics , nvme_tcp
...
/sys/ modules / nvme_core / parameters / multipath :Y
...
# Installation Step
1 Connecting your installation workstation to Lightbits’
software repository
2 Verifying the network connectivity of the servers used in
the cluster
3 Setting up an Ansible environment on your installation
workstation
4 Installing a Lightbits cluster by running the Ansible
installation playbook
5 Updating clients (if required)
6 Provisioning storage, connecting clients, and
performing IO tests
With the Lightbits software installed and the Lightbits management service running, you can create a volume and connect
that volume to your application clients.
Sample Command
$ lbcli -J $LIGHTOS_JWT create volume --size="2 GiB" --name=vol1 --acl="acl3" --replica -
count =2 --project -name= default
Note: By default, the LIGHTOS_JWT is generated during the Lightbits installation on the Ansible installa-
tion host, and is saved to ~/lightos-system-jwt. See Post Installation Steps for an example of how to get
LIGHTOS_JWT.
Sample Output
Name UUID State Size Replicas ACL
vol1 76 c3eae8 -7ade -4394 -82e5 -056 d05a92b5e Creating 2.0 GiB 3 values :"acl3"
This example command creates a volume with 2 GiB of capacity, an Access Control List (ACL) string “acl3”, and a
replication factor of 3.
Note: Only clients that mention the ACL value of “acl3” during connect can connect to this volume. This is
detailed in Connecting the Cluster Client to Lightbits.
Sample Command
$ ping -c 1 10.10.10.100
Sample Ouptut
PING 10.10.10.100 (10.10.10.100) 56(84) bytes of data.
64 bytes from 10.10.10.100: icmp_seq =1 ttl =255 time =0.032 ms
This output indicates that this application client has a connection to the data NIC IP address on the Lightbits storage
server where volumes were created.
Repeat this ping check for the other Lightbits cluster servers: 10.10.10.101 and 10.10.10.102.
After you have checked the TCP/IP connectivity between your application client and the Lightbits storage servers, use
the nvme CLI utility to connect the application client via NVMe/TCP to the Lightbits storage server.
To use the nvme CLI utility on your application client, you will need the following details.
Enter the lbcli get cluster command on any Lightbits storage server to identify the subsytem NQN.
Sample Command
$ lbcli -J $LIGHTOS_JWT get cluster -o yaml
Note: By default the LIGHTOS_JWT is generated during the Lightbits installation on the ansible installation
host and is saved to ~/lightos-system-jwt. Post Installation Steps shows one way to get LIGHTOS_JWT.
Sample Output
UUID: 95 a251b6 -0885 -4 f5b -a0eb -90 e90a2009a3
currentMaxReplicas : 3
...
subsystemNQN : nqn .2014 -08. org. nvmexpress :NVMf:uuid:b5fe744a -b919 -465a -953a- a8a0df7b9d31
<--- subsystem NQN
supportedMaxReplicas : 3
Enter the lbcli list nodes command to identify the NIC IP address and TCP port.
Sample Command
$ lbcli -J $LIGHTOS_JWT list nodes
Sample Output
With the IP, port, subsystem NQN and ACL values for the volume, you can execute the nvme connect command to
connect to all of the nodes in the cluster.
Notes: - We are using an ACL value/hostnqn of “acl3”, so that we can connect to the volume created, as detailed
in Creating a Volume on the Lightbits Storage Server.
- Use the client procedure for each node in the cluster. Remember to use the correct NVME-Endpoint for each
node.
- Using the --ctrl-loss-tmo -1 flag allows for infinite attempts to reconnect nodes, and prevents a timeout from
occuring when attempting to connect with a node in a failure state.
- Starting from Version 3.1.1, data IP can be IPv6.
- See the discovery-client documentation of the Lightbits Administration Guide. Like nvme connect, this can
connect to NVMe over TCP volumes. However, it can also monitor the nodes and if new nodes/paths are created
or removed, it will properly maintain those.
# Installation Steps
1 Connecting your installation workstation to Lightbits’
software repository
2 Verifying the network connectivity of the servers used in
the cluster
3 Setting up an Ansible environment on your installation
workstation
4 Installing a Lightbits cluster by running the Ansible
installation playbook
5 Updating clients (if required)
6 Provisioning storage, connecting clients, and performing
IO tests
Each /dev/nvmeX is a successful NVMe over TCP connection to a server in the cluster. When the optimized path is
connected, a block device is created with the name /dev/nvmeXnY, which can then be used as any block device (create
fs on top of it and mount it).
If you see a multipath error (with the nvme block devices showing up as 0 byte, or each replica/nvme connection showing
up as a separate nvme block device), refer to the Lightbits Troubleshooting Guide, or contact Lightbits Support.
After you have entered the nvme connect command, you can confirm the client’s connection to Lightbits by entering
the nvme list command. This will list all of the NVMe block devices. For more information on each connection’s
multipathing, you can use nvme list-subsys, which will list all of the NVMe character devices.
Note: The nvme list and lsblk command will show the NVMe block device that is created upon a successful
connection. It will be of the format nvme0n1. The nvme list-subsys will list all of the paths that make up these
block devices; these paths appear as character devices. So from the output below we can conclude that block device
nvme0n1 is made of 3 character devices: nvme0, nvme1, and nvme2. When we need to interact with the block device
- for example to create a filesystem and mount it - we will interact with the block device, nvme0n1, and not the
character devices (nvme0,nvme1, and nvme2).
Sample Command
$ nvme list - subsys
Sample Output
nvme - subsys0 - NQN=nqn .2014 -08. org. nvmexpress :NVMf:uuid:b5fe744a -b919 -465a -953a-
a8a0df7b9d3
\
+- nvme0 tcp traddr =10.10.10.100 trsvcid =4420 live
+- nvme1 tcp traddr =10.10.10.101 trsvcid =4420 live
+- nvme2 tcp traddr =10.10.10.102 trsvcid =4420 live
Next, review your connected block devices to see the newly connected NVMe/TCP block device using the Linux lsblk
command.
Sample Command
$ lsblk
Sample Output
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:1 0 2G 0 disk
sdb 8:16 0 223.6 G 0 disk
|-sdb2 8:18 0 222.6 G 0 part
| |-CentOS00 -swap 253:1 0 22.4G 0 lvm [SWAP]
| |-CentOS00 -home 253:2 0 150.2 G 0 lvm /home
| |-CentOS00 -root 253:0 0 50G 0 lvm /
|-sdb1 8:17 0 1G 0 part /boot
sda 8:0 0 111.8 G 0 disk
A new nvme0n1 block device with 2GB of storage is identified and available.
To determine which node in the cluster is the primary and which is secondary for this block device, enter the nvme list
-subsys command with the block device name.
Sample Command
$ nvme list - subsys /dev/ nvme0n1
Sample Output
nvme - subsys0 - NQN=nqn .2014 -08. org. nvmexpress :NVMf:uuid:b5fe744a -b919 -465a -953a-
a8a0df7b9d31 \
+- nvme0 tcp traddr =10.10.10.100 trsvcid =4420 live optimized
+- nvme1 tcp traddr =10.10.10.101 trsvcid =4420 live inaccessible
+- nvme2 tcp traddr =10.10.10.102 trsvcid =4420 live
In the output, the optimized status identifies the primary node, and an inaccessible status for the secondary node. In this
case we can see that server 10.10.10.100 is the primary node with the optimized path. All of the IO from the client will
go to 10.10.10.100. The cluster will then replicate the data between the other nodes.
Troubleshooting
Note: For additional troubleshooting-related information, see the Lightbits Troubleshooting Guide, or contact
Lightbits Support.
To avoid this error, you need to disable StrictHostKeyChecking in the /etc/ssh/ssh_config, or log into each node
from your installation workstation at least once.
By default, StrictHostKeyChecking is enabled in the file /etc/ssh/ssh_config. You can disable this by un-remarking
it in ssh_config and setting it to:
StrictHostKeyChecking no
Or, you can leave StrictHostKeyChecking enabled and log into each node from the installation workstation and
“answer yes” to permanently add the host to the Known Hosts files.
The first time you SSH from one server to another the following SSH exchange occurs:
$ ssh root@192 .168.16.22
The authenticity of host '192.168.16.22 (192.168.16.22) ' can 't be established .
ECDSA key fingerprint is SHA256 : zouTZEZF2oUXfIGpnvWutrOR4 / fBnd5ARqXNJj0iqD0 .
ECDSA key fingerprint is
MD5 :7d:0f:0a:3f :27:08:2 e :66:93: ae:f5 :08: c8 :13:23: af.
Are you sure you want to continue connecting (yes/no)? Yes
Warning : Permanently added '192.168.16.22 ' ( ECDSA ) to the list of known hosts .
root@192 .168.16.22 's password :
Last login : Wed Nov 13 19:06:13 2019 from cluster - manager
[ root@node00 ~]#
So, by logging into all the servers at least once from your installation workstation before you run the Ansible playbook,
there will be no issues using the sshpass method.
Note: If the Linux Virtual Server (LVS) software reports anything but “CentOS” for the Volume
Group name used for the LinuxOS file system, you will need to specify the exact name in the ~/light-
app/ansible/inventories/cluster_example/host_vars file for that node. For more information, see the
etcd_vg_name variable description in the Host Configuration File Variables list.
In this example, the LinuxOS was installed onto a 118 GB drive and the entire amount is allocated. You can resize the
home LVM by 20 GB to free up some space.
To resize this file system, you need to:
1. Move any files you have in the /home file system to a safe location.
2. Unmount, resize, and recreate the file system.
In this example, the LinuxOS is installed on device “sda” and on partition sda1 with 119.2 GB of space available. It is
possible to take 20 GB away from home to free up some space and still have over 44 GB remaining.
1. Mount and record the current mount path for home.
$ mount
/dev/ mapper / CentOS_lightos --c3 -home on /home type xfs (rw ,relatime ,attr2 ,inode64 ,
noquota )
4. Remount home.
$ mount /dev/ mapper / CentOS_lightos --c3 -home /home
Note: For information on installing Red Hat, see Red Hat Linux Installation.
Note: Replace with the name of the server that will be removed as listed in the hosts file, so it can be of the
form: server00, server01, etc.
Reconfigure command:
ansible - playbook -i ansible / inventories / cluster_example / hosts playbooks /configure -
lightos - playbook .yml
The following is important for better understanding cleanup and configure: * When a Lightbits installation is done via a
deploy-lightos playbook - as described in Running the Ansible Installation Playbook to Install Lightbits Cluster Software
- it runs two playbooks in order. First it runs an install playbook, which installs all of the Lightbits dependencies and
packages and does a reboot. Then it runs the configure playbook, which sets up all of the Lightbits services. * The cleanup
playbook removes all of the Lightbits configurations. It does not uninstall any of the packages that were installed. * The
configure playbook does not install any Lightbits packages; it simply reconfigures all of them. Only run this if you are
certain that the deploy-lightos playbook ran through the install playbook on all servers; otherwise, use the deploy-lightos
playbook as described in Running the Ansible Installation Playbook to Install Lightbits Cluster Software.
The output of each Lightbits server is saved into a dated directory inside /tmp/ on the Ansible host.
Notes: - This can be run against servers that have Lightbits installed, as well as servers that do not have Lightbits
installed.
- To properly gather logs from Lightbits servers, the playbook will depend on the jwt being in the ~/lightos-system-
jwt file.
Note/Caution: Do not run this on servers that are active or show up on a Lightbits cluster. Only run this against
servers that need Lightbits removed; otherwise, the cluster state will be in danger. All of the commands run from
the servers must be run as highest privelege (root).
1. From the Ansible host, run the cleanup playbook to unconfigure the server.
ansible - playbook -i ansible / inventories / cluster_example / hosts playbooks /cleanup -lightos -
playbook .yml -t cleanup --limit <server_name >
Note: Make sure matches the name of the server from where Lightbits must be removed. Usually it will be of the
form “server00” or “server01”, etc.
3. Uninstall Lightbits packages with these steps from the Lightbits server.
First, find out the version of Lightbits that is installed.
lbcli version | awk '{ print $NF}'`
The output will be like “3.0.1~b1004”. This is the format used for the next steps.
You can also see the latest installed rpms using rpm -qa --last, which will list the latest installed packages at the top
of the list.
Extract the version number in a format similar to this: “A.B.C~bD”. In this example, it is 3.0.1~b1004. But each case
could be different.
rpm -qa | grep 3.0.1~ b1004
Now you can manually remove each package with yum remove PKG -y or rpm -e PKG. However, if all of the packages
look Lightbits-related, then run the next command and it will uninstall them.
Important: Make sure to replace 3.0.1~b1004 with your version.
bash <(echo "( set -xeu"; rpm -qa | grep 3.0.1~ b1004 | xargs -I % echo yum remove % -y;
echo ")")
4. To fully uninstall Lightbits from a server, note that some releases (GA releases) install a specific kernel during deploy,
so it is recommended to uninstall it and set another kernel as the default. Refer to your OS documentation on how
to uninstall a kernel and set another kernel as default.
Note: The Lightbits Red Hat releases do not need to run this step. Instead, edit /etc/yum.conf and remove or
comment out the line that shows exclude=redhat-release* kernel* kmod-kvdo*.
Appendixes
The following sections provide additional information to help you complete the Lightbits installation.
Notes: - The user must provide the etcd volume group name in the etcd_vg_name variable, and confirm that
there is enough server space to create a new logical volume. The default logical volume name (etcd_lv_name) is
“etcd” and the default volume size (etcd_lv_size) is 10GB.
- If there is not enough space in the server, the user must reduce the other logical volume sizes before the cluster
software installation to allocate the required space. For more details, see https://www.rootusers.com/lvm-resize-
how-to-decrease-an-lvm-partition
.
---
data_ifaces :
- bootproto : static
conn_name : ens1
ifname : ens1
ip4: 10.10.10.100/24
nodes :
- instanceID : 0
data_ip : 10.10.10.100
failure_domains :
- server00
ec_enabled : true
lightfieldMode : SW_LF
storageDeviceLayout :
initialDeviceCount : 6
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
# - model =~ ".*"
- partition == false
- size >= gib (300)
# - name =~ " nvme0n1 "
Note: The example above shows the format for setting IPv4 addresses using ip4: ip/subnet. IPv6 addresses
can be set using ip6: ip/prefix.
By default, the playbook inspects the remote machine and determines the directory containing the specific configuration for
Duroslight and backend services (datapath configuration). The excluding node-manager configuration uses the following
logic:
<system_vendor >-< processor_count >-processor -< processor_cores >- cores
---
nodes :
- instanceID : 0
data_ip : 10.10.10.100
failure_domains :
- server00
ec_enabled : true
lightfieldMode : SW_LF
storageDeviceLayout :
initialDeviceCount : 6
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
# - model =~ ".*"
- partition == false
- size >= gib (300)
# - name =~ " nvme0n1 "
datapath_config : custom -datapath - config
Example 5: Use the Linux Volume Manager (LVM) Partition for etcd Data
Host configuration with custom lvm partition for etcd data.
---
nodes :
- instanceID : 0
data_ip : 10.10.10.100
failure_domains :
- server00
ec_enabled : true
lightfieldMode : SW_LF
storageDeviceLayout :
initialDeviceCount : 6
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
# - model =~ ".*"
- partition == false
- size >= gib (300)
# - name =~ " nvme0n1 "
use_lvm_for_etcd : true
etcd_lv_name : etcd
# etcd_settings_user :
etcd_lv_size : 15 GiB
etcd_vg_name : centos
Note: For information on installing Red Hat, see Red Hat Linux Installation.
In case the cluster is homogeneous and we want to apply the same override to all nodes, we can provide a single setting
in the groups/all.yml file or via the cmd with:
ansible - playbook -i ansible / inventories / cluster_example / hosts playbooks /deploy - lightos .
yml -e profile_generator_overrides_dir =/ tmp/ overrides .d
In the above example, we specify profile_generator_overrides_dir which is a directory on the Ansible Controller that
will be copied to the target machine.
---
name: server00
nodes :
- instanceID : 0
data_ip : 172.16.10.10
failure_domains :
- server00
ec_enabled : true
lightfieldMode : SW_LF
storageDeviceLayout :
initialDeviceCount : 12
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
# - model =~ ".*"
- partition == false
- size >= gib (300)
# - name =~ " nvme0n1 "
- instanceID : 1
data_ip : 172.16.20.10
failure_domains :
- server00
ec_enabled : true
lightfieldMode : SW_LF
storageDeviceLayout :
initialDeviceCount : 12
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
# - model =~ ".*"
- partition == false
- size >= gib (300)
# - name =~ " nvme0n1 "
server01.yml
---
name: server01
nodes :
- instanceID : 0
data_ip : 172.16.10.11
failure_domains :
- server01
ec_enabled : true
lightfieldMode : SW_LF
storageDeviceLayout :
initialDeviceCount : 12
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
# - model =~ ".*"
- partition == false
- size >= gib (300)
# - name =~ " nvme0n1 "
- instanceID : 1
data_ip : 172.16.20.11
failure_domains :
- server01
ec_enabled : true
lightfieldMode : SW_LF
storageDeviceLayout :
initialDeviceCount : 12
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
# - model =~ ".*"
- partition == false
- size >= gib (300)
# - name =~ " nvme0n1 "
Note: Unlike dual instance configuration from the example above, which had a unique data IP per instance, each
instance on a server has the same IP.
---
name: server00
nodes :
- instanceID : 0
data_ip : 10.10.10.100
failure_domains :
- server00
ec_enabled : true
lightfieldMode : SW_LF
storageDeviceLayout :
initialDeviceCount : 12
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
# - model =~ ".*"
- partition == false
- size >= gib (300)
# - name =~ " nvme0n1 "
- instanceID : 1
data_ip : 10.10.10.100
failure_domains :
- server00
ec_enabled : true
lightfieldMode : SW_LF
storageDeviceLayout :
initialDeviceCount : 12
maxDeviceCount : 12
allowCrossNumaDevices : false
deviceMatchers :
# - model =~ ".*"
- partition == false
- size >= gib (300)
# - name =~ " nvme0n1 "
For example:
[ duros_nodes :vars]
source_type = offline
source_etcd_binary ="/root/ lightos_release /deps/etcd -v3 .4.1 - linux - amd64 .tar.gz"
source_rpms_dir ="/root/ lightos_release / target_rpms "
source_dependencies_rpms_dir ="/root/ lightos_release /deps"
dest_dir ="/tmp/rpms"
In this example, we have set the playbook to permanently configure interface ens1 with static IP 10.20.20.10.
3. Set a new data interface IP and net mask IP for the data NIC. In the following example, the card is ens1f0 :
$ cat >/etc/ sysconfig /network - scripts /ifcfg - ens1f0 <<EOL
DEVICE = ens1f0
NM_CONTROLLED =no
IPADDR =10.20.20.10
NETMASK =255.255.255.224
ONBOOT =yes
BOOTPROTO = static
EOL
4. Toggle the NIC down and then up again by entering the ifdown command, waiting at least 30 seconds, and then
entering the ifupcommand.
$ ifdown ens1f0
$ ifup ens1f0
4: ens1f0 : <BROADCAST ,MULTICAST ,UP ,LOWER_UP > mtu 1500 qdisc mq state UP group
default qlen 1000
inet 10.20.20.10/27 brd 10.20.20.31 scope global ens1f0
valid_lft forever preferred_lft forever
Note: Use nmtui (NetworkManager-tui) if NetworkManager is installed, and ip -4 -br a or ip -br a to verify
the ip (for a cleaner view).
etcd Partitioning
Based on your boot device’s write latency performance, you might need to create a separate partition for etcd data on
the boot device. If you have questions about the need to use etcd partitioning, contact Lightbits.
To use etcd partitioning:
1. Confirm that a partition pre-allocated for etcd exists on the node and has at least 10 GB of space.
With the default configuration, the top section of the hosts file lines are configured as below:
server00 ansible_host =rack11 - server92 ansible_connection =ssh ansible_ssh_user
=root ansible_ssh_pass = light ansible_become_user =root
ansible_become_pass = light
As an example, assume that the SSH key for the servers is located at /root/mykey.txt. If so, change the configuration line
to this:
server00 ansible_host =rack11 - server92 ansible_connection =ssh ansible_ssh_user
=root ansible_ssh_private_key_file =/ root/ mykey .txt ansible_become_user =
root ansible_become_pass = light
Method 1
The latest RPMs are retrieved from the OS repository and installed on the cluster nodes.
Method 2
The specific NTP version required by the customer is installed on the cluster nodes. To use this method:
1. Edit the all.yml file: ~/light-app/ansible/inventories/cluster_example/group_vars/all.yml.
2. Edit or append the following to the all.yml file, using the specific version that you want to install. For example:
ntp_version : ntp -4.2.6 p5 -29. el7. CentOS . x86_64
Method 3
The NTP is installed using an offline method.
1. Edit the all.yml file: ~/light-app/ansible/inventories/cluster_example/group_vars/all.yml.
4. The desired NTP packages must be copied to the dest_dir. For more, see Performing an Offline Installation.
Additional Note
In order to ensure NTP client consistency and synchronization with the NTP servers, it is highly recommended to eliminate
the NetworkManager from updating /etc/resolv.conf. Incorrect configuration of the file could cause the NTP client to
communicate with the NTP server, and therefore create time drifting between the cluster nodes.
This can be done by:
As the root user, create the /etc/NetworkManager/conf.d/90-dns-none.conf file with the following content - by using a
text editor:
[main]
dns=none
Reload the NetworkManager service :
Optionally , remove the Generated by NetworkManager comment from /etc/ resolv .conf to
avoid confusion .
Note: For information on installing Red Hat, see Red Hat Linux Installation.
Note: It is important that the inventory folder is shared with the cluster inventory folder so that you can
fetch all cluster IPs.
Prerequisite
• docker-ce
Note: These monitoring packages should be installed on host machines, not on the Lightbits target servers.
Note: See Connecting to the Lightbits Software Repository for additional information.
Usage
After lightos monitor rpms (lightos-monitoring-clustering, lightos-monitoring-images), run the following:
/var/lib/ monitoring - images / deploy .sh deploy - clustering
In the Clusters section, change the instance names for your cluster hosts (remove the extra lines in case of a single cluster).
clusters :
cluster_1 :
- rack01 - server01
- rack02 - server02
- rack03 - server03
cluster_2 :
- rack04 - server04
- rack05 - server05
Then run:
/var/lib/ monitoring - images / deploy .sh configure - monitor
Outcome
Running the following:
docker ps
Integrating Grafana
There are two options for integrating the Grafana reference metrics: * Manually create the data source for Lightbits
Prometheus with the Grafana GUI, and then manually create a dashboard by importing reference metrics. * Integrate
the reference files directly, as shown in the example below:
Merge the data source configuration in monitoring-clustering/configure_grafana/roles/grafana/defaults/main.yml with
the existing data source.
Note that a different version of Grafana may have a different format for the configuration.
You can also easily create a data source manually with the GUI.
[ root@localhost ~]# vim /usr/ share / grafana /conf/ provisioning / datasources / sample .yaml
...
data sources :
- name: Prometheus
type: prometheus
url: http :// localhost :9090
[ root@localhost ~]# vim /usr/ share / grafana /conf/ provisioning / dashboards / sample .yaml
...
providers :
- name: 'default '
orgId: 1
folder : ''
folderUid : ''
type: file
options :
path: /var/lib/ grafana / dashboards
Use the GUI to verify the result. Access the Prometheus GUI using the instructions above. For example:
http://localhost:9090/ or http://monitoring-server:9090/. Note that when using the installation above, the Grafana and
Prometheus are the same host.
Integrating Prometheus
To integrate Prometheus, merge the configuration inside the Lightbits reference Prometheus configuration files and Light-
bits reference configure files - as shown below:
[ root@localhost ~]# tree monitoring - clustering / prometheus /
monitoring - clustering / prometheus /���
alert. rules .yaml���
prometheus .yml���
record .rules.yaml
You will need to manually merge contents inside of prometheus.yml with your existing prometheus.yml.
A scrape configuration containing exactly one endpoint to scrape. Here, it is Prometheus itself:
scrape_configs :
- job_name : lightos
scheme : http
scrape_timeout : 25s
scrape_interval : 30s
metrics_path : / metrics
honor_timestamps : True
params :
collect []:
- clustering
- datapath
- meminfo
- textfile
- lightfield
- netstat
- netdev
- cpufreq
file_sd_configs :
- refresh_interval : 10s
files:
- 'file_sd_configs /lightbox - exporter /*. yaml '
And copy the other two associated rule configuration files to the Prometheus configuration file folder.
[ root@localhost prometheus ]# cp alert . rules .yaml record . rules .yaml /usr/ local / prometheus
/
Copy the Lightbits reference target files to the Prometheus configuration file folder, and update the IP address of the
Lightbits cluster.
[ root@localhost ~]# tree monitoring - clustering / file_sd_configs
monitoring - clustering / file_sd_configs���
api�
��� targets .yaml���
lightbox - exporter
��� targets .yaml
[ root@localhost ~]# cp monitoring - clustering / file_sd_configs /usr/ local / prometheus / -a
Update the prometheus.yml with the new location of the target files.
[ root@localhost prometheus ]# vim /usr/ local / prometheus / prometheus .yml
...
files:
- 'file_sd_configs /lightbox - exporter /*. yaml '
...
Below is a screenshot of the cluster_tab dashboard. This is composed of multiple sections of graphs, statistics, and
tables.
Click on each artifact’s arrow button. Clicking the View option will expand the window to full screen.
Using Prometheus
Log in to Prometheus using the access instructions in Configuring Grafana and Prometheus.
Prometheus can be used to query any of the time series metrics received from a Lightbits cluster. The metrics come in at
the cluster level and node level. This means that most metrics can be viewed for each node and also for the cluster as a
whole. Prometheus is also helpful in figuring out the full names of metrics, which then can be used for creating dashboards
in Grafana.
As an example, let’s look at the write bandwidth for the whole cluster. The values will be shown in their raw format.
We can assume that this will be in “bytes/seconds”; however, if this is not the case, we could compare with other known
values.
Step 1
Make sure Use Local Time and Enable Autocomplete are enabled. Local time will help in lining up the times to
your timezone, regardless of the server’s timezone. Autocomplete will help explore all of the different metrics.
Step 2
Start by writing “instance:cluster” into the expression field. As characters are entered, it will show available metrics in
the drop-down. As more characters are entered, the drop-down menu converges on specific metrics.
With Enable Autocomplete, as text is typed into the expression field, Prometheus will then show metrics that have
matching text as a drop-down.
As you enter more text, you will see less metrics that are more specific.
Here we can see that we have “write_iops” and “write_throughput” as options. Since we want to know about write
bandwidth, the suitable metric would be “instance:cluster:write_throughput”.
Tip
One good way to know what to type into the Expression field is to study the drop-down. Another is to simply view all
of the available metrics.
To view all possible Prometheus metrics, curl, wget or open your browser to Prometheus Metrics.
The output will be large, but it will have all of the metrics. Here are example snippets of the output (searching for the
word “throughput”):
…
" instance : cluster : read_throughput "," instance : cluster : total_read_bytes "," instance : cluster
: total_reads "," instance : cluster : total_write_bytes "," instance : cluster : total_writes ","
instance : cluster : write_iops "," instance : cluster : write_throughput "…
" instance :node: read_throughput "," instance :node: receive_bytes_total "," instance :node:
receive_drop_total "," instance :node: receive_errs_total "," instance :node:
receive_packets_total "," instance :node: replication_write_iops_rx "," instance :node:
replication_write_iops_tx "," instance :node: replications_write_throughput_rx ","
instance :node: replications_write_throughput_tx "
Step 3
Finish typing “instance:cluster:write_throughput” into the Expression field, or select it from the drop-down menu, and
enter Execute.
Here we can see the raw value of the cluster write_throughput expressed in bytes. We can see that the current write
throughput is 98709485 bytes per second. This matches the fio job running in the background.
The following is the fio command that was launched from the same client.
root@rack02 - server65 [ client_0 ]:~# fio --direct =1 --rw= write --numjobs =8 --iodepth =1 --
ioengine = posixaio --bs=4k -- group_reporting =1 --filesize =1G --directory =/ test/ --
time_based =1 --runtime =3600 s --name=test
Step 4
Click Graph to view the graph output. The duration of the graph and end time and shading of the graphs are adjustable
with the buttons.
Here the graph shows the last 1 hour’s worth of data. However, any time period can be viewed by adjusting the values in
the boxes.
Note that there was a period of no throughput when the the fio job was cancelled temporarily.
Step 5
In Prometheus, you can also:
• Create alerts (this can also be done in Grafana).
• Stack other metrics to compare. Click Add Panel and then follow the same steps above to add another expression.
As an example, in the screenshot below, another panel was added to the bottom showing the write IOPs metric of
the entire cluster, by using the expression ”instance:cluster:write_iops”.
3. Re-enter the iptables -nL command to see if the port is now open.
$ iptables -nL | grep 80
ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt :80
7. Run netcat to the server you are running iperf3 to verify that port 80 is accepting commands.
$ nc -z -v 192.168.16.7 80
Ncat: Version 7.50 ( https :// nmap.org/ncat )
Ncat: Connected to 192.168.16.7:80.
Ncat: 0 bytes sent , 0 bytes received in 0.01 seconds .
nc
These settings will be passed to all tasks accessing the web for the installation of RPMs and other binaries, through the
proxy settings provided.
Note: You will need to ensure that the formatting is correct (yaml formatting). This can be in a separate page.
Single-IP-Dual-NUMA Configuration
Starting from version 3.1.2, Lightbits supports dual NUMA configuration with single data IP (network interface) for the
server.
The typical installation uses one data IP for instance ID 0, and another data IP for instance ID 1, but with this feature
both instance IDs use the same IP.
Therefore, instead of using two network interfaces in a dual NUMA server, only one network interface will be utilized for
the data network in both NUMAs.
To configure this single IP for both NUMA, duroslight and replicator use different ports for the different NUMAs.
Example
Configure this by appending the following lines into all.yml:
duroslight_ports :
0: "4420"
1: "4421"
replicator_ports :
0: "22226 "
1: "22227 "
The above settings allow Duroslight and replication to run off of the same IP (single IP), but with different ports for each
instance. This therefore allows two different instances of Lightbits to run off of the same IP, by using different ports.
Note: Having the system JWT preconfigured introduces security concerns, because any lbcli command can now
be run. Therefore it’s important to ensure that the server is secured.
1. After deploying the cluster, grab the system jwt. From the Ansible installation host, the file will be in located in
~/lightos-system-jwt. Show the content of the file with cat:
cat ~/ lightos -system -jwt
The output should show the token, as below. Note that the token has been cut for brevity.
export LIGHTOS_JWT = eyJhbGciOi < remaining jwt content > BaFEuMsT9gQNQA
Copy the jwt token portion (everything after “LIGHTOS_JWT=”). Note that its long output will span multiple lines of
terminal output; however, it should only take up one line in a file.
2. On a Lightbits server, edit /etc/lbcli/lbcli.yaml and append the jwt to the bottom.
jwt: <jwt>
The full content of /etc/lbcli/lbcli.yaml will be similar to this:
output - format : human - readable
dial - timeout : 5s
command - timeout : 60s
insecure -skip -tls - verify : true
debug : false
api - version : 2
insecure - transport : false
endpoint : https ://127.0.0.1:443
jwt: eyJhbGciOi < remaining jwt content > BaFEuMsT9gQNQA
About - Legal
Lightbits Labs (Lightbits) is leading the digital data center transformation by making high-performance elastic block
storage available to any cloud. Creators of the NVMe® over TCP (NVMe/TCP) protocol, Lightbits software-defined
storage is easy to deploy at scale and delivers performance equivalent to local flash to accelerate cloud-native applications
in bare metal, virtual, or containerized environments. Backed by leading enterprise investors including Cisco Investments,
Dell Technologies Capital, Intel Capital, JP Morgan Chase, Lenovo, and Micron, Lightbits is on a mission to make
high-performance elastic block storage simple, scalable and cost-efficient for any cloud.
www.lightbitslabs.com
[email protected]
US Offices
1830 The Alameda, San Jose, CA 95126
1412 Broadway 21st Floor, New York, NY 10018
Israel Office
17 Atir Yeda Street,Kfar Saba, Israel 4464313
The information in this document and any document referenced herein is provided for informational purposes only, is
provided as is and with all faults and cannot be understood as substituting for customized service and information that
might be developed by Lightbits Labs ltd for a particular user based upon that user’s particular environment. Reliance
upon this document and any document referenced herein is at the user’s own risk.
The software is provided “As is”, without warranty of any kind, express or implied, including but not limited to the
warranties of merchantability, fitness for a particular purpose and non-infringement. In no event shall the contributors or
copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise,
arising from, out of or in connection with the software or the use or other dealings with the software.
Unauthorized copying or distributing of included software files, via any medium is strictly prohibited.
COPYRIGHT (C) 2023 LIGHTBITS LABS LTD. - ALL RIGHTS RESERVED