Configuring an NVMe/TCP Client on Linux for VAST Cluster Block Storage

Complete these steps to enable ab NVMe/TCP Linux client to access VAST block storage:

Install the NVMe CLI tool.
Create transport rules to enable the client to connect to the VAST Cluster block controller.
Verify that the host's maximum number of retries for NVMe allows for maintaining high availability.
Obtain the host NQN that will identify your host for the VAST cluster.
Connect to the NVMe subsystem on the VAST cluster using your host NQN.
This step requires that block volumes have been created on the VAST cluster and mapped to your hosts. it includes the substeps:
1. Load kernel modules to enable NVMe over Fabrics.
2. Discover available VAST NVMe subsystems over TCP.
3. Connect to the VAST NVMe subsystem you need.
Verify the configuration by listing connected NVMe subsystems and block volumes.
If necessary, troubleshoot your configuration.

Installing the Client

To configure client block hosts for interacting with the cluster as a remote NVMe device, install the NVMe CLI tool on the host:

sudo yum install nvme-cli

Creating Transport Rules

Create transport rules to ensure that the client automatically discovers and connects or reconnects to the cluster's subsystems and volumes after reboot or new volume mappings:

Create the following file:

sudo vi /lib/udev/rules.d/71-nvmf-vastdata.rules

Add this content to the file:

# Enable round-robin for Vast Data Block Controller
ACTION=="add|change", SUBSYSTEM=="nvme-subsystem", ATTR{subsystype}=="nvm", ATTR{model}=="VASTData", RUN+="/bin/sh -c 'echo round-robin > /sys/class/nvme-subsystem/%k/iopolicy'"
ACTION=="add|change", SUBSYSTEM=="nvme-subsystem", ATTR{subsystype}=="nvm", ATTR{model}=="VastData", RUN+="/bin/sh -c 'echo round-robin > /sys/class/nvme-subsystem/%k/iopolicy'"

Run:

sudo udevadm control --reload-rules
sudo udevadm trigger

Verifying the Maximum Number of Retries for NVMe

To provide for high availability, the maximum number of retries configured on the host for NVMe commands must be set to a non-zero value. On most systems, the default parameter value is 5.

To check the maximum number of retries for NVMe, run one of the following:

cat /sys/module/nvme_core/parameters/max_retries

grep . /sys/module/nvme_core/parameters/*

To persistently set a maximum number of retries for NVMe to 5:

For hosts where /sys/module/nvme_core already exists:
1. Create or edit this file:
```
sudo nano /etc/modprobe.d/nvme_core.conf
```
2. Add this line to the file:
```
options nvme_core max_retries=5
```
3. Run this command to have the new settings applied on boot:
  - On Ubuntu or Debian:
    sudo update-initramfs -u
  - On RHEL, CentOS or Fedora:
    sudo dracut -f
4. Reboot the host:
```
sudo reboot
```
5. Verify that the maximum number of retries for NVMe is now set to 5:
```
cat /sys/module/nvme_core/parameters/max_retries
```
For hosts where /sys/module/nvme_core does not exist:
1. Edit GRUB:
```
sudo nano /etc/default/grub
```
2. Add the nvme_core.max_retries=5 string to the options in GRUB_CMDLINE_LINUX_DEFAULT, for example:
```
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nvme_core.max_retries=5"
```
3. Apply the updates:
  - On Ubuntu or Debian:
    sudo update-grub
  - On RHEL, CentOS or Fedora:
    sudo grub2-mkconfig -o /boot/grub2/grub.cfg
4. Reboot the host:
```
sudo reboot
```
5. Verify that the maximum number of retries for NVMe is now set to 5:
```
cat /sys/module/nvme_core/parameters/max_retries
```

Obtaining the Host NQN

The host NQN is generated automatically when you install nvme-cli. The host NQN must be specified in host properties maintained on the VAST cluster to allow for mapping VAST block volumes to the host.

To obtain your host NQN:

cat /etc/nvme/hostnqn

Connecting to Mapped Volumes

Note
This step requires that block volumes have been created on the VAST cluster and mapped to your host.

After volumes have been mapped to the host through VMS, you can do the following to connect to the mapped volumes:

Load the necessary kernel modules to enable NVMe over Fabrics (NVMe-oF).
- To load the modules once:
```
sudo modprobe nvme
sudo modprobe nvme-fabrics
```
- To have the modules load automatically on reboot:
  1. Create a file /etc/modules-load.d/nvme.conf and list the modules to be loaded in it:
    nvme nvme-fabrics
  2. Run this command so that the new settings apply on boot:
    On Ubuntu or Debian:
    sudo update-initramfs -u
    On RHEL, CentOS or Fedora:
    sudo dracut -f
Discover available VAST NVMe subsystems over TCP:
```
sudo nvme discover -t tcp -a VIRTUAL_IP -s 8009
```
For VIRTUAL_IP, provide a virtual IP from a virtual IP pool with Protocol role that is accessible to the relevant block-enabled view. The view policy might restrict/dedicate virtual IP pools.
Note
For information about creating subsystems on the cluster, see Provisioning Block Storage with VMS and Creating a Block Storage Subsystem (View).
Add the discovery parameters you used to /etc/nvme/discovery.conf so that the configuration sustains a reboot:
```
echo "-t tcp -a VIRTUAL_IP -s 8009" >> /etc/nvme/discovery.conf
```
Replace VIRTUAL_IP with the actual virtual IP.
Connect to a VAST NVMe subsystem:
1. Obtain the subsystem NQN from VMS:
  1. In the VAST Web UI, open the Views tab in the Element Store page.
  2. Find the subsystem view.
  3. Right-click the view and select View to see its configuration.
    The NQN is displayed in the Subsystem NQN field.
  4. Click the button to copy the NQN to your clipboard.
2. Establish connection to the subsystem:
```
sudo nvme connect -t tcp -n SUBSYSTEM_NQN -a VIRTUAL_IP
```
  in which:
  - SUBSYSTEM_NQN is the subsystem NQN obtained in the previous step.
  - VIRTUAL_IP is a virtual IP configured on the cluster and accessible to the subsystem view.
3. Run the connect-all command to connect all paths:
```
sudo nvme connect-all -t tcp -a VIRTUAL_IP -s 8009
```
4. If volumes are added to the subsystem or removed from the subsystem, you can run connect-all again to update the volume mapping.
```
sudo nvme connect-all
```
5. Run the following command to ensure that your NVMe connection sustains a reboot:
```
systemctl enable nvmf-autoconnect.service
```

Listing Subsystems, Paths and Available Volumes

To display a list of connected NVMe subsystems and paths, use the nvme list-subsys command.

For example:

sudo nvme list-subsys
nvme-subsys0 - NQN=nqn.2024-08.com.vastdata:ef992044-0c8e-557a-a629-4d3c9abd9f9d:default:subsystem-3
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:27590942-6282-d720-eae9-fdd2d81355d4
               iopolicy=round-robin
\
 +- nvme9 tcp traddr=172.27.133.9,trsvcid=4420 live
 +- nvme8 tcp traddr=172.27.133.8,trsvcid=4420 live
 +- nvme7 tcp traddr=172.27.133.7,trsvcid=4420 live
 +- nvme6 tcp traddr=172.27.133.6,trsvcid=4420 live
 +- nvme5 tcp traddr=172.27.133.5,trsvcid=4420 live
 +- nvme4 tcp traddr=172.27.133.4,trsvcid=4420 live
 +- nvme3 tcp traddr=172.27.133.3,trsvcid=4420 live
 +- nvme2 tcp traddr=172.27.133.2,trsvcid=4420 live
 +- nvme16 tcp traddr=172.27.133.16,trsvcid=4420 live
 +- nvme15 tcp traddr=172.27.133.15,trsvcid=4420 live
 +- nvme14 tcp traddr=172.27.133.14,trsvcid=4420 live
 +- nvme13 tcp traddr=172.27.133.13,trsvcid=4420 live
 +- nvme12 tcp traddr=172.27.133.12,trsvcid=4420 live
 +- nvme11 tcp traddr=172.27.133.11,trsvcid=4420 live
 +- nvme10 tcp traddr=172.27.133.10,trsvcid=4420 live
 +- nvme0 tcp traddr=172.27.133.1,trsvcid=4420 live

To display a list of connected NVMe volumes, use the nvme list command:
```
sudo nvme list
```

Disconnecting Existing Connections

The following commands disconnect the cluster's subsystems from the host:

To disconnect all connected subsystems:
```
sudo nvme disconnect-all
```
To disconnect a specific subsystem:
```
sudo nvme disconnect -n <NQN>
```

Troubleshooting

Issue: NVMe Subsystem Not Found

Cause: Incorrect IP address or network issue.
Solution: Verify that the virtual IP is correct and that the host has network connectivity to the VAST cluster.

Issue: NVMe Device Not Appearing

Cause: NVMe connection not established or missing kernel modules.
Solution: Ensure kernel modules are loaded using:
```
sudo modprobe nvme
sudo modprobe nvme-fabrics
```

Logs and Diagnostics

Use dmesg to check kernel logs for errors related to NVMe:

dmesg | grep nvme

Issue: No Mapped Volumes

Symptoms:

sudo nvme discover output:

Failed to write to /dev/nvme-fabrics: Connection refused
Failed to add controller, error connection refused

dmesg error: nvme nvme0: failed to connect socket: -111

Root cause: There are no mapped volumes.
Fix: Validate the volume mapping to this client's NQN.

Issues Connecting to the Target

Symptom	Root Cause
`sudo nvme discover` output: Failed to add controller, error cannot assign requested address	The source IP in the command line is incorrect.
`sudo nvme connect` output: Failed to write to /dev/nvme-fabrics: Input/output error could not add new controller: failed to write to nvme-fabrics device	The IP used for connection is incorrect or does not belong to the relevant tenant.
`sudo nvme discover` output: failed to get discovery log: Success	The IP used for connection is incorrect or does not belong to the relevant tenant.

Symptom

Root Cause

sudo nvme discover output:

Failed to add controller, error cannot assign requested address

The source IP in the command line is incorrect.

sudo nvme connect output:

Failed to write to /dev/nvme-fabrics: Input/output error
could not add new controller: failed to write to nvme-fabrics device

The IP used for connection is incorrect or does not belong to the relevant tenant.

sudo nvme discover output:

failed to get discovery log: Success

The IP used for connection is incorrect or does not belong to the relevant tenant.