Persisting NVMe-oF over TCP Across Reboots

Overview

This document describes a host-side, supported approach to ensure NVMe over TCP (NVMe/TCP) devices are reliably discovered and usable after reboot. It includes copy/paste snippets, recommended systemd ordering, and an explanation of key flags.

Scope

Linux hosts that connect to remote NVMe namespaces using NVMe-oF over TCP (nvme connect).
Goal: After reboot, the host consistently loads NVMe modules, connects to subsystems, and mounts filesystems.

Storage-side support required?

No. Everything in this document is host-side (kernel modules, initramfs, systemd ordering, and mounts). Storage is only expected to present NVMe subsystems as usual.

Boot chain

To guarantee NVMe/TCP is ready after boot, implement two layers for module loading plus a deterministic connect sequence:

Initramfs includes NVMe modules (the earliest practical point).
systemd-modules-load loads them at boot (backup layer).
A systemd connect service runs after the network is online.
Filesystems are mounted via fstab using safe options.

Correct order

Modules loaded
Network online
NVMe Connect Service
mounts (fstab automount or normal mount)

A. Identify required kernel modules

For NVMe/TCP, you typically need these modules:

nvme
nvme-core (sometimes appears as nvme_core in tooling output)
nvme-tcp

Check what is currently loaded

lsmod | grep -E '(^nvme|nvme_tcp|nvme_core|nvme-core)'

Confirm the module exists on the system

modinfo nvme-tcp

If modinfo nvme-tcp returns information, the module is available.

B. Load NVMe modules early using initramfs (best practice)

Initramfs is loaded before most of the OS and services. Including NVMe modules here is the most reliable way to ensure the kernel can bring up NVMe/TCP early.

Rocky / RHEL / CentOS (dracut)

Create a dracut config snippet:

File: /etc/dracut.conf.d/nvme-tcp.conf

sudo mkdir -p /etc/dracut.conf.d
sudo tee /etc/dracut.conf.d/nvme-tcp.conf >/dev/null <<'EOF'
add_drivers+=" nvme nvme-core nvme-tcp "
EOF

Rebuild initramfs:
```
sudo dracut -f
```

Verify modules are in initramfs:

lsinitrd /boot/initramfs-$(uname -r).img | grep -Ei 'nvme(_|-)?(tcp|core|fabrics)|nvme'

Ubuntu / Debian (initramfs-tools)

Add modules to initramfs list:

File: /etc/initramfs-tools/modules

sudo tee -a /etc/initramfs-tools/modules >/dev/null <<'EOF'
nvme
nvme-core
nvme-tcp
EOF

Update initramfs:
```
sudo update-initramfs -u
```

Verify contents:

lsinitramfs /boot/initrd.img-$(uname -r) | grep -E 'nvme|nvme-tcp|nvme_core|nvme-core' | head

C. Load NVMe modules at boot using modules-load.d (backup layer)

This is a safety net in case initramfs was not rebuilt correctly, or a kernel change occurs.

File: /etc/modules-load.d/nvme.conf

sudo tee /etc/modules-load.d/nvme.conf >/dev/null <<'EOF'
nvme
nvme-core
nvme-tcp
EOF

Apply without reboot:

sudo systemctl restart systemd-modules-load.service
sudo systemctl status systemd-modules-load.service --no-pager
lsmod | grep -E '(^nvme|nvme_tcp|nvme_core|nvme-core)'

D. Create a reliable NVMe/TCP connect service (systemd)

This ensures your NVMe connections come up automatically after the network is ready and after modules are available.

Create a connect script

File: /usr/local/sbin/vast-nvme-tcp-connect.sh

sudo tee /usr/local/sbin/vast-nvme-tcp-connect.sh >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail

# Example: connect to one or more subsystems
# Replace with your values:
# - TRADDR: target IP/DNS
# - TRSVCID: target port (usually 4420)
# - NQN: subsystem NQN

CONNECTS=(
  # "traddr trsvid nqn"
  "10.0.0.10 4420 nqn.2014-08.org.nvmexpress:uuid:aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
  # "10.0.0.11 4420 nqn.2014-08.org.nvmexpress:uuid:ffffffff-1111-2222-3333-444444444444"
)

for c in "${CONNECTS[@]}"; do
  read -r TRADDR TRSVCID NQN <<< "${c}"
  echo "Connecting: traddr=${TRADDR} trsvcid=${TRSVCID} nqn=${NQN}"

  # idempotent connect: if already connected, nvme-cli will usually return non-zero
  # so we check existing connections first
  if nvme list-subsys 2>/dev/null | grep -q "${NQN}"; then
    echo "Already connected to ${NQN}, skipping"
    continue
  fi

  nvme connect -t tcp -a "${TRADDR}" -s "${TRSVCID}" -n "${NQN}"
done

# Optional: wait briefly for /dev/nvme* nodes to appear
udevadm settle
EOF

sudo chmod +x /usr/local/sbin/vast-nvme-tcp-connect.sh

Create the systemd unit

File: /etc/systemd/system/vast-nvme-tcp.service

[Unit]
Description=NVMe/TCP connect (persist across reboot)
Wants=network-online.target
After=network-online.target systemd-modules-load.service

[Service]
Type=oneshot
RemainAfterExit=yes

# Guard rails: ensure modules exist even if initramfs/modules-load missed
ExecStartPre=/usr/sbin/modprobe nvme
ExecStartPre=/usr/sbin/modprobe nvme-tcp

ExecStart=/usr/local/sbin/vast-nvme-tcp-connect.sh

# Give enough time for network + connect
TimeoutStartSec=300

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable --now vast-nvme-tcp.service
sudo systemctl status vast-nvme-tcp.service --no-pager

Why these unit settings matter

Wants=network-online.target and After=network-online.target
- Ensures the service runs only after the system considers networking “online”. This is important because NVMe/TCP requires an active network path to the target.
After=systemd-modules-load.service
- Adds ordering so module loading happens before the connect attempt (backup layer).
ExecStartPre=modprobe ...
- Hard guarantee that the kernel modules are present immediately before attempting a connect, even if earlier steps were skipped.
Type=oneshot + RemainAfterExit=yes
- The connect is a one-time action at boot, but systemd treats the service as “active” once done, which is often convenient for dependency chains.
TimeoutStartSec=300
- Prevents premature failure during slower boot/network conditions.

E. Mounting: recommended fstab options and flag explanations

Once the NVMe namespace exists (example device: /dev/nvme0n1p1), you can mount it reliably using fstab. The key is to avoid boot hangs when networking or NVMe targets are temporarily unavailable.

Example fstab entry (using filesystem UUID)

Get the UUID first:

sudo blkid /dev/nvme0n1p1

Example /etc/fstab entry:

UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx /mnt/data xfs _netdev,nofail,x-systemd.automount,x-systemd.device-timeout=30 0 2

If using ext4, swap xfs with ext4.

What each flag does

_netdev

Tells the boot process this filesystem depends on networking. This helps systemd and mount logic avoid attempting the mount before the network stack is ready.
Host-side only. No storage-side support required.

nofail

If the mount fails at boot (for example, target not reachable yet), the system continues booting instead of dropping into emergency mode. This is critical for resilience.

x-systemd.automount

Creates an automount unit so the filesystem mounts on first access instead of during early boot. This prevents boot delays and reduces sensitivity to timing issues.

x-systemd.device-timeout=30

Limits how long systemd waits for the block device to appear before considering the mount attempt failed. This avoids long boot stalls.

Optional flag (use only if you specifically want auto-unmount behavior):

x-systemd.idle-timeout=300

If set alongside x-systemd.automount, systemd may unmount the filesystem after 300 seconds of inactivity.
If you want the filesystem to remain mounted once accessed, do not use this.

F. Verification steps (copy/paste)

After reboot, verify modules are loaded

lsmod | grep -E '(^nvme|nvme_tcp|nvme_core|nvme-core)'

Verify module load timing and boot messages

sudo journalctl -b | grep -E 'systemd-modules-load|nvme|nvme-tcp' | head -n 200

Verify the connect service ordering and critical chain
```
systemd-analyze critical-chain vast-nvme-tcp.service
```

Verify the service logs

sudo journalctl -u vast-nvme-tcp.service -b --no-pager

Verify NVMe connections and namespaces
```
nvme list
nvme list-subsys
```
Validate mounts
If using x-systemd.automount, trigger mount by accessing the path:
```
ls -la /mnt/data
mount | grep '/mnt/data'
```

Common pitfalls and how this approach avoids them

Mount attempted before the network is ready.
- Emphasize network-online.target ordering + _netdev + (optionally) automount.
Module not loaded at connect time
- Covered by initramfs + modules-load + modprobe.
Boot delays or hangs
- Avoided via nofail, x-systemd.automount, x-systemd.device-timeout.

G. Persisting a udev rule

A udev rule does not establish NVMe/TCP connections or ensure reconnection on boot.

It only applies the configuration after a device/subsystem appears. Persistent connectivity should be handled separately (e.g., via systemd boot-time connect + fstab mount dependencies)

Confirm the NVMe subsystem exists in sysfs
```
ls -l /sys/class/nvme-subsystem/
```
Confirm iopolicy exists for that subsystem
```
sudo cat /sys/class/nvme-subsystem/nvme-subsys0/iopolicy
```
Expected: A value like round-robin or something else (e.g., numa, none, etc.).
Show which controller(s) belong to the subsystem
```
ls -l /sys/class/nvme-subsystem/nvme-subsys0/
```
You should see symlinks like nvme0 under that directory.

Apply your udev rule (exact snippet)

Create the persistent rule

sudo tee /etc/udev/rules.d/71-nvmf-iopolicy.rules >/dev/null <<'EOF'
# Persistently set NVMe subsystem I/O policy to round-robin when subsystem is added/changed
ACTION=="add|change", SUBSYSTEM=="nvme-subsystem", ATTR{iopolicy}="round-robin"
EOF

Reload Rules
```
sudo udevadm control --reload-rules
```

Trigger the rule for nvme-subsystem devices

sudo udevadm trigger --subsystem-match=nvme-subsystem

Verify it changed
```
sudo cat /sys/class/nvme-subsystem/nvme-subsys0/iopolicy
```
Check that the udev device path is correct and attributes are visible.

Documentation Index