Persisting NVMe-oF over TCP Across Reboots

Prev Next

Overview

This document describes a host-side, supported approach to ensure NVMe over TCP (NVMe/TCP) devices are reliably discovered and usable after reboot. It includes copy/paste snippets, recommended systemd ordering, and an explanation of key flags.

Scope

  • Linux hosts that connect to remote NVMe namespaces using NVMe-oF over TCP (nvme connect).

  • Goal: After reboot, the host consistently loads NVMe modules, connects to subsystems, and mounts filesystems.

Storage-side support required?

  • No. Everything in this document is host-side (kernel modules, initramfs, systemd ordering, and mounts). Storage is only expected to present NVMe subsystems as usual.

Boot chain

To guarantee NVMe/TCP is ready after boot, implement two layers for module loading plus a deterministic connect sequence:

  1. Initramfs includes NVMe modules (the earliest practical point).

  2. systemd-modules-load loads them at boot (backup layer).

  3. A systemd connect service runs after the network is online.

  4. Filesystems are mounted via fstab using safe options.

Correct order

  • Modules loaded

  • Network online

  • NVMe Connect Service

  • mounts (fstab automount or normal mount)

A. Identify required kernel modules

For NVMe/TCP, you typically need these modules:

  • nvme

  • nvme-core (sometimes appears as nvme_core in tooling output)

  • nvme-tcp

Check what is currently loaded

lsmod | grep -E '(^nvme|nvme_tcp|nvme_core|nvme-core)'

Confirm the module exists on the system

modinfo nvme-tcp

If modinfo nvme-tcp returns information, the module is available.

B. Load NVMe modules early using initramfs (best practice)

Initramfs is loaded before most of the OS and services. Including NVMe modules here is the most reliable way to ensure the kernel can bring up NVMe/TCP early.

Rocky / RHEL / CentOS (dracut)

  1. Create a dracut config snippet:

    1. File: /etc/dracut.conf.d/nvme-tcp.conf

      sudo mkdir -p /etc/dracut.conf.d
                              sudo tee /etc/dracut.conf.d/nvme-tcp.conf >/dev/null <<'EOF'
                              add_drivers+=" nvme nvme-core nvme-tcp "
                              EOF
  1. Rebuild initramfs:

    sudo dracut -f
  1. Verify modules are in initramfs:

    lsinitrd /boot/initramfs-$(uname -r).img | grep -Ei 'nvme(_|-)?(tcp|core|fabrics)|nvme'

Ubuntu / Debian (initramfs-tools)

  1. Add modules to initramfs list:

    1. File: /etc/initramfs-tools/modules

      sudo tee -a /etc/initramfs-tools/modules >/dev/null <<'EOF'
                              nvme
                              nvme-core
                              nvme-tcp
                              EOF

  2. Update initramfs:

    sudo update-initramfs -u

  3. Verify contents:

    lsinitramfs /boot/initrd.img-$(uname -r) | grep -E 'nvme|nvme-tcp|nvme_core|nvme-core' | head

C. Load NVMe modules at boot using modules-load.d (backup layer)

This is a safety net in case initramfs was not rebuilt correctly, or a kernel change occurs.

File: /etc/modules-load.d/nvme.conf

sudo tee /etc/modules-load.d/nvme.conf >/dev/null <<'EOF'
        nvme
        nvme-core
        nvme-tcp
        EOF

Apply without reboot:

sudo systemctl restart systemd-modules-load.service
        sudo systemctl status systemd-modules-load.service --no-pager
        lsmod | grep -E '(^nvme|nvme_tcp|nvme_core|nvme-core)'

D. Create a reliable NVMe/TCP connect service (systemd)

This ensures your NVMe connections come up automatically after the network is ready and after modules are available.

  1. Create a connect script

    1. File: /usr/local/sbin/vast-nvme-tcp-connect.sh

      sudo tee /usr/local/sbin/vast-nvme-tcp-connect.sh >/dev/null <<'EOF'
                              #!/usr/bin/env bash
                              set -euo pipefail
      
                              # Example: connect to one or more subsystems
                              # Replace with your values:
                              # - TRADDR: target IP/DNS
                              # - TRSVCID: target port (usually 4420)
                              # - NQN: subsystem NQN
      
                              CONNECTS=(
                              # "traddr trsvid nqn"
                              "10.0.0.10 4420 nqn.2014-08.org.nvmexpress:uuid:aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
                              # "10.0.0.11 4420 nqn.2014-08.org.nvmexpress:uuid:ffffffff-1111-2222-3333-444444444444"
                              )
      
                              for c in "${CONNECTS[@]}"; do
                              read -r TRADDR TRSVCID NQN <<< "${c}"
                              echo "Connecting: traddr=${TRADDR} trsvcid=${TRSVCID} nqn=${NQN}"
      
                              # idempotent connect: if already connected, nvme-cli will usually return non-zero
                              # so we check existing connections first
                              if nvme list-subsys 2>/dev/null | grep -q "${NQN}"; then
                              echo "Already connected to ${NQN}, skipping"
                              continue
                              fi
      
                              nvme connect -t tcp -a "${TRADDR}" -s "${TRSVCID}" -n "${NQN}"
                              done
      
                              # Optional: wait briefly for /dev/nvme* nodes to appear
                              udevadm settle
                              EOF
      
                              sudo chmod +x /usr/local/sbin/vast-nvme-tcp-connect.sh
  1. Create the systemd unit

File: /etc/systemd/system/vast-nvme-tcp.service

[Unit]
        Description=NVMe/TCP connect (persist across reboot)
        Wants=network-online.target
        After=network-online.target systemd-modules-load.service

        [Service]
        Type=oneshot
        RemainAfterExit=yes

        # Guard rails: ensure modules exist even if initramfs/modules-load missed
        ExecStartPre=/usr/sbin/modprobe nvme
        ExecStartPre=/usr/sbin/modprobe nvme-tcp

        ExecStart=/usr/local/sbin/vast-nvme-tcp-connect.sh

        # Give enough time for network + connect
        TimeoutStartSec=300

        [Install]
        WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
        sudo systemctl enable --now vast-nvme-tcp.service
        sudo systemctl status vast-nvme-tcp.service --no-pager

Why these unit settings matter

  • Wants=network-online.target and After=network-online.target

    • Ensures the service runs only after the system considers networking “online”. This is important because NVMe/TCP requires an active network path to the target.

  • After=systemd-modules-load.service

    • Adds ordering so module loading happens before the connect attempt (backup layer).

  • ExecStartPre=modprobe ...

    • Hard guarantee that the kernel modules are present immediately before attempting a connect, even if earlier steps were skipped.

  • Type=oneshot + RemainAfterExit=yes

    • The connect is a one-time action at boot, but systemd treats the service as “active” once done, which is often convenient for dependency chains.

  • TimeoutStartSec=300

    • Prevents premature failure during slower boot/network conditions.

E. Mounting: recommended fstab options and flag explanations

Once the NVMe namespace exists (example device: /dev/nvme0n1p1), you can mount it reliably using fstab. The key is to avoid boot hangs when networking or NVMe targets are temporarily unavailable.

Example fstab entry (using filesystem UUID)

Get the UUID first:

sudo blkid /dev/nvme0n1p1

Example /etc/fstab entry:

UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx /mnt/data xfs _netdev,nofail,x-systemd.automount,x-systemd.device-timeout=30 0 2

If using ext4, swap xfs with ext4.

What each flag does

_netdev

  • Tells the boot process this filesystem depends on networking. This helps systemd and mount logic avoid attempting the mount before the network stack is ready.

  • Host-side only. No storage-side support required.

nofail

  • If the mount fails at boot (for example, target not reachable yet), the system continues booting instead of dropping into emergency mode. This is critical for resilience.

x-systemd.automount

  • Creates an automount unit so the filesystem mounts on first access instead of during early boot. This prevents boot delays and reduces sensitivity to timing issues.

x-systemd.device-timeout=30

  • Limits how long systemd waits for the block device to appear before considering the mount attempt failed. This avoids long boot stalls.

Optional flag (use only if you specifically want auto-unmount behavior):

x-systemd.idle-timeout=300

  • If set alongside x-systemd.automount, systemd may unmount the filesystem after 300 seconds of inactivity.

  • If you want the filesystem to remain mounted once accessed, do not use this.

F. Verification steps (copy/paste)

  1. After reboot, verify modules are loaded

    lsmod | grep -E '(^nvme|nvme_tcp|nvme_core|nvme-core)'
  2. Verify module load timing and boot messages

    sudo journalctl -b | grep -E 'systemd-modules-load|nvme|nvme-tcp' | head -n 200
  3. Verify the connect service ordering and critical chain

    systemd-analyze critical-chain vast-nvme-tcp.service

  4. Verify the service logs

    sudo journalctl -u vast-nvme-tcp.service -b --no-pager

  5. Verify NVMe connections and namespaces

    nvme list
                    nvme list-subsys

  6. Validate mounts

    If using x-systemd.automount, trigger mount by accessing the path:

    ls -la /mnt/data
                    mount | grep '/mnt/data'

Common pitfalls and how this approach avoids them

  • Mount attempted before the network is ready.

    • Emphasize network-online.target ordering + _netdev + (optionally) automount.

  • Module not loaded at connect time

    • Covered by initramfs + modules-load + modprobe.

  • Boot delays or hangs

    • Avoided via nofail, x-systemd.automount, x-systemd.device-timeout.

G. Persisting a udev rule

A udev rule does not establish NVMe/TCP connections or ensure reconnection on boot.

It only applies the configuration after a device/subsystem appears. Persistent connectivity should be handled separately (e.g., via systemd boot-time connect + fstab mount dependencies)

  1. Confirm the NVMe subsystem exists in sysfs

    ls -l /sys/class/nvme-subsystem/

  2. Confirm iopolicy exists for that subsystem

    sudo cat /sys/class/nvme-subsystem/nvme-subsys0/iopolicy

    Expected: A value like round-robin or something else (e.g., numa, none, etc.).

  3. Show which controller(s) belong to the subsystem

    ls -l /sys/class/nvme-subsystem/nvme-subsys0/

    You should see symlinks like nvme0 under that directory.

  4. Apply your udev rule (exact snippet)

    1. Create the persistent rule

      sudo tee /etc/udev/rules.d/71-nvmf-iopolicy.rules >/dev/null <<'EOF'
                              # Persistently set NVMe subsystem I/O policy to round-robin when subsystem is added/changed
                              ACTION=="add|change", SUBSYSTEM=="nvme-subsystem", ATTR{iopolicy}="round-robin"
                              EOF

    2. Reload Rules

      sudo udevadm control --reload-rules

    3. Trigger the rule for nvme-subsystem devices

      sudo udevadm trigger --subsystem-match=nvme-subsystem

    4. Verify it changed

      sudo cat /sys/class/nvme-subsystem/nvme-subsys0/iopolicy

      Checking that the udev device path is correct and attributes are visible.