GPU Direct Storage: Installation and Operations Guide 2025

Overview

If you are unfamiliar with GPU Direct for Storage (GDS), this may not make sense. You should first educate yourself before proceeding.

Start with the NVIDIA GTC 2020 Conference slides for MagnumIO and GPU Direct for Storage. For deeper details, please read GPU Direct for Storage Design Guide. For NVIDIA GDS VAST Data specific information, please see Setting Up and Troubleshooting VAST Data.

This is not an introduction to GDS itself; GDS is a very rich and complex set of interlinked components for NVIDIA GPU-based systems, and is beyond the scope of this post to document it. The links above should provide a solid starting point for the eager learners.

This document's sole purpose is to help people understand how GDS is configured and installed on a VAST Cluster. For internal testing and validation of new GDS code, new software configurations, or customer installations, this knowledge is essential. As the name suggests, it's for installing and operating GDS on a VAST Cluster.

Prerequisites

A working GDS environment has the following dependencies on the base Operating System. This guide is written for Ubuntu 22.04, which is usually what DGX/HGX systems ship with.

Basic NFS client (sudo apt install nfs-common)
Mellanox OFED or DOCA OFED - depending on the NIC used for storage access. Mellanox OFED for ConnectX series NICs, but these days it has been deprecated, and the DOCA OFED should work for both ConnectX and Bluefield NICs.
VAST NFS Client. This is well documented at VAST NFS Client site should be built against this {M,D}OFED and the exact OS/kernel on the test system, following the download and installation instructions at the site.
NVIDIA driver and CUDA Toolkit have to be installed as well.
NVIDIA Fabric Manager package to manage inter-GPU communications.
The nvidia-gds package needs to be installed to get crucial GDS drivers shown for CUDA 12-8 here (nvidia-fs) and libraries (libcufile-12-8 and libcufile-dev-12-8)
The gds-tools-12-8 package should be installed as well.

Installation

Start with a clean slate (use sudo dpkg --purge to remove all NVIDIA-related packages).

# Download CUDA repo pin file
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600

# Add the CUDA repo (Ubuntu 22.04, x86_64)
wget https://developer.download.nvidia.com/compute/cuda/12.6.68/local_installers/cuda-repo-ubuntu2204-12-6-local_12.6.68-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-6-local_12.6.68-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-6-local/cuda-*-keyring.gpg /usr/share/keyrings/

sudo apt update

# THIS IS CRUCIAL - GDS is only supported with the NVIDIA Open Driver after CUDA 12-2.
sudo apt install -y nvidia-driver-575-open

sudo reboot 

#Ensure with nvidia-smi that the GPUs are visible and the correct Driver version shows up
nvidia-smi

Install CUDA Toolkit

This is for CUDA 12.8.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"

# Install once this repository is available
sudo apt update
sudo apt install cuda-toolkit-12-8

# Add environment variable to ~/.bashrc
export PATH=/usr/local/cuda-12.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH

# Reboot and verify (example below is for 12-9 - will be similar for 12-8)

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Apr__9_19:24:57_PDT_2025
Cuda compilation tools, release 12.9, V12.9.41
Build cuda_12.9.r12.9/compiler.35813241_0

Install NVIDIA Fabric Manager

sudo apt update
sudo apt install nvidia-fabricmanager-575 libnvidia-nscq-575
sudo systemctl enable nvidia-fabricmanager.service
sudo systemctl start nvidia-fabricmanager.service
sudo systemctl status nvidia-fabricmanager.service

The RDMA Stack has to be installed next. Depending on the NIC used, if Mellanox ConnectX-6 or ConnectX-7 NICs are employed for VAST NFS access, the Mellanox OFED should be installed. If the NIC being used is a Bluefield-3 NIC, the DOCA OFED should be installed.

Note that the Mellanox OFED has been deprecated; the DOCA OFED is a superset of the Mellanox OFED, so it should be the starting point.

Install OFED

The DOCA OFED can be found at: NVIDIA DOCA Downloads.
The chain to follow for an Ubuntu 22.04 client machine is:
Host Server->DOCA Host->Linux->x86_64->doca-ofed->ubuntu->22.04->deb(local)
This will yield installation instructions to install it.

wget https://www.mellanox.com/downloads/DOCA/DOCA_v3.1.0/host/doca-host_3.1.0-091000-25.07-ubuntu2204_amd64.deb
sudo dpkg -i doca-host_3.1.0-091000-25.07-ubuntu2204_amd64.deb
sudo apt-get update
sudo apt-get -y install doca-ofed

Install VAST NFS Client

Next, the VAST NFS client should be installed (I usually build from source using ./build.sh bin), following the instructions. The DOCA OFED is usually installed with DKMS support. The VAST NFS Client will follow suit.

Make sure to do the following:

# After installing the VAST NFS Client with dpkg -i <VAST NFS Client Debian>
sudo update-initramfs -u -k `uname -r`
sudo reboot

After this, we should have a system with the NVIDIA Driver, NVIDIA Fabric Manager, CUDA Toolkit, DOCA OFED, and the VAST NFS Client. You can and should test that NFS mounts, multipath, and RDMA mounts work. For RDMA, jumbo frames are strongly recommended - MTU 9216 (over 9000 is fine). The network should support RDMA end-to-end.

Sanity Checks and Network Configuration

# Check that everything worked
ofed_info -s # Should show DOFED version
dpkg -l|grep vast # Should show pacage installed
# Mount VAST using the following mount command next
sudo mount -v -o vers=3,proto=rdma,port=20049,spread_reads,spread_writes,nconnect=8,localports=172.200.3.213~172.200.3.250,remoteports=172.200.4.1-172.200.4.16 172.200.4.1:/nosquash /mnt/multipath
# Here, nconnect is the TOTAL number of transports between the client NIC ports and the target VIPs the c-nodes.
# So if we had 1 localport, 4 remoteports and nconnect=16, we would have 4 transports per remoteport VIP.
# Note that typically the properties for these ports are handled by netplan
# Config files for netplan are in /etc/netplan/* - set IP, DNS, gateways, DHCP, MTU, CIDR etc here
# Sample netplan - in YAML
# This file describes the network interfaces available on your system - example is for 2 NIC ports.
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      dhcp4: no
      addresses: [10.61.10.51/16]
      gateway4: 10.61.254.254
      nameservers:
        addresses: [8.8.4.4]
    enp216s0f0:
      mtu: 9216
      dhcp4: false
      addresses: [172.200.3.242/16]
    enp216s0f1:
      mtu: 9216
      dhcp4: false
      addresses: [172.200.3.241/16]

Install GDS

# To be safe get the linux headers for the nvidia-fs build
sudo apt install linux-headers-$(uname -r)
# Add the Nvidia GDS bundle. This should install libcufile and nvidia-fs appropriate to the CUDA Version
sudo apt install nvidia-gds-12-8 
sudp apt install gds-tools-12-8 # sets up tools and utilities.

Verify GDS

 /usr/local/cuda/gds/tools/gdscheck -p
 # Output should be something similar
 GDS release version: 1.14.0.30
 nvidia_fs version:  2.24 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe P2PDMA        : Unsupported
 NVMe               : Supported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Supported
 BeeGFS             : Unsupported
 ScaTeFS            : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_pci_p2pdma : false
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.per_buffer_cache_size_kb : 1024
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 64 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.scatefs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 fs.gpfs.gds_async_support: true
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
 GPU index 1 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
 GPU index 2 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
 GPU index 3 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
 GPU index 4 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
 GPU index 5 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
 GPU index 6 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
 GPU index 7 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: Pass-through or enabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12040
 Platform: DGXA100 920-23687-2530-000, Arch: x86_64(Linux 5.15.0-1070-nvidia)
 Platform verification succeeded

Tuning

If ACS is enabled in gdscheck, disable it using the script below.

#!/bin/bash
for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do
    sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
    if [ $? -ne 0 ]; then continue; fi
    sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
done

For optimal performance with GDS, the PCI_WR_ORDERING has to be set to force-_relax(1) for Storage NICs.

sudo mst start
sudo mlxconfig q|grep PCI_WR
         PCI_WR_ORDERING                     force_relax(1)  
# Default is per_mkey(0) so this needs to change to force_relax(1) if incorrect
# If this is not set right, this can be changed by first identifying the device
sudo mlxconfig q|grep -i device:
Device:         /dev/mst/mt4123_pciconf0
# Then the setting can be changed by
sudo mlxconfig -y -d /dev/mst/mt4123_pciconf0 set PCI_WR_ORDERING=1
sudo reboot

Testing with `gdsio`

Ensure you understand the NUMA Affinity for each GPU using nvidia-smi topo -m. The example shown is for a DGX-A100. This will differ from case to case.

The `nvidia-smi topo -m` command output displays the network topology configuration, showing connections between GPUs (GPU0-GPU7) and network interfaces (NIC0-NICC), with some specific CPU affinity settings listed in the last column. — NUMA Affinity for each GPU

Quick note: Resist the temptation to modify anything in /etc/cufile.json. Other vendors need to make changes there as their code has twists and turns, but VAST works out of the box.

For GDS Writes use the following (For DGX-A100)

#!/bin/bash -x
WORKERS=64; IO_TYPE=1 ; XFER_TYPE=0 ; IO_SIZE=1M ; DATASET_SIZE=5G
/usr/local/gds/tools/gdsio -F -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \
             -D /mnt/gds/gdsio_test -d 0 -n 3 -w $WORKERS \
             -D /mnt/gds/gdsio_test -d 1 -n 3 -w $WORKERS \
             -D /mnt/gds/gdsio_test -d 2 -n 1 -w $WORKERS \
             -D /mnt/gds/gdsio_test -d 3 -n 1 -w $WORKERS \
             -D /mnt/gds/gdsio_test -d 4 -n 7 -w $WORKERS \
             -D /mnt/gds/gdsio_test -d 5 -n 7 -w $WORKERS \
             -D /mnt/gds/gdsio_test -d 6 -n 5 -w $WORKERS \
             -D /mnt/gds/gdsio_test -d 7 -n 5 -w $WORKERS
# Where:
# WORKERS = number of GPU threads
# IO_TYPE =0 is READS and =1 is WRITES
# XFER_TYPE =0 is VAST-->GPU MEM - GPU Direct Mode 
#           =1 is VAST-->SYSMEM
#           =2 is VAST-->SYSMEM-->GPU MEM (standard data path for GOU work)
# IO_SIZE size of IO
# DATASET_SIZE is size of each file created. Need one for each WORKER thread
# -D is directory path where files exist
# -d is GPU device ID 
# -n is the NUMA Affinity

GDS Reads use the following (example for DGX-A100)

#!/bin/bash -x
WORKERS=64; IO_TYPE=1 ; XFER_TYPE=0 ; IO_SIZE=1M ; DATASET_SIZE=5G
/usr/local/gds/tools/gdsio -F -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \
             -D /mnt/gds/gdsio_test -d 0 -n 3 -w $WORKERS \
             -D /mnt/gds/gdsio_test -d 1 -n 3 -w $WORKERS \
             -D /mnt/gds/gdsio_test -d 2 -n 1 -w $WORKERS \
             -D /mnt/gds/gdsio_test -d 3 -n 1 -w $WORKERS \
             -D /mnt/gds/gdsio_test -d 4 -n 7 -w $WORKERS \
             -D /mnt/gds/gdsio_test -d 5 -n 7 -w $WORKERS \
             -D /mnt/gds/gdsio_test -d 6 -n 5 -w $WORKERS \
             -D /mnt/gds/gdsio_test -d 7 -n 5 -w $WORKERS
# WORKERS = number of GPU threads
# IO_TYPE =0 is READS and =1 is WRITES
# XFER_TYPE =0 is VAST-->GPU MEM - GPU Direct Mode 
#           =1 is VAST-->SYSMEM
#           =2 is VAST-->SYSMEM-->GPU MEM (standard data path for GOU work)
# IO_SIZE size of IO
# DATASET_SIZE is size of each file created. Need one for each WORKER thread
# -D is directory path where files exist
# -d is GPU device ID 
# -n is the NUMA Affinity

Checks to Ensure this GDS is Working

How does one know that GDS is working? First, anywhere GDS is run, a log file ./cufile.log is created. This file will be empty - showing no errors. If there are errors here, there is usually some issue to fix.

GDS will fall back to POSIX IO without any fanfare if GDS is not working. You will see messages in cufile.log when this happens. The way to ensure that TRUE GDS is working is to study specific structures in /proc that are associated with the nvidia-fs driver, as GDS has to use it.

To enable stats for nvidia-fs

#!/bin/bash
echo 1 |sudo tee /sys/module/nvidia_fs/parameters/rw_stats_enabled

To observe GDS IOs (counters will change and be non-zero when GDS is being used. If counters increment, GDS is working.

watch -n 1 cat /proc/driver/nvidia-fs/stats

# Output will update every second
Every 1.0s: cat /proc/driver/nvidia-fs/stats                                                                                                                                                        selab-nvidia-dgx: Tue Sep  9 13:44:38 2025
GDS Version: 1.13.1.3
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.24.3)
Mellanox PeerDirect Supported: False
IO stats: Disabled, peer IO stats: Disabled
Logging level: info
Active Shadow-Buffer (MiB): 0
Active Process: 0
Reads                           : err=0 io_state_err=0
Sparse Reads                    : n=0 io=0 holes=0 pages=0
Writes                          : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
Mmap                            : n=0 ok=0 err=0 munmap=0
Bar1-map                        : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0
Error                           : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
Ops                             : Read=0 Write=0 BatchIO=0

Documentation Index