Overview
If you are unfamiliar with GPU Direct for Storage (GDS), this may not make sense. You should first educate yourself before proceeding.
Start with the NVIDIA GTC 2020 Conference slides for MagnumIO and GPU Direct for Storage. For deeper details, please read GPU Direct for Storage Design Guide. For NVIDIA GDS VAST Data specific information, please see Setting Up and Troubleshooting VAST Data.
This is not an introduction to GDS itself; GDS is a very rich and complex set of interlinked components for NVIDIA GPU-based systems, and is beyond the scope of this post to document it. The links above should provide a solid starting point for the eager learners.
This document's sole purpose is to help people understand how GDS is configured and installed on a VAST Cluster. For internal testing and validation of new GDS code, new software configurations, or customer installations, this knowledge is essential. As the name suggests, it's for installing and operating GDS on a VAST Cluster.
Prerequisites
A working GDS environment has the following dependencies on the base Operating System. This guide is written for Ubuntu 22.04, which is usually what DGX/HGX systems ship with.
Basic NFS client (
sudo apt install nfs-common)Mellanox OFED or DOCA OFED - depending on the NIC used for storage access. Mellanox OFED for ConnectX series NICs, but these days it has been deprecated, and the DOCA OFED should work for both ConnectX and Bluefield NICs.
VAST NFS Client. This is well documented at VAST NFS Client site should be built against this {M,D}OFED and the exact OS/kernel on the test system, following the download and installation instructions at the site.
NVIDIA driver and CUDA Toolkit have to be installed as well.
NVIDIA Fabric Manager package to manage inter-GPU communications.
The nvidia-gds package needs to be installed to get crucial GDS drivers shown for CUDA 12-8 here (
nvidia-fs) and libraries (libcufile-12-8 and libcufile-dev-12-8)The
gds-tools-12-8package should be installed as well.
Installation
Start with a clean slate (use sudo dpkg --purge to remove all NVIDIA-related packages).
# Download CUDA repo pin file
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
# Add the CUDA repo (Ubuntu 22.04, x86_64)
wget https://developer.download.nvidia.com/compute/cuda/12.6.68/local_installers/cuda-repo-ubuntu2204-12-6-local_12.6.68-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-6-local_12.6.68-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-6-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt update
# THIS IS CRUCIAL - GDS is only supported with the NVIDIA Open Driver after CUDA 12-2.
sudo apt install -y nvidia-driver-575-open
sudo reboot
#Ensure with nvidia-smi that the GPUs are visible and the correct Driver version shows up
nvidia-smi Install CUDA Toolkit
This is for CUDA 12.8.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
# Install once this repository is available
sudo apt update
sudo apt install cuda-toolkit-12-8
# Add environment variable to ~/.bashrc
export PATH=/usr/local/cuda-12.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH
# Reboot and verify (example below is for 12-9 - will be similar for 12-8)
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Apr__9_19:24:57_PDT_2025
Cuda compilation tools, release 12.9, V12.9.41
Build cuda_12.9.r12.9/compiler.35813241_0Install NVIDIA Fabric Manager
sudo apt update
sudo apt install nvidia-fabricmanager-575 libnvidia-nscq-575
sudo systemctl enable nvidia-fabricmanager.service
sudo systemctl start nvidia-fabricmanager.service
sudo systemctl status nvidia-fabricmanager.serviceThe RDMA Stack has to be installed next. Depending on the NIC used, if Mellanox ConnectX-6 or ConnectX-7 NICs are employed for VAST NFS access, the Mellanox OFED should be installed. If the NIC being used is a Bluefield-3 NIC, the DOCA OFED should be installed.
Note that the Mellanox OFED has been deprecated; the DOCA OFED is a superset of the Mellanox OFED, so it should be the starting point.
Install OFED
The DOCA OFED can be found at: NVIDIA DOCA Downloads.
The chain to follow for an Ubuntu 22.04 client machine is:Host Server->DOCA Host->Linux->x86_64->doca-ofed->ubuntu->22.04->deb(local)
This will yield installation instructions to install it.
wget https://www.mellanox.com/downloads/DOCA/DOCA_v3.1.0/host/doca-host_3.1.0-091000-25.07-ubuntu2204_amd64.deb
sudo dpkg -i doca-host_3.1.0-091000-25.07-ubuntu2204_amd64.deb
sudo apt-get update
sudo apt-get -y install doca-ofedInstall VAST NFS Client
Next, the VAST NFS client should be installed (I usually build from source using ./build.sh bin), following the instructions. The DOCA OFED is usually installed with DKMS support. The VAST NFS Client will follow suit.
Make sure to do the following:
# After installing the VAST NFS Client with dpkg -i <VAST NFS Client Debian>
sudo update-initramfs -u -k `uname -r`
sudo rebootAfter this, we should have a system with the NVIDIA Driver, NVIDIA Fabric Manager, CUDA Toolkit, DOCA OFED, and the VAST NFS Client. You can and should test that NFS mounts, multipath, and RDMA mounts work. For RDMA, jumbo frames are strongly recommended - MTU 9216 (over 9000 is fine). The network should support RDMA end-to-end.
Sanity Checks and Network Configuration
# Check that everything worked
ofed_info -s # Should show DOFED version
dpkg -l|grep vast # Should show pacage installed
# Mount VAST using the following mount command next
sudo mount -v -o vers=3,proto=rdma,port=20049,spread_reads,spread_writes,nconnect=8,localports=172.200.3.213~172.200.3.250,remoteports=172.200.4.1-172.200.4.16 172.200.4.1:/nosquash /mnt/multipath
# Here, nconnect is the TOTAL number of transports between the client NIC ports and the target VIPs the c-nodes.
# So if we had 1 localport, 4 remoteports and nconnect=16, we would have 4 transports per remoteport VIP.
# Note that typically the properties for these ports are handled by netplan
# Config files for netplan are in /etc/netplan/* - set IP, DNS, gateways, DHCP, MTU, CIDR etc here
# Sample netplan - in YAML
# This file describes the network interfaces available on your system - example is for 2 NIC ports.
# For more information, see netplan(5).
network:
version: 2
renderer: networkd
ethernets:
eno1:
dhcp4: no
addresses: [10.61.10.51/16]
gateway4: 10.61.254.254
nameservers:
addresses: [8.8.4.4]
enp216s0f0:
mtu: 9216
dhcp4: false
addresses: [172.200.3.242/16]
enp216s0f1:
mtu: 9216
dhcp4: false
addresses: [172.200.3.241/16]Install GDS
# To be safe get the linux headers for the nvidia-fs build
sudo apt install linux-headers-$(uname -r)
# Add the Nvidia GDS bundle. This should install libcufile and nvidia-fs appropriate to the CUDA Version
sudo apt install nvidia-gds-12-8
sudp apt install gds-tools-12-8 # sets up tools and utilities.Verify GDS
/usr/local/cuda/gds/tools/gdscheck -p
# Output should be something similar
GDS release version: 1.14.0.30
nvidia_fs version: 2.24 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe P2PDMA : Unsupported
NVMe : Supported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Supported
BeeGFS : Unsupported
ScaTeFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_pci_p2pdma : false
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.per_buffer_cache_size_kb : 1024
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 64
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.scatefs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
fs.gpfs.gds_async_support: true
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 4
execution.max_io_queue_depth : 128
execution.parallel_io : true
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 4
properties.force_odirect_mode : false
properties.prefer_iouring : false
=========
GPU INFO:
=========
GPU index 0 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
GPU index 1 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
GPU index 2 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
GPU index 3 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
GPU index 4 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
GPU index 5 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
GPU index 6 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
GPU index 7 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
==============
PLATFORM INFO:
==============
IOMMU: Pass-through or enabled
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed: 12040
Platform: DGXA100 920-23687-2530-000, Arch: x86_64(Linux 5.15.0-1070-nvidia)
Platform verification succeededTuning
If ACS is enabled in gdscheck, disable it using the script below.
#!/bin/bash
for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do
sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
if [ $? -ne 0 ]; then continue; fi
sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
doneFor optimal performance with GDS, the PCI_WR_ORDERING has to be set to force-_relax(1) for Storage NICs.
sudo mst start
sudo mlxconfig q|grep PCI_WR
PCI_WR_ORDERING force_relax(1)
# Default is per_mkey(0) so this needs to change to force_relax(1) if incorrect
# If this is not set right, this can be changed by first identifying the device
sudo mlxconfig q|grep -i device:
Device: /dev/mst/mt4123_pciconf0
# Then the setting can be changed by
sudo mlxconfig -y -d /dev/mst/mt4123_pciconf0 set PCI_WR_ORDERING=1
sudo rebootTesting with gdsio
Ensure you understand the NUMA Affinity for each GPU using nvidia-smi topo -m. The example shown is for a DGX-A100. This will differ from case to case.

NUMA Affinity for each GPU
Quick note: Resist the temptation to modify anything in /etc/cufile.json. Other vendors need to make changes there as their code has twists and turns, but VAST works out of the box.
For GDS Writes use the following (For DGX-A100)
#!/bin/bash -x
WORKERS=64; IO_TYPE=1 ; XFER_TYPE=0 ; IO_SIZE=1M ; DATASET_SIZE=5G
/usr/local/gds/tools/gdsio -F -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \
-D /mnt/gds/gdsio_test -d 0 -n 3 -w $WORKERS \
-D /mnt/gds/gdsio_test -d 1 -n 3 -w $WORKERS \
-D /mnt/gds/gdsio_test -d 2 -n 1 -w $WORKERS \
-D /mnt/gds/gdsio_test -d 3 -n 1 -w $WORKERS \
-D /mnt/gds/gdsio_test -d 4 -n 7 -w $WORKERS \
-D /mnt/gds/gdsio_test -d 5 -n 7 -w $WORKERS \
-D /mnt/gds/gdsio_test -d 6 -n 5 -w $WORKERS \
-D /mnt/gds/gdsio_test -d 7 -n 5 -w $WORKERS
# Where:
# WORKERS = number of GPU threads
# IO_TYPE =0 is READS and =1 is WRITES
# XFER_TYPE =0 is VAST-->GPU MEM - GPU Direct Mode
# =1 is VAST-->SYSMEM
# =2 is VAST-->SYSMEM-->GPU MEM (standard data path for GOU work)
# IO_SIZE size of IO
# DATASET_SIZE is size of each file created. Need one for each WORKER thread
# -D is directory path where files exist
# -d is GPU device ID
# -n is the NUMA AffinityGDS Reads use the following (example for DGX-A100)
#!/bin/bash -x
WORKERS=64; IO_TYPE=1 ; XFER_TYPE=0 ; IO_SIZE=1M ; DATASET_SIZE=5G
/usr/local/gds/tools/gdsio -F -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \
-D /mnt/gds/gdsio_test -d 0 -n 3 -w $WORKERS \
-D /mnt/gds/gdsio_test -d 1 -n 3 -w $WORKERS \
-D /mnt/gds/gdsio_test -d 2 -n 1 -w $WORKERS \
-D /mnt/gds/gdsio_test -d 3 -n 1 -w $WORKERS \
-D /mnt/gds/gdsio_test -d 4 -n 7 -w $WORKERS \
-D /mnt/gds/gdsio_test -d 5 -n 7 -w $WORKERS \
-D /mnt/gds/gdsio_test -d 6 -n 5 -w $WORKERS \
-D /mnt/gds/gdsio_test -d 7 -n 5 -w $WORKERS
# WORKERS = number of GPU threads
# IO_TYPE =0 is READS and =1 is WRITES
# XFER_TYPE =0 is VAST-->GPU MEM - GPU Direct Mode
# =1 is VAST-->SYSMEM
# =2 is VAST-->SYSMEM-->GPU MEM (standard data path for GOU work)
# IO_SIZE size of IO
# DATASET_SIZE is size of each file created. Need one for each WORKER thread
# -D is directory path where files exist
# -d is GPU device ID
# -n is the NUMA Affinity
Checks to Ensure this GDS is Working
How does one know that GDS is working? First, anywhere GDS is run, a log file ./cufile.log is created. This file will be empty - showing no errors. If there are errors here, there is usually some issue to fix.
GDS will fall back to POSIX IO without any fanfare if GDS is not working. You will see messages in cufile.log when this happens. The way to ensure that TRUE GDS is working is to study specific structures in /proc that are associated with the nvidia-fs driver, as GDS has to use it.
To enable stats for nvidia-fs
#!/bin/bash
echo 1 |sudo tee /sys/module/nvidia_fs/parameters/rw_stats_enabledTo observe GDS IOs (counters will change and be non-zero when GDS is being used. If counters increment, GDS is working.
watch -n 1 cat /proc/driver/nvidia-fs/stats# Output will update every second
Every 1.0s: cat /proc/driver/nvidia-fs/stats selab-nvidia-dgx: Tue Sep 9 13:44:38 2025
GDS Version: 1.13.1.3
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.24.3)
Mellanox PeerDirect Supported: False
IO stats: Disabled, peer IO stats: Disabled
Logging level: info
Active Shadow-Buffer (MiB): 0
Active Process: 0
Reads : err=0 io_state_err=0
Sparse Reads : n=0 io=0 holes=0 pages=0
Writes : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
Mmap : n=0 ok=0 err=0 munmap=0
Bar1-map : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0
Error : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
Ops : Read=0 Write=0 BatchIO=0