VAST NFS Performance Troubleshooting Guide

Prev Next

SLIs and SLOs

Service level indicators and objectives are used in several ways:

  1. Measure the health of a system.  

  2. Understand usage and trends for forecasting and procurement.

SLI

SLO

Write Throughput

Refer to the sizer for your environment

Write Latency

10ms at 50% max throughput

15ms at up to 80% max throughput

Read Throughput

Refer to the sizer for your environment

Read Latency

4ms at up to 50% of max throughput

8ms at up to 80% of max throughput

Create Latency

(leading indicator for metadata write ops)

3ms at up to 50% of max throughput

6ms at up to 80% of max throughput

Get Attribute Latency

300µ (0.3ms) at up to 50% of max throughput

600µ (0.6ms) at up to 80% of max throughput

Assumptions:

  1. Multipath is in use - ensuring uniform distribution of load on VAST compute nodes.

  2. The sizer measures for 100% reads OR 100% writes.
    Most systems serve a combination of reads and writes.
    Therefore, assume reads/writes are using the same resources.
    Therefore, if a system is serving 70% of its maximum reads, it can only serve 30% of its maximum writes.
    For example, if the sizer reports 100GBps reads and 50GBps writes, the cluster can support 70GBps reads and 15GBps writes.

Create and getattr are good indicators for metadata write and read operations.
We could add additional metrics over time.

Measurement Methodology

Latency should be measured both on the client and on the server for several reasons:

  1. VAST owns the server. The customer owns the client and network. The end-user owns the workload. We need visibility across the board to isolate where the problem is.

  2. There are subtle cases where VAST is reporting great latency, but the problem is still on VAST (the network stack queues data before the VAST application sees it).

Server Side Measurement

Write throughput:

rate(vast_cluster_metrics_ProtoMetrics_write_size_sum{proto_name="ProtoAll"}[5m])

Write latency:

rate(vast_cluster_metrics_ProtoMetrics_write_latency_sum{proto_name="ProtoAll"}[5m]) / rate(vast_cluster_metrics_ProtoMetrics_write_latency_count{proto_name="ProtoAll"}[5m])

Read throughput:

rate(vast_cluster_metrics_ProtoMetrics_read_size_sum{proto_name="ProtoAll"}[5m])

Read latency:

rate(vast_cluster_metrics_ProtoMetrics_read_latency_sum{proto_name="ProtoAll"}[5m]) / rate(vast_cluster_metrics_ProtoMetrics_read_latency_count{proto_name="ProtoAll"}[5m])

Create latency:

rate(vast_cluster_metrics_NfsMetrics_nfs_create_latency_sum[5m]) / rate(vast_cluster_metrics_NfsMetrics_nfs_create_latency_count[5m])

Get attribute latency:

rate(vast_cluster_metrics_NfsMetrics_nfs_getattr_latency_sum[5m]) / rate(vast_cluster_metrics_NfsMetrics_nfs_getattr_latency_count[5m])

Client Side Measurement

Vnfs-Collector

VAST’s vNFS Collector is an open-source tool that provides deep visibility into NFS workloads by capturing detailed I/O metrics for every NFS mount. It tracks per-operation counters for all key NFSv3 and NFSv4 commands, including READ, WRITE, LOOKUP, and DELETE, along with contextual metadata such as mount points, process names, user IDs, and environment variables like SLURM_JOB_ID. This rich dataset enables accurate workload profiling and performance tuning. 

The collector supports flexible data forwarding, with local JSON logging and seamless integration into Prometheus (for Grafana dashboards), Kafka (for event-driven pipelines), and the VAST DataBase (for historical analytics via Trino, Spark, and Grafana).

For more information, visit: https://www.vastdata.com/blog/tracking-job-ids-enhancing-observability-and-efficiency-in-large-scale

  1. Ensure the collector is running on the client fleet.

$ sudo systemctl status vnfs-collector
● vnfs-collector.service - NFS Collector tracking per application/mount stats using ebpf
     Loaded: loaded (/usr/lib/systemd/system/vnfs-collector.service; enabled; preset: enabled)
     Active: active (running) since Fri 2025-06-20 06:02:58 UTC; 1 week 3 days ago
   Main PID: 265771 (vnfs-collector)
      Tasks: 4 (limit: 9440)
     Memory: 142.8M (peak: 144.6M)
        CPU: 8min 8.195s
     CGroup: /system.slice/vnfs-collector.service
             └─265771 /opt/vnfs-collector/src/venv/bin/python3 /usr/local/bin/vnfs-collector -C /opt/vnfs-collector/nfsops.yaml

Jun 20 06:02:58 blake-1 systemd[1]: Started vnfs-collector.service - NFS Collector tracking per application/mount stats using ebpf.
Jun 20 06:03:04 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:04 | nfsops | BPF version: 0.29.1
Jun 20 06:03:04 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:04 | nfsops | VNFS Collector<1.4-0> initialization
Jun 20 06:03:04 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:04 | nfsops | Configuration options: drivers=['file', 'prometheus'], interval=10, vaccum=600, envs=None, ebpf=False, squash-pid=True, tag-filter=None, anon-fields=None, c>
Jun 20 06:03:07 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:07 |  file  | Setting up driver.
Jun 20 06:03:07 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:07 |  file  | FileDriver(path=/opt/vnfs-collector/vnfs-collector.log, max_size_mb=200, max_backups=5) has been initialized.
Jun 20 06:03:07 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:07 | prometheus | Setting up driver.
Jun 20 06:03:07 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:07 | prometheus | PrometheusDriver(prom_exporter_host=0.0.0.0, prom_exporter_port=9100, buffer_size=1000) has been initialized.
Jun 20 06:03:13 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:13 | nfsops | All good! StatsCollector has been attached.
$
  1. Ensure the config is laid out as you’d like.  This example is very basic for Prometheus scraping.  The metrics are updated every 10 seconds with the `interval` key, and the ‘vacuum` key sets how often the application cleans up its internal process and mount information.

    1. More details in the readme here. https://github.com/vast-data/vnfs-collector/blob/main/README.md

$ cat /opt/vnfs-collector/nfsops.yaml
# envs:
# - JOBID
# - SCHEDID
# - POD_NAME
interval: 10
vaccum: 600
# screen: {}
file:
 samples_path: /opt/vnfs-collector/vnfs-collector.log
 max_backups: 5
 max_size_mb: 200
# vdb:
#  db_endpoint: <endpoint>
#  db_access_key: <access_key>
#  db_secret_key: <secret_key>
#  db_bucket: <bucket>
#  db_schema: <schema>
#  db_table: <table>
prometheus:
 prom_exporter_host: 0.0.0.0
 prom_exporter_port: 9100
# kafka:
#  bootstrap_servers: <broker1:9093,broker2:9093>
#  topic: vnfs-collector
#  sasl_username: <username>
#  sasl_password: <password>
#  security_protocol: SASL_PLAINTEXT
$
  1. Ensure prom (or equivalent) is scraping metrics (just an example, metrics could be pushed to ,Kafka for example) 

    1. https://github.com/vast-data/vnfs-collector/blob/main/prometheus.yml

global:
  scrape_interval: 60s

scrape_configs:
  - job_name: 'vnfs-collector'
    static_configs:
      - targets: ['vnfs-collector:9000']

You’ll have to add this prompt as a data source for Grafana (an important step).

  1. To Graph just create and getattr procedure latency and errors.

vnfs_create_duration{COMM=~"elbencho|fio"}
vnfs_getattr_duration{COMM=~"elbencco|fio"}
rate(vnfs_create_errors{COMM=~"elbencho|fio"}[1m])
rate(vnfs_create_errors{COMM=~"elbencho|fio"}[1m])
  1. To graph client-side bandwidth, for example.

vnfs_read_bytes{COMM=~"elbencho|fio"}
vnfs_write_bytes{COMM=~"elbencho|fio"}
  1. To graph create and getattr latency:

quantile(.9, rate(node_mountstats_nfs_operations_response_time_seconds_total{operation=~"GETATTR|CREATE"}[5m])) by (cluster, cluster_org, operation, zone, node, mountaddr)

Troubleshooting Flow

  1. User is reporting “slowness”

  2. If the client metrics are NOT showing a breach of SLOs:

    1. Debug ‘application slowness’. Use VAST for assistance.

  3. If the client metrics ARE showing a breach of SLOs:

    1. Are the metrics on the server breaching SLOs? It’s a VAST issue.

    2. Otherwise, using tshark on the CNode, is breaching SLOs? It’s a VAST issue.

    3. Otherwise, it could be specific to a client or segment of the network:

      1. Is the client CPU saturated?

      2. Are other clients not seeing a problem? Compare network paths.

Debugging Application Slowness

Using strace

For more strace-related troubleshooting, see this page.

Troubleshooting NFS Performance Using strace

[vastdata@MetaSpaceWorker1 pandas]$ strace -f -c git clone https://github.com/pandas-dev/pandas.git
Cloning into 'pandas'...
strace: Process 116322 attached
strace: Process 116323 attached
strace: Process 116324 attached
strace: Process 116325 attached
strace: Process 116326 attached
remote: Enumerating objects: 407818, done.
remote: Counting objects: 100% (283/283), done.
remote: Compressing objects: 100% (225/225), done.
remote: Total 407818 (delta 187), reused 58 (delta 58), pack-reused 407535 (from 4)
Receiving objects: 100% (407818/407818), 360.63 MiB | 3.75 MiB/s, done.
strace: Process 116329 attached
strace: Process 116330 attached
strace: Process 116331 attached5)
Resolving deltas: 100% (342915/342915), done.
strace: Process 116334 attached
Updating files: 100% (2629/2629), done.
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 52.20   12.154686     3038671         4           wait4
 23.69    5.516001          11    462246     75050 futex
 10.33    2.404826           3    750215        27 read
  9.12    2.124073           4    507410           pread64
  3.63    0.844071           2    419166           write
  0.49    0.113141          28      3996      1138 openat
  0.25    0.057483          12      4605      3868 newfstatat
  0.09    0.020865           7      2810           fstat
  0.05    0.012017           1      8768         8 rt_sigaction
  0.05    0.010589           3      2914           close
  0.03    0.006984          24       283         2 mkdir
  .... snip ....
  0.00    0.000012           2         6           getrandom
  0.00    0.000009           4         2           dup
  0.00    0.000009           1         5           set_tid_address
  0.00    0.000002           2         1           getpeername
  0.00    0.000000           0         6           dup2
  0.00    0.000000           0         3           link
------ ----------- ----------- --------- --------- ----------------
100.00   23.284532          10   2165567     80211 total
[vastdata@MetaSpaceWorker1 pandas]$

Looking into an existing application:

strace -f -c -p <pid>

Using Tshark or Packet captures

For more information on using tshark to troubleshoot nfs performance

Troubleshooting NFS Performance Using tshark / pcap

Installing tshark on a VAST CNode: 

$ sudo yum install wireshark-cli -y

$ sudo tcpdump -i any "port 2049" -w nfs-capture.pcap
tcpdump: data link type LINUX_SLL2
dropped privs to tcpdump
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
^C418 packets captured
434 packets received by filter
0 packets dropped by kernel
[vastdata@MetaSpaceWorker1 pandas]$ tshark -r nfs-capture.pcap -z rpc,srt,100003,3 -q

===================================================================
RPC SRT Statistics:
Filter: nfs.procedure_v3
Index  Procedure              Calls    Min SRT    Max SRT    Avg SRT    Sum SRT
    1  GETATTR                     4   0.000375   0.000556   0.000478   0.001913
    2  SETATTR                     2   0.000227   0.001355   0.000791   0.001582
    3  LOOKUP                      6   0.000308   0.000620   0.000438   0.002625
    4  ACCESS                      2   0.000196   0.000388   0.000292   0.000584
    7  WRITE                      13   0.001457   0.005013   0.002347   0.030505
    8  CREATE                      6   0.001265   0.002410   0.001742   0.010449
==================================================================
$

You can also run tshark live, without using a capture file.

Replace ens33 with your active network interface

[vastdata@MetaSpaceWorker1 pandas]$ sudo tshark -i ens33 -z rpc,srt,100003,3 -q
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ens33'
^C^C109 packets dropped from ens33
191031 packets captured

===================================================================
RPC SRT Statistics:
Filter: nfs.procedure_v3
Index  Procedure              Calls    Min SRT    Max SRT    Avg SRT    Sum SRT
    1  GETATTR                    64   0.000110   0.000830   0.000491   0.031410
    3  LOOKUP                     74   0.000129   0.001219   0.000416   0.030801
    4  ACCESS                      1   0.000111   0.000111   0.000111   0.000111
    7  WRITE                     333   0.000510   0.010482   0.001789   0.595599
    8  CREATE                     69   0.001303   0.005677   0.002054   0.141704
==================================================================


$

Using ebpf tools

https://github.com/iovisor/bcc

sudo dnf install bcc-tools

Here’s an example using nfsslower

[vastdata@MetaSpaceWorker1 pandas]$ sudo /usr/share/bcc/tools/nfsslower 1
Tracing NFS operations that are slower than 1 ms... Ctrl-C to quit
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
07:52:44 python3        77161  W 2402731 0           1.09 f000000384765.dat
07:52:45 python3        77162  W 2763137 0           1.22 f000000404635.dat
07:52:47 python3        77160  W 2825855 0           1.25 f000000502147.dat
07:52:47 python3        77153  W 3069437 0           1.28 f000000466010.dat
07:52:49 python3        77170  W 2794281 0           1.20 f000000429451.dat
07:52:52 python3        77169  W 2546840 0           1.07 f000000432319.dat
^C[vastdata@MetaSpaceWorker1 pandas]$
[vastdata@MetaSpaceWorker1 pandas]$

Identifying Virtual IPs Used By a Client

$ sudo vastnfs-ctl rpc-clients /mnt/nfs | grep state
		172.27.216.2, state: CONNECTED BOUND
		172.27.216.3, state: CONNECTED BOUND
		172.27.216.4, state: CONNECTED BOUND
		172.27.216.5, state: CONNECTED BOUND
		172.27.216.6, state: CONNECTED BOUND
		172.27.216.7, state: CONNECTED BOUND
		172.27.216.8, state: CONNECTED BOUND
		172.27.216.9, state: CONNECTED BOUND
		172.27.216.10, state: CONNECTED BOUND
		172.27.216.11, state: CONNECTED BOUND
		172.27.216.12, state: CONNECTED BOUND
		172.27.216.13, state: CONNECTED BOUND
		172.27.216.14, state: CONNECTED BOUND
		172.27.216.15, state: CONNECTED BOUND
		172.27.216.16, state: CONNECTED BOUND
		172.27.216.17, state: CONNECTED BOUND

The tshark and output of vastnfs-ctl tools can be combined to see the latency of operations on a specific VIP:

$ tshark -r capture.pcap -z rpc,srt,100003,3,ip.addr==172.27.216.2 -q

Additional Data To Collect

sudo netstat-nap | grep ESTA

Escalating To VAST

  1. Collect the results generated by the commands above.

  2. Escalate with VAST support to open a case and collect a support bundle during the issue.