SLIs and SLOs
Service level indicators and objectives are used in several ways:
Measure the health of a system.
Understand usage and trends for forecasting and procurement.
SLI | SLO |
|---|---|
Write Throughput | Refer to the sizer for your environment |
Write Latency | 10ms at 50% max throughput |
15ms at up to 80% max throughput | |
Read Throughput | Refer to the sizer for your environment |
Read Latency | 4ms at up to 50% of max throughput |
8ms at up to 80% of max throughput | |
Create Latency (leading indicator for metadata write ops) | 3ms at up to 50% of max throughput |
6ms at up to 80% of max throughput | |
Get Attribute Latency | 300µ (0.3ms) at up to 50% of max throughput |
600µ (0.6ms) at up to 80% of max throughput |
Assumptions:
Multipath is in use - ensuring uniform distribution of load on VAST compute nodes.
The sizer measures for 100% reads OR 100% writes.
Most systems serve a combination of reads and writes.
Therefore, assume reads/writes are using the same resources.
Therefore, if a system is serving 70% of its maximum reads, it can only serve 30% of its maximum writes.
For example, if the sizer reports 100GBps reads and 50GBps writes, the cluster can support 70GBps reads and 15GBps writes.
Create and getattr are good indicators for metadata write and read operations.
We could add additional metrics over time.
Measurement Methodology
Latency should be measured both on the client and on the server for several reasons:
VAST owns the server. The customer owns the client and network. The end-user owns the workload. We need visibility across the board to isolate where the problem is.
There are subtle cases where VAST is reporting great latency, but the problem is still on VAST (the network stack queues data before the VAST application sees it).
Server Side Measurement
Write throughput:
rate(vast_cluster_metrics_ProtoMetrics_write_size_sum{proto_name="ProtoAll"}[5m])
Write latency:
rate(vast_cluster_metrics_ProtoMetrics_write_latency_sum{proto_name="ProtoAll"}[5m]) / rate(vast_cluster_metrics_ProtoMetrics_write_latency_count{proto_name="ProtoAll"}[5m])
Read throughput:
rate(vast_cluster_metrics_ProtoMetrics_read_size_sum{proto_name="ProtoAll"}[5m])
Read latency:
rate(vast_cluster_metrics_ProtoMetrics_read_latency_sum{proto_name="ProtoAll"}[5m]) / rate(vast_cluster_metrics_ProtoMetrics_read_latency_count{proto_name="ProtoAll"}[5m])
Create latency:
rate(vast_cluster_metrics_NfsMetrics_nfs_create_latency_sum[5m]) / rate(vast_cluster_metrics_NfsMetrics_nfs_create_latency_count[5m])
Get attribute latency:
rate(vast_cluster_metrics_NfsMetrics_nfs_getattr_latency_sum[5m]) / rate(vast_cluster_metrics_NfsMetrics_nfs_getattr_latency_count[5m])
Client Side Measurement
Vnfs-Collector
VAST’s vNFS Collector is an open-source tool that provides deep visibility into NFS workloads by capturing detailed I/O metrics for every NFS mount. It tracks per-operation counters for all key NFSv3 and NFSv4 commands, including READ, WRITE, LOOKUP, and DELETE, along with contextual metadata such as mount points, process names, user IDs, and environment variables like SLURM_JOB_ID. This rich dataset enables accurate workload profiling and performance tuning.
The collector supports flexible data forwarding, with local JSON logging and seamless integration into Prometheus (for Grafana dashboards), Kafka (for event-driven pipelines), and the VAST DataBase (for historical analytics via Trino, Spark, and Grafana).
For more information, visit: https://www.vastdata.com/blog/tracking-job-ids-enhancing-observability-and-efficiency-in-large-scale
Ensure the collector is running on the client fleet.
$ sudo systemctl status vnfs-collector
● vnfs-collector.service - NFS Collector tracking per application/mount stats using ebpf
Loaded: loaded (/usr/lib/systemd/system/vnfs-collector.service; enabled; preset: enabled)
Active: active (running) since Fri 2025-06-20 06:02:58 UTC; 1 week 3 days ago
Main PID: 265771 (vnfs-collector)
Tasks: 4 (limit: 9440)
Memory: 142.8M (peak: 144.6M)
CPU: 8min 8.195s
CGroup: /system.slice/vnfs-collector.service
└─265771 /opt/vnfs-collector/src/venv/bin/python3 /usr/local/bin/vnfs-collector -C /opt/vnfs-collector/nfsops.yaml
Jun 20 06:02:58 blake-1 systemd[1]: Started vnfs-collector.service - NFS Collector tracking per application/mount stats using ebpf.
Jun 20 06:03:04 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:04 | nfsops | BPF version: 0.29.1
Jun 20 06:03:04 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:04 | nfsops | VNFS Collector<1.4-0> initialization
Jun 20 06:03:04 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:04 | nfsops | Configuration options: drivers=['file', 'prometheus'], interval=10, vaccum=600, envs=None, ebpf=False, squash-pid=True, tag-filter=None, anon-fields=None, c>
Jun 20 06:03:07 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:07 | file | Setting up driver.
Jun 20 06:03:07 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:07 | file | FileDriver(path=/opt/vnfs-collector/vnfs-collector.log, max_size_mb=200, max_backups=5) has been initialized.
Jun 20 06:03:07 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:07 | prometheus | Setting up driver.
Jun 20 06:03:07 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:07 | prometheus | PrometheusDriver(prom_exporter_host=0.0.0.0, prom_exporter_port=9100, buffer_size=1000) has been initialized.
Jun 20 06:03:13 blake-1 vnfs-collector[265771]: INFO: 2025-06-20 06:03:13 | nfsops | All good! StatsCollector has been attached.
$Ensure the config is laid out as you’d like. This example is very basic for Prometheus scraping. The metrics are updated every 10 seconds with the `interval` key, and the ‘vacuum` key sets how often the application cleans up its internal process and mount information.
More details in the readme here. https://github.com/vast-data/vnfs-collector/blob/main/README.md
$ cat /opt/vnfs-collector/nfsops.yaml
# envs:
# - JOBID
# - SCHEDID
# - POD_NAME
interval: 10
vaccum: 600
# screen: {}
file:
samples_path: /opt/vnfs-collector/vnfs-collector.log
max_backups: 5
max_size_mb: 200
# vdb:
# db_endpoint: <endpoint>
# db_access_key: <access_key>
# db_secret_key: <secret_key>
# db_bucket: <bucket>
# db_schema: <schema>
# db_table: <table>
prometheus:
prom_exporter_host: 0.0.0.0
prom_exporter_port: 9100
# kafka:
# bootstrap_servers: <broker1:9093,broker2:9093>
# topic: vnfs-collector
# sasl_username: <username>
# sasl_password: <password>
# security_protocol: SASL_PLAINTEXT
$Ensure prom (or equivalent) is scraping metrics (just an example, metrics could be pushed to ,Kafka for example)
global:
scrape_interval: 60s
scrape_configs:
- job_name: 'vnfs-collector'
static_configs:
- targets: ['vnfs-collector:9000']You’ll have to add this prompt as a data source for Grafana (an important step).
To Graph just
createandgetattrprocedure latency and errors.
vnfs_create_duration{COMM=~"elbencho|fio"}
vnfs_getattr_duration{COMM=~"elbencco|fio"}
rate(vnfs_create_errors{COMM=~"elbencho|fio"}[1m])
rate(vnfs_create_errors{COMM=~"elbencho|fio"}[1m])To graph client-side bandwidth, for example.
vnfs_read_bytes{COMM=~"elbencho|fio"}
vnfs_write_bytes{COMM=~"elbencho|fio"}To graph
createandgetattrlatency:
quantile(.9, rate(node_mountstats_nfs_operations_response_time_seconds_total{operation=~"GETATTR|CREATE"}[5m])) by (cluster, cluster_org, operation, zone, node, mountaddr)Troubleshooting Flow
User is reporting “slowness”
If the client metrics are NOT showing a breach of SLOs:
Debug ‘application slowness’. Use VAST for assistance.
If the client metrics ARE showing a breach of SLOs:
Are the metrics on the server breaching SLOs? It’s a VAST issue.
Otherwise, using tshark on the CNode, is breaching SLOs? It’s a VAST issue.
Otherwise, it could be specific to a client or segment of the network:
Is the client CPU saturated?
Are other clients not seeing a problem? Compare network paths.
Debugging Application Slowness
Using strace
For more strace-related troubleshooting, see this page.
Troubleshooting NFS Performance Using strace
[vastdata@MetaSpaceWorker1 pandas]$ strace -f -c git clone https://github.com/pandas-dev/pandas.git
Cloning into 'pandas'...
strace: Process 116322 attached
strace: Process 116323 attached
strace: Process 116324 attached
strace: Process 116325 attached
strace: Process 116326 attached
remote: Enumerating objects: 407818, done.
remote: Counting objects: 100% (283/283), done.
remote: Compressing objects: 100% (225/225), done.
remote: Total 407818 (delta 187), reused 58 (delta 58), pack-reused 407535 (from 4)
Receiving objects: 100% (407818/407818), 360.63 MiB | 3.75 MiB/s, done.
strace: Process 116329 attached
strace: Process 116330 attached
strace: Process 116331 attached5)
Resolving deltas: 100% (342915/342915), done.
strace: Process 116334 attached
Updating files: 100% (2629/2629), done.
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
52.20 12.154686 3038671 4 wait4
23.69 5.516001 11 462246 75050 futex
10.33 2.404826 3 750215 27 read
9.12 2.124073 4 507410 pread64
3.63 0.844071 2 419166 write
0.49 0.113141 28 3996 1138 openat
0.25 0.057483 12 4605 3868 newfstatat
0.09 0.020865 7 2810 fstat
0.05 0.012017 1 8768 8 rt_sigaction
0.05 0.010589 3 2914 close
0.03 0.006984 24 283 2 mkdir
.... snip ....
0.00 0.000012 2 6 getrandom
0.00 0.000009 4 2 dup
0.00 0.000009 1 5 set_tid_address
0.00 0.000002 2 1 getpeername
0.00 0.000000 0 6 dup2
0.00 0.000000 0 3 link
------ ----------- ----------- --------- --------- ----------------
100.00 23.284532 10 2165567 80211 total
[vastdata@MetaSpaceWorker1 pandas]$Looking into an existing application:
strace -f -c -p <pid>
Using Tshark or Packet captures
For more information on using tshark to troubleshoot nfs performance
Troubleshooting NFS Performance Using tshark / pcap
Installing tshark on a VAST CNode:
$ sudo yum install wireshark-cli -y
$ sudo tcpdump -i any "port 2049" -w nfs-capture.pcap
tcpdump: data link type LINUX_SLL2
dropped privs to tcpdump
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
^C418 packets captured
434 packets received by filter
0 packets dropped by kernel
[vastdata@MetaSpaceWorker1 pandas]$ tshark -r nfs-capture.pcap -z rpc,srt,100003,3 -q
===================================================================
RPC SRT Statistics:
Filter: nfs.procedure_v3
Index Procedure Calls Min SRT Max SRT Avg SRT Sum SRT
1 GETATTR 4 0.000375 0.000556 0.000478 0.001913
2 SETATTR 2 0.000227 0.001355 0.000791 0.001582
3 LOOKUP 6 0.000308 0.000620 0.000438 0.002625
4 ACCESS 2 0.000196 0.000388 0.000292 0.000584
7 WRITE 13 0.001457 0.005013 0.002347 0.030505
8 CREATE 6 0.001265 0.002410 0.001742 0.010449
==================================================================
$You can also run tshark live, without using a capture file.
Replace ens33 with your active network interface
[vastdata@MetaSpaceWorker1 pandas]$ sudo tshark -i ens33 -z rpc,srt,100003,3 -q
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ens33'
^C^C109 packets dropped from ens33
191031 packets captured
===================================================================
RPC SRT Statistics:
Filter: nfs.procedure_v3
Index Procedure Calls Min SRT Max SRT Avg SRT Sum SRT
1 GETATTR 64 0.000110 0.000830 0.000491 0.031410
3 LOOKUP 74 0.000129 0.001219 0.000416 0.030801
4 ACCESS 1 0.000111 0.000111 0.000111 0.000111
7 WRITE 333 0.000510 0.010482 0.001789 0.595599
8 CREATE 69 0.001303 0.005677 0.002054 0.141704
==================================================================
$Using ebpf tools
https://github.com/iovisor/bcc
sudo dnf install bcc-tools
Here’s an example using nfsslower
[vastdata@MetaSpaceWorker1 pandas]$ sudo /usr/share/bcc/tools/nfsslower 1
Tracing NFS operations that are slower than 1 ms... Ctrl-C to quit
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
07:52:44 python3 77161 W 2402731 0 1.09 f000000384765.dat
07:52:45 python3 77162 W 2763137 0 1.22 f000000404635.dat
07:52:47 python3 77160 W 2825855 0 1.25 f000000502147.dat
07:52:47 python3 77153 W 3069437 0 1.28 f000000466010.dat
07:52:49 python3 77170 W 2794281 0 1.20 f000000429451.dat
07:52:52 python3 77169 W 2546840 0 1.07 f000000432319.dat
^C[vastdata@MetaSpaceWorker1 pandas]$
[vastdata@MetaSpaceWorker1 pandas]$Identifying Virtual IPs Used By a Client
$ sudo vastnfs-ctl rpc-clients /mnt/nfs | grep state
172.27.216.2, state: CONNECTED BOUND
172.27.216.3, state: CONNECTED BOUND
172.27.216.4, state: CONNECTED BOUND
172.27.216.5, state: CONNECTED BOUND
172.27.216.6, state: CONNECTED BOUND
172.27.216.7, state: CONNECTED BOUND
172.27.216.8, state: CONNECTED BOUND
172.27.216.9, state: CONNECTED BOUND
172.27.216.10, state: CONNECTED BOUND
172.27.216.11, state: CONNECTED BOUND
172.27.216.12, state: CONNECTED BOUND
172.27.216.13, state: CONNECTED BOUND
172.27.216.14, state: CONNECTED BOUND
172.27.216.15, state: CONNECTED BOUND
172.27.216.16, state: CONNECTED BOUND
172.27.216.17, state: CONNECTED BOUNDThe tshark and output of vastnfs-ctl tools can be combined to see the latency of operations on a specific VIP:
$ tshark -r capture.pcap -z rpc,srt,100003,3,ip.addr==172.27.216.2 -q
Additional Data To Collect
sudo netstat-nap | grep ESTA
Escalating To VAST
Collect the results generated by the commands above.
Escalate with VAST support to open a case and collect a support bundle during the issue.