Overview
This guide provides instructions for diagnosing NFS client performance issues using eBPF tools from the BCC (BPF Compiler Collection) toolkit. These tools provide real-time visibility into NFS operations without requiring application changes or system restarts.
Prerequisites
Root access on the client system.
BCC tools installed (
apt-get install bpfcc-toolsoryum install bcc-tools).Linux kernel 4.1+ (newer kernels have better eBPF support).
Quick Diagnosis Workflow
Step 1: Check Overall NFS Operation Latency
Tool: nfsslower
# Show NFS operations slower than 10ms (default)
sudo nfsslower
# Show NFS operations slower than 1ms
sudo nfsslower 1
# Show NFS operations slower than 1ms with CSV outputExample:
$ sudo /usr/share/bcc/tools/nfsslower -j 1
ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE
805100595998,b'python3',77159,W,2665025,0,1079,b'f000000189903.dat'
805103189703,b'python3',77162,W,2178908,0,4069,b'f000000160747.dat'
805104571360,b'python3',77155,W,2837177,0,1213,b'f000000293438.dat'
805106258905,b'python3',77172,W,3114077,0,1431,b'f000000193137.dat'
805107286939,b'python3',77158,W,2617788,0,1125,b'f000000264222.dat'
805107693680,b'python3',77142,W,2768168,0,1193,b'f000000259502.dat'
805108374832,b'python3',77145,W,2840249,0,1194,b'f000000246017.dat'
805108716805,b'python3',77152,W,3040844,0,7169,b'f000000184216.dat'
805108748857,b'python3',77160,W,2950676,0,1220,b'f000000258253.dat'
805109104171,b'python3',77171,W,2796053,0,9722,b'f000000250545.dat'
^C $
$
$ sudo /usr/share/bcc/tools/nfsslower 1
Tracing NFS operations that are slower than 1 ms... Ctrl-C to quit
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
23:30:13 python3 77157 W 2921397 0 1.08 f000000267036.dat
23:30:14 python3 77143 W 3040384 0 1.18 f000000275327.dat
23:30:16 python3 77159 W 2862232 0 1.26 f000000189908.dat
23:30:18 python3 77149 W 2980817 0 1.24 f000000285363.dat
23:30:19 python3 77145 W 2682881 0 1.10 f000000246021.dat
23:30:21 python3 77152 W 2938159 0 1.14 f000000184219.dat
23:30:23 python3 77147 W 2763934 0 1.16 f000000224322.dat
23:30:25 python3 77168 W 2601147 0 1.05 f000000279048.dat
23:30:25 python3 77142 W 3073096 0 1.25 f000000259506.dat
^C $What to look for:
Normal: Most operations complete in <1ms for cached data, <10ms for network operations.
Warning: Regular operations taking 50-100ms.
Bad: Operations consistently over 100ms or occasional multi-second delays.
Step 2: Analyze File System Activity Patterns
Tool: vfsstat
# Show VFS statistics with 1-second intervals
sudo vfsstat 1
# Show 5 samples then exit
sudo vfsstat 1 5Example output:
$ sudo /usr/share/bcc/tools/vfsstat 1
TIME READ/s WRITE/s FSYNC/s OPEN/s CREATE/s UNLINK/s MKDIR/s RMDIR/s
23:34:12: 3651 77 1 7 0 0 0 0
23:34:13: 30 56 0 1 0 0 0 0
23:34:14: 55 62 0 18 0 0 0 0
23:34:15: 24 47 0 1 0 0 0 0
23:34:16: 46 72 0 1 0 0 0 0
23:34:17: 28 53 0 1 0 0 0 0
23:34:18: 45 72 0 1 0 0 0 0
23:34:19: 25 52 0 4 0 0 0 0
23:34:20: 40 63 0 3 0 0 0 0
....
$ sudo /usr/share/bcc/tools/vfsstat 1 5
TIME READ/s WRITE/s FSYNC/s OPEN/s CREATE/s UNLINK/s MKDIR/s RMDIR/s
23:35:01: 3655 57 1 21 0 0 0 0
23:35:02: 45 70 0 1 0 0 0 0
23:35:03: 26 52 0 3 0 0 0 0
23:35:04: 56 64 0 17 0 0 0 0
23:35:05: 27 54 0 2 0 0 0 0Step 3: Identify Slow File Operations
Tool: fileslower
# Trace file operations slower than 10ms
sudo fileslower 10
# Include file type information
sudo fileslower -t 10
# Filter for specific process
sudo fileslower -p <PID> 10What to look for:
Normal: Occasional slow operations during cache misses.
Warning: Frequent slow operations on the same files.
Bad: All operations are slow, indicating systemic issues.
Example:
$ sudo /usr/share/bcc/tools/fileslower 1
Tracing sync read/writes slower than 1 ms
TIME(s) COMM TID D BYTES LAT(ms) FILENAME
0.492 python3 77141 W 2906967 1.14 f000000219886.dat
0.694 python3 77171 W 2959081 1.30 f000000250571.dat
1.230 python3 77147 W 2543445 1.25 f000000224344.dat
2.534 python3 77154 W 2507605 1.41 f000000283097.dat
4.481 python3 77156 W 2380353 1.21 f000000298343.dat
5.626 python3 77162 W 2872629 7.31 f000000160773.dat
5.803 python3 77145 W 2697660 1.73 f000000246044.datStep 4: Check Network Health (Critical for NFS)
Tool: tcpretrans
# Monitor TCP retransmissions
sudo tcpretrans
# Include TCP state
sudo tcpretrans -s
# Filter for NFS server IP
sudo tcpretrans -i <NFS_SERVER_IP>What to look for:
Normal: Occasional retransmissions (<0.1% of packets).
Warning: Regular retransmissions (0.1-1%).
Bad: Frequent retransmissions (>1%), indicating network problems.
Example:
$ sudo /usr/share/bcc/tools/tcpretrans
Tracing retransmits ... Hit Ctrl-C to end
TIME PID IP LADDR:LPORT T> RADDR:RPORT STATE
23:39:17 77143 4 172.200.13.56:937 R> 172.200.204.4:2049 ESTABLISHED
23:39:17 77161 4 172.200.13.56:880 R> 172.200.204.4:2049 ESTABLISHED
23:39:20 77172 4 172.200.13.56:937 R> 172.200.204.4:2049 ESTABLISHED
23:40:11 77154 4 172.200.13.56:937 R> 172.200.204.4:2049 ESTABLISHED
23:40:38 77149 4 172.200.13.56:957 R> 172.200.204.4:2049 ESTABLISHED
...
$ sudo /usr/share/bcc/tools/tcpretrans -s
Tracing retransmits ... Hit Ctrl-C to end
TIME PID IP LADDR:LPORT T> RADDR:RPORT STATE SEQ
23:42:16 77156 4 172.200.13.56:880 R> 172.200.204.4:2049 ESTABLISHED 4014085456
23:44:09 77161 4 172.200.13.56:937 R> 172.200.204.4:2049 ESTABLISHED 3479636909
...Tool: tcptop
# Show TCP throughput by connection
sudo tcptop
# Update every 1 seconds, stop after 5 samples
sudo tcptop 1 5
# Filter for specific port (2049 for NFS, and -C to not clear the screen)
sudo tcptop -p 2049 -CStep 5: Analyze Metadata Operations
Tool: statsnoop
# Trace all stat() calls
sudo statsnoop
# Include timestamps
sudo statsnoop -t
# Filter for failed stat calls
sudo statsnoop -xWhat to look for:
Normal: Stat calls complete quickly (<1ms).
Warning: High volume of stat calls or calls taking >10ms.
Bad: Applications doing repeated stats on the same files (indicates poor caching).
Quick Reference Card
Tool | Purpose | Key Flag | Normal Range |
|---|---|---|---|
nfsslower | NFS operation latency | threshold ms | <5ms |
vfsstat | VFS layer statistics | interval | Displays the current operation mix |
filesslower | Slow file operations | threshold ms | <5ms |
tcpretrans | Network retransmissions | -s (state) -i (ip of vast ip) | <0.1% |
tcptop | TCP Throughput | -p 2049 | Consistent flow |
statssnoop | stat() calls | -t (timestamp) | <1ms latency |
Remember: Always establish baseline performance during normal operations to better identify anomalies during issues.