Troubleshooting NFS Performance Using ebpf Tools

Prev Next

Overview

This guide provides instructions for diagnosing NFS client performance issues using eBPF tools from the BCC (BPF Compiler Collection) toolkit. These tools provide real-time visibility into NFS operations without requiring application changes or system restarts.

Prerequisites

  • Root access on the client system.

  • BCC tools installed (apt-get install bpfcc-tools or yum install bcc-tools).

  • Linux kernel 4.1+ (newer kernels have better eBPF support).

Quick Diagnosis Workflow

Step 1: Check Overall NFS Operation Latency

Tool: nfsslower

# Show NFS operations slower than 10ms (default)
sudo nfsslower

# Show NFS operations slower than 1ms
sudo nfsslower 1

# Show NFS operations slower than 1ms with CSV output

Example:

$ sudo /usr/share/bcc/tools/nfsslower -j 1
ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE
805100595998,b'python3',77159,W,2665025,0,1079,b'f000000189903.dat'
805103189703,b'python3',77162,W,2178908,0,4069,b'f000000160747.dat'
805104571360,b'python3',77155,W,2837177,0,1213,b'f000000293438.dat'
805106258905,b'python3',77172,W,3114077,0,1431,b'f000000193137.dat'
805107286939,b'python3',77158,W,2617788,0,1125,b'f000000264222.dat'
805107693680,b'python3',77142,W,2768168,0,1193,b'f000000259502.dat'
805108374832,b'python3',77145,W,2840249,0,1194,b'f000000246017.dat'
805108716805,b'python3',77152,W,3040844,0,7169,b'f000000184216.dat'
805108748857,b'python3',77160,W,2950676,0,1220,b'f000000258253.dat'
805109104171,b'python3',77171,W,2796053,0,9722,b'f000000250545.dat'
^C $
$
$ sudo /usr/share/bcc/tools/nfsslower 1
Tracing NFS operations that are slower than 1 ms... Ctrl-C to quit
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
23:30:13 python3        77157  W 2921397 0           1.08 f000000267036.dat
23:30:14 python3        77143  W 3040384 0           1.18 f000000275327.dat
23:30:16 python3        77159  W 2862232 0           1.26 f000000189908.dat
23:30:18 python3        77149  W 2980817 0           1.24 f000000285363.dat
23:30:19 python3        77145  W 2682881 0           1.10 f000000246021.dat
23:30:21 python3        77152  W 2938159 0           1.14 f000000184219.dat
23:30:23 python3        77147  W 2763934 0           1.16 f000000224322.dat
23:30:25 python3        77168  W 2601147 0           1.05 f000000279048.dat
23:30:25 python3        77142  W 3073096 0           1.25 f000000259506.dat
^C $

What to look for:

  • Normal: Most operations complete in <1ms for cached data, <10ms for network operations.

  • Warning: Regular operations taking 50-100ms.

  • Bad: Operations consistently over 100ms or occasional multi-second delays.

Step 2: Analyze File System Activity Patterns

Tool: vfsstat

# Show VFS statistics with 1-second intervals
sudo vfsstat 1

# Show 5 samples then exit
sudo vfsstat 1 5

Example output:

$ sudo /usr/share/bcc/tools/vfsstat 1
TIME         READ/s  WRITE/s  FSYNC/s   OPEN/s CREATE/s UNLINK/s  MKDIR/s  RMDIR/s
23:34:12:      3651       77        1        7        0        0        0        0
23:34:13:        30       56        0        1        0        0        0        0
23:34:14:        55       62        0       18        0        0        0        0
23:34:15:        24       47        0        1        0        0        0        0
23:34:16:        46       72        0        1        0        0        0        0
23:34:17:        28       53        0        1        0        0        0        0
23:34:18:        45       72        0        1        0        0        0        0
23:34:19:        25       52        0        4        0        0        0        0
23:34:20:        40       63        0        3        0        0        0        0
....

$ sudo /usr/share/bcc/tools/vfsstat 1 5
TIME         READ/s  WRITE/s  FSYNC/s   OPEN/s CREATE/s UNLINK/s  MKDIR/s  RMDIR/s
23:35:01:      3655       57        1       21        0        0        0        0
23:35:02:        45       70        0        1        0        0        0        0
23:35:03:        26       52        0        3        0        0        0        0
23:35:04:        56       64        0       17        0        0        0        0
23:35:05:        27       54        0        2        0        0        0        0

Step 3: Identify Slow File Operations

Tool: fileslower

# Trace file operations slower than 10ms
sudo fileslower 10

# Include file type information
sudo fileslower -t 10

# Filter for specific process
sudo fileslower -p <PID> 10

What to look for:

  • Normal: Occasional slow operations during cache misses.

  • Warning: Frequent slow operations on the same files.

  • Bad: All operations are slow, indicating systemic issues.

Example:

$ sudo /usr/share/bcc/tools/fileslower 1
Tracing sync read/writes slower than 1 ms
TIME(s)  COMM           TID    D BYTES   LAT(ms) FILENAME
0.492    python3        77141  W 2906967    1.14 f000000219886.dat
0.694    python3        77171  W 2959081    1.30 f000000250571.dat
1.230    python3        77147  W 2543445    1.25 f000000224344.dat
2.534    python3        77154  W 2507605    1.41 f000000283097.dat
4.481    python3        77156  W 2380353    1.21 f000000298343.dat
5.626    python3        77162  W 2872629    7.31 f000000160773.dat
5.803    python3        77145  W 2697660    1.73 f000000246044.dat

Step 4: Check Network Health (Critical for NFS)

Tool: tcpretrans

# Monitor TCP retransmissions
sudo tcpretrans

# Include TCP state
sudo tcpretrans -s

# Filter for NFS server IP
sudo tcpretrans -i <NFS_SERVER_IP>

What to look for:

  • Normal: Occasional retransmissions (<0.1% of packets).

  • Warning: Regular retransmissions (0.1-1%).

  • Bad: Frequent retransmissions (>1%), indicating network problems.

Example:

$ sudo /usr/share/bcc/tools/tcpretrans
Tracing retransmits ... Hit Ctrl-C to end
TIME     PID     IP LADDR:LPORT          T> RADDR:RPORT          STATE
23:39:17 77143   4  172.200.13.56:937    R> 172.200.204.4:2049   ESTABLISHED
23:39:17 77161   4  172.200.13.56:880    R> 172.200.204.4:2049   ESTABLISHED
23:39:20 77172   4  172.200.13.56:937    R> 172.200.204.4:2049   ESTABLISHED
23:40:11 77154   4  172.200.13.56:937    R> 172.200.204.4:2049   ESTABLISHED
23:40:38 77149   4  172.200.13.56:957    R> 172.200.204.4:2049   ESTABLISHED
...
$ sudo /usr/share/bcc/tools/tcpretrans -s
Tracing retransmits ... Hit Ctrl-C to end
TIME     PID     IP LADDR:LPORT          T> RADDR:RPORT          STATE        SEQ
23:42:16 77156   4  172.200.13.56:880    R> 172.200.204.4:2049   ESTABLISHED  4014085456
23:44:09 77161   4  172.200.13.56:937    R> 172.200.204.4:2049   ESTABLISHED  3479636909
...

Tool: tcptop

# Show TCP throughput by connection
sudo tcptop

# Update every 1 seconds, stop after 5 samples
sudo tcptop 1 5 

# Filter for specific port (2049 for NFS, and -C to not clear the screen)
sudo tcptop -p 2049 -C

Step 5: Analyze Metadata Operations

Tool: statsnoop

# Trace all stat() calls
sudo statsnoop

# Include timestamps
sudo statsnoop -t

# Filter for failed stat calls
sudo statsnoop -x

What to look for:

  • Normal: Stat calls complete quickly (<1ms).

  • Warning: High volume of stat calls or calls taking >10ms.

  • Bad: Applications doing repeated stats on the same files (indicates poor caching).

Quick Reference Card

Tool

Purpose

Key Flag

Normal Range

nfsslower

NFS operation latency

threshold ms
(10ms default)

<5ms

vfsstat

VFS layer statistics

interval

Displays the current operation mix

filesslower

Slow file operations

threshold ms
(10ms default)

<5ms

tcpretrans

Network retransmissions

-s (state)

-i (ip of vast ip)

<0.1%

tcptop

TCP Throughput

-p 2049

Consistent flow

statssnoop

stat() calls

-t (timestamp)

<1ms latency

Remember: Always establish baseline performance during normal operations to better identify anomalies during issues.