Troubleshooting NFS Performance Using tshark / pcap

Prev Next

Overview

This guide provides comprehensive techniques for analyzing NFS performance issues using tshark and tcpdump. These tools help identify bottlenecks, latency issues, and problematic operations in NFS environments.

Basic Capture and Analysis

Initial Capture

# Capture all NFS traffic (default port 2049)
sudo tcpdump -i any "port 2049" -w capture.pcap

# Capture with size limit and rotation
sudo tcpdump -i any "port 2049" -w capture.pcap -C 100 -W 10
# -C 100: Max file size 100MB
# -W 10: Keep 10 files in rotation

Basic RPC Service Response Time (SRT) Analysis

# NFSv3 analysis
tshark -r capture.pcap -z rpc,srt,100003,3 -q

# NFSv4 analysis
tshark -r capture.pcap -z rpc,srt,100003,4 -q

Example:

$ sudo tshark -r capture.pcap -z rpc,srt,100003,3 -q
Running as user "root" and group "root". This could be dangerous.

===================================================================
RPC SRT Statistics:
Filter: nfs.procedure_v3
Index  Procedure              Calls    Min SRT    Max SRT    Avg SRT    Sum SRT
    1  GETATTR                     6   0.000479   0.000857   0.000607   0.003641
    2  SETATTR                     1   0.000210   0.000210   0.000210   0.000210
    3  LOOKUP                     20   0.000235   0.000689   0.000496   0.009927
    4  ACCESS                      1   0.000326   0.000326   0.000326   0.000326
    7  WRITE                      43   0.001164   0.003525   0.002044   0.087903
    8  CREATE                     20   0.001284   0.003368   0.001761   0.035220
==================================================================
[vastdata@MetaSpaceWorker1 pandas]$

Advanced Analysis Techniques

1. Per-Client Analysis

# Extract unique client IPs
tshark -r capture.pcap -T fields -e ip.src | sort | uniq > clients.txt

# Analyze performance per client
for client in $(cat clients.txt); do
    echo "=== Client: $client ==="
    tshark -r capture.pcap -Y "ip.src == $client" -z rpc,srt,100003,3 -q
done

2. Time-Based Analysis

# Split capture into time windows (e.g., 60-second intervals)
editcap -i 60 capture.pcap split.pcap

# Analyze each time window
for file in split_*.pcap; do
    echo "=== Analyzing $file ==="
    tshark -r $file -z rpc,srt,100003,3 -q
done

Example:

$ editcap -i 2 capture.pcap split.pcap
$ ls -lah
total 74M
drwxrwxrwx. 2    78910 vastdata 4.0K Jun 28 22:40 .
drwxr-xr-x. 2    78910     2000 4.0K Jun 28 08:17 ..
-rwx------. 1 root     root      37M Jun 28 22:35 capture.pcap
-rwx------. 1 vastdata vastdata   28 Jun 28 22:39 clients.txt
-rwx------. 1 vastdata vastdata 3.8M Jun 28 22:40 split_00000_20250628223541.pcap
-rwx------. 1 vastdata vastdata 6.6M Jun 28 22:40 split_00001_20250628223544.pcap
-rwx------. 1 vastdata vastdata  13M Jun 28 22:40 split_00002_20250628223545.pcap
-rwx------. 1 vastdata vastdata  14M Jun 28 22:40 split_00003_20250628223547.pcap
$ for file in split_*.pcap; do echo "=== Analyzing $file ===" ; tshark -r $file -z rpc,srt,100003,3 -q ; done
=== Analyzing split_00000_20250628223541.pcap ===

===================================================================
RPC SRT Statistics:
Filter: nfs.procedure_v3
Index  Procedure              Calls    Min SRT    Max SRT    Avg SRT    Sum SRT
    1  GETATTR                     2   0.000506   0.000857   0.000682   0.001363
    2  SETATTR                     1   0.000210   0.000210   0.000210   0.000210
    3  LOOKUP                      4   0.000235   0.000689   0.000521   0.002083
    4  ACCESS                      1   0.000326   0.000326   0.000326   0.000326
    7  WRITE                       6   0.001164   0.002074   0.001588   0.009530
    8  CREATE                      4   0.001500   0.002014   0.001794   0.007175
==================================================================
=== Analyzing split_00001_20250628223544.pcap ===

===================================================================
RPC SRT Statistics:
Filter: nfs.procedure_v3
Index  Procedure              Calls    Min SRT    Max SRT    Avg SRT    Sum SRT
    1  GETATTR                     2   0.000479   0.000715   0.000597   0.001194
    3  LOOKUP                      3   0.000331   0.000545   0.000456   0.001368
    7  WRITE                       8   0.001349   0.003525   0.002489   0.019909
    8  CREATE                      3   0.001487   0.001676   0.001561   0.004683
==================================================================
=== Analyzing split_00002_20250628223545.pcap ===

===================================================================
RPC SRT Statistics:
Filter: nfs.procedure_v3
Index  Procedure              Calls    Min SRT    Max SRT    Avg SRT    Sum SRT
    1  GETATTR                     2   0.000524   0.000560   0.000542   0.001084
    3  LOOKUP                      7   0.000408   0.000589   0.000540   0.003779
    7  WRITE                      16   0.001447   0.002854   0.002111   0.033779
    8  CREATE                      7   0.001411   0.003368   0.001907   0.013350
==================================================================
=== Analyzing split_00003_20250628223547.pcap ===

===================================================================
RPC SRT Statistics:
Filter: nfs.procedure_v3
Index  Procedure              Calls    Min SRT    Max SRT    Avg SRT    Sum SRT
    3  LOOKUP                      6   0.000375   0.000524   0.000450   0.002697
    7  WRITE                      13   0.001433   0.002790   0.001899   0.024685
    8  CREATE                      6   0.001284   0.002185   0.001669   0.010012
==================================================================
$

3. Operation-Specific Deep Dive

# Focus on slow operations (e.g., operations taking >5ms)
tshark -r capture.pcap -Y "rpc.time > 0.005" -T fields \
    -e frame.time -e ip.src -e ip.dst -e rpc.procedure -e rpc.time

# Analyze specific operations (e.g., WRITE operations)
tshark -r capture.pcap -Y "nfs.procedure_v3 == 7" \
    -T fields -e rpc.time \
    | awk '{if ($1 != "") {sum+=$1; count++}} END {print "Avg WRITE latency:", sum/count*1000, "ms"}'

# If you need write sizes too, extract them separately
tshark -r capture.pcap -Y "nfs.procedure_v3 == 7 and nfs.count3" \
    -T fields -e nfs.count3 \
    | awk '{sum+=$1; count++} END {print "Avg WRITE size:", sum/count, "bytes"}'

Example:

[vastdata@MetaSpaceWorker1 pandas]$ tshark -r capture.pcap -Y "rpc.time > 0.002" -T fields -e frame.time -e ip.src -e ip.dst -e rpc.procedure -e rpc.time
Jun 28, 2025 22:35:43.152428000 MST     172.200.204.4   172.200.13.56   8       0.002014000
Jun 28, 2025 22:35:43.156568000 MST     172.200.204.4   172.200.13.56   7       0.002074000
Jun 28, 2025 22:35:44.437504000 MST     172.200.204.4   172.200.13.56   7       0.002314000
Jun 28, 2025 22:35:44.438338000 MST     172.200.204.4   172.200.13.56   7       0.002865000
Jun 28, 2025 22:35:44.688833000 MST     172.200.204.4   172.200.13.56   7       0.003491000
Jun 28, 2025 22:35:44.689515000 MST     172.200.204.4   172.200.13.56   7       0.003525000
Jun 28, 2025 22:35:45.014094000 MST     172.200.204.4   172.200.13.56   7       0.002762000
Jun 28, 2025 22:35:45.261662000 MST     172.200.204.4   172.200.13.56   7       0.002367000
Jun 28, 2025 22:35:45.265180000 MST     172.200.204.4   172.200.13.56   7       0.002714000
Jun 28, 2025 22:35:45.265602000 MST     172.200.204.4   172.200.13.56   7       0.002242000
Jun 28, 2025 22:35:45.443772000 MST     172.200.204.4   172.200.13.56   7       0.002455000
Jun 28, 2025 22:35:45.448950000 MST     172.200.204.4   172.200.13.56   7       0.002020000
Jun 28, 2025 22:35:46.063459000 MST     172.200.204.4   172.200.13.56   7       0.002854000
Jun 28, 2025 22:35:46.205786000 MST     172.200.204.4   172.200.13.56   8       0.003368000
Jun 28, 2025 22:35:46.806685000 MST     172.200.204.4   172.200.13.56   8       0.002269000
Jun 28, 2025 22:35:46.811204000 MST     172.200.204.4   172.200.13.56   7       0.002390000
Jun 28, 2025 22:35:47.158509000 MST     172.200.204.4   172.200.13.56   7       0.002472000
Jun 28, 2025 22:35:47.159744000 MST     172.200.204.4   172.200.13.56   7       0.002548000
Jun 28, 2025 22:35:48.093065000 MST     172.200.204.4   172.200.13.56   7       0.002093000
Jun 28, 2025 22:35:48.482774000 MST     172.200.204.4   172.200.13.56   7       0.002099000
Jun 28, 2025 22:35:48.903600000 MST     172.200.204.4   172.200.13.56   8       0.002185000
Jun 28, 2025 22:35:48.905586000 MST     172.200.204.4   172.200.13.56   8       0.002174000
Jun 28, 2025 22:35:48.910427000 MST     172.200.204.4   172.200.13.56   7       0.002790000
Jun 28, 2025 22:35:48.911325000 MST     172.200.204.4   172.200.13.56   7       0.002135000
Jun 28, 2025 22:35:48.913962000 MST     172.200.204.4   172.200.13.56   7       0.002353000

$ tshark -r capture.pcap -Y "nfs.procedure_v3 == 7" -T fields -e frame.time -e rpc.time -e nfs.count3     | awk '{sum+=$2; count++} END {print "Avg WRITE latency:", sum/count*1000, "ms"}'
Avg WRITE latency: 28000 ms
$

4. File Handle Analysis

# Extract file handles and their operation counts
tshark -r capture.pcap -T fields -e nfs.fh.hash -e nfs.procedure_v3 \
    | sort | uniq -c | sort -rn | head -20

# Track specific file handle performance
tshark -r capture.pcap -Y "nfs.fh.hash == 0x12345678" \
    -T fields -e frame.time -e rpc.procedure -e rpc.time

Example:

$ tshark -r capture.pcap -T fields -e nfs.fh.hash -e nfs.procedure_v3 \
    | sort | uniq -c | sort -rn | head -20
   1036
     43         7
     20         3
      6         1
      3 0xa801c973      7
      3 0x88ca7993      7
      3 0x5548932f      7
      3 0x53468be7      7
      3 0x4e8f3d35      7
      3 0x2a64906b      7
      2 0xfe88312d      7
      2 0xe4e452e0      7
      2 0xe05e698e      7
      2 0xcbe67073      7
      2 0x8defaa86      7
      2 0x88120cc1      7
      2 0x66137b8e      7
      2 0x55b0388d      7
      2 0x3b9a9f47      7
      2 0x2edd1be7      7
$ tshark -r capture.pcap -Y "nfs.fh.hash == 0xa801c973" \
    -T fields -e frame.time -e rpc.procedure -e rpc.time
Jun 28, 2025 22:35:45.256039000 MST     8       0.001610000
Jun 28, 2025 22:35:45.259295000 MST     7
Jun 28, 2025 22:35:45.262466000 MST     7
Jun 28, 2025 22:35:45.263360000 MST     7
$

5. Error Analysis

# Find all NFS errors
tshark -r capture.pcap -Y "nfs.status3 != 0" \
    -T fields -e frame.time -e ip.src -e nfs.procedure_v3 -e nfs.status3

# Common error codes:
# 2 = NOENT (No such file)
# 13 = ACCES (Permission denied)
# 28 = NOSPC (No space left)
# 70 = STALE (Stale file handle)
# See RFS1813 for full list
# https://www.ietf.org/rfc/rfc1813.txt

6. Latency Distribution Analysis

# Create latency histogram
tshark -r capture.pcap -T fields -e rpc.time | \
awk '{
    if ($1 < 0.001) bucket["<1ms"]++
    else if ($1 < 0.005) bucket["1-5ms"]++
    else if ($1 < 0.010) bucket["5-10ms"]++
    else if ($1 < 0.050) bucket["10-50ms"]++
    else bucket[">50ms"]++
} END {
    for (b in bucket) print b, bucket[b]
}' | sort

Example:

$ tshark -r capture.pcap -T fields -e rpc.time | \
awk '{
    if ($1 < 0.001) bucket["<1ms"]++
    else if ($1 < 0.005) bucket["1-5ms"]++
    else if ($1 < 0.010) bucket["5-10ms"]++
    else if ($1 < 0.050) bucket["10-50ms"]++
    else bucket[">50ms"]++
} END {
    for (b in bucket) print b, bucket[b]
}' | sort
1-5ms 63
<1ms 1176
$

7. Compound Operations Analysis (NFSv4)

# Analyze NFSv4 compound operations
tshark -r capture.pcap -Y "nfs" -T fields \
    -e nfs.procedure_v4 -e nfs.opcode -e rpc.time \
    | awk '{ops[$2]++; time[$2]+=$3} END {
        for (op in ops) print op, ops[op], time[op]/ops[op]*1000 "ms"
    }'

Performance Metrics Script

Create a comprehensive analysis script:

#!/bin/bash
# nfs_perf_analyze.sh

PCAP_FILE=$1
OUTPUT_DIR="nfs_analysis_$(date +%Y%m%d_%H%M%S)"

mkdir -p $OUTPUT_DIR

echo "Analyzing NFS performance from $PCAP_FILE..."

# Overall statistics
echo "=== Overall RPC Statistics ===" > $OUTPUT_DIR/overall_stats.txt
tshark -r $PCAP_FILE -z rpc,srt,100003,3 -q >> $OUTPUT_DIR/overall_stats.txt

# Top talkers
echo "=== Top Clients by Request Count ===" > $OUTPUT_DIR/top_clients.txt
tshark -r $PCAP_FILE -T fields -e ip.src | sort | uniq -c | \
    sort -rn | head -20 >> $OUTPUT_DIR/top_clients.txt

# Slow operations
echo "=== Operations Slower than 10ms ===" > $OUTPUT_DIR/slow_ops.txt
tshark -r $PCAP_FILE -Y "rpc.time > 0.010" -T fields \
    -e frame.time -e ip.src -e ip.dst -e nfs.procedure_v3 -e rpc.time \
    >> $OUTPUT_DIR/slow_ops.txt

# Error summary
echo "=== NFS Errors Summary ===" > $OUTPUT_DIR/errors.txt
tshark -r $PCAP_FILE -Y "nfs.status3 != 0" -T fields -e nfs.status3 | \
    sort | uniq -c | sort -rn >> $OUTPUT_DIR/errors.txt

# Latency percentiles
echo "=== Latency Percentiles ===" > $OUTPUT_DIR/latency_percentiles.txt
tshark -r $PCAP_FILE -T fields -e rpc.time | sort -n | \
awk '{
    arr[NR] = $1
} END {
    print "P50:", arr[int(NR*0.5)]*1000, "ms"
    print "P90:", arr[int(NR*0.9)]*1000, "ms"
    print "P95:", arr[int(NR*0.95)]*1000, "ms"
    print "P99:", arr[int(NR*0.99)]*1000, "ms"
    print "P99.9:", arr[int(NR*0.999)]*1000, "ms"
}' >> $OUTPUT_DIR/latency_percentiles.txt

echo "Analysis complete. Results saved in $OUTPUT_DIR/"

Example:

$ sudo bash ./nfs_perf_analyze.sh capture.pcap
Analyzing NFS performance from capture.pcap...
Analysis complete. Results saved in nfs_analysis_20250628_230245/
$ tree -ah nfs_analysis_20250628_230245/
nfs_analysis_20250628_230245/
├── [   37]  errors.txt
├── [   92]  latency_percentiles.txt
├── [  778]  overall_stats.txt
├── [   36]  slow_ops.txt
└── [   81]  top_clients.txt

0 directories, 5 files
$ cat nfs_analysis_20250628_230245/errors.txt
=== NFS Errors Summary ===
     20 2
$ cat nfs_analysis_20250628_230245/latency_percentiles.txt
=== Latency Percentiles ===
P50: 0 ms
P90: 0 ms
P95: 1.164 ms
P99: 2.353 ms
P99.9: 3.368 ms
$ cat nfs_analysis_20250628_230245/overall_stats.txt
=== Overall RPC Statistics ===

===================================================================
RPC SRT Statistics:
Filter: nfs.procedure_v3
Index  Procedure              Calls    Min SRT    Max SRT    Avg SRT    Sum SRT
    1  GETATTR                     6   0.000479   0.000857   0.000607   0.003641
    2  SETATTR                     1   0.000210   0.000210   0.000210   0.000210
    3  LOOKUP                     20   0.000235   0.000689   0.000496   0.009927
    4  ACCESS                      1   0.000326   0.000326   0.000326   0.000326
    7  WRITE                      43   0.001164   0.003525   0.002044   0.087903
    8  CREATE                     20   0.001284   0.003368   0.001761   0.035220
==================================================================
$ cat nfs_analysis_20250628_230245/slow_ops.txt
=== Operations Slower than 10ms ===
$ cat nfs_analysis_20250628_230245/top_clients.txt
=== Top Clients by Request Count ===
    896 172.200.13.56
    343 172.200.204.4
$

Real-time Monitoring

# Monitor NFS operations in real-time
$ sudo tshark -i any -f "port 2049" -Y "rpc.time" -T fields \
    -e frame.time_epoch -e ip.src -e nfs.procedure_v3 -e rpc.time | \
awk '{
    proc_name[1]="GETATTR"; proc_name[3]="LOOKUP"; proc_name[7]="WRITE";
    proc_name[6]="READ"; proc_name[8]="CREATE"; proc_name[12]="REMOVE";

    if ($3 && $4) {  # Only process if we have both procedure and time
        op = proc_name[$3] ? proc_name[$3] : "PROC_"$3;
        latency = $4 * 1000;  # Convert to ms

        printf "[%s] %s -> %s: %.2fms\n",
            strftime("%H:%M:%S", $1), $2, op, latency

        if (latency > 10) print "  *** HIGH LATENCY ***"
    }
}'
Capturing on 'any'
83 [23:07:41] 172.200.204.4 -> LOOKUP: 0.61ms
[23:07:41] 172.200.204.4 -> CREATE: 2.54ms
[23:07:41] 172.200.204.4 -> WRITE: 1.13ms
[23:07:41] 172.200.204.4 -> WRITE: 2.52ms
[23:07:41] 172.200.204.4 -> WRITE: 2.49ms
[23:07:41] 172.200.204.4 -> GETATTR: 0.56ms
[23:07:41] 172.200.204.4 -> LOOKUP: 0.47ms
[23:07:41] 172.200.204.4 -> LOOKUP: 0.39ms
[23:07:41] 172.200.204.4 -> CREATE: 1.67ms
[23:07:41] 172.200.204.4 -> WRITE: 1.46ms
[23:07:41] 172.200.204.4 -> WRITE: 2.41ms
[23:07:41] 172.200.204.4 -> GETATTR: 0.47ms
[23:07:41] 172.200.204.4 -> CREATE: 1.84ms
[23:07:41] 172.200.204.4 -> WRITE: 2.21ms
[23:07:41] 172.200.204.4 -> WRITE: 1.79ms
[23:07:41] 172.200.204.4 -> LOOKUP: 0.67ms
[23:07:41] 172.200.204.4 -> CREATE: 2.46ms
[23:07:41] 172.200.204.4 -> WRITE: 2.43ms
[23:07:41] 172.200.204.4 -> WRITE: 2.00ms
[23:07:41] 172.200.204.4 -> WRITE: 2.34ms
[23:07:42] 172.200.204.4 -> GETATTR: 0.88ms
[23:07:42] 172.200.204.4 -> GETATTR: 0.70ms
[23:07:42] 172.200.204.4 -> LOOKUP: 0.32ms
...

Visualization Preparation

# Export data for visualization (e.g., in Grafana or Excel)
tshark -r capture.pcap -T fields \
    -e frame.time_epoch -e nfs.procedure_v3 -e rpc.time \
    -E header=y -E separator=, > nfs_timeseries.csv

# Create per-minute aggregates
tshark -r capture.pcap -T fields -e frame.time_epoch -e rpc.time | \
awk '{
    minute = int($1/60)*60
    sum[minute] += $2
    count[minute]++
} END {
    for (m in sum) {
        avg_ms = (sum[m]/count[m])*1000
        print strftime("%Y-%m-%d %H:%M:00", m), avg_ms, count[m]
    }
}' | sort > nfs_per_minute.csv

Advanced Filtering Examples

# Complex filter: Find CREATE operations from specific subnet taking >5ms
tshark -r capture.pcap -Y "ip.src == 192.168.1.0/24 && nfs.procedure_v3 == 8 && rpc.time > 0.005"

# Find operations on large files (>1MB READs/WRITEs)
tshark -r capture.pcap -Y "(nfs.procedure_v3 == 6 || nfs.procedure_v3 == 7) && nfs.count3 > 1048576"

# Track specific user operations (requires uid mapping)
tshark -r capture.pcap -Y "nfs.uid3 == 1000"

This comprehensive guide provides the tools and techniques needed to effectively troubleshoot NFS performance issues using packet analysis.