Troubleshooting NFS Performance Using strace

Prev Next

Using strace to Troubleshoot NFS Performance

Summary:

This guide shows how to use strace to diagnose NFS-related performance issues, interpret syscall patterns, and identify slow or inefficient I/O. Real-world examples and actionable tips are included.


How strace Works - A Technical Overview

What is strace?

strace is a diagnostic tool that intercepts and records system calls made by a process. It sits between your application and the Linux kernel, capturing every request your program makes to the operating system.

How Does It Work?

  1. Process Attachment via ptrace()

    Application → [strace intercepts] → Kernel → NFS/Storage
    • Uses the ptrace() system call to attach to a process.

    • Can attach to running processes (-p PID) or start new ones.

    • Acts like a debugger, pausing the process at each system call.

  2. System Call Interception

    • When your app makes a system call (like read(), write(), open()):

      • Process is paused

      • strace records the call details

      • Kernel executes the system call

      • strace records the result and timing

      • Process continues

        • This creates up to a 2x increase in context switches.

  3. Timing Measurement

    read(5, "data...", 4096) = 1024 <0.002145>
                                     ↑
                             Time spent in kernel
    • The <0.002145> shows time spent inside the kernel.

    • Includes time waiting for the NFS server response.

    • Critical for identifying slow operations.

What Gets Captured?

  1. System Call Information:

    • Call name: openat, read, write, etc.

    • Arguments: file descriptors, buffers, flags

    • Return value: success/failure codes

    • Timing: duration of the call

  2. Process Information:

    • PID (Process ID)

    • Timestamp when call was made

    • Thread information with -f flag

Why It's Perfect for NFS Debugging

NFS operations appear as regular file system calls to applications:

Application thinks:     strace sees:           What actually happens:
fd = open("file")  →   openat(...) = 3    →   NFS RPC: LOOKUP + OPEN
read(fd, buf, 4096) →  read(3,...) = 4096 →   NFS RPC: READ
close(fd)          →   close(3) = 0       →   NFS RPC: CLOSE

Performance Impact

  • Overhead: strace adds 10-100x slowdown to system calls

  • Why: Each syscall requires two context switches

  • Best practice: Use in test environments or briefly in production

Key strace Options for NFS

bash
strace -f -T -tt -o trace.out command
       │  │  │   └─ Write to file
       │  │  └─ Absolute timestamps
       │  └─ Show time in syscalls
       └─ Follow child processes

# Filter specific calls
strace -e trace=file     # All file operations
strace -e trace=open,read,write,close  # Specific calls

What the Timing Tells You

read(4, "PACK\0\0\0\2\0\69\n", 12) = 12 <0.234538>

This 234ms read could indicate:

  • Network latency to NFS server

  • Server disk I/O slowness

  • NFS protocol overhead (authentication, etc.)

  • Network congestion

Behind the Scenes - NFS + strace

When you read a file over NFS:

  1. App calls read() → strace starts timer

  2. Kernel VFS layer receives call

  3. NFS client creates RPC request

  4. Network round-trip to NFS server

  5. Server reads from disk

  6. Server sends response

  7. Kernel returns data to app → strace stops timer

The beauty of strace is it captures the total time, including all network and server delays!

📋 Quick Example

bash
# Fast local disk read
read(3, "data", 4096) = 4096 <0.000052>  # 52 microseconds

# Slow NFS read
read(3, "data", 4096) = 4096 <0.045123>  # 45 milliseconds

# Network timeout/retry
read(3, "data", 4096) = 4096 <5.234567>  # 5+ seconds!

This is why strace is so powerful for NFS troubleshooting - it immediately shows you which operations are slow, without needing to understand the complex NFS protocol underneath

When to Use strace

  • Application is reported as “slow,” but NFS/server metrics look healthy.

  • You need to pinpoint if slowness is due to system calls, file I/O, or NFS/network issues.


Basic strace Usage

Profile a command and get a syscall summary:

strace -f -c <your-command>
  • -f: Follows child processes.

  • -c: Outputs a summary of system call statistics.

Example Output:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 49.69   13.521276     3380319         4           wait4
 27.59    7.507152          16    466692     82293 futex
 10.31    2.805373           5    507523           pread64
  8.58    2.333674           3    735952        31 read
  3.03    0.825550           1    414052           write
  0.39    0.105212          26      3996      1138 openat
  ...
100.00   27.213860          12   2150774     87458 total

What to look for:

  • High time in read, write, or pread64 may indicate I/O bottlenecks.

  • High error counts in openat, read, or write could signal permission or NFS issues.

  • Compare with local filesystem runs to isolate NFS as the bottleneck.


Advanced: Trace Live NFS I/O with Timing

Attach to a running process and see syscall durations:

strace -T -f -e trace=read,write,open,close,stat,fstat,lstat,getdents -p <pid>
-T: Shows syscall duration (in seconds).
-e trace=...: Limits output to relevant file and I/O syscalls.

Example Output:

[pid 78795] read(5, ..., 7512) = 1370 <0.001357>
[pid 78795] write(4, ..., 8192) = 8192 <0.000015>
[pid 78795] read(5, ..., 14389) = 14389 <0.022307>

How to Interpret the Output

What is a "Large" read/write?

  • The third argument to read/write is the buffer size (e.g., 8192, 16384).

  • Typical NFS I/O sizes are 8K or 16K. Much larger sizes (>1MB) are rare unless tuned.

  • Very small sizes (e.g., 1-100 bytes) in a tight loop may indicate inefficient application I/O.

What is a "Slow" read/write?

  • The number in <...> at the end of each line is the syscall duration in seconds.

  • Most healthy NFS reads/writes complete in a few milliseconds (<0.001>).

  • Consistent durations above 50-100ms (<0.050> or higher) are worth investigating.

  • Example of a slightly slow call:

      [pid 78795] read(5, ..., 14389) = 14389 <0.022307>
  • Example of a problematic call (not in your logs, but for reference):

      read(5, ..., 16384) = 16384 <0.500000>

What to Look For

Symptom

What to Look For in Output

What it Means

Fast read/write

<0.000015>

Normal, healthy I/O

Occasional 10-20ms call

<0.022307>

Slightly slow, but not critical

Consistent >50ms calls

<0.050000> or higher

NFS/network/server may be slow

Very small I/O

read(..., 1)

Inefficient app I/O pattern


Practical Workflow

  1. Start your workload and capture its PID:

       sh -c 'echo $$; exec <your-command>' &
  2. Attach strace to the PID:

       strace -T -f -e trace=read,write,open,close,stat,fstat,lstat,getdents -p <pid>
  3. Real Example

    [vastdata@MetaSpaceWorker1 pandas]$ sh -c 'echo $$; exec git clone --quiet https://github.com/pandas-dev/pandas.git > /dev/null 2>&1' &
    [1] 113652
    [vastdata@MetaSpaceWorker1 pandas]$ 113652
    [vastdata@MetaSpaceWorker1 pandas]$ strace -T -tt -f -e trace=read,write,open,close,stat,fstat,lstat,getdents -p 113652
    strace: Process 113652 attached with 2 threads
    [pid 113656] 22:25:36.052817 read(5,  <unfinished ...>
    [pid 113652] 22:25:36.052937 read(6,  <unfinished ...>
    [pid 113656] 22:25:36.058008 <... read resumed>"\0243j\363|>\220i\217\272/\223\235M\340m\241\230\313gA\267\367J\361\205'd\361\372\336\235"..., 39598) = 1370 <0.005051>
    [pid 113656] 22:25:36.058091 read(5, "\277E\223_\260\353M<\31\6\0\262?4c\346\22\210\267\26x\234{\314\363\230gB\240\210\327\322"..., 38228) = 5433 <0.000007>
    [pid 113656] 22:25:36.058175 read(5, "\357\337@\373&H\377\274\vn?\331\23e1\243\30F\370\327\35X\212q\245*\214\205\231\276\"\237"..., 32795) = 1361 <0.000421>
    [pid 113656] 22:25:36.058679 read(5, "\231\311\3319\251z\5\225\f\356\f\217\331?9\324k\335x\277\354\n[\304\372\323\357\327M\275>\331"..., 31434) = 1370 <0.000004>
    [pid 113656] 22:25:36.058756 read(5, "\370o\10\232\353\370/\235I\354\334\370\343\325}j\21\t\341\10( \0w\261\261\354\350\335\334'\342"..., 30064) = 9840 <0.000008>
    [pid 113656] 22:25:36.058822 read(5, "!u\265\260\316\256\1\276mHa\17W\327V \366\202@EX\205?\323Z\341\2\312\27?\6\253"..., 20224) = 12274 <0.000007>
    [pid 113656] 22:25:36.058886 read(5, "@F\24\202\372.\10R\277P\37H\267K\277u\300\216sf\234\201`\372\315\4\353\t\266`\37\237"..., 7950) = 5480 <0.000005>
    [pid 113656] 22:25:36.058943 read(5, "\32H\356\213\210\315.\227\276\310\230\20Z\274Z~v\200[\262\341\377\31\223\376\2251i\223Y\370\307"..., 2470) = 2470 <0.000004>
    [pid 113656] 22:25:36.059005 write(4, "\322y\376!/\361\4a\257\4\224m\344k\341'\257\220zxy\3\227\7\177\375+\261\313\354e5"..., 65515) = 65515 <0.000016>
    [pid 113656] 22:25:36.059074 read(5, "401a", 4) = 4 <0.000004>
    [pid 113656] 22:25:36.059124 read(5, "\1iH\317'w\300\255\10\224\30\325\22JC\362Q\252X\351}\2611\315\227\300\262\0\t\262\211\335"..., 16406) = 11170 <0.000007>
    [pid 113656] 22:25:36.059181 read(5, "\23\266\234,\35\261\264\31\222\336B\357\245*\222{\315\16g\274\24\7}\377\324\377[\353^\327\20\376"..., 5236) = 1370 <0.003961>
    [pid 113656] 22:25:36.063249 read(5, "\326B\344\266V\370\347\321Up\317mg\312\3\344]\262\205\372\336m\307\341\227\235\365\315\346\201z\353"..., 3866) = 3866 <0.000006>
    [pid 113656] 22:25:36.063339 write(4, "iH\317'w\300\255\10\224\30\325\22JC\362Q\252X\351}\2611\315\227\300\262\0\t\262\211\335\37"..., 16405) = 16405 <0.000009>
    [pid 113656] 22:25:36.063433 read(5, "4005", 4) = 4 <0.000005>
    [pid 113656] 22:25:36.063512 read(5, "\1F+bwUm\4\321AR\0\346\0230H\341\316\371\301\221\301<\24\205\33\331A\245)|-"..., 16385) = 2933 <0.000005>
    [pid 113656] 22:25:36.063589 read(5, "\2029\th\361\372\317\22\216K\227\326\356:X;O\334P\312\301\017100644 com"..., 13452) = 1361 <0.005005>
    [pid 113656] 22:25:36.068703 read(5, "\4\347\4\220U\24\2*\24\301\177\203\216rx\26\363\207\336)+\276\344%\217\342\261i\2\1\24\205"..., 12091) = 12033 <0.000008>
    [pid 113656] 22:25:36.068793 read(5, "\210\232\204-\363\344\332\322X22\222\322R:\355\\M\341\344\277K\365F|e\241\261v\345\nc"..., 58) = 58 <0.000004>
    [pid 113656] 22:25:36.068870 write(4, "F+bwUm\4\321AR\0\346\0230H\341\316\371\301\221\301<\24\205\33\331A\245)|-P"..., 16384) = 16384 <0.000007>
    [pid 113656] 22:25:36.068949 read(5, "4005", 4) = 4 <0.000004>
    [pid 113656] 22:25:36.069024 read(5, "\1\343\356M\225\260\377eRd\24\237{\206\361\277\326\226\32\v7\227\324\205\24=\351\211\305\221t>"..., 16385) = 5409 <0.000006>
    [pid 113656] 22:25:36.069115 read(5, "\346~\261\216\24\332\215\314GW\212c\0050\331,\27y\225tr\254e\322D\302,4\233)\364d"..., 10976) = 1370 <0.000234>
    [pid 113656] 22:25:36.069439 read(5, "(o\0\264\17f\313\rp\337\307\244\212\361\245\254%_\21\330\22\223Z\203\270\200\336\210f\342\332\335"..., 9606) = 5480 <0.000006>
    [pid 113656] 22:25:36.069503 read(5, "\206\277\234k\202?\376\366\33x\260\376\277\375\237K\34\347e\310\327\213I\362\r!?L\20eU\271"..., 4126) = 4126 <0.000011>
    [pid 113656] 22:25:36.069563 write(4, "\343\356M\225\260\377eRd\24\237{\206\361\277\326\226\32\v7\227\324\205\24=\351\211\305\221t>\247"..., 16384) = 16384 <0.000007>
    [pid 113656] 22:25:36.069617 read(5, "4005", 4) = 4 <0.000004>
    [pid 113656] 22:25:36.069666 read(5, "\1\301\34\0'O\5\351\215\10\203R*\355+\332\2\364\2161\331\256\241\344H0\214n\234\21\260U"..., 16385) = 2664 <0.000005>
    [pid 113656] 22:25:36.069717 read(5, "s>4j\205X\377N\217Z\364\27A\344Q\24\221Y\210\5\337\257O\355\3530-\203\347\213\216\243"..., 13721) = 1370 <0.006360>
    [pid 113656] 22:25:36.076177 read(5, "[\27\322\301\245\327l\313\322\332R\343\261\200\243\f4\24\202\220p2\34\356\3\360\360\7F\10\16\347"..., 12351) = 6850 <0.000007>
    [pid 113656] 22:25:36.076266 read(5, "\226\343\257\241pT\203\303a\0\26\232)V\340+\301\241Te\306a\31Ns\17\242\257\235X0\r"..., 5501) = 1370 <0.000090>
    [pid 113656] 22:25:36.076444 read(5, "Rh\372\37aqB\270\373\31\311\353\6\211y\17\221\207@\24\16\257_\306\326\341\253m#\243\314\255"..., 4131) = 4131 <0.000005>
    [pid 113656] 22:25:36.076545 write(4, "\301\34\0'O\5\351\215\10\203R*\355+\332\2\364\2161\331\256\241\344H0\214n\234\21\260U\302"..., 16384) = 16384 <0.000009>
    [pid 113656] 22:25:36.076628 read(5, "2005", 4) = 4 <0.000004>
    [pid 113656] 22:25:36.076709 read(5, "\1;h\201\7\16\210p\352\300\252\306\31\272\256\235>\243J<\21U\21\262\333\201\17\217n*~l"..., 8193) = 8193 <0.000007>
    [pid 113656] 22:25:36.076787 write(4, ";h\201\7\16\210p\352\300\252\306\31\272\256\235>\243J<\21U\21\262\333\201\17\217n*~lI"..., 8192) = 8192 <0.000006>
    [pid 113656] 22:25:36.076872 read(5, "4005", 4) = 4 <0.000004>
    [pid 113656] 22:25:36.076938 read(5, "\1\325\311\316h6\322\35\16\231\f\370\257\267\17>\31\350\351Uh\277b-\247\27,\277\210m\360\307"..., 16385) = 16385 <0.000013>
    [pid 113656] 22:25:36.076999 write(4, "\325\311\316h6\322\35\16\231\f\370\257\267\17>\31\350\351Uh\277b-\247\27,\277\210m\360\307\260"..., 16384) = 16384 <0.000007>
    [pid 113656] 22:25:36.077058 read(5, "4005", 4) = 4 <0.000004>
    [pid 113656] 22:25:36.077112 read(5, "\1U\202\tj\201bZ\212\346\272\266\221\257\271\272\242\202\336Qq@.\206\255`\333\345\311\247\374\26"..., 16385) = 16385 <0.000007>
    [pid 113656] 22:25:36.077166 write(4, "U\202\tj\201bZ\212\346\272\266\221\257\271\272\242\202\336Qq@.\206\255`\333\345\311\247\374\26 "..., 16384) = 16384 <0.000006>
    [pid 113656] 22:25:36.077221 read(5, "4005", 4) = 4 <0.000004>
    [pid 113656] 22:25:36.077270 read(5, "\1\344\31]D<;\306\235\342\213^\16\344<9\"\6e\374\275 \35\0\0,\1m\20\252\273\22"..., 16385) = 2922 <0.000005>
    [pid 113656] 22:25:36.077321 read(5, "\377\28\24H\213u!%*\5]\240\326v\n(lg\260\35@\26A\263K\3\177\1\22\362\27"..., 13463) = 1370 <0.000651>
    [pid 113656] 22:25:36.078044 read(5, "\v\3546A\36\301\2b\0\234\7\207x \245N2Q\355?\216B\f\320b\374p\244\341l\204\302"..., 12093) = 4110 <0.000006>
    [pid 113656] 22:25:36.078130 read(5, "\261\34\266\25\234\251\22U{\356\4\361&\214#'\306\255\253\324e\261\36U\301&3\1\177\30\361\340"..., 7983) = 1370 <0.002329>
    [pid 113656] 22:25:36.080553 read(5, "Y\344\21\340\32\253F\1\246\215\2343\361\262c\325\24\233\231%\346\252E\253f3\257\245bZ\240\331"..., 6613) = 6613 <0.000006>
  4. Watch for:

  • Slow read/write calls (duration in <...>).

  • Unusually large or small I/O sizes.

  • Repeated or failed open/stat calls.


Automating Slow Call Detection

You can grep for slow calls in your strace log:

grep -E 'read|write' strace.log | awk '{ if ($NF > 0.05) print $0 }'

This flags any read/write taking longer than 50ms.


Hints for Field Validation

  • Simulate NFS slowness (e.g., network delay) and verify that read and write durations increase.

  • Use the output to identify which files or operations are slow, and correlate with NFS server logs or metrics.

  • Share strace logs with support if deeper analysis is needed.


Questions?

Share your strace output, and we can help interpret it with this tool.

#!/usr/bin/env python3
"""
An strace analyzer focused on NFS performance insights.
"""
import re
import sys
import time
from collections import defaultdict
from datetime import datetime
# Known syscalls to validate parsing
KNOWN_SYSCALLS = {
    'read', 'write', 'open', 'openat', 'close', 'stat', 'fstat', 'lstat',
    'pread64', 'pwrite64', 'fsync', 'fdatasync', 'mkdir', 'rmdir', 'unlink',
    'rename', 'getdents', 'getdents64', 'lseek', 'mmap', 'munmap', 'brk',
    'rt_sigaction', 'rt_sigprocmask', 'rt_sigreturn', 'ioctl', 'access',
    'pipe', 'select', 'poll', 'epoll_wait', 'clone', 'fork', 'vfork',
    'execve', 'exit', 'wait4', 'kill', 'fcntl', 'flock', 'futex',
    'nanosleep', 'getpid', 'socket', 'connect', 'accept', 'sendto',
    'recvfrom', 'shutdown', 'bind', 'listen', 'getsockname', 'getpeername',
    'socketpair', 'setsockopt', 'getsockopt', 'pselect6', 'ppoll',
    'epoll_create', 'dup', 'dup2', 'pause', 'getitimer', 'alarm',
    'setitimer', 'getpgid', 'getppid', 'getpgrp', 'setsid', 'getcwd',
    'chdir', 'fchdir', 'chmod', 'fchmod', 'chown', 'fchown', 'umask',
    'gettimeofday', 'settimeofday', 'mprotect', 'msync', 'mlock',
    'munlock', 'mlockall', 'munlockall', 'mremap', 'madvise', 'shmget',
    'shmat', 'shmctl', 'semget', 'semop', 'semctl', 'shmdt', 'msgget',
    'msgsnd', 'msgrcv', 'msgctl', 'newfstatat', 'getdents64', 'getcwd',
    'readlink', 'sysinfo', 'times', 'getuid', 'getgid', 'setuid',
    'setgid', 'geteuid', 'getegid', 'setpgid', 'umask', 'uname',
    'statfs', 'fstatfs', 'getpriority', 'setpriority', 'sched_getparam',
    'sched_setparam', 'sched_getscheduler', 'sched_setscheduler',
    'sched_get_priority_max', 'sched_get_priority_min', 'sched_rr_get_interval',
    'recvmsg', 'sendmsg', 'getrlimit', 'setrlimit', 'getrusage',
    'getsockopt', 'setsockopt', 'getsockname', 'getpeername', 'sendfile',
    'symlink', 'link', 'readlinkat', 'fstatat', 'getxattr', 'lgetxattr',
    'fgetxattr', 'listxattr', 'llistxattr', 'flistxattr', 'removexattr',
    'lremovexattr', 'fremovexattr', 'setxattr', 'lsetxattr', 'fsetxattr'
}
def parse_strace_line(line):
    """Parse strace line with better validation."""
    # Pattern for format: PID TIMESTAMP syscall(args) = result <duration>
    # Example: 79033 09:20:48.857449 brk(NULL) = 0x562ec57bb000 <0.000004>
    timing_pattern = r'^\d+\s+\d+:\d+:\d+\.\d+\s+(\w+)\((.*?)\)\s*=\s*([^<]+?)\s*<(\d+\.\d+)>'
    match = re.match(timing_pattern, line.strip())
    if match:
        syscall = match.group(1)
        # Validate it's a real syscall
        if syscall not in KNOWN_SYSCALLS:
            return None
        args = match.group(2)
        result = match.group(3).strip()
        duration = float(match.group(4))
        # Check if it's an error result
        is_error = '-1' in result or 'EINVAL' in result or 'ENOENT' in result or 'error' in result.lower()
        return {
            'syscall': syscall,
            'args': args,
            'result': result,
            'duration': duration,
            'is_error': is_error,
            'line': line.strip()
        }
    # Also try pattern for lines that might have [pid XXXX] format
    alt_pattern = r'^\[pid\s+\d+\]\s+(\w+)\((.*?)\)\s*=\s*([^<]+?)\s*<(\d+\.\d+)>'
    match = re.match(alt_pattern, line.strip())
    if match:
        syscall = match.group(1)
        if syscall not in KNOWN_SYSCALLS:
            return None
        args = match.group(2)
        result = match.group(3).strip()
        duration = float(match.group(4))
        is_error = '-1' in result or 'EINVAL' in result or 'ENOENT' in result or 'error' in result.lower()
        return {
            'syscall': syscall,
            'args': args,
            'result': result,
            'duration': duration,
            'is_error': is_error,
            'line': line.strip()
        }
    return None
def categorize_operations():
    """Define operation categories for analysis."""
    return {
        'File Operations': ['open', 'openat', 'close', 'read', 'write', 
                           'pread64', 'pwrite64', 'lseek'],
        'Directory Operations': ['mkdir', 'rmdir', 'getdents', 'getdents64'],
        'Metadata Operations': ['stat', 'fstat', 'lstat', 'newfstatat', 
                               'chmod', 'chown', 'access'],
        'Sync Operations': ['fsync', 'fdatasync', 'sync', 'syncfs'],
        'Path Operations': ['unlink', 'rename', 'link', 'symlink'],
        'Memory Mapping': ['mmap', 'munmap', 'mremap', 'msync'],
        'Process/Thread': ['futex', 'clone', 'wait4', 'exit'],
    }
def analyze_strace_file(filename):
    """Analyze strace file with progress and validation."""
    import os
    syscall_stats = defaultdict(lambda: {
        'count': 0,
        'total_time': 0,
        'max_time': 0,
        'min_time': float('inf'),
        'errors': 0
    })
    slow_calls = []  # Calls over 10ms
    error_calls = []  # Failed calls
    file_size = os.path.getsize(filename)
    bytes_read = 0
    lines_processed = 0
    calls_found = 0
    parse_errors = 0
    print(f"Analyzing: {filename}")
    print(f"File size: {file_size:,} bytes ({file_size/1024/1024:.2f} MB)")
    print("-" * 60)
    start_time = time.time()
    with open(filename, 'r', buffering=8192*1024) as f:
        for line in f:
            lines_processed += 1
            bytes_read += len(line.encode('utf-8'))
            if lines_processed % 10000 == 0:
                progress = (bytes_read / file_size) * 100
                elapsed = time.time() - start_time
                rate = bytes_read / elapsed / 1024 / 1024 if elapsed > 0 else 0
                print(f"\rProgress: {progress:5.1f}% | Lines: {lines_processed:,} | "
                      f"Valid syscalls: {calls_found:,} | Rate: {rate:.1f} MB/s", 
                      end='', flush=True)
            parsed = parse_strace_line(line)
            if parsed:
                calls_found += 1
                syscall = parsed['syscall']
                duration = parsed['duration']
                result = parsed['result']
                stats = syscall_stats[syscall]
                stats['count'] += 1
                stats['total_time'] += duration
                if duration > stats['max_time']:
                    stats['max_time'] = duration
                if duration < stats['min_time']:
                    stats['min_time'] = duration
                # Track errors
                if parsed['is_error']:
                    stats['errors'] += 1
                    if len(error_calls) < 100:  # Keep first 100 errors
                        error_calls.append(parsed)
                # Track slow calls (>10ms)
                if duration > 0.01:
                    if len(slow_calls) < 1000:
                        slow_calls.append(parsed)
                    elif duration > slow_calls[-1]['duration']:
                        slow_calls[-1] = parsed
                        slow_calls.sort(key=lambda x: x['duration'], reverse=True)
            else:
                # Line didn't parse - might be continuation or data
                if '<' in line and '>' in line and '=' in line:
                    parse_errors += 1
    analysis_time = time.time() - start_time
    print(f"\n\nAnalysis complete in {analysis_time:.2f} seconds")
    print(f"Valid syscalls: {calls_found:,} | Parse errors: {parse_errors:,}")
    return syscall_stats, slow_calls, error_calls
def format_duration(seconds):
    """Format duration in human-readable form."""
    if seconds >= 1:
        return f"{seconds:.3f}s"
    elif seconds >= 0.001:
        return f"{seconds*1000:.3f}ms"
    else:
        return f"{seconds*1000000:.1f}µs"
def print_performance_summary(syscall_stats):
    """Print clear performance summary focused on insights."""
    categories = categorize_operations()
    print("\n" + "="*80)
    print("PERFORMANCE SUMMARY BY OPERATION TYPE")
    print("="*80)
    total_time = sum(s['total_time'] for s in syscall_stats.values())
    total_calls = sum(s['count'] for s in syscall_stats.values())
    # Analyze by category
    category_stats = {}
    for category, ops in categories.items():
        cat_time = 0
        cat_count = 0
        cat_errors = 0
        slowest = None
        for op in ops:
            if op in syscall_stats:
                stats = syscall_stats[op]
                cat_time += stats['total_time']
                cat_count += stats['count']
                cat_errors += stats['errors']
                if not slowest or stats['max_time'] > slowest[1]:
                    slowest = (op, stats['max_time'])
        if cat_count > 0:
            category_stats[category] = {
                'time': cat_time,
                'count': cat_count,
                'errors': cat_errors,
                'slowest': slowest,
                'percent': (cat_time / total_time * 100) if total_time > 0 else 0
            }
    # Sort by time spent
    sorted_categories = sorted(category_stats.items(), 
                              key=lambda x: x[1]['time'], 
                              reverse=True)
    for category, stats in sorted_categories:
        print(f"\n{category}:")
        print(f"  Total time: {format_duration(stats['time'])} ({stats['percent']:.1f}% of total)")
        print(f"  Operations: {stats['count']:,}")
        print(f"  Average: {format_duration(stats['time']/stats['count'])}")
        if stats['errors'] > 0:
            print(f"  Errors: {stats['errors']} ({stats['errors']/stats['count']*100:.1f}%)")
        if stats['slowest']:
            print(f"  Slowest: {stats['slowest'][0]} took {format_duration(stats['slowest'][1])}")
def print_nfs_insights(syscall_stats, slow_calls):
    """Print NFS-specific performance insights."""
    print("\n" + "="*80)
    print("NFS PERFORMANCE INSIGHTS")
    print("="*80)
    # Key NFS operations
    nfs_ops = {
        'open/close': ['open', 'openat', 'close'],
        'read/write': ['read', 'write', 'pread64', 'pwrite64'],
        'metadata': ['stat', 'fstat', 'lstat', 'newfstatat', 'getdents', 'getdents64'],
        'sync': ['fsync', 'fdatasync'],
        'directory': ['mkdir', 'rmdir', 'unlink', 'rename']
    }
    print("\nOperation Performance:")
    print("-" * 70)
    print(f"{'Category':<15} {'Count':>10} {'Total Time':>12} {'Avg Latency':>12} {'Max Latency':>12}")
    print("-" * 70)
    total_nfs_time = 0
    total_nfs_ops = 0
    for category, ops in nfs_ops.items():
        cat_count = 0
        cat_time = 0
        cat_max = 0
        for op in ops:
            if op in syscall_stats:
                stats = syscall_stats[op]
                cat_count += stats['count']
                cat_time += stats['total_time']
                cat_max = max(cat_max, stats['max_time'])
        if cat_count > 0:
            total_nfs_time += cat_time
            total_nfs_ops += cat_count
            avg_time = cat_time / cat_count
            print(f"{category:<15} {cat_count:>10,} {format_duration(cat_time):>12} "
                  f"{format_duration(avg_time):>12} {format_duration(cat_max):>12}")
    # Performance indicators
    print("\n" + "="*80)
    print("PERFORMANCE INDICATORS")
    print("="*80)
    # Calculate metrics
    if total_nfs_ops > 0:
        avg_latency = total_nfs_time / total_nfs_ops * 1000  # in ms
        # Count slow operations
        slow_reads = sum(1 for c in slow_calls if c['syscall'] in ['read', 'pread64'])
        slow_writes = sum(1 for c in slow_calls if c['syscall'] in ['write', 'pwrite64'])
        slow_metadata = sum(1 for c in slow_calls if c['syscall'] in ['stat', 'fstat', 'lstat', 'newfstatat'])
        print(f"\n{'Metric':<40} {'Value':>20}")
        print("-" * 60)
        print(f"{'Average NFS operation latency':<40} {avg_latency:>19.3f}ms")
        print(f"{'Total NFS operations':<40} {total_nfs_ops:>20,}")
        print(f"{'Total time in NFS operations':<40} {format_duration(total_nfs_time):>20}")
        print(f"{'Operations slower than 10ms':<40} {len(slow_calls):>20}")
        print(f"{'  - Slow reads':<40} {slow_reads:>20}")
        print(f"{'  - Slow writes':<40} {slow_writes:>20}")
        print(f"{'  - Slow metadata ops':<40} {slow_metadata:>20}")
        # Performance assessment
        print("\n" + "="*80)
        print("PERFORMANCE ASSESSMENT")
        print("="*80)
        if avg_latency < 1:
            print("✓ EXCELLENT: Average latency under 1ms - NFS is performing very well")
        elif avg_latency < 5:
            print("✓ GOOD: Average latency under 5ms - NFS performance is acceptable")
        elif avg_latency < 10:
            print("⚠ WARNING: Average latency 5-10ms - Some performance degradation")
        else:
            print("✗ POOR: Average latency over 10ms - Significant NFS performance issues")
        if slow_reads > 100 or slow_writes > 100:
            print("\n⚠ HIGH SLOW OPERATION COUNT:")
            if slow_reads > 100:
                print(f"  - {slow_reads} slow read operations detected")
            if slow_writes > 100:
                print(f"  - {slow_writes} slow write operations detected")
            print("  Consider investigating network latency or storage performance")
def print_slow_operations(slow_calls, limit=10):
    """Print slowest operations with context."""
    print("\n" + "="*80)
    print(f"TOP {limit} SLOWEST OPERATIONS")
    print("="*80)
    # Filter out wait4 and similar non-NFS operations
    nfs_related = ['open', 'openat', 'close', 'read', 'write', 'pread64', 
                   'pwrite64', 'stat', 'fstat', 'lstat', 'newfstatat', 
                   'fsync', 'fdatasync', 'mkdir', 'unlink', 'rename', 
                   'getdents', 'getdents64']
    nfs_slow_calls = [c for c in slow_calls if c['syscall'] in nfs_related]
    if nfs_slow_calls:
        print("\nNFS-Related Slow Operations:")
        print("-" * 80)
        print(f"{'Duration':>10} {'Syscall':<12} {'Details'}")
        print("-" * 80)
        for call in nfs_slow_calls[:limit]:
            args_preview = call['args'][:50] + '...' if len(call['args']) > 50 else call['args']
            print(f"{format_duration(call['duration']):>10} {call['syscall']:<12} {args_preview}")
    # Also show non-NFS slow calls for context
    other_slow = [c for c in slow_calls if c['syscall'] not in nfs_related][:5]
    if other_slow:
        print("\nOther Slow Operations (for context):")
        print("-" * 80)
        for call in other_slow:
            print(f"{format_duration(call['duration']):>10} {call['syscall']:<12}")
def main():
    if len(sys.argv) != 2:
        print("Usage: python strace_analyzer_improved.py <strace_output_file>")
        print("\nGenerate strace output with:")
        print("  strace -f -T -tt -o output.txt command")
        sys.exit(1)
    filename = sys.argv[1]
    try:
        syscall_stats, slow_calls, error_calls = analyze_strace_file(filename)
        # Print comprehensive report
        print_performance_summary(syscall_stats)
        print_nfs_insights(syscall_stats, slow_calls)
        print_slow_operations(slow_calls)
        # Print errors if any
        if error_calls:
            print("\n" + "="*80)
            print("FAILED OPERATIONS (First 10)")
            print("="*80)
            for call in error_calls[:10]:
                print(f"{call['syscall']}: {call['args'][:60]}...")
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")
        sys.exit(1)
    except KeyboardInterrupt:
        print("\n\nAnalysis interrupted.")
        sys.exit(1)
if __name__ == "__main__":
    main()

Example Output:

$ python3 ./strace_analyzer.py git_clone_trace.txt
Analyzing: git_clone_trace.txt
File size: 420,631,345 bytes (401.15 MB)
------------------------------------------------------------
Progress: 100.0% | Lines: 4,290,000 | Valid syscalls: 577,251 | Rate: 3.0 MB/s
Analysis complete in 131.55 seconds
Valid syscalls: 577,708 | Parse errors: 1,904,059
================================================================================
NFS OPERATION PERFORMANCE
================================================================================
Operation Performance:
----------------------------------------------------------------------
Category             Count   Total Time  Avg Latency  Max Latency
----------------------------------------------------------------------
open/close           6,638       5.207s      784.5µs     24.030ms
read/write         492,432      31.526s       64.0µs    234.538ms
metadata             7,436       3.277s      440.6µs      9.441ms
sync                     3     21.175ms      7.058ms     13.304ms
directory              311    536.980ms      1.727ms      6.502ms
================================================================================
PERFORMANCE INDICATORS
================================================================================
Metric                                                  Value
------------------------------------------------------------
Average NFS operation latency                          0.080ms
Total NFS operations                                  506,820
Total time in NFS operations                          40.568s
Operations slower than 10ms                               149
  - Slow reads                                             46
  - Slow writes                                            28
  - Slow metadata ops                                       0
================================================================================
PERFORMANCE ASSESSMENT
================================================================================
✓ EXCELLENT: Average latency under 1ms - NFS is performing very well
================================================================================
BREAKDOWN BY OPERATION TYPE
================================================================================
File Operations:
  Total time: 36.735s (68.2% of total)
  Operations: 499,238
  Average: 73.6µs
  Errors: 885 (0.2%)
  Slowest: read took 234.538ms
Process/Thread:
  Total time: 11.877s (22.1% of total)
  Operations: 61,980
  Average: 191.6µs
  Errors: 553 (0.9%)
  Slowest: futex took 28.602ms
Metadata Operations:
  Total time: 3.286s (6.1% of total)
  Operations: 7,525
  Average: 436.6µs
  Errors: 3951 (52.5%)
  Slowest: fstat took 9.441ms
Directory Operations:
  Total time: 479.873ms (0.9% of total)
  Operations: 309
  Average: 1.553ms
  Errors: 2 (0.6%)
  Slowest: mkdir took 6.502ms
Path Operations:
  Total time: 68.379ms (0.1% of total)
  Operations: 32
  Average: 2.137ms
  Slowest: rename took 3.716ms
Memory Mapping:
  Total time: 35.357ms (0.1% of total)
  Operations: 291
  Average: 121.5µs
  Slowest: munmap took 14.876ms
Sync Operations:
  Total time: 21.175ms (0.0% of total)
  Operations: 3
  Average: 7.058ms
  Slowest: fsync took 13.304ms
================================================================================
TOP 10 SLOWEST OPERATIONS
================================================================================
NFS-Related Slow Operations:
--------------------------------------------------------------------------------
  Duration Syscall      Details
--------------------------------------------------------------------------------
  11.317ms read         6, "rust Root Certification Authorit"..., 4096
  26.012ms read         5, "\343/\345\313\301\267\374<\300\202\34\fw\257D\...
  23.879ms read         5, "\313ID/\355<\243w\36\27N)\204\361\373Z\300\301...
  20.824ms read         5, "\27\3\3\5k", 5
  23.476ms read         5, "\27\3\3\5k", 5
  21.087ms read         5, "\333\345\333\203M\247?\326\371 +[\326\341\16Er...
  23.844ms read         5, "\27\3\3\5k", 5
  20.906ms read         5, "\27\3\3\5k", 5
  23.917ms read         5, "\27\3\3\5k", 5
  24.854ms read         5, "\348\300\16\363\330F\27\10\262\177\5;\374\213\...
Other Slow Operations (for context):
--------------------------------------------------------------------------------
  23.201ms pselect6
  11.997ms setsockopt
  36.037ms pselect6
  35.509ms pselect6
  50.067ms pselect6
================================================================================
FAILED OPERATIONS (First 10)
================================================================================
access: "/etc/ld.so.preload", R_OK...
access: "/etc/gitconfig", R_OK...
access: "/home/foobarbaz/.config/git/config", R_OK...
access: "/home/foobarbaz/.gitconfig", R_OK...
access: "/etc/gitconfig", R_OK...
access: "/home/foobarbaz/.config/git/config", R_OK...
access: "/home/foobarbaz/.gitconfig", R_OK...
openat: AT_FDCWD, "/usr/share/locale/en_US.UTF-8/LC_MESSAGES/git.mo"...
openat: AT_FDCWD, "/usr/share/locale/en_US.utf8/LC_MESSAGES/git.mo",...
openat: AT_FDCWD, "/usr/share/locale/en_US/LC_MESSAGES/git.mo", O_RD...
$