Using strace to Troubleshoot NFS Performance
Summary:
This guide shows how to use strace to diagnose NFS-related performance issues, interpret syscall patterns, and identify slow or inefficient I/O. Real-world examples and actionable tips are included.
How strace Works - A Technical Overview
What is strace?
strace is a diagnostic tool that intercepts and records system calls made by a process. It sits between your application and the Linux kernel, capturing every request your program makes to the operating system.
How Does It Work?
Process Attachment via ptrace()
Application → [strace intercepts] → Kernel → NFS/StorageUses the
ptrace()system call to attach to a process.Can attach to running processes (
-p PID) or start new ones.Acts like a debugger, pausing the process at each system call.
System Call Interception
When your app makes a system call (like
read(),write(),open()):Process is paused
strace records the call details
Kernel executes the system call
strace records the result and timing
Process continues
This creates up to a 2x increase in context switches.
Timing Measurement
read(5, "data...", 4096) = 1024 <0.002145> ↑ Time spent in kernelThe
<0.002145>shows time spent inside the kernel.Includes time waiting for the NFS server response.
Critical for identifying slow operations.
What Gets Captured?
System Call Information:
Call name:
openat,read,write, etc.Arguments: file descriptors, buffers, flags
Return value: success/failure codes
Timing: duration of the call
Process Information:
PID (Process ID)
Timestamp when call was made
Thread information with
-fflag
Why It's Perfect for NFS Debugging
NFS operations appear as regular file system calls to applications:
Application thinks: strace sees: What actually happens:
fd = open("file") → openat(...) = 3 → NFS RPC: LOOKUP + OPEN
read(fd, buf, 4096) → read(3,...) = 4096 → NFS RPC: READ
close(fd) → close(3) = 0 → NFS RPC: CLOSEPerformance Impact
Overhead: strace adds 10-100x slowdown to system calls
Why: Each syscall requires two context switches
Best practice: Use in test environments or briefly in production
Key strace Options for NFS
bashstrace -f -T -tt -o trace.out command
│ │ │ └─ Write to file
│ │ └─ Absolute timestamps
│ └─ Show time in syscalls
└─ Follow child processes
# Filter specific calls
strace -e trace=file # All file operations
strace -e trace=open,read,write,close # Specific callsWhat the Timing Tells You
read(4, "PACK\0\0\0\2\0\69\n", 12) = 12 <0.234538>This 234ms read could indicate:
Network latency to NFS server
Server disk I/O slowness
NFS protocol overhead (authentication, etc.)
Network congestion
Behind the Scenes - NFS + strace
When you read a file over NFS:
App calls
read()→ strace starts timerKernel VFS layer receives call
NFS client creates RPC request
Network round-trip to NFS server
Server reads from disk
Server sends response
Kernel returns data to app → strace stops timer
The beauty of strace is it captures the total time, including all network and server delays!
📋 Quick Example
bash# Fast local disk read
read(3, "data", 4096) = 4096 <0.000052> # 52 microseconds
# Slow NFS read
read(3, "data", 4096) = 4096 <0.045123> # 45 milliseconds
# Network timeout/retry
read(3, "data", 4096) = 4096 <5.234567> # 5+ seconds!This is why strace is so powerful for NFS troubleshooting - it immediately shows you which operations are slow, without needing to understand the complex NFS protocol underneath
When to Use strace
Application is reported as “slow,” but NFS/server metrics look healthy.
You need to pinpoint if slowness is due to system calls, file I/O, or NFS/network issues.
Basic strace Usage
Profile a command and get a syscall summary:
strace -f -c <your-command>-f: Follows child processes.-c: Outputs a summary of system call statistics.
Example Output:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
49.69 13.521276 3380319 4 wait4
27.59 7.507152 16 466692 82293 futex
10.31 2.805373 5 507523 pread64
8.58 2.333674 3 735952 31 read
3.03 0.825550 1 414052 write
0.39 0.105212 26 3996 1138 openat
...
100.00 27.213860 12 2150774 87458 totalWhat to look for:
High time in
read,write, orpread64may indicate I/O bottlenecks.High error counts in
openat,read, orwritecould signal permission or NFS issues.Compare with local filesystem runs to isolate NFS as the bottleneck.
Advanced: Trace Live NFS I/O with Timing
Attach to a running process and see syscall durations:
strace -T -f -e trace=read,write,open,close,stat,fstat,lstat,getdents -p <pid>
-T: Shows syscall duration (in seconds).
-e trace=...: Limits output to relevant file and I/O syscalls.Example Output:
[pid 78795] read(5, ..., 7512) = 1370 <0.001357>
[pid 78795] write(4, ..., 8192) = 8192 <0.000015>
[pid 78795] read(5, ..., 14389) = 14389 <0.022307>How to Interpret the Output
What is a "Large" read/write?
The third argument to
read/writeis the buffer size (e.g., 8192, 16384).Typical NFS I/O sizes are 8K or 16K. Much larger sizes (>1MB) are rare unless tuned.
Very small sizes (e.g., 1-100 bytes) in a tight loop may indicate inefficient application I/O.
What is a "Slow" read/write?
The number in
<...>at the end of each line is the syscall duration in seconds.Most healthy NFS reads/writes complete in a few milliseconds (
<0.001>).Consistent durations above 50-100ms (
<0.050>or higher) are worth investigating.Example of a slightly slow call:
[pid 78795] read(5, ..., 14389) = 14389 <0.022307>Example of a problematic call (not in your logs, but for reference):
read(5, ..., 16384) = 16384 <0.500000>
What to Look For
Symptom | What to Look For in Output | What it Means |
|---|---|---|
Fast read/write |
| Normal, healthy I/O |
Occasional 10-20ms call |
| Slightly slow, but not critical |
Consistent >50ms calls |
| NFS/network/server may be slow |
Very small I/O |
| Inefficient app I/O pattern |
Practical Workflow
Start your workload and capture its PID:
sh -c 'echo $$; exec <your-command>' &Attach strace to the PID:
strace -T -f -e trace=read,write,open,close,stat,fstat,lstat,getdents -p <pid>Real Example
[vastdata@MetaSpaceWorker1 pandas]$ sh -c 'echo $$; exec git clone --quiet https://github.com/pandas-dev/pandas.git > /dev/null 2>&1' & [1] 113652 [vastdata@MetaSpaceWorker1 pandas]$ 113652 [vastdata@MetaSpaceWorker1 pandas]$ strace -T -tt -f -e trace=read,write,open,close,stat,fstat,lstat,getdents -p 113652 strace: Process 113652 attached with 2 threads [pid 113656] 22:25:36.052817 read(5, <unfinished ...> [pid 113652] 22:25:36.052937 read(6, <unfinished ...> [pid 113656] 22:25:36.058008 <... read resumed>"\0243j\363|>\220i\217\272/\223\235M\340m\241\230\313gA\267\367J\361\205'd\361\372\336\235"..., 39598) = 1370 <0.005051> [pid 113656] 22:25:36.058091 read(5, "\277E\223_\260\353M<\31\6\0\262?4c\346\22\210\267\26x\234{\314\363\230gB\240\210\327\322"..., 38228) = 5433 <0.000007> [pid 113656] 22:25:36.058175 read(5, "\357\337@\373&H\377\274\vn?\331\23e1\243\30F\370\327\35X\212q\245*\214\205\231\276\"\237"..., 32795) = 1361 <0.000421> [pid 113656] 22:25:36.058679 read(5, "\231\311\3319\251z\5\225\f\356\f\217\331?9\324k\335x\277\354\n[\304\372\323\357\327M\275>\331"..., 31434) = 1370 <0.000004> [pid 113656] 22:25:36.058756 read(5, "\370o\10\232\353\370/\235I\354\334\370\343\325}j\21\t\341\10( \0w\261\261\354\350\335\334'\342"..., 30064) = 9840 <0.000008> [pid 113656] 22:25:36.058822 read(5, "!u\265\260\316\256\1\276mHa\17W\327V \366\202@EX\205?\323Z\341\2\312\27?\6\253"..., 20224) = 12274 <0.000007> [pid 113656] 22:25:36.058886 read(5, "@F\24\202\372.\10R\277P\37H\267K\277u\300\216sf\234\201`\372\315\4\353\t\266`\37\237"..., 7950) = 5480 <0.000005> [pid 113656] 22:25:36.058943 read(5, "\32H\356\213\210\315.\227\276\310\230\20Z\274Z~v\200[\262\341\377\31\223\376\2251i\223Y\370\307"..., 2470) = 2470 <0.000004> [pid 113656] 22:25:36.059005 write(4, "\322y\376!/\361\4a\257\4\224m\344k\341'\257\220zxy\3\227\7\177\375+\261\313\354e5"..., 65515) = 65515 <0.000016> [pid 113656] 22:25:36.059074 read(5, "401a", 4) = 4 <0.000004> [pid 113656] 22:25:36.059124 read(5, "\1iH\317'w\300\255\10\224\30\325\22JC\362Q\252X\351}\2611\315\227\300\262\0\t\262\211\335"..., 16406) = 11170 <0.000007> [pid 113656] 22:25:36.059181 read(5, "\23\266\234,\35\261\264\31\222\336B\357\245*\222{\315\16g\274\24\7}\377\324\377[\353^\327\20\376"..., 5236) = 1370 <0.003961> [pid 113656] 22:25:36.063249 read(5, "\326B\344\266V\370\347\321Up\317mg\312\3\344]\262\205\372\336m\307\341\227\235\365\315\346\201z\353"..., 3866) = 3866 <0.000006> [pid 113656] 22:25:36.063339 write(4, "iH\317'w\300\255\10\224\30\325\22JC\362Q\252X\351}\2611\315\227\300\262\0\t\262\211\335\37"..., 16405) = 16405 <0.000009> [pid 113656] 22:25:36.063433 read(5, "4005", 4) = 4 <0.000005> [pid 113656] 22:25:36.063512 read(5, "\1F+bwUm\4\321AR\0\346\0230H\341\316\371\301\221\301<\24\205\33\331A\245)|-"..., 16385) = 2933 <0.000005> [pid 113656] 22:25:36.063589 read(5, "\2029\th\361\372\317\22\216K\227\326\356:X;O\334P\312\301\017100644 com"..., 13452) = 1361 <0.005005> [pid 113656] 22:25:36.068703 read(5, "\4\347\4\220U\24\2*\24\301\177\203\216rx\26\363\207\336)+\276\344%\217\342\261i\2\1\24\205"..., 12091) = 12033 <0.000008> [pid 113656] 22:25:36.068793 read(5, "\210\232\204-\363\344\332\322X22\222\322R:\355\\M\341\344\277K\365F|e\241\261v\345\nc"..., 58) = 58 <0.000004> [pid 113656] 22:25:36.068870 write(4, "F+bwUm\4\321AR\0\346\0230H\341\316\371\301\221\301<\24\205\33\331A\245)|-P"..., 16384) = 16384 <0.000007> [pid 113656] 22:25:36.068949 read(5, "4005", 4) = 4 <0.000004> [pid 113656] 22:25:36.069024 read(5, "\1\343\356M\225\260\377eRd\24\237{\206\361\277\326\226\32\v7\227\324\205\24=\351\211\305\221t>"..., 16385) = 5409 <0.000006> [pid 113656] 22:25:36.069115 read(5, "\346~\261\216\24\332\215\314GW\212c\0050\331,\27y\225tr\254e\322D\302,4\233)\364d"..., 10976) = 1370 <0.000234> [pid 113656] 22:25:36.069439 read(5, "(o\0\264\17f\313\rp\337\307\244\212\361\245\254%_\21\330\22\223Z\203\270\200\336\210f\342\332\335"..., 9606) = 5480 <0.000006> [pid 113656] 22:25:36.069503 read(5, "\206\277\234k\202?\376\366\33x\260\376\277\375\237K\34\347e\310\327\213I\362\r!?L\20eU\271"..., 4126) = 4126 <0.000011> [pid 113656] 22:25:36.069563 write(4, "\343\356M\225\260\377eRd\24\237{\206\361\277\326\226\32\v7\227\324\205\24=\351\211\305\221t>\247"..., 16384) = 16384 <0.000007> [pid 113656] 22:25:36.069617 read(5, "4005", 4) = 4 <0.000004> [pid 113656] 22:25:36.069666 read(5, "\1\301\34\0'O\5\351\215\10\203R*\355+\332\2\364\2161\331\256\241\344H0\214n\234\21\260U"..., 16385) = 2664 <0.000005> [pid 113656] 22:25:36.069717 read(5, "s>4j\205X\377N\217Z\364\27A\344Q\24\221Y\210\5\337\257O\355\3530-\203\347\213\216\243"..., 13721) = 1370 <0.006360> [pid 113656] 22:25:36.076177 read(5, "[\27\322\301\245\327l\313\322\332R\343\261\200\243\f4\24\202\220p2\34\356\3\360\360\7F\10\16\347"..., 12351) = 6850 <0.000007> [pid 113656] 22:25:36.076266 read(5, "\226\343\257\241pT\203\303a\0\26\232)V\340+\301\241Te\306a\31Ns\17\242\257\235X0\r"..., 5501) = 1370 <0.000090> [pid 113656] 22:25:36.076444 read(5, "Rh\372\37aqB\270\373\31\311\353\6\211y\17\221\207@\24\16\257_\306\326\341\253m#\243\314\255"..., 4131) = 4131 <0.000005> [pid 113656] 22:25:36.076545 write(4, "\301\34\0'O\5\351\215\10\203R*\355+\332\2\364\2161\331\256\241\344H0\214n\234\21\260U\302"..., 16384) = 16384 <0.000009> [pid 113656] 22:25:36.076628 read(5, "2005", 4) = 4 <0.000004> [pid 113656] 22:25:36.076709 read(5, "\1;h\201\7\16\210p\352\300\252\306\31\272\256\235>\243J<\21U\21\262\333\201\17\217n*~l"..., 8193) = 8193 <0.000007> [pid 113656] 22:25:36.076787 write(4, ";h\201\7\16\210p\352\300\252\306\31\272\256\235>\243J<\21U\21\262\333\201\17\217n*~lI"..., 8192) = 8192 <0.000006> [pid 113656] 22:25:36.076872 read(5, "4005", 4) = 4 <0.000004> [pid 113656] 22:25:36.076938 read(5, "\1\325\311\316h6\322\35\16\231\f\370\257\267\17>\31\350\351Uh\277b-\247\27,\277\210m\360\307"..., 16385) = 16385 <0.000013> [pid 113656] 22:25:36.076999 write(4, "\325\311\316h6\322\35\16\231\f\370\257\267\17>\31\350\351Uh\277b-\247\27,\277\210m\360\307\260"..., 16384) = 16384 <0.000007> [pid 113656] 22:25:36.077058 read(5, "4005", 4) = 4 <0.000004> [pid 113656] 22:25:36.077112 read(5, "\1U\202\tj\201bZ\212\346\272\266\221\257\271\272\242\202\336Qq@.\206\255`\333\345\311\247\374\26"..., 16385) = 16385 <0.000007> [pid 113656] 22:25:36.077166 write(4, "U\202\tj\201bZ\212\346\272\266\221\257\271\272\242\202\336Qq@.\206\255`\333\345\311\247\374\26 "..., 16384) = 16384 <0.000006> [pid 113656] 22:25:36.077221 read(5, "4005", 4) = 4 <0.000004> [pid 113656] 22:25:36.077270 read(5, "\1\344\31]D<;\306\235\342\213^\16\344<9\"\6e\374\275 \35\0\0,\1m\20\252\273\22"..., 16385) = 2922 <0.000005> [pid 113656] 22:25:36.077321 read(5, "\377\28\24H\213u!%*\5]\240\326v\n(lg\260\35@\26A\263K\3\177\1\22\362\27"..., 13463) = 1370 <0.000651> [pid 113656] 22:25:36.078044 read(5, "\v\3546A\36\301\2b\0\234\7\207x \245N2Q\355?\216B\f\320b\374p\244\341l\204\302"..., 12093) = 4110 <0.000006> [pid 113656] 22:25:36.078130 read(5, "\261\34\266\25\234\251\22U{\356\4\361&\214#'\306\255\253\324e\261\36U\301&3\1\177\30\361\340"..., 7983) = 1370 <0.002329> [pid 113656] 22:25:36.080553 read(5, "Y\344\21\340\32\253F\1\246\215\2343\361\262c\325\24\233\231%\346\252E\253f3\257\245bZ\240\331"..., 6613) = 6613 <0.000006>Watch for:
Slow read/write calls (duration in
<...>).Unusually large or small I/O sizes.
Repeated or failed open/stat calls.
Automating Slow Call Detection
You can grep for slow calls in your strace log:
grep -E 'read|write' strace.log | awk '{ if ($NF > 0.05) print $0 }'This flags any read/write taking longer than 50ms.
Hints for Field Validation
Simulate NFS slowness (e.g., network delay) and verify that read and write durations increase.
Use the output to identify which files or operations are slow, and correlate with NFS server logs or metrics.
Share strace logs with support if deeper analysis is needed.
Questions?
Share your strace output, and we can help interpret it with this tool.
#!/usr/bin/env python3
"""
An strace analyzer focused on NFS performance insights.
"""
import re
import sys
import time
from collections import defaultdict
from datetime import datetime
# Known syscalls to validate parsing
KNOWN_SYSCALLS = {
'read', 'write', 'open', 'openat', 'close', 'stat', 'fstat', 'lstat',
'pread64', 'pwrite64', 'fsync', 'fdatasync', 'mkdir', 'rmdir', 'unlink',
'rename', 'getdents', 'getdents64', 'lseek', 'mmap', 'munmap', 'brk',
'rt_sigaction', 'rt_sigprocmask', 'rt_sigreturn', 'ioctl', 'access',
'pipe', 'select', 'poll', 'epoll_wait', 'clone', 'fork', 'vfork',
'execve', 'exit', 'wait4', 'kill', 'fcntl', 'flock', 'futex',
'nanosleep', 'getpid', 'socket', 'connect', 'accept', 'sendto',
'recvfrom', 'shutdown', 'bind', 'listen', 'getsockname', 'getpeername',
'socketpair', 'setsockopt', 'getsockopt', 'pselect6', 'ppoll',
'epoll_create', 'dup', 'dup2', 'pause', 'getitimer', 'alarm',
'setitimer', 'getpgid', 'getppid', 'getpgrp', 'setsid', 'getcwd',
'chdir', 'fchdir', 'chmod', 'fchmod', 'chown', 'fchown', 'umask',
'gettimeofday', 'settimeofday', 'mprotect', 'msync', 'mlock',
'munlock', 'mlockall', 'munlockall', 'mremap', 'madvise', 'shmget',
'shmat', 'shmctl', 'semget', 'semop', 'semctl', 'shmdt', 'msgget',
'msgsnd', 'msgrcv', 'msgctl', 'newfstatat', 'getdents64', 'getcwd',
'readlink', 'sysinfo', 'times', 'getuid', 'getgid', 'setuid',
'setgid', 'geteuid', 'getegid', 'setpgid', 'umask', 'uname',
'statfs', 'fstatfs', 'getpriority', 'setpriority', 'sched_getparam',
'sched_setparam', 'sched_getscheduler', 'sched_setscheduler',
'sched_get_priority_max', 'sched_get_priority_min', 'sched_rr_get_interval',
'recvmsg', 'sendmsg', 'getrlimit', 'setrlimit', 'getrusage',
'getsockopt', 'setsockopt', 'getsockname', 'getpeername', 'sendfile',
'symlink', 'link', 'readlinkat', 'fstatat', 'getxattr', 'lgetxattr',
'fgetxattr', 'listxattr', 'llistxattr', 'flistxattr', 'removexattr',
'lremovexattr', 'fremovexattr', 'setxattr', 'lsetxattr', 'fsetxattr'
}
def parse_strace_line(line):
"""Parse strace line with better validation."""
# Pattern for format: PID TIMESTAMP syscall(args) = result <duration>
# Example: 79033 09:20:48.857449 brk(NULL) = 0x562ec57bb000 <0.000004>
timing_pattern = r'^\d+\s+\d+:\d+:\d+\.\d+\s+(\w+)\((.*?)\)\s*=\s*([^<]+?)\s*<(\d+\.\d+)>'
match = re.match(timing_pattern, line.strip())
if match:
syscall = match.group(1)
# Validate it's a real syscall
if syscall not in KNOWN_SYSCALLS:
return None
args = match.group(2)
result = match.group(3).strip()
duration = float(match.group(4))
# Check if it's an error result
is_error = '-1' in result or 'EINVAL' in result or 'ENOENT' in result or 'error' in result.lower()
return {
'syscall': syscall,
'args': args,
'result': result,
'duration': duration,
'is_error': is_error,
'line': line.strip()
}
# Also try pattern for lines that might have [pid XXXX] format
alt_pattern = r'^\[pid\s+\d+\]\s+(\w+)\((.*?)\)\s*=\s*([^<]+?)\s*<(\d+\.\d+)>'
match = re.match(alt_pattern, line.strip())
if match:
syscall = match.group(1)
if syscall not in KNOWN_SYSCALLS:
return None
args = match.group(2)
result = match.group(3).strip()
duration = float(match.group(4))
is_error = '-1' in result or 'EINVAL' in result or 'ENOENT' in result or 'error' in result.lower()
return {
'syscall': syscall,
'args': args,
'result': result,
'duration': duration,
'is_error': is_error,
'line': line.strip()
}
return None
def categorize_operations():
"""Define operation categories for analysis."""
return {
'File Operations': ['open', 'openat', 'close', 'read', 'write',
'pread64', 'pwrite64', 'lseek'],
'Directory Operations': ['mkdir', 'rmdir', 'getdents', 'getdents64'],
'Metadata Operations': ['stat', 'fstat', 'lstat', 'newfstatat',
'chmod', 'chown', 'access'],
'Sync Operations': ['fsync', 'fdatasync', 'sync', 'syncfs'],
'Path Operations': ['unlink', 'rename', 'link', 'symlink'],
'Memory Mapping': ['mmap', 'munmap', 'mremap', 'msync'],
'Process/Thread': ['futex', 'clone', 'wait4', 'exit'],
}
def analyze_strace_file(filename):
"""Analyze strace file with progress and validation."""
import os
syscall_stats = defaultdict(lambda: {
'count': 0,
'total_time': 0,
'max_time': 0,
'min_time': float('inf'),
'errors': 0
})
slow_calls = [] # Calls over 10ms
error_calls = [] # Failed calls
file_size = os.path.getsize(filename)
bytes_read = 0
lines_processed = 0
calls_found = 0
parse_errors = 0
print(f"Analyzing: {filename}")
print(f"File size: {file_size:,} bytes ({file_size/1024/1024:.2f} MB)")
print("-" * 60)
start_time = time.time()
with open(filename, 'r', buffering=8192*1024) as f:
for line in f:
lines_processed += 1
bytes_read += len(line.encode('utf-8'))
if lines_processed % 10000 == 0:
progress = (bytes_read / file_size) * 100
elapsed = time.time() - start_time
rate = bytes_read / elapsed / 1024 / 1024 if elapsed > 0 else 0
print(f"\rProgress: {progress:5.1f}% | Lines: {lines_processed:,} | "
f"Valid syscalls: {calls_found:,} | Rate: {rate:.1f} MB/s",
end='', flush=True)
parsed = parse_strace_line(line)
if parsed:
calls_found += 1
syscall = parsed['syscall']
duration = parsed['duration']
result = parsed['result']
stats = syscall_stats[syscall]
stats['count'] += 1
stats['total_time'] += duration
if duration > stats['max_time']:
stats['max_time'] = duration
if duration < stats['min_time']:
stats['min_time'] = duration
# Track errors
if parsed['is_error']:
stats['errors'] += 1
if len(error_calls) < 100: # Keep first 100 errors
error_calls.append(parsed)
# Track slow calls (>10ms)
if duration > 0.01:
if len(slow_calls) < 1000:
slow_calls.append(parsed)
elif duration > slow_calls[-1]['duration']:
slow_calls[-1] = parsed
slow_calls.sort(key=lambda x: x['duration'], reverse=True)
else:
# Line didn't parse - might be continuation or data
if '<' in line and '>' in line and '=' in line:
parse_errors += 1
analysis_time = time.time() - start_time
print(f"\n\nAnalysis complete in {analysis_time:.2f} seconds")
print(f"Valid syscalls: {calls_found:,} | Parse errors: {parse_errors:,}")
return syscall_stats, slow_calls, error_calls
def format_duration(seconds):
"""Format duration in human-readable form."""
if seconds >= 1:
return f"{seconds:.3f}s"
elif seconds >= 0.001:
return f"{seconds*1000:.3f}ms"
else:
return f"{seconds*1000000:.1f}µs"
def print_performance_summary(syscall_stats):
"""Print clear performance summary focused on insights."""
categories = categorize_operations()
print("\n" + "="*80)
print("PERFORMANCE SUMMARY BY OPERATION TYPE")
print("="*80)
total_time = sum(s['total_time'] for s in syscall_stats.values())
total_calls = sum(s['count'] for s in syscall_stats.values())
# Analyze by category
category_stats = {}
for category, ops in categories.items():
cat_time = 0
cat_count = 0
cat_errors = 0
slowest = None
for op in ops:
if op in syscall_stats:
stats = syscall_stats[op]
cat_time += stats['total_time']
cat_count += stats['count']
cat_errors += stats['errors']
if not slowest or stats['max_time'] > slowest[1]:
slowest = (op, stats['max_time'])
if cat_count > 0:
category_stats[category] = {
'time': cat_time,
'count': cat_count,
'errors': cat_errors,
'slowest': slowest,
'percent': (cat_time / total_time * 100) if total_time > 0 else 0
}
# Sort by time spent
sorted_categories = sorted(category_stats.items(),
key=lambda x: x[1]['time'],
reverse=True)
for category, stats in sorted_categories:
print(f"\n{category}:")
print(f" Total time: {format_duration(stats['time'])} ({stats['percent']:.1f}% of total)")
print(f" Operations: {stats['count']:,}")
print(f" Average: {format_duration(stats['time']/stats['count'])}")
if stats['errors'] > 0:
print(f" Errors: {stats['errors']} ({stats['errors']/stats['count']*100:.1f}%)")
if stats['slowest']:
print(f" Slowest: {stats['slowest'][0]} took {format_duration(stats['slowest'][1])}")
def print_nfs_insights(syscall_stats, slow_calls):
"""Print NFS-specific performance insights."""
print("\n" + "="*80)
print("NFS PERFORMANCE INSIGHTS")
print("="*80)
# Key NFS operations
nfs_ops = {
'open/close': ['open', 'openat', 'close'],
'read/write': ['read', 'write', 'pread64', 'pwrite64'],
'metadata': ['stat', 'fstat', 'lstat', 'newfstatat', 'getdents', 'getdents64'],
'sync': ['fsync', 'fdatasync'],
'directory': ['mkdir', 'rmdir', 'unlink', 'rename']
}
print("\nOperation Performance:")
print("-" * 70)
print(f"{'Category':<15} {'Count':>10} {'Total Time':>12} {'Avg Latency':>12} {'Max Latency':>12}")
print("-" * 70)
total_nfs_time = 0
total_nfs_ops = 0
for category, ops in nfs_ops.items():
cat_count = 0
cat_time = 0
cat_max = 0
for op in ops:
if op in syscall_stats:
stats = syscall_stats[op]
cat_count += stats['count']
cat_time += stats['total_time']
cat_max = max(cat_max, stats['max_time'])
if cat_count > 0:
total_nfs_time += cat_time
total_nfs_ops += cat_count
avg_time = cat_time / cat_count
print(f"{category:<15} {cat_count:>10,} {format_duration(cat_time):>12} "
f"{format_duration(avg_time):>12} {format_duration(cat_max):>12}")
# Performance indicators
print("\n" + "="*80)
print("PERFORMANCE INDICATORS")
print("="*80)
# Calculate metrics
if total_nfs_ops > 0:
avg_latency = total_nfs_time / total_nfs_ops * 1000 # in ms
# Count slow operations
slow_reads = sum(1 for c in slow_calls if c['syscall'] in ['read', 'pread64'])
slow_writes = sum(1 for c in slow_calls if c['syscall'] in ['write', 'pwrite64'])
slow_metadata = sum(1 for c in slow_calls if c['syscall'] in ['stat', 'fstat', 'lstat', 'newfstatat'])
print(f"\n{'Metric':<40} {'Value':>20}")
print("-" * 60)
print(f"{'Average NFS operation latency':<40} {avg_latency:>19.3f}ms")
print(f"{'Total NFS operations':<40} {total_nfs_ops:>20,}")
print(f"{'Total time in NFS operations':<40} {format_duration(total_nfs_time):>20}")
print(f"{'Operations slower than 10ms':<40} {len(slow_calls):>20}")
print(f"{' - Slow reads':<40} {slow_reads:>20}")
print(f"{' - Slow writes':<40} {slow_writes:>20}")
print(f"{' - Slow metadata ops':<40} {slow_metadata:>20}")
# Performance assessment
print("\n" + "="*80)
print("PERFORMANCE ASSESSMENT")
print("="*80)
if avg_latency < 1:
print("✓ EXCELLENT: Average latency under 1ms - NFS is performing very well")
elif avg_latency < 5:
print("✓ GOOD: Average latency under 5ms - NFS performance is acceptable")
elif avg_latency < 10:
print("⚠ WARNING: Average latency 5-10ms - Some performance degradation")
else:
print("✗ POOR: Average latency over 10ms - Significant NFS performance issues")
if slow_reads > 100 or slow_writes > 100:
print("\n⚠ HIGH SLOW OPERATION COUNT:")
if slow_reads > 100:
print(f" - {slow_reads} slow read operations detected")
if slow_writes > 100:
print(f" - {slow_writes} slow write operations detected")
print(" Consider investigating network latency or storage performance")
def print_slow_operations(slow_calls, limit=10):
"""Print slowest operations with context."""
print("\n" + "="*80)
print(f"TOP {limit} SLOWEST OPERATIONS")
print("="*80)
# Filter out wait4 and similar non-NFS operations
nfs_related = ['open', 'openat', 'close', 'read', 'write', 'pread64',
'pwrite64', 'stat', 'fstat', 'lstat', 'newfstatat',
'fsync', 'fdatasync', 'mkdir', 'unlink', 'rename',
'getdents', 'getdents64']
nfs_slow_calls = [c for c in slow_calls if c['syscall'] in nfs_related]
if nfs_slow_calls:
print("\nNFS-Related Slow Operations:")
print("-" * 80)
print(f"{'Duration':>10} {'Syscall':<12} {'Details'}")
print("-" * 80)
for call in nfs_slow_calls[:limit]:
args_preview = call['args'][:50] + '...' if len(call['args']) > 50 else call['args']
print(f"{format_duration(call['duration']):>10} {call['syscall']:<12} {args_preview}")
# Also show non-NFS slow calls for context
other_slow = [c for c in slow_calls if c['syscall'] not in nfs_related][:5]
if other_slow:
print("\nOther Slow Operations (for context):")
print("-" * 80)
for call in other_slow:
print(f"{format_duration(call['duration']):>10} {call['syscall']:<12}")
def main():
if len(sys.argv) != 2:
print("Usage: python strace_analyzer_improved.py <strace_output_file>")
print("\nGenerate strace output with:")
print(" strace -f -T -tt -o output.txt command")
sys.exit(1)
filename = sys.argv[1]
try:
syscall_stats, slow_calls, error_calls = analyze_strace_file(filename)
# Print comprehensive report
print_performance_summary(syscall_stats)
print_nfs_insights(syscall_stats, slow_calls)
print_slow_operations(slow_calls)
# Print errors if any
if error_calls:
print("\n" + "="*80)
print("FAILED OPERATIONS (First 10)")
print("="*80)
for call in error_calls[:10]:
print(f"{call['syscall']}: {call['args'][:60]}...")
except FileNotFoundError:
print(f"Error: File '{filename}' not found.")
sys.exit(1)
except KeyboardInterrupt:
print("\n\nAnalysis interrupted.")
sys.exit(1)
if __name__ == "__main__":
main()Example Output:
$ python3 ./strace_analyzer.py git_clone_trace.txt
Analyzing: git_clone_trace.txt
File size: 420,631,345 bytes (401.15 MB)
------------------------------------------------------------
Progress: 100.0% | Lines: 4,290,000 | Valid syscalls: 577,251 | Rate: 3.0 MB/s
Analysis complete in 131.55 seconds
Valid syscalls: 577,708 | Parse errors: 1,904,059
================================================================================
NFS OPERATION PERFORMANCE
================================================================================
Operation Performance:
----------------------------------------------------------------------
Category Count Total Time Avg Latency Max Latency
----------------------------------------------------------------------
open/close 6,638 5.207s 784.5µs 24.030ms
read/write 492,432 31.526s 64.0µs 234.538ms
metadata 7,436 3.277s 440.6µs 9.441ms
sync 3 21.175ms 7.058ms 13.304ms
directory 311 536.980ms 1.727ms 6.502ms
================================================================================
PERFORMANCE INDICATORS
================================================================================
Metric Value
------------------------------------------------------------
Average NFS operation latency 0.080ms
Total NFS operations 506,820
Total time in NFS operations 40.568s
Operations slower than 10ms 149
- Slow reads 46
- Slow writes 28
- Slow metadata ops 0
================================================================================
PERFORMANCE ASSESSMENT
================================================================================
✓ EXCELLENT: Average latency under 1ms - NFS is performing very well
================================================================================
BREAKDOWN BY OPERATION TYPE
================================================================================
File Operations:
Total time: 36.735s (68.2% of total)
Operations: 499,238
Average: 73.6µs
Errors: 885 (0.2%)
Slowest: read took 234.538ms
Process/Thread:
Total time: 11.877s (22.1% of total)
Operations: 61,980
Average: 191.6µs
Errors: 553 (0.9%)
Slowest: futex took 28.602ms
Metadata Operations:
Total time: 3.286s (6.1% of total)
Operations: 7,525
Average: 436.6µs
Errors: 3951 (52.5%)
Slowest: fstat took 9.441ms
Directory Operations:
Total time: 479.873ms (0.9% of total)
Operations: 309
Average: 1.553ms
Errors: 2 (0.6%)
Slowest: mkdir took 6.502ms
Path Operations:
Total time: 68.379ms (0.1% of total)
Operations: 32
Average: 2.137ms
Slowest: rename took 3.716ms
Memory Mapping:
Total time: 35.357ms (0.1% of total)
Operations: 291
Average: 121.5µs
Slowest: munmap took 14.876ms
Sync Operations:
Total time: 21.175ms (0.0% of total)
Operations: 3
Average: 7.058ms
Slowest: fsync took 13.304ms
================================================================================
TOP 10 SLOWEST OPERATIONS
================================================================================
NFS-Related Slow Operations:
--------------------------------------------------------------------------------
Duration Syscall Details
--------------------------------------------------------------------------------
11.317ms read 6, "rust Root Certification Authorit"..., 4096
26.012ms read 5, "\343/\345\313\301\267\374<\300\202\34\fw\257D\...
23.879ms read 5, "\313ID/\355<\243w\36\27N)\204\361\373Z\300\301...
20.824ms read 5, "\27\3\3\5k", 5
23.476ms read 5, "\27\3\3\5k", 5
21.087ms read 5, "\333\345\333\203M\247?\326\371 +[\326\341\16Er...
23.844ms read 5, "\27\3\3\5k", 5
20.906ms read 5, "\27\3\3\5k", 5
23.917ms read 5, "\27\3\3\5k", 5
24.854ms read 5, "\348\300\16\363\330F\27\10\262\177\5;\374\213\...
Other Slow Operations (for context):
--------------------------------------------------------------------------------
23.201ms pselect6
11.997ms setsockopt
36.037ms pselect6
35.509ms pselect6
50.067ms pselect6
================================================================================
FAILED OPERATIONS (First 10)
================================================================================
access: "/etc/ld.so.preload", R_OK...
access: "/etc/gitconfig", R_OK...
access: "/home/foobarbaz/.config/git/config", R_OK...
access: "/home/foobarbaz/.gitconfig", R_OK...
access: "/etc/gitconfig", R_OK...
access: "/home/foobarbaz/.config/git/config", R_OK...
access: "/home/foobarbaz/.gitconfig", R_OK...
openat: AT_FDCWD, "/usr/share/locale/en_US.UTF-8/LC_MESSAGES/git.mo"...
openat: AT_FDCWD, "/usr/share/locale/en_US.utf8/LC_MESSAGES/git.mo",...
openat: AT_FDCWD, "/usr/share/locale/en_US/LC_MESSAGES/git.mo", O_RD...
$