Scope
This document provides a practical guide to tuning Linux systems for optimal performance with modern clients, focusing on I/O strategies and virtual memory (VM) configuration. It examines the distinctions between buffer-cached I/O and direct I/O, with a focus on optimizing workloads that utilize buffer-cached I/O. The document also presents key vm tuning parameters and recommended values to enhance I/O efficiency and overall system responsiveness.
This guide extends our NFS Tuning document, building upon core NFS best practices with additional system-level optimizations tailored for modern workloads.
Issue: Linux default settings are no longer optimized for modern clients
Large memory footprint:
Modern client servers often have large memory capacities (e.g., 1.5 TB or more), and may use default cache parameter settings based on a fixed ratio of total memory, resulting in very large cache sizes.
Low performance may occur when the page cache system buffers and flushes data to storage inefficiently, potentially causing unnecessary latency and bandwidth throttling.
Faster networking capabilities:
Linux networking defaults are often optimized for lower-speed interfaces, such as 10 GbE or 25 GbE. However, modern high-end systems commonly use 100/200/400 GbE networks, where default tuning parameters can become a performance bottleneck. Tuning is often required to fully leverage the available bandwidth.
Therefore, manual tuning of vm.* parameters and networking settings is recommended to avoid bottlenecks and ensure optimal performance on modern high-speed systems.
Intro to buffered I/O vs direct I/O on NFS
Buffered I/O, which is the default on Linux systems
All reads/writes go through the Linux page cache
Data is cached in RAM before being written to the NFS server
Uses kernel's
vm.dirty_*parameters to control write-back
Advantage | Disadvantage |
|---|---|
Better performance for repeated reads (cache hits) | Double caching (client page cache + NFS server cache) |
Write coalescing reduces small I/O operations | Memory pressure on the client with large datasets |
Read-ahead improves sequential read performance | Stale data risk if other clients modify files (actimeo tuning) |
Works well for general-purpose workloads | Unpredictable latency during write-back flushes (especially on larger RAM systems) |
Direct I/O (Bypassing Linux Buffer Cache)
Bypasses the Linux page cache entirely
Reads/writes go directly to the NFS server
Enabled via
O_DIRECTflag
Advantage | Disadvantage |
|---|---|
No client memory overhead for large files | No read caching (every IO request will require an NFS request over the wire) |
No stale cache | Worse small I/O performance |
Predictable latency (no write-back spikes) |
vm.* parmeters tuning
1. Key vm.dirty Parameters Explained
Parameter | Default (Most Distros) | Description |
|---|---|---|
| 10% | Percentage of RAM where the kernel starts writeback |
| 20% | The percentage where processes block on writes |
| 3000 (30s) | How old the dirty data must be before writeback |
| 500 (5s) | Interval between periodic wake-ups of flusher threads in 100ths of a second. |
| 0 (disabled) | Absolute byte limits |
| 0 (disabled) | Absolute number of bytes of dirty |
Note: Most system uses the ratio of the percentage of memory allocated to a page.
How does it work?
Data read from disk or remote storage is stored in RAM and is referred to as pagecache.
When data is modified, not yet flushed to storage, and still remains in RAM, it is known as dirty pagecache.
Moving modified dirty pagecache to storage is known as flushing or dirty writeback.
Flushing typically works on the following conditions
Set time: defined by
dirty_writeback_centisecsBackground: (size) defined by
dirty_background_bytesActive: rate of change
vm.dirty_bytesanddirty_background_bytes
Kernel periodically flushes these to disk via:
Background writes (start at
dirty_background_*)Blocking writes (when
dirty_ratiois hit)
Two tuning approaches:
Ratio-based (% of total RAM)
Byte-based (absolute values)
Modern clients and servers have large RAM, which can cause low NFS performance
As shown in the table above, most Linux distributions set the default limit for dirty pages to 20% of total RAM. For example, on a system with 1.5 TB of memory, this allows up to 300 GB of dirty data to be cached in RAM. While this can boost performance by delaying writes, the data must eventually be flushed to disk. This flushing can introduce noticeable latency and create I/O bottlenecks, particularly under heavy workloads.
At first glance, reducing the page cache may seem counterproductive.
However, tuning the
vm.dirty_bytesandvm.dirty_background_bytesparameters provide precise control over how much data is cached before being flushed.This helps manage write bursts more effectively and reduces the risk of I/O stalls during flushes to the NFS server.
For example, testing
vm.dirty_background_bytesvalues between 300 MB and 600 MB can help strike the right balance between caching efficiency and flush responsiveness.
sysctl -w vm.dirty_background_bytes=314572800
sysctl -w vm.dirty_bytes=629145600