Linux Optimization for Modern Clients

Scope

This document provides a practical guide to tuning Linux systems for optimal performance with modern clients, focusing on I/O strategies and virtual memory (VM) configuration. It examines the distinctions between buffer-cached I/O and direct I/O, with a focus on optimizing workloads that utilize buffer-cached I/O. The document also presents key vm tuning parameters and recommended values to enhance I/O efficiency and overall system responsiveness.

This guide extends our NFS Tuning document, building upon core NFS best practices with additional system-level optimizations tailored for modern workloads.

Issue: Linux default settings are no longer optimized for modern clients

Large memory footprint:
- Modern client servers often have large memory capacities (e.g., 1.5 TB or more), and may use default cache parameter settings based on a fixed ratio of total memory, resulting in very large cache sizes.
- Low performance may occur when the page cache system buffers and flushes data to storage inefficiently, potentially causing unnecessary latency and bandwidth throttling.
Faster networking capabilities:
- Linux networking defaults are often optimized for lower-speed interfaces, such as 10 GbE or 25 GbE. However, modern high-end systems commonly use 100/200/400 GbE networks, where default tuning parameters can become a performance bottleneck. Tuning is often required to fully leverage the available bandwidth.

Therefore, manual tuning of vm.* parameters and networking settings is recommended to avoid bottlenecks and ensure optimal performance on modern high-speed systems.

Intro to buffered I/O vs direct I/O on NFS

Buffered I/O, which is the default on Linux systems

All reads/writes go through the Linux page cache
Data is cached in RAM before being written to the NFS server
Uses kernel's vm.dirty_* parameters to control write-back

Advantage	Disadvantage
Better performance for repeated reads (cache hits)	Double caching (client page cache + NFS server cache)
Write coalescing reduces small I/O operations	Memory pressure on the client with large datasets
Read-ahead improves sequential read performance	Stale data risk if other clients modify files (actimeo tuning)
Works well for general-purpose workloads	Unpredictable latency during write-back flushes (especially on larger RAM systems)

Direct I/O (Bypassing Linux Buffer Cache)

Bypasses the Linux page cache entirely
Reads/writes go directly to the NFS server
Enabled via O_DIRECT flag

Advantage	Disadvantage
No client memory overhead for large files	No read caching (every IO request will require an NFS request over the wire)
No stale cache	Worse small I/O performance
Predictable latency (no write-back spikes)

vm.* parmeters tuning

1. Key vm.dirty Parameters Explained

Parameter	Default (Most Distros)	Description
`vm.dirty_background_ratio`	10%	Percentage of RAM where the kernel starts writeback
`vm.dirty_ratio`	20%	The percentage where processes block on writes
`vm.dirty_expire_centisecs`	3000 (30s)	How old the dirty data must be before writeback
`vm.dirty_writeback_centisecs`	500 (5s)	Interval between periodic wake-ups of flusher threads in 100ths of a second.
`vm.dirty_bytes`	0 (disabled)	Absolute byte limits
`vm.dirty_background_bytes`	0 (disabled)	Absolute number of bytes of dirty

Note: Most system uses the ratio of the percentage of memory allocated to a page.

How does it work?

Data read from disk or remote storage is stored in RAM and is referred to as pagecache.
When data is modified, not yet flushed to storage, and still remains in RAM, it is known as dirty pagecache.
Moving modified dirty pagecache to storage is known as flushing or dirty writeback.

Flushing typically works on the following conditions

Set time: defined by dirty_writeback_centisecs
Background: (size) defined by dirty_background_bytes
Active: rate of change vm.dirty_bytes and dirty_background_bytes

Kernel periodically flushes these to disk via:
- Background writes (start at dirty_background_*)
- Blocking writes (when dirty_ratio is hit)
Two tuning approaches:
- Ratio-based (% of total RAM)
- Byte-based (absolute values)

Modern clients and servers have large RAM, which can cause low NFS performance

As shown in the table above, most Linux distributions set the default limit for dirty pages to 20% of total RAM. For example, on a system with 1.5 TB of memory, this allows up to 300 GB of dirty data to be cached in RAM. While this can boost performance by delaying writes, the data must eventually be flushed to disk. This flushing can introduce noticeable latency and create I/O bottlenecks, particularly under heavy workloads.

At first glance, reducing the page cache may seem counterproductive.

However, tuning the vm.dirty_bytes and vm.dirty_background_bytes parameters provide precise control over how much data is cached before being flushed.
This helps manage write bursts more effectively and reduces the risk of I/O stalls during flushes to the NFS server.
For example, testing vm.dirty_background_bytes values between 300 MB and 600 MB can help strike the right balance between caching efficiency and flush responsiveness.

sysctl -w vm.dirty_background_bytes=314572800
sysctl -w vm.dirty_bytes=629145600