Linux Optimization for Modern Clients

Prev Next

Scope

This document provides a practical guide to tuning Linux systems for optimal performance with modern clients, focusing on I/O strategies and virtual memory (VM) configuration. It examines the distinctions between buffer-cached I/O and direct I/O, with a focus on optimizing workloads that utilize buffer-cached I/O. The document also presents key vm tuning parameters and recommended values to enhance I/O efficiency and overall system responsiveness.

This guide extends our NFS Tuning document, building upon core NFS best practices with additional system-level optimizations tailored for modern workloads.

Issue: Linux default settings are no longer optimized for modern clients

  • Large memory footprint:

    • Modern client servers often have large memory capacities (e.g., 1.5 TB or more), and may use default cache parameter settings based on a fixed ratio of total memory, resulting in very large cache sizes.

    • Low performance may occur when the page cache system buffers and flushes data to storage inefficiently, potentially causing unnecessary latency and bandwidth throttling.

  • Faster networking capabilities:

    • Linux networking defaults are often optimized for lower-speed interfaces, such as 10 GbE or 25 GbE. However, modern high-end systems commonly use 100/200/400 GbE networks, where default tuning parameters can become a performance bottleneck. Tuning is often required to fully leverage the available bandwidth.

Therefore, manual tuning of vm.* parameters and networking settings is recommended to avoid bottlenecks and ensure optimal performance on modern high-speed systems.

Intro to buffered I/O vs direct I/O on NFS

Buffered I/O, which is the default on Linux systems

  • All reads/writes go through the Linux page cache

  • Data is cached in RAM before being written to the NFS server

  • Uses kernel's vm.dirty_* parameters to control write-back

Advantage

Disadvantage

Better performance for repeated reads (cache hits)

Double caching (client page cache + NFS server cache)

Write coalescing reduces small I/O operations

Memory pressure on the client with large datasets

Read-ahead improves sequential read performance

Stale data risk if other clients modify files (actimeo tuning)

Works well for general-purpose workloads

Unpredictable latency during write-back flushes (especially on larger RAM systems)

Direct I/O (Bypassing Linux Buffer Cache)

  • Bypasses the Linux page cache entirely

  • Reads/writes go directly to the NFS server

  • Enabled via O_DIRECT flag

Advantage

Disadvantage

No client memory overhead for large files

No read caching (every IO request will require an NFS request over the wire)

No stale cache

Worse small I/O performance

Predictable latency (no write-back spikes)

vm.* parmeters tuning

1. Key vm.dirty Parameters Explained

Parameter

Default (Most Distros)

Description

vm.dirty_background_ratio

10%

Percentage of RAM where the kernel starts writeback

vm.dirty_ratio

20%

The percentage where processes block on writes

vm.dirty_expire_centisecs

3000 (30s)

How old the dirty data must be before writeback

vm.dirty_writeback_centisecs

500 (5s)

Interval between periodic wake-ups of flusher threads in 100ths of a second.

vm.dirty_bytes

0 (disabled)

Absolute byte limits

vm.dirty_background_bytes

0 (disabled)

Absolute number of bytes of dirty

Note: Most system uses the ratio of the percentage of memory allocated to a page.

How does it work?

  • Data read from disk or remote storage is stored in RAM and is referred to as pagecache.

  • When data is modified, not yet flushed to storage, and still remains in RAM, it is known as dirty pagecache.

  • Moving modified dirty pagecache to storage is known as flushing or dirty writeback.

Flushing typically works on the following conditions

  • Set time: defined by dirty_writeback_centisecs

  • Background: (size) defined by dirty_background_bytes

  • Active: rate of change vm.dirty_bytes and dirty_background_bytes

  1. Kernel periodically flushes these to disk via:

    • Background writes (start at dirty_background_*)

    • Blocking writes (when dirty_ratio is hit)

  2. Two tuning approaches:

    • Ratio-based (% of total RAM)

    • Byte-based (absolute values)

Modern clients and servers have large RAM, which can cause low NFS performance

As shown in the table above, most Linux distributions set the default limit for dirty pages to 20% of total RAM. For example, on a system with 1.5 TB of memory, this allows up to 300 GB of dirty data to be cached in RAM. While this can boost performance by delaying writes, the data must eventually be flushed to disk. This flushing can introduce noticeable latency and create I/O bottlenecks, particularly under heavy workloads.

At first glance, reducing the page cache may seem counterproductive.

  • However, tuning the vm.dirty_bytes and vm.dirty_background_bytes parameters provide precise control over how much data is cached before being flushed.

  • This helps manage write bursts more effectively and reduces the risk of I/O stalls during flushes to the NFS server.

  • For example, testing vm.dirty_background_bytes values between 300 MB and 600 MB can help strike the right balance between caching efficiency and flush responsiveness.

sysctl -w vm.dirty_background_bytes=314572800
sysctl -w vm.dirty_bytes=629145600