VAST Quick NFS Read Ahead Tuning

ℹ️ Info
TLDR:
If you aren’t sure where to start with NFS readahead, VAST recommends setting it to 4MB rather than the default 128KB. We have found this to be a good starting point. Details on configuring readahead can be found at the end of the article.

NFS Read Ahead / Prefetch / Read Size

NFS read-ahead predicts and requests blocks from a file before the application issues I/O requests. It's designed to improve client sequential read throughput. With the recent default change to 128KB, this can cause severe negative performance on reads.

It is recommended to set the value to a higher limit (VAST recommends 4MB to start) for better read performance. These settings are very application-specific, and there isn’t a one-size-fits-all solution. If you want to optimize performance for a specific workload, you can test with different readahead values.

NFS has a few optimizations to improve read performance that anticipate what will be read. NFS has a minimum read size (rsize in the NFS mount options; VAST sets it to 1MB by default) and a prefetch or readahead size. The default varies depending on the kernel and version. We've seen it as low as 128K and as high as 15MB. High values yield better performance when applications ultimately read most of the prefetched data. On the other hand, low values work better when the client lacks the required buffering or when IO is highly random, and prefetched data is unlikely to be read.

Important note: RHEL 8.3 and Ubuntu 20.04 default to a much smaller NFS readahead, which limits performance. More details here https://access.redhat.com/solutions/5953561

Prefetch is too Large

A sure sign that unnecessary prefetching is occurring is to note the throughput reported by the client application and compare it to what is reported by VAST. For example, if IOR is being run and it reports a throughput much lower than the VAST reported throughput, that indicates that the NFS client is fetching data not needed by the client application (IOR in this example). Another example might be an application that reads only a few bytes from a file and then looks at another file.

Remember that reducing the prefetch will help random IO workloads but will hurt sequential IO workloads.

Prefetch is too Small.

If you know your application’s IO patterns are sequential, increasing the prefetch value can be highly beneficial. We've found that since newer Linux versions have changed the default prefetch size to 128K, tuning prefetch to a larger value is often very useful.

Tuning nfs readahead

Readahead can be tuned to optimize for specific workloads. There is a tunable system value that controls the amount of NFS readahead. This is well explained in Tuning NFS Client Read Ahead. That article is for SUSE, but as of this writing, it applies to CentOS and Ubuntu. Here is an example script that supports changing the tunable values and printing the current value. All workloads vary, but we recommend a starting point of 4MB over the default 128KB.

#!/bin/sh
# usage: To display current value: set-ra.sh </mount/point>
# To set a new value: set-ra.sh </mount/point> <new_value>
case $# in
   1) cat /sys/class/bdi/$(mountpoint -d "$1")/read_ahead_kb
   ;;
   2) echo $2 > /sys/class/bdi/$(mountpoint -d "$1")/read_ahead_kb
   ;;
esac

Changing readahead Defaults

ℹ️ Note
Note: Keep in mind that NFS readahead will only benefit sequential streaming read IO.

Note: We have found that a 4MB readahead size is a good general starting point. There is a balance between reading just enough and pre-fetching too much that the application may not use. We settled on a 4MB readahead size because it optimizes performance for general workloads. If it is desired to tune for a specific workload, readahead is controllable "live" and per mountpoint (using the set_ra.sh script above), with no need to re-mount. We encourage you to experiment with sizes up to 15360 or 15MB. Some latency-sensitive database applications (kdb) have been found to benefit from read_ahead_kb=16, while other large-file sequential streaming applications enjoy up to a maximum of 65536.

ℹ️ Note
Note: You cannot have two different client mountpoints to the same vip:/export and still have individual readahead tunings for each. But you can mount it to two different VIPs and still control the readahead.

Eg:

Filesystem                   1K-blocks       Used    Available Use% Mounted on
172.200.203.1:/meta 280854167552 8784133120 272070034432   4% /mnt/scratch_nfs_ra_4m
172.200.203.2:/meta 280854167552 8784133120 272070034432   4% /mnt/scratch_nfs_ra_15m
[vastdata@se-sjc-cb2-c1 linux-4.18]$ sudo nfs_readahead /mnt/scratch_nfs_ra_4m 
4096
[vastdata@se-sjc-cb2-c1 linux-4.18]$ sudo nfs_readahead /mnt/scratch_nfs_ra_15m 
15380

How to persistently set readahead for NFS mounts using nfs.conf

RHEL8.7 and above (from nfsutils-2.6.2), you can use /etc/nfs.conf to set the readahead. https://man7.org/linux/man-pages/man5/nfsrahead.5.html.

[nfsrahead]
nfs=4096  #readahead would be set to 4MB for both NFSv3 and NFV4 instead of the default 128KB

How to persistently set readahead for NFS mounts using udev. (for RHEL8.6, Ubuntu 22.04 and older):

# create /etc/udev/rules.d/99-nfs.rules with the following content:
SUBSYSTEM=="bdi", ACTION=="add", PROGRAM="/bin/awk -v bdi=$kernel 'BEGIN{ret=1} {if ($4 == bdi) {ret=0}} END{exit ret}' /proc/fs/nfsfs/volumes", ATTR{read_ahead_kb}="4096"
# apply the udev rule:
udevadm control --reload