This document will guide you through best practices for VAST benchmarking. It is meant as a starting point for understanding how to drive load against a VAST Data cluster and to effectively give some insights. While this document isn’t exhaustive on the ways to benchmark a VAST data cluster or all the things to consider, it’s a helpful place to start.

Quick Start (tl;dr)

For the impatient wanting to drive the most load they can on a single client:

Download & extract elbencho from GitHub.

Mount each cluster VIP to its own directory on the client.

# update the IPs and the seq to match your VIP -- this example uses 8 VIPs
for i in $(seq 1 8); do
    sudo mkdir -p /mnt/172.200.201.$i
    sudo mount -t nfs -o vers=3,nconnect=16 172.200.201.$i:/ /mnt/172.200.201.$i
    mkdir -p /mnt/172.200.201.$i/elbencho-files
done

If the mount fails, your client may not support nconnect which will limit your ability to drive load through the client.

Run elbencho to write and then read files across all mounts using all client CPUs.
```
# update the IP to match your VIP
elbencho --write --read --size 20G --direct --threads $(nproc) --iodepth 16 \
    "/mnt/172.200.201.[1-8]/elbencho-files/file[1-16]"
```
The write speed will depend on your cluster size and VAST Data version. The read speed should saturate a 100Gbps NIC.

Delete the files and unmount.

# update the IP to match your VIP
rm /mnt/172.200.201.1/elbencho-files/file*
rmdir /mnt/172.200.201.1/elbencho-files

# update the IPs and the seq to match your VIP
for i in $(seq 1 8); do
    sudo umount /mnt/172.200.201.$i
done

The context for running & understanding benchmarks

You can download a tool, run it, and get numbers – but what do the numbers mean?

This section provides some background on the factors influencing your benchmarking efforts. How you set up your benchmark is key to understanding why you got the results that you did.

VAST Data Architecture

We're not going to cover the full VAST Data architecture; see The VAST Data Platform whitepaper for the full details. However, we need to cover some key aspects that impact most benchmarking.

CNodes & VIPs

Data enters a VAST Data cluster through the CNodes – whether via NFS, S3, SMB, etc. Each CNode has one or more virtual IPs (VIPs). The VIPs can be moved between CNodes if one goes offline, such as due to an upgrade or failure. When doing benchmarking, we need to be aware of:

How many CNodes are in the cluster?
What the VIPs of those CNodes are – this is controlled by the VIP Pool?

Clusters provide a DNS server that clients can use to connect to a CNode, but that DNS name will resolve to a VIP that is on one CNode.

ℹ️ Info
When benchmarking a VAST Data cluster, it’s important to understand how the test clients are connected to the CNodes. If only half of the CNodes are used, you’ll get half the expected cluster performance at most. If your mount points use the VAST DNS, your clients may not be evenly distributed across all CNodes – indeed, some CNodes might not be mounted!

Similarity and Encryption

VAST Data clusters provide robust data reduction through Similarity. This data reduction is done in the write path as data is migrated from Storage Class Memory (SCM) to flash. Similarity, as with many data reduction methods, it may impact benchmarking because many performance tools write unrealistic, highly compressible data. This can have performance impacts on reads and writes due to extreme hot spots. Imagine reading a 100GB file full of zeros that was reduced to a single block on a single SSD. We recommend writing incompressible data or disabling Similarity for most benchmarking exercises.

VAST Data clusters support optional encryption at rest. Encryption has a small but measurable performance impact. Keep this in mind when comparing performance results between different VAST Data clusters.

SCM, SSDs, and Performance

When writes are made to a VAST Data cluster, they are written to SCM. When enough data is written, somewhere around > 40% of the system’s SCM capacity, an asynchronous process begins migrating that data to SSDs. Because the data is on persistent, stable storage in the SCM, it is not proactively migrated. From an end-user perspective, this is largely irrelevant: the data can be read and written to the SCM just as it can to the SSDs. On a brand-new cluster, the SCM is empty, and the initial write performance will be faster than expected until the async migration processes kick in during steady state.

In typical usage, the cluster regularly gets data written to it, and older data is migrated to SSDs. But in benchmarking, it is not uncommon to write a (relatively speaking) little amount of data and read it back. If the data has not been migrated down to SSDs, those reads come from the SCM. The read performance is thus limited by the number of SCM devices in the system (few) rather than the number of SSDs (many). Therefore, to get peak read performance from a cluster, you need to write enough data that most of it is being read from the SSDs. Alternatively, write the initial data and follow it with a second large write of data to cause the first set to be migrated, then do the reads against the initial data that is now on the SSDs.

While the amount of SCM varies per cluster, the current rule of thumb is to write 4TB per DBox to ensure the data has been migrated to SCM before doing reads against that data.

Data Flow & Analytics UIs

When performing tests against a VAST Data cluster, the Data Flow and Analytics UIs are incredibly helpful in ensuring that data goes where you expect it.

The Data Flow UI is useful for validating that data is being written to or read from all the expected CNodes.
The Analytics UI is useful for seeing the overall cluster performance as well as a breakdown by CNode.

Client & Protocol Considerations

Caching

One of the biggest client-side challenges to benchmarking is caching. Operating systems use free client memory as both a write-through cache and as a read cache. If the benchmarking workload is small enough and the client RAM large enough the test might be entirely operating against the cache, making the test results misleading and invalid.

Writes to a VAST Data cluster are always stable; that is, when a write is acknowledged back to the client, the data written is on stable storage. The VAST Data architecture does not use an unstable, memory-backed write buffer. We use this to our advantage during benchmarking by having the benchmarking tool always do direct (eg: O_DIRECT) reads and writes, thereby bypassing the client cache altogether.

Using --direct (elbencho) or --direct=1 (fio) will remove the client cache as a factor for almost all read & write tests, and is recommended in all tests.

NFSv3 vs NFSv4

VAST supports both NFS v3 and v4.1. Because of the stateful nature of v4.1, it is slower than v3. You should test with whatever protocol version you intend on using in production, but don’t assume v4 is faster just because it’s newer. If you do not require features in v4.1, you will get better performance using v3.

VAST does not support pNFS in NFS v4.1.

nconnect & VAST NFS Client

By default, traffic for an NFS mount point transits over a single TCP connection from the client to the server. This single TCP connection has an inherent performance limit of around 2.5 GB/s on a single 100 Gbps NIC.

In modern Linux kernels, the number of TCP connections per mount point can be increased by setting the nconnect option on the mount, up to 16. This can improve speeds on that mount point to up to 11 GB/s on a single 100 Gbps NIC.

ℹ️ Info
nconnect is a per-IP setting. If the same IP is mounted twice, both will use the nconnect value specified by what was mounted first. Always use mount -t nfs to see what the nconnect value is for the mount point, if it is not included it is using an implicit nconnect=1.

In both cases, the Linux client is still connected to a single CNode, because the mount point is going to a single IP address.

The VAST NFS client is a drop-in replacement for the Linux NFS client. This client adds additional NFS mount options to allow a single NFS mount to connect to multiple CNodes (remoteports) and spread traffic across all of them. In addition, the VAST NFS client supports nconnect values up to 64 and because it can be installed on older Linux kernels, it enables the use of nconnect for them.

The VAST NFS client also allows traffic for a single mount point to use multiple client NICs with localports.

ℹ️ Note
Depending on your client hardware and workload you may get better performance with the VAST NFS client, but we generally recommend that if your Linux kernel supports nconnect you do an initial benchmarking pass with that as it’s easy and out-of-the-box.

TCP vs RDMA

VAST Data clusters support NFS over TCP and also over RDMA (RoCE v2). While RDMA will give you the highest performance with the lowest client CPU usage, you can achieve very high performance over TCP.

It’s generally recommended to do initial benchmarks over TCP before wading into RDMA, as using RDMA requires special network settings to be properly configured.

ℹ️ Info
To successfully use RDMA, your entire network must be correctly configured to be lossless. Failing to do so will result in unexpected mount behavior, and application hangs.

Benchmarking tools

There are many filesystem and protocol benchmarking tools out there. We specifically want to call out some that we recommend and some that we discourage.

The most critical part of a good benchmarking tool for a VAST Data cluster is concurrency. Tools need to be both multi-threaded and able to drive enough concurrency – usually with async I/O – to not be the limiting factor.

ℹ️ Info
The most common problem we see during benchmarking exercises is not using enough concurrency either because the tool does not support it or the flags to enable high concurrency weren’t used.

Recommended tools

These are recommended benchmarking tools as they’re able to drive enough concurrency if configured correctly and can produce incompressible data to bypass similarity. This is not an exhaustive list.

elbencho

elbencho is a good all-around benchmarking tool. It can drive high concurrency, generate large and small files, run metadata-only workloads, work over NFS, SMB, and S3, and run in a client/server configuration to drive load concurrently across multiple clients. elbencho generates incompressible data by default.

elbencho supports multiple threads via --threads, and the depth of asynchronous IO can be controlled with --iodepth.

ℹ️ Info
Recommended elbencho flags:
--threads=x --iodepth=y --direct

ℹ️ Note
Pro-tip: Use the --dryrun command in elbencho to have it give you a summary of what it's going to do before it does it! This is great for confirming it will generate the number and size of files that you expect!

fio

fio is a common and also good benchmarking tool if configured correctly. It can drive high concurrency for large and small files, work over NFS and SMB, and run in a client/server configuration. fio generates highly compressible data by default but it has options to generate incompressible data with --refill_buffers and --randrepeat=0.

fio supports multiple threads via --numjobs and asynchronous IO with --ioengine=libaio and --iodepth.

Discouraged tools

The following are poor benchmarking tools because they’re single-threaded and can’t drive sufficient concurrency. How these perform is a valid test case, but they aren’t good benchmarks.

dd
cp / mv / rm
rsync - see msrsync for a multi-threaded version; rclone may also be an alternative
tar – see Oracle’s partar for a multi-threaded tar
git - see the parallel checkout config

The following tools can give misunderstood results easily.

ior added the ability to write mostly incompressible data with -l random but the implementation of that means that the data is still somewhat compressible. If not used correctly, stat and read tests might be coming from the client-side cache and not the cluster.
mdtest is similar to ior, if not used correctly stat and read tests might be coming from the client-side cache and not the cluster.
iozone by default writes very compressible data (1000:1 DDR) which will generate very skewed results. If you must use iozone:
- Use -I to do direct reads & writes to bypass the client-side cache.
- Run version 3.489 or higher and use the following flags to reduce – but not eliminate – the dedup/compression: -+a 0 -+w 0 -+y 0 -+C 0

Mixed workloads

Tools like elbencho and fio offer the ability to run “mixed workloads” that are composed of both reads and writes. These are usually specified by percentages, for example:

elbencho’s --rwmixpct=50 will do a 50/50 read/write workload
fio’s --readwrite will do a 50/50 read/write workload; the percentages can be changed with --rwmixread or --rwmixwrite

To satisfy the requested read/write percentages, the lowest performing of the two IO types is the limiting factor. On a VAST Data cluster, writes are almost always the limiting factor, and so they will govern the maximum read performance reported by the tool. While this may seem self-evident, it's not uncommon for people to get confused on why “reads are so slow” in mixed workloads when they are very fast in isolation – they’re being throttled by the tool to satisfy the specified mixed workload.

Benchmarking Examples

Here are some paved roads for running benchmarks against VAST. These examples use elbencho, but fio could be used in many cases for a similar effect.

Similarly, clush (cluster shell) is used in these examples to manage multiple load-generating clients, but they could be managed individually or using another tool.

Peak Client Numbers

To drive as much throughput as possible from a single client to a cluster, you need to either:

Mount each CNode to a separate NFS mount and drive traffic across all mounts (such as the tl;dr intro example), ideally using nconnect=16 or higher.

Mount the cluster with the VAST NFS client using remoteports to connect to all CNodes and write to several files, eg:

sudo mkdir -p /mnt/vast
sudo mount -t nfs \
    -o vers=3,remoteports=172.200.201.1-172.200.201.8,nconnect=32 \
    172.200.201.1:/ /mnt/vast

mkdir -p /mnt/vast/elbencho-files

elbencho --write --read --size 20G --direct --threads $(nproc) --iodepth 16 \
    "/mnt/vast/elbencho-files/file[1-256]"
    
sudo umount /mnt/vast

We have to use multiple files because traffic for a single file will only be sent to a single CNode. By using 256 files we’re ensuring there are enough files to get traffic to all CNodes.

Mount the cluster with the VAST NFS client using remoteports to connect to all CNodes and use spread_reads and spread_writes to spread data for a single file across all CNodes:

sudo mkdir -p /mnt/vast
sudo mount -t nfs \
    -o vers=3,remoteports=172.200.201.1-172.200.201.8,nconnect=32,spread_reads,spread_writes \
    172.200.201.1:/ /mnt/vast

mkdir /mnt/vast/elbencho-files

elbencho --write --read --size 200G --direct --threads $(nproc) --iodepth 16 \
    /mnt/vast/elbencho-files/file
    
sudo umount /mnt/vast

Note that the file size is bigger here because there’s only one file.

ℹ️ Info
If you need the most performance over a single mount point where the number of files might be very small – such as a DGX – vers=3,remoteports=x-y,nconnect=32,spread_writes,spread_reads will give you the best overall performance.

ℹ️ Info
spread_reads and spread_writes are not effective with NFSv4 – the mount will succeed but they are silently ignored.

Peak Cluster Numbers

To generate peak cluster numbers, you’re going to need multiple clients talking to all the CNodes to generate enough traffic. Both elbencho and fio provide a client/server mode where the server runs on every traffic-generating client and is controlled by a single elbencho/fio process when given a list of those clients.

To make sure you are driving load across all CNodes evenly, you can either:

Have a 1:1 relationship between clients and VIPs, where each client is driving load to a single VIP.
Have all clients mount all VIPs and drive load across all of them.
Have clients mount the cluster with the VAST NFS driver and use remoteports.

This example uses the latter as it’s simpler.

First start elbencho on all of our clients:

clush -a elbencho --service

Now create a hosts.txt file with the set of IPs, one per line:

clush -a hostname | cut -d: -f1 > hosts.txt

Now we can mount the cluster with remoteports and generate load:

clush -a sudo mkdir -p /mnt/vast
clush -a sudo mount -t nfs \
    -o vers=3,remoteports=172.200.201.1-172.200.201.8,nconnect=32,spread_reads,spread_writes \
    172.200.201.1:/ /mnt/vast

mkdir /mnt/vast/elbencho-files

elbencho --hostsfile hosts.txt \
    --write --read --size 20G --direct --threads $(nproc) --iodepth 16 \
    "/mnt/vast/elbencho-files/file[1-16]"

clush -a sudo umount /mnt/vast

elbencho will report the aggregate performance across all clients.

We can use elbencho to stop the service that is running on all the clients:

elbencho --hostsfile hosts.txt --quit

Small Block Performance

By default, elbencho will use a 1MiB block size (fio’s default block size is 8k!). We can specify a smaller block size to simulate other workloads:

# 8k blocks
elbencho --write --read --size 20G --direct --threads $(nproc) --iodepth 16 \
    --block 8k \
    /mnt/vast/elbencho-files/file

Small File Performance

elbencho can be used to generate many small files in different directories as well. Use --dryrun to make sure you will be generating the number of files that you expect before you run it!

# note that we're passing elbencho a directory!
sudo elbencho --mkdirs --write --read --delfiles --deldirs \
    --size 4k --files 100 --dirs 1000 \
    --threads 32 --direct \
    /mnt/vast/elbencho-files/

Metadata Performance

elbencho can be used to generate some high-level metadata numbers by reading and writing zero-length files. To ensure that we bypass the client-side cache, we need to use --sync and --dropcache, which requires us to run it as root.

# note that we're passing elbencho a directory!
sudo elbencho --sync --dropcache \
    --mkdirs --write --stat --read --delfiles --deldirs \
    --block 0k --size 0k --files 100 --dirs 1000 \
    --threads 32 --direct \
    /mnt/vast/elbencho-files/

Troubleshooting

Not getting the performance you expect? Here are some things to check:

Are you using enough concurrency? The number of threads can easily go as high as the number of processors in the system. Don’t forget --iodepth.
Are you using nconnect? A single TCP connection can only drive ~2.5GB/s.
If you’re using remoteports, are you using enough files or spread_reads / spread_writes? Traffic for a single file can only go to a single IP when using remoteports unless spread_reads / spread_writes are used. The latter are only supported with NFSv3.
Is your view or user set up with a Quality of Service policy to limit throughput?
If you’re using RDMA, is your network lossless? If not you’ll see hangs and other unexpected behavior.
Are you hitting your NIC limit? Use ethtool to confirm the speed of your network connection.
Are you experiencing network fragmentation? If your client is using jumbo frames but not every switch on the network between you and the cluster does, you’ll get packet fragmentation. Use tracepath to confirm you aren’t seeing packet fragmentation.
Is your NIC dropping packets or seeing errors? Check ifconfig.
Are you able to get line rate from the client to the VAST cluster? CNodes have iperf3 installed and you can use that in server-mode with a copy on the client to confirm the maximum throughput between the two.