S3 benchmarking using elbencho

Intro

Elbencho is an open-source benchmarking tool. It can be used with file and object storage systems, which is useful for testing multi-protocol access in a VAST system. Elbencho also makes it easy to run distributed tests across multiple clients, which is typically required to drive the full throughput of a VAST system.

Getting elbencho

Elbencho can be built from source, but that's not necessary. It's easier to just download the static binary, which runs on any Linux distribution without the need to even install an rpm/deb package: elbencho GitHib

Running elbencho with multiple clients

To run coordinated elbencho tests across multiple clients, you would first start it on the clients that you want to use for a test in "service mode" like this:

$ elbencho --service

Then the elbencho service just sits idle in the background, waiting for commands from a master instance. In addition to the normal benchmark parameters, the master instance will simply specify the hostnames of the service instances either directly on the command line, like this:

$ elbencho --hosts "node0[10-12,14],node[115,116]"

...or through a hostsfile, which contains the corresponding hostnames or IP addresses of the service nodes, newline-separated:

$ elbencho --hostsfile /path/to/myhostsfile.txt

All examples below can also be used on a single client by just omitting the --host/--hostsfile arguments.

Simple max bandwidth tests for starters

Choosing the right number of clients

The exact number of clients and threads needed to drive the full bandwidth of a VAST system depends on the client and server hardware. But very roughly speaking, a single VAST protocol server (CNode) can serve data at about 10GB/s (100Gbit), which gives you an indication of the number of clients needed for full bandwidth, depending on their network interconnect.

While there generally is more overhead in an individual S3 request in comparison to an NFS request (e.g., a single 4KB read via S3 is significantly more "expensive" than a single 4KB read via NFS), VAST sees all protocols as "first-class citizens" of the architecture and thus enables high bandwidth also via S3.

General considerations for max bandwidth

Otherwise, the same rules apply for S3 benchmarking that also applied for other access protocols on VAST: It doesn't make sense to write only zeros (as many benchmarking tools would do by default), because the VAST system would deduplicate all of those writes into a single block. For that reason, elbencho generates non-reducible data by default, so no extra parameter is needed.

Also, the workload needs to be nicely spread across the VAST CNodes, which you do by providing all VAST VIPs to elbencho via the --s3endpoints parameter.

The full read bandwidth of the VAST system will only be unleashed when the data has been migrated over from the SCM write buffer to the data drives. In a production system, this would continuously happen in the background based on new data being ingested into the system. But in a benchmarking environment, it's a good idea to write about 4TB of data per VAST NVMe enclosure (DBox) to ensure the majority of data has been migrated.

The actual bandwidth test

A simple bandwidth can use a fixed number of large objects so that the same dataset can be used independently of the number of clients and independently of the number of threads per client that will later be used to read the data back. For this, the object names can be provided directly to elbencho as command-line parameters.

The following example assumes that you have already created a bucket named mybucket and that you have generated an S3 access key and secret key pair for a user. Also, the 8 VIPs in this example need to be replaced by the actual range of VIPs of your VAST system. (If needed, you can also add -d to have elbencho create the bucket.)

This command will write (-w) 256 objects in 16MiB blocks (-b 16m), each of the objects being 16GiB in size (-s 16g), for a total of 4TiB. The number of threads (-t 48) is per-client.

$ elbencho --s3endpoints "http://172.200.203.[1-8]" --s3key="..." --s3secret="..." -w -t 48 -s 16g -b 16m --hostsfile myhosts.txt "mybucket/bigobjects/file[1-256]"

The dataset size of 4TiB is appropriate for a single VAST DBox. For more DBoxes, you would just linearly increase the object size (e.g., -s 32g for two DBoxes).

The same command, just with -r (read) instead of -w (write), can be used to read the data back.

In the S3 world, there isn't really a concept of random writes, but at least reads can be done from random offsets with the --rand parameter. And of course, the dataset could also be read back using different thread counts, different client counts, or different block sizes of interest (e.g., -b 1m for 1MiB reads).

Non I/O (metadata) Tests

ListObjects

Before a LIST objects test can be run, you must first create the objects.

The --dirs, --files, and --threads options interact to determine the total number of objects created. Each participating client/daemon will create:

numfiles = dirs * files * threads

The example below writes ~2.5M 1-byte objects per client. Multiply by the total number of clients to calculate the total objects created. For S3, the --dirs value controls how objects are grouped using key-separators (/). On a VAST cluster, this implicitly creates directories, which are required if NFS clients will access the same bucket.

Step 1: Create the dataset

elbencho --hostsfile hosts.txt \
  --s3endpoints "http://172.200.201.[1-8]" \
  --s3key=${s3key} --s3secret=${s3secret} \
  --s3objprefix smallobjects/ \
  --dirs 40 --files 1000 \
  --threads 64 --size 1 --block 1 \
  --write \
  andyperns3bucket

Parallel listing (--s3listobjpar)

Use --s3listobjpar when the thread layout matches how the data was created. Each thread independently lists its own prefix tree in parallel, giving you full multi-thread throughput. This requires that --dirs matches the value used during dataset creation.

elbencho --hostsfile hosts.txt \
  --s3endpoints "http://172.200.201.[1-8]" \
  --s3key=${s3key} --s3secret=${s3secret} \
  --s3objprefix smallobjects/ \
  --dirs 40 --files 1000 \
  --threads 64 \
  --s3listobjpar \
  andyperns3bucket

Sequential listing (--s3listobj)

Use --s3listobj NUM when you have no prior knowledge of the bucket's structure -- analogous to a find command. This uses a single thread per bucket to sequentially list everything. Set NUM to -1 to list all objects.

elbencho --hostsfile hosts.txt \
  --s3endpoints "http://172.200.201.[1-8]" \
  --s3key=${s3key} --s3secret=${s3secret} \
  --s3listobj -1 \
  andyperns3bucket

When to use which:

Mode	Flag	Threads	Use case
Parallel list	`--s3listobjpar`	Multi (matches creation layout)	Benchmark peak listing throughput; requires known prefix structure
Sequential list	`--s3listobj -1`	1 per bucket	Unknown bucket contents; simulates a real-world discovery scan

HEAD Requests (--stat)

The --stat flag tests S3 HEAD request performance. It works the same way as --write/--read each thread operates on its own objects, and the rate is measured per-thread. --stat can be combined with --write and --read in a single command, and elbencho will execute each operation as a separate phase.

Example: write, then HEAD, then read

The following runs 10 threads, each first creating 10 objects of 1 byte each (write phase), then doing a HEAD on each of those objects (stat phase), then reading them back (read phase). The --dirs 1 means each thread has its own unique subdir/prefix; using --dirs 0 would have all threads share the same prefix.

export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."

elbencho \
  --s3endpoints "http://1.2.3.[4-10]" \
  --write --read --stat \
  -b 1 -s 1 -t 10 \
  --dirs 1 --files 10 \
  --s3objprefix "test1/" \
  mybucket

Combining with listing: You can append either --s3listobjpar or --s3listobj -1 to the same command to add a list phase after the read phase, or run listing separately against an existing dataset.

Multiprotocol access

Since there is no real difference between a file and an object in a VAST system, you might be interested in trying to read your objects back as files. The corresponding elbencho command for the dataset generated above would look like this, assuming the NFS mountpoint /mnt/vast/mybucket refers to the S3 bucket from the previous section:

$ elbencho -r -t 48 -s 16g -b 16m --hostsfile myhosts.txt "/mnt/vast/mybucket/bigobjects/file[1-256]"

Multiple objects per thread

Specifying object names directly on the command line as elbencho parameters works only for a relatively limited number of objects, such as the 256 in the examples above. For tests with significantly more objects, you would rather specify a certain number of subdirs and objects per subdir for each thread. In this case, you would only provide the bucket name as an argument and add -n (number of subdirs per thread) and -N (number of files per subdir) as parameters, like this:

$ elbencho --s3endpoints "http://172.200.203.[1-8]" --s3key="..." --s3secret="..." -w -t 48 -s 16m -b 16m -n 5 -N 10 --hostsfile myhosts.txt mybucket

This command will make each thread create 10 subdirs (-n 10), inside which it will create 20 objects (-N 20) of 16MiB in size each ("-s 16m"), each object uploaded in a single 16MiB request (-b 16m). With e.g. 4 clients, this would mean a total of 9600 objects: 4 clients x 48 threads_per_client x 5 subdirs_per_thread x 10 objects_per_subdir.

The dataset can be read back by using the same command with -r (read) instead of (-w), but since it has been created for a certain number of threads, it can only be read back with the same or a lower number of threads.

And again, the same dataset could also be read back via NFS like this:

$ elbencho -r -t 48 -s 16m -b 16m -n 5 -N 10 --hostsfile myhosts.txt /mnt/vast/mybucket

Keeping results

By using elbencho's service mode, you will get a single aggregate result for all clients, instead of having to gather individual results and verify that all clients actually ran at the same time. Elbencho shows two result sets for each run: The aggregate end result ("last done", referring to the point in time when the slowest thread finished its work) and the aggregate "first done" result, referring to the point in time when the fastest thread/client finished its work. The phase between "first done" and "last done" is called the "tail" and is usually a phase of lower throughput based on the fact that fewer threads are active.

To preserve the human-readable results that are shown on the console, you can use the --resfile /path/to/results.txt parameter.

To write the end results into a CSV file, you can use the " --csvfile /path/to/results.csv " parameter.

elbencho will append to result files and not overwrite them if they already exist. This can be useful, e.g., to build graphs from throughput results with different object sizes or different block sizes from a CSV file via spreadsheet applications.

You might also find the following command useful to view CSV file contents on the console:

$ column -ts /path/to/results.csv | less

Additional notes and limitations for S3

Different from the file world, in the S3 world, objects only appear in the namespace when they are completely uploaded. That means during the upload, you won't see the object if you list the bucket directory via S3 or NFS, and if you press CTRL+C in the middle of writing a large object, then elbencho will notify the S3 server to discard the partially uploaded object content of the current object.
There are no subdirectories in the S3 world, but there is a concept of "separators" in object names, which can be used to group sets of objects within the same bucket together - conceptually similar to subdirectories. Not surprisingly, the slash ("/") is a commonly used separator in the S3 world. Thus, to bring everything nicely together with the file world, VAST systems interpret elements of object names with a trailing slash as directory names.
In the S3 world, there are simple PUTs (i.e., upload of an object through a single HTTP request, which only makes sense for small or medium-sized objects) and multi-part uploads (i.e., upload through multiple HTTP requests, which makes sense for larger objects). If the given elbencho block size is equal to the given object size (e.g., elbencho -b 16m -s 16m), then elbencho automatically uses a simple single PUT for the upload. If the block size is smaller than the object size (e.g., elbencho -b 16m -s 1g ) then elbencho automatically uses multi-part upload.
The block size (-b) needs to be allocated in RAM by each thread, hence it wouldn't be practical to upload, e.g., a 1TB object without using multi-part upload.
Amazon defined that a multi-part upload cannot have more than 10,000 parts. VAST implements this limitation. That means elbencho -w -s 1g -b 4k mybucket/myobj1 --s3endpoints ...would not work, because 1GiB divided by 4KiB is more than 10,000, so this would result in a multi-part upload of more than 10,000 parts. The same example with "-b 1m" would work, because it results in fewer than 10,000 parts being uploaded for the object.
Amazon defined that an individual part of a multi-part upload cannot be smaller than 5MiB. VAST does not implement this limitation. That means elbencho -w -s 10m -b 1m would not work on most object stores, but would work on VAST. However, consequently, typical S3 applications use at least 5MiB as a block size when writing larger objects.
For reads, the multi-part upload limitations do not apply, so, e.g., reading a 1GiB file in 4KiB blocks would work, but due to the overhead of the S3 protocol for very small requests, it's probably something that normal S3 applications would try to avoid.