Client to Protocol Server (CNode) Balancing

Introduction

VAST provides a high-performance file server implementation supporting the NFS, S3, and SMB protocols. Unlike some file systems, VAST does not provide a client-side driver. Additionally, VAST is largely self-tuning, requiring very little server-side configuration. As such, most of the performance tuning of a VAST system is directly related to the tuning and configuration of the file system client.

We focus here on client behaviors and assume that the VAST system itself has sufficient total throughput and IOPS. VAST itself is linearly scalable and can support as much throughput and IOPS as needed by adding more storage enclosures (D-boxes) and/or protocol servers (CNodes).

In this article, we focus on how to tune Client-to-Protocol Server (CNode) balancing. To learn more about NFS client tuning, refer to NFS Tuning.

Client Connections to VAST

Clients connect to VAST over an IP-based network. More specifically, the client operating system opens a network connection to a VAST Virtual IP (VIP), which ultimately connects to a single VAST CNode on a single network port - for example, an NFS mount uses a single socket to a single VIP, and an S3 connection points to a single VIP. What this means is that throughput from a single client is limited by the number of client connections and how well those connections are spread among the VAST CNodes that provide the actual file system protocol. Note that VAST provides a custom NFS kernel driver with advanced capabilities, including multiple TCP connections, RDMA support, and multiple connections to different CNodes. This article is focused on the simpler case of a client connecting to a single CNode.

First, we need to define what we are balancing. One could balance throughput, IOPS, or the number of client connections. For simplicity, in this article, we will assume that all clients generate similar load and thus focus on balancing the number of client connections to each CNode.

The obvious concern, of course, is what if different clients generate different amounts of load? While not a complete solution, if different workloads generate different amounts of load on VAST, creating different VIP pools for those workloads (and ensuring that clients use pools based upon workload type) will also improve balancing, under the assumption that clients executing similar workloads generate similar amounts of load. If that's true, then balancing VIPs evenly by workload is helpful. For the remainder of this article, we'll assume connection balancing is sufficient.

Is Worrying about Balance Crucial?

Before going too far down this path, it is important to remember that in real world workloads, if all of the clients choose the VAST VIPs using a random mechanism, such as DNS round robin (with a time to live as low as possible, ideally 0) as recommended and there are significantly more clients than VIPs, the random balancing of clients to CNodes is usually just fine and there is nothing further to consider. In production environments, that is usually the case. The challenge is in highly specialized environments or early in a proof of concept, where the number of clients is limited.

It's also important to consider the traffic being balanced. In VAST, write traffic involves both foreground work (the actual writes to Optane) and background work (the migration to flash). The cost of write traffic is dominated by the background work, and the background work is automatically shared by all CNodes in a VAST cluster, regardless of which CNode received the write. As such, the balance of client traffic to CNodes is most important for read traffic and far less important for write traffic or mixed read/write workloads.

CNode and VIP Behaviors

Consider the following diagram showing 4 CNodes (protocol servers) exposing 16 VIPs (4 per node, 2 per port), and all of the VIPs are registered in a DNS server, which is expected to provide a very low TTL (Time To Live) to ensure each lookup retrieves a different IP (aka DNS Round Robin). Sometimes, DNS servers do not meet this requirement well for complex reasons. If you are experiencing that issue, VAST, starting with version 3.4, includes a built-in DNS server that supports round robin behavior. Also, look at Configuring the VAST Cluster DNS Service.

This is some state in a VAST cluster. Notice that the VIPs are randomly and evenly spread amongst the CNodes. Sometime later if a CNode restarts, the VIPs will still be spread evenly, but they will be in different locations. For example, maybe they will look like this:

The key observation is that there is no pattern to VIP movement. The only guarantee is that they are balanced across the CNodes and their ports. While not relevant here, we note that if a CNode stops, all of its VIPs will be evenly distributed to other CNodes.

If there are many clients looking up vast.bigcorp.com, their connections will be spread randomly amongst all of the CNodes. With enough clients, randomness works very well.

Random Balancing Downsides

The challenge with random balancing is when there are a small number of clients relative to the size of the VAST cluster. Randomly selecting VIPs can lead to an imbalance, since VIPs are also randomly distributed across CNodes. For example, if there are 4 CNodes and 16 VIPs, and 4 VIPs are selected at random, it is possible that all 4 VIPs connect to the same CNode - a significant imbalance! In this case, the best possible performance is achieved by manually balancing client traffic. Even a simple pattern, such as selecting VIPs in sequence, can result in significant imbalance. For example, look at the previous two diagrams and notice how load is balanced if the clients use V1, V2, V3, and V4. Consider the first diagram with 4 clients added:

Notice that CNode 1 is doing twice the work of CNodes 2 and 4, and CNode 3 is idle.

Manual Balancing Solutions (when needed)

Manual Balancing Considerations

When balancing clients to CNodes, there are two levels of balancing to consider: the VIPs and the CNodes. Essentially, one can ensure that the VIPs are spread evenly over all clients without considering the relationship of VIPs to CNodes, or one can explicitly consider the number of connections to each CNode.

If the number of clients is an exact multiple of the number of VIPs, balancing at the VIP level is sufficient and easiest. This works since VAST automatically balances the VIPs among the CNodes. Consider, for example, if there are 16 clients, and each selects one VIP - the result looks like this which is just fine:

Even if the VIPs move around, the system remains balanced because there are always 2 VIPs on each CNode port. Interestingly, in this rare case, it might be better to have fewer VIPs than the usual 4 per CNode. If there are only 8 clients, selecting 8 of 16 VIPs is likely to result in an imbalance in the previous diagram, but if instead, we reduce the number of VIPs to 8 (2 for each CNode, 1 for each port), we have balance:

We won't show it, but this technique holds for any exact multiple of the number of VIPs. And of course, once the number of clients is significantly larger than the number of VIPS, random balancing is just fine, and no manual steps are needed.

Cautions:

generally fewer than 2 VIPs per CNode is not okay since there are then not enough VIPs to result in all CNode network ports being in use in a standard VAST configuration.
Having just 2 VIPs per CNode makes traffic imbalance more significant when one CNode fails.

On the other hand, if the number of clients is not an exact multiple of the number of VIPs, manual balancing must explicitly account for CNodes. Balancing the VIPs just isn't good enough. Consider the diagram above, changed with 10 clients and some unlucky VIP movement:

Notice that CNode 1 is doing twice the work of every other CNode. The only way to effectively balance load for a small number of clients in this situation is to balance by considering the CNodes explicitly. For example, in this situation, it would be better if client 10 (shown as 0) moved to V4 on CNode 2. Obviously, manually balancing this way is more difficult.

Caution: VIPs Move

VIPs are generally stable, but if a CNode restarts due to an error condition or an upgrade, all VIPs will be redistributed in an unpredictable manner. Therefore, manually balanced clients to VIPs must be rebalanced periodically.

Balancing Techniques

Balancing Clients to VIPs

Manually balancing clients to VIPs essentially comes down to explicitly mounting VAST from each client using a chosen VIP instead of a DNS entry. Some customers automate this by writing a simple script that automatically spreads the VIPs across clients - for example, it might have a list of VIPs as input and then use modulo arithmetic on the client (a node index, IP address, or something that is unique to each client) to pick the correct VIP from the list. Basically, the clients are spread evenly by VIP.

Balancing Clients to VIPs by Leveraging VIP Pools

In a VAST cluster, by default, there is one single VIP Pool, but additional pools can be created that limit how VIPs are assigned to CNodes. The primary purpose of VIP Pools is to guarantee certain qualities of service to particular clients or workloads by guaranteeing them certain CNodes. First and foremost, the same considerations apply when using VIP Pools.

Unexpectedly, VIP Pools can also be used to help with balancing. In addition to their primary purpose, VIP Pools can also be used to spread the load more evenly if used properly. For example, if a pool of 4 CNodes is created, and there are just 4 VIPs in that pool, it is guaranteed that those VIPs will be spread among 4 CNodes (assuming all 4 are running). This makes it possible for just 4 clients to get guaranteed spreading of load if they use those 4 VIPs (essentially a special case of VIP balancing). For example, if there are 16 VIPs and 4 CNodes, 4 clients using 4 VIPs in sequence might end up wildly out of balance. Consider this perfectly legal situation:

But if those same 4 clients use 4 VIPs from a pool of 4 VIPs that is assigned to 4 CNodes, they are guaranteed to get better load spreading. The only difference now is that VIPs 1-4 are part of a VIP Pool assigned to CNodes 1-4.

While we won't show it, since VAST guarantees that VIPs in the same pool are balanced as evenly as possible across the CNodes, to achieve the best possible balancing, create a VIP Pool with exactly the same number of VIPs as clients. Obviously, this consumes IP addresses, but for specialized scenarios and POCs with small numbers of clients, this approach is quite simple to implement.

Caution:

VAST guarantees that VIPs from a given pool are evenly balanced across CNodes, not across CNode ports. Thus, if two VIP Pools are in use, it is possible that all the VIPs from one pool will be assigned to one port of any particular CNode, and all the VIPs from the other pool will be assigned to the other port (internal feature ORION-44135). This typically doesn't matter, but in extreme load scenarios, you may find that clients bottleneck on a particular port that has a physical throughput limit. As of this writing (VAST 3.4 with typical hardware), a CNode can handle 10GB/sec of reads, but a single port can handle 5GB/sec.

Balancing Clients to CNodes

To spread clients across CNodes explicitly, you need to first determine which VIPs are assigned to which CNodes. This can be done using the VMS web UI (Configuration -> Virtual IPs) or via VCLI as follows:

#from the CNodes running VMS, start VCLI
$ vcli
vcli: admin> vip list
+-------------+--------------+-----------+-----------+
| VIP-Pool | Virtual-IP | CNode | Cluster |
+-------------+--------------+-----------+-----------+
| HPC | 10.101.127.1 | cnode-2 | se-demo-1 |
| HPC | 10.101.127.2 | cnode-3 | se-demo-1 |
| HPC | 10.101.127.3 | cnode-3 | se-demo-1 |
| HPC | 10.101.127.4 | cnode-1 | se-demo-1 |
| HPC | 10.101.127.5 | cnode-4 | se-demo-1 |
| HPC | 10.101.127.6 | cnode-2 | se-demo-1 |
| HPC | 10.101.127.7 | cnode-1 | se-demo-1 |
| HPC | 10.101.127.8 | cnode-4 | se-demo-1 |
+-------------+--------------+-----------+-----------+

Notice in this example that there are 8 VIPs and 4 CNodes (the more typical configuration is 4 VIPS per CNode). cnode-2 is using VIPs 10.101.127.[1,7], cnode-3 is using 10.101.127.[2,3] and so on. You'll need to mount VAST from your clients to account for this. For example, if there are only four clients, it would be best to use 10.101.127.[1,2,4,5]. With 8 clients, all VIPs can be used.

This process can be automated, and we've published an article that highlights some of the key tools that you will need to incorporate into a complete solution: VIP to CNode mapping via REST API

Protocol Specifics

We've covered a lot of ground in this article, and everything has been intentionally generic, as balancing is really a behavior of the network and VAST CNodes and not protocol-specific. To make things a bit more concrete, we'll provide some protocol-specific insights here.

NFS

When an NFS client mounts a file system, the mount command takes an IP address or a DNS entry. If DNS RR is in use, every mount from every client effectively uses a different VIP and thus a different CNode at random. All traffic related to that mount will be directed to that CNode. If random is good enough, you are done. If it's not, you'll need to mount VAST using the IP addresses of the CNode VIPs, spreading traffic as discussed in this article.

S3

S3 has no concept of mount. Beneath the surface, an S3 client is simply opening a bunch of HTTP connections. If DNS RR is in use, then each client instantiation (typically an operating system process) will get a VIP to use at random, spreading the load randomly. Since S3 clients tend to be shorter-lived than NFS mounts, random behavior is more likely to work with S3 clients, but even here, randomness may not be sufficient. If random load spreading isn't good enough, then VAST VIPs will need to be used explicitly by the S3 clients.

SMB

SMB clients connect to VAST by specifying the share as something like smb:\\vast.bigcorp.com\share. As with other protocols, this triggers a DNS lookup, and a random VAST VIP will be used if DNS RR has been enabled properly (probably in Active Directory). Since SMB uses Kerberos for authentication, it is necessary to specify a proper Service Principal Name (SPN) that maps to a proper identity in AD (e.g., HOST\vast.bigcorp.com). In order to ensure there is a proper SPN, you must edit the computer object in AD for VAST (created and specified during the initial AD configuration of VAST) and use the advanced settings to specify additional SPNs. By default, there will be an SPN for the VAST machine name and machine name (e.g., HOST\vast) in its fully qualified form (e.g., HOST\vast.bigcorp.com).

VAST does not currently support NTLM. Therefore, connecting to a share via IP addresses is not expected to work - meaning manually mounting different VIPs by IP address will not work - a technique that does work with NFS. Additionally, if the same Windows client mounts the same share using the same FQDN, the DNS RR lookup does not appear to be repeated, resulting in multiple mounts going to the same VAST protocol server. In the unlikely event that you need to mount the same share multiple times from the same client and want the load to be spread across multiple VIPs, you must create multiple SPNs that point to either different VIP pools or individual VIPs (for example v1.vast.bigcorp.com, v2.vast.bigcorp.com). If that is done, each SMB client system can mount the VAST share in a manner very similar to NFS. Further, if there are multiple VAST shares, each can be mounted independently on an SMB client, allowing for load spreading there as well.