Overview
The following article will describe the recommended settings for Windows clients utilizing VAST SMB3 multichannel feature in order to reach the optimal performance provided by the client.
Note that 100GbE-connected CNodes are required for >5GB/sec single-mountpoint performance, which is constrained by 50GbE splitter cables in standard config systems. Existing systems can be retrofitted with 100GbE cables.
Note that we have seen varying performance results across clients using the same Vast clusters. The difference came from the client configuration and hardware, and we should understand which client was used, as it has a direct impact on the performance we will get.
The first Step is to validate network performance using iperf. See troubleshooting below.
Install NVIDIA/Mellanox MOFED for Windows
Download here: Mellanox OFED for Windows - WinOF / WinOF-2
* WinOF-2 is for ConnectX-4 and later
Confirm SMB3 is used
Use Get-SmbConnection to confirm Dialect is 3.0.
PS C:\Windows\system32> Get-SmbConnection
ServerName ShareName UserName Credential Dialect NumOpens
---------- --------- -------- ---------- ------- --------
v134 SMB SLI\adar SLI.VASTDATA.COM\adar 3.0 2Use Get-SmbMultichannelConnection to confirm multichannel connections are seen and from which interface.
PS C:\Windows\system32> Get-SmbMultichannelConnection
Server Name Selected Client IP Server IP Client Interface Index Server Interface Index Client RSS Capable Client RDMA Capable
----------- -------- --------- --------- ---------------------- ---------------------- ------------------ -------------------
v134 True 172.30.190.179 172.30.134.16 7 2 True FalseUse netstat to confirm the active number of TCP sessions
NOTE: We might need to generate some IO to establish multiple connections; see the last section below on running some test IO.
PS C:\Windows\system32> .\NETSTAT.EXE -an | ? {$_ -match "445"}
TCP 0.0.0.0:445 0.0.0.0:0 LISTENING
TCP 172.30.190.179:49834 172.30.134.16:445 ESTABLISHED
TCP 172.30.190.179:50217 172.30.134.16:445 ESTABLISHED
TCP 172.30.190.179:50218 172.30.134.16:445 ESTABLISHED
TCP 172.30.190.179:50220 172.30.134.16:445 ESTABLISHEDConfigure SMB client
NOTE: SMB signing below impacts performance up to 70% ! Even after you disable it as seen below, This can also be automatically reset to be enabled by an AD group policy.
Disable
EnableSecuritySignatureDisable
RequireSecuritySignatureIncrease
ConnectionCountPerRssNetworkInterfaceto at least 8Increase
MaxCmdsto 100
set-SmbClientConfiguration -RequireSecuritySignature 0 -ConnectionCountPerRssNetworkInterface 12 -EnableSecuritySignature 0 -MaxCmds 100Example configuration
PS C:\Windows\system32> Get-SmbClientConfiguration
ConnectionCountPerRssNetworkInterface : 12
DirectoryCacheEntriesMax : 16
DirectoryCacheEntrySizeMax : 65536
DirectoryCacheLifetime : 10
DormantFileLimit : 1023
EnableBandwidthThrottling : True
EnableByteRangeLockingOnReadOnlyFiles : True
EnableInsecureGuestLogons : False
EnableLargeMtu : True
EnableLoadBalanceScaleOut : True
EnableMultiChannel : True
EnableSecuritySignature : False
ExtendedSessionTimeout : 1000
FileInfoCacheEntriesMax : 64
FileInfoCacheLifetime : 10
FileNotFoundCacheEntriesMax : 128
FileNotFoundCacheLifetime : 5
KeepConn : 600
MaxCmds : 100
MaximumConnectionCountPerServer : 32
OplocksDisabled : False
RequireSecuritySignature : False
SessionTimeout : 60
UseOpportunisticLocking : True
WindowSizeThreshold : 8Tune CPU assignment
Use Get-NetAdapterRSS and Set-NetAdapterRSS to confirm that the adapter distance from the CPU is 0.
CPU profile set to
closestNumberOfReceiveQueuesis set to at least 8.Distance from the CPU is
0
Set-NetAdapterRss -Name "Ethernet 2" -MaxProcessors 16 -MaxProcessorNumber 30 -NumberOfReceiveQueues 16For example, below we see RssProcessorArray has cores with a distance 0 but also cores with a distance of 32767 which means that the performance we will get from these cores will be non optimal as they are farther away.
Name : Ethernet 2
InterfaceDescription : Mellanox ConnectX-5 Adapter
Enabled : True
NumberOfReceiveQueues : 40
Profile : Closest
BaseProcessor: [Group:Number] : 0:0
MaxProcessor: [Group:Number] : 0:62
MaxProcessors : 8
RssProcessorArray: [Group:Number/NUMA Distance] : 0:0/0 0:2/0 0:4/0 0:6/0 0:8/0 0:10/0 0:12/0 0:14/0
0:16/0 0:18/0 0:20/0 0:22/0 0:24/0 0:26/0 0:28/0 0:30/0
0:32/32767 0:34/32767 0:36/32767 0:38/32767 0:40/32767 0:42/32767 0:44/32767 0:46/32767
0:48/32767 0:50/32767 0:52/32767 0:54/32767 0:56/32767 0:58/32767 0:60/32767 0:62/32767To correct that, we can set the max processor number and limit which processors are used.
Set-NetAdapterRss -Name "Ethernet 2" -MaxProcessorNumber 30 Name : Ethernet 2
InterfaceDescription : Mellanox ConnectX-5 Adapter
Enabled : True
NumberOfReceiveQueues : 16
Profile : Closest
BaseProcessor: [Group:Number] : 0:0
MaxProcessor: [Group:Number] : 0:30
MaxProcessors : 16
RssProcessorArray: [Group:Number/NUMA Distance] : 0:0/0 0:2/0 0:4/0 0:6/0 0:8/0 0:10/0 0:12/0 0:14/0
0:16/0 0:18/0 0:20/0 0:22/0 0:24/0 0:26/0 0:28/0 0:30/0
IndirectionTable: [Group:Number] : 0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30
0:16 0:18 0:20 0:22 0:24 0:26 0:28 0:30Configure the client adapter
Increase to the max the switches/client adapter allows.
Increase the MTU to work with jumbo frames.
Increase ring sizes to the max supported by the adapter.



Client-specific added notes
The section below is an add-on to the above recommendations.
HPE client
We found that to reach full performance, we had to disable Recv Segment Coalescing (RSC) when using SMB3 multichannel.
Environment
HPE Gen 10 Icelake clients with ConnectX6 dual-port 100GbE adapter.

AMD client
For AMD clients with Mellanox adapters, we found that changing pci_wr_order is required to achieve the expected performance. (If not, you may be limited to around 2GB/sec read performance.)
Environment
Lenovo P620 single-port ConnectX6 adapter 100GbE.
This has to be done using the Mellanox MFT tool for Windows. NVIDIA Firmware Tools (MFT)
C:\WINDOWS\system32> mlxconfig -d /dev/mst/mt4123_pciconf0.1 s PCI_WR_ORDERING=1
Device #1:
Device type: ConnectX6
Name: MCX653106A-HDA_Ax
Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
Device: /dev/mst/mt4123_pciconf0.1
Configurations: Next Boot New
PCI_WR_ORDERING per_mkey(0) force_relax(1)
Apply new Configuration? (y/n) [n] : mst status -v
-E- Aborted by user.
C:\WINDOWS\system32> mlxconfig -d /dev/mst/mt4123_pciconf0.1 s PCI_WR_ORDERING=1
Device #1:
Device type: ConnectX6
Name: MCX653106A-HDA_Ax
Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
Device: /dev/mst/mt4123_pciconf0.1
Configurations: Next Boot New
PCI_WR_ORDERING per_mkey(0) force_relax(1)
Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.Additional AMD BIOS settings:
To enable x2APIC (Default XAPIC MSI Interrupts), set the following in the BIOS:
Advanced → AMD CBS → CPU Common Options → Local APIC Mode → x2APIC
Enable “Preferred I/O” in PCI options.
Turn off CPU power savings.
Troubleshooting
Confirm the network is working as expected between the client and Vast CNode. If Iperf cannot reach 8GB to 9GB/sec from a Windows client, then SMB3-MC won’t either.
To validate network performance between Windows clients and VAST CNodes, you can use iperf2 to test the adapter's maximum bandwidth. (Not iperf3, it is not multi-threaded.)
Use the second example with -M 9000 -m -N -e to get more details, including retrans.
Monitor adapter counters
Mellanox Win-OF2 driver package can be installed in order to gain extended debug counters from the adapter using Windows perfmon.
iPerf2
Download iperf2 exe file to the Windows client.
Test by simulating a read and a write.
Iperf 2 - Browse Files at SourceForge.net
Read from Vast
Win:
.\iperf-2.1.8-win.exe -s -p 7575 -w 1024KB
cnode:
iperf -c Client_IP -p 7575 -i 5 -P 10 -w 1024KB
iperf -c Client_IP -p 7575 -i 5 -P 10 -w 2m -m -N -eWrite to Vast
cnode:
iperf -s -p 7575 -w 1024KB
Win:
iperf -c Vast_VIP -p 7575 -i 5 -P 10 -w 1024KBRunning IO
FIO\Frametest are tools that can be used here for testing.
FIO.
Windows download: https://bsdio.com/fio/releases/fio-3.27-x86.zip
Layout a file
fio.exe --name=test --ioengine=windowsaio --rw=write --bs=1m --direct=1 --size=50g --numjobs=1 --thread=1 --iodepth=16 --group_reporting --fallocate=none --refill_buffers --randrepeat=0 --create_on_open=1Read
fio.exe --name=test --ioengine=windowsaio --rw=randread --bs=1m --direct=1 --size=50g --numjobs=1 --thread=1 --iodepth=16 --group_reporting --time_based --fallocate=none --refill_buffers --randrepeat=0 --create_on_open=1 --runtime=600Frametest
Windows Download: How to use frametest
Create a directory and lay out the files.
frametest.exe -w 4k -t 20 4k_dirRead
frametest.exe -r 4k -t 20 4k_dirOther Influences of Bad Performance:
Overheating of the motherboard “chipset chip”, CPU, and Mellanox card.
Check the Windows Event Viewer for system events, specifically mlx5 (Win-Mofed will not say “overheating,” but stock Windows drivers will).
This was found to limit performance to 2.2GB/sec on Windows and 21 Gbit/sec with iperf, without any other warnings or event messages.
The same machine booted Linux and ran iperf at 93Gbit/s.
After getting better cooling and swapping cards, 8GB/sec SMB3 MC performance under Windows.
Install WinMofed and MFT tools, and then read with mget_temp -d mt4123_pciconf0
PCI cards in PCI3 x8 slots: this can still drive 7 to 7.5GB/sec on a 100GbE link, but not any faster, and there are no other warnings or indicators beyond the device manager for CX6 interface details.
Note that a firmware update for the CX6 will remove pci_wr_order=relaxed (for AMD boxes). Check that again!
There is also a “CX6 Power limit to 25W” UEFI or Firmware setting. This will also kill performance.