SMB3 Multichannel Best Practices

Prev Next

Overview

The following article will describe the recommended settings for Windows clients utilizing VAST SMB3 multichannel feature in order to reach the optimal performance provided by the client.

Note that 100GbE-connected CNodes are required for >5GB/sec single-mountpoint performance, which is constrained by 50GbE splitter cables in standard config systems. Existing systems can be retrofitted with 100GbE cables.

Note that we have seen varying performance results across clients using the same Vast clusters. The difference came from the client configuration and hardware, and we should understand which client was used, as it has a direct impact on the performance we will get.

The first Step is to validate network performance using iperf. See troubleshooting below.

Install NVIDIA/Mellanox MOFED for Windows

Download here: Mellanox OFED for Windows - WinOF / WinOF-2
* WinOF-2 is for ConnectX-4 and later

Confirm SMB3 is used

Use Get-SmbConnection to confirm Dialect is 3.0.

PS C:\Windows\system32> Get-SmbConnection

ServerName ShareName UserName Credential            Dialect NumOpens
---------- --------- -------- ----------            ------- --------
v134       SMB       SLI\adar SLI.VASTDATA.COM\adar 3.0     2

Use Get-SmbMultichannelConnection to confirm multichannel connections are seen and from which interface.

PS C:\Windows\system32> Get-SmbMultichannelConnection

Server Name Selected Client IP      Server IP     Client Interface Index Server Interface Index Client RSS Capable Client RDMA Capable
----------- -------- ---------      ---------     ---------------------- ---------------------- ------------------ -------------------
v134        True     172.30.190.179 172.30.134.16 7                      2                      True               False

Use netstat to confirm the active number of TCP sessions

NOTE: We might need to generate some IO to establish multiple connections; see the last section below on running some test IO.

PS C:\Windows\system32> .\NETSTAT.EXE -an | ? {$_ -match "445"}
  TCP    0.0.0.0:445            0.0.0.0:0              LISTENING
  TCP    172.30.190.179:49834   172.30.134.16:445      ESTABLISHED
  TCP    172.30.190.179:50217   172.30.134.16:445      ESTABLISHED
  TCP    172.30.190.179:50218   172.30.134.16:445      ESTABLISHED
  TCP    172.30.190.179:50220   172.30.134.16:445      ESTABLISHED

Configure SMB client

NOTE: SMB signing below impacts performance up to 70% ! Even after you disable it as seen below, This can also be automatically reset to be enabled by an AD group policy.

  • Disable EnableSecuritySignature

  • Disable RequireSecuritySignature

  • Increase ConnectionCountPerRssNetworkInterface to at least 8

  • Increase MaxCmds to 100

set-SmbClientConfiguration -RequireSecuritySignature 0 -ConnectionCountPerRssNetworkInterface 12 -EnableSecuritySignature 0 -MaxCmds 100

Example configuration

PS C:\Windows\system32> Get-SmbClientConfiguration

ConnectionCountPerRssNetworkInterface : 12
DirectoryCacheEntriesMax              : 16
DirectoryCacheEntrySizeMax            : 65536
DirectoryCacheLifetime                : 10
DormantFileLimit                      : 1023
EnableBandwidthThrottling             : True
EnableByteRangeLockingOnReadOnlyFiles : True
EnableInsecureGuestLogons             : False
EnableLargeMtu                        : True
EnableLoadBalanceScaleOut             : True
EnableMultiChannel                    : True
EnableSecuritySignature               : False
ExtendedSessionTimeout                : 1000
FileInfoCacheEntriesMax               : 64
FileInfoCacheLifetime                 : 10
FileNotFoundCacheEntriesMax           : 128
FileNotFoundCacheLifetime             : 5
KeepConn                              : 600
MaxCmds                               : 100
MaximumConnectionCountPerServer       : 32
OplocksDisabled                       : False
RequireSecuritySignature              : False
SessionTimeout                        : 60
UseOpportunisticLocking               : True
WindowSizeThreshold                   : 8

Tune CPU assignment

Use Get-NetAdapterRSS and Set-NetAdapterRSS to confirm that the adapter distance from the CPU is 0.

  • CPU profile set to closest

  • NumberOfReceiveQueues is set to at least 8.

  • Distance from the CPU is 0

Set-NetAdapterRss -Name "Ethernet 2" -MaxProcessors 16 -MaxProcessorNumber 30 -NumberOfReceiveQueues 16

For example, below we see RssProcessorArray has cores with a distance 0 but also cores with a distance of 32767 which means that the performance we will get from these cores will be non optimal as they are farther away.

Name                                            : Ethernet 2
InterfaceDescription                            : Mellanox ConnectX-5 Adapter
Enabled                                         : True
NumberOfReceiveQueues                           : 40
Profile                                         : Closest
BaseProcessor: [Group:Number]                   : 0:0
MaxProcessor: [Group:Number]                    : 0:62
MaxProcessors                                   : 8
RssProcessorArray: [Group:Number/NUMA Distance] : 0:0/0  0:2/0  0:4/0  0:6/0  0:8/0  0:10/0  0:12/0  0:14/0
                                                  0:16/0  0:18/0  0:20/0  0:22/0  0:24/0  0:26/0  0:28/0  0:30/0
                                                  0:32/32767  0:34/32767  0:36/32767  0:38/32767  0:40/32767  0:42/32767  0:44/32767  0:46/32767
                                                  0:48/32767  0:50/32767  0:52/32767  0:54/32767  0:56/32767  0:58/32767  0:60/32767  0:62/32767

To correct that, we can set the max processor number and limit which processors are used.

Set-NetAdapterRss -Name "Ethernet 2" -MaxProcessorNumber 30 
Name                                            : Ethernet 2
InterfaceDescription                            : Mellanox ConnectX-5 Adapter
Enabled                                         : True
NumberOfReceiveQueues                           : 16
Profile                                         : Closest
BaseProcessor: [Group:Number]                   : 0:0
MaxProcessor: [Group:Number]                    : 0:30
MaxProcessors                                   : 16
RssProcessorArray: [Group:Number/NUMA Distance] : 0:0/0  0:2/0  0:4/0  0:6/0  0:8/0  0:10/0  0:12/0  0:14/0
                                                  0:16/0  0:18/0  0:20/0  0:22/0  0:24/0  0:26/0  0:28/0  0:30/0
IndirectionTable: [Group:Number]                : 0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30
                                                  0:16  0:18    0:20    0:22    0:24    0:26    0:28    0:30

Configure the client adapter

  • Increase to the max the switches/client adapter allows.

  • Increase the MTU to work with jumbo frames.

  • Increase ring sizes to the max supported by the adapter.

Mellanox ConnectX-5 Adapter Properties - Jumbo Packet

 Mellanox ConnectX-5 Adapter Properties - Receive Buffers

Mellanox ConnectX-5 Adapter Properties - Send Buffers

Client-specific added notes

The section below is an add-on to the above recommendations.

HPE client

We found that to reach full performance, we had to disable Recv Segment Coalescing (RSC) when using SMB3 multichannel.

Environment

HPE Gen 10 Icelake clients with ConnectX6 dual-port 100GbE adapter.

AMD client

For AMD clients with Mellanox adapters, we found that changing pci_wr_order is required to achieve the expected performance. (If not, you may be limited to around 2GB/sec read performance.)

Environment

  • Lenovo P620 single-port ConnectX6 adapter 100GbE.

This has to be done using the Mellanox MFT tool for Windows. NVIDIA Firmware Tools (MFT)

C:\WINDOWS\system32> mlxconfig -d /dev/mst/mt4123_pciconf0.1 s PCI_WR_ORDERING=1

Device #1:

Device type:    ConnectX6
Name:           MCX653106A-HDA_Ax
Description:    ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
Device:         /dev/mst/mt4123_pciconf0.1

Configurations:                              Next Boot       New
PCI_WR_ORDERING                     per_mkey(0)     force_relax(1)

Apply new Configuration? (y/n) [n] : mst status -v
-E- Aborted by user.

C:\WINDOWS\system32> mlxconfig -d /dev/mst/mt4123_pciconf0.1 s PCI_WR_ORDERING=1

Device #1:

Device type:    ConnectX6
Name:           MCX653106A-HDA_Ax
Description:    ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
Device:         /dev/mst/mt4123_pciconf0.1

Configurations:                              Next Boot       New
PCI_WR_ORDERING                     per_mkey(0)     force_relax(1)

Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.

Additional AMD BIOS settings:

  • To enable x2APIC (Default XAPIC MSI Interrupts), set the following in the BIOS:  

    • Advanced → AMD CBS → CPU Common Options → Local APIC Mode → x2APIC

  • Enable “Preferred I/O” in PCI options.

  • Turn off CPU power savings.

Troubleshooting

  • Confirm the network is working as expected between the client and Vast CNode. If Iperf cannot reach 8GB to 9GB/sec from a Windows client, then SMB3-MC won’t either.

    • To validate network performance between Windows clients and VAST CNodes, you can use iperf2 to test the adapter's maximum bandwidth. (Not iperf3, it is not multi-threaded.)

    • Use the second example with -M 9000 -m -N -e to get more details, including retrans.

  • Monitor adapter counters

    • Mellanox Win-OF2 driver package can be installed in order to gain extended debug counters from the adapter using Windows perfmon.

iPerf2

Download iperf2 exe file to the Windows client.
Test by simulating a read and a write.

Iperf 2 -  Browse Files at SourceForge.net

Read from Vast

Win:
.\iperf-2.1.8-win.exe -s -p 7575 -w 1024KB

cnode:
iperf -c Client_IP -p 7575 -i 5 -P 10 -w 1024KB
iperf -c Client_IP -p 7575 -i 5 -P 10 -w 2m -m -N -e

Write to Vast

cnode:
iperf -s -p 7575 -w 1024KB

Win:
iperf -c Vast_VIP -p 7575 -i 5 -P 10 -w 1024KB

Running IO

FIO\Frametest are tools that can be used here for testing.

FIO.

Windows download: https://bsdio.com/fio/releases/fio-3.27-x86.zip

Layout a file

fio.exe --name=test --ioengine=windowsaio --rw=write --bs=1m --direct=1 --size=50g --numjobs=1 --thread=1 --iodepth=16 --group_reporting --fallocate=none --refill_buffers --randrepeat=0 --create_on_open=1

Read

fio.exe --name=test --ioengine=windowsaio --rw=randread --bs=1m --direct=1 --size=50g --numjobs=1 --thread=1 --iodepth=16 --group_reporting --time_based --fallocate=none --refill_buffers --randrepeat=0 --create_on_open=1 --runtime=600

Frametest

Windows Download: How to use frametest

Create a directory and lay out the files.

frametest.exe -w 4k -t 20 4k_dir

Read

frametest.exe -r 4k -t 20 4k_dir

Other Influences of Bad Performance:

  • Overheating of the motherboard “chipset chip”, CPU, and Mellanox card.

    • Check the Windows Event Viewer for system events, specifically mlx5 (Win-Mofed will not say “overheating,” but stock Windows drivers will).

    • This was found to limit performance to 2.2GB/sec on Windows and 21 Gbit/sec with iperf, without any other warnings or event messages.

    • The same machine booted Linux and ran iperf at 93Gbit/s.

    • After getting better cooling and swapping cards, 8GB/sec SMB3 MC performance under Windows.

    • Install WinMofed and MFT tools, and then read with mget_temp -d mt4123_pciconf0

  • PCI cards in PCI3 x8 slots: this can still drive 7 to 7.5GB/sec on a 100GbE link, but not any faster, and there are no other warnings or indicators beyond the device manager for CX6 interface details.

  • Note that a firmware update for the CX6 will remove pci_wr_order=relaxed (for AMD boxes). Check that again!

  • There is also a “CX6 Power limit to 25W” UEFI or Firmware setting. This will also kill performance.