SR-IOV, LAGs, and affinity
For use cases that do not currently benefit from vSPU, the best that you can do is load balancing across all CPUs. You can best achieve this using SR-IOV, LAGs, and CPU affinity settings.
You likely need link aggregation, if not for throughput, for resiliency. Considerations for LAG differ when considering VFs, but the main concepts are the same. The following diagram represents an example LAG-based topology based on having two NIC cards, each with two ports in a single NUMA node.
This scenario tolerates the following:
- NIC port/link failure
- NIC card failure
- Switch failure
The design also stresses the need for the trust on
setting discussed earlier, as the VF must react upon the status of the PF, as LACP is not going to provide the functionality it would do in an
appliance-based deployment.
You must configure LACP mode as static in this deployment scenario.
In this diagram, the PF is using an external VLAN tag to separate traffic to the respective VFs and the VM is unaware of this external VLAN.
Without vSPU, there is no PMD, and the NIC uses the interrupts are used to signal that there is network traffic that the CPU must process. To get a performant system without using vSPU, you must take care to balance the amount of interrupts that each CPU receives.
Using the same layout as the diagram displays, find the relevant system interrupts/queues:
diagnose hardware sysinfo interrupts grep "CPUport" CPU0 CPU1 <...> CPU15 47: 119912 0 <...> 0 PCI-MSI-edge iavf-port2-TxRx-0 48: 0 200309 <...> 0 PCI-MSI-edge iavf-port2-TxRx-1 49: 0 0 <...> 0 PCI-MSI-edge iavf-port2-TxRx-2 50: 0 0 <...> 0 PCI-MSI-edge iavf-port2-TxRx-3 <...> 67: 254849 0 <...> 0 PCI-MSI-edge iavf-port6-TxRx-0 68: 0 443186 <...> 0 PCI-MSI-edge iavf-port6-TxRx-1 69: 0 0 <...> 0 PCI-MSI-edge iavf-port6-TxRx-2 70: 0 0 <...> 0 PCI-MSI-edge iavf-port6-TxRx-3 <...> 87: 72971 0 <...> 0 PCI-MSI-edge iavf-port10-TxRx-0 88: 0 376044 <...> 0 PCI-MSI-edge iavf-port10-TxRx-1 89: 0 0 <...> 0 PCI-MSI-edge iavf-port10-TxRx-2 90: 0 0 <...> 0 PCI-MSI-edge iavf-port10-TxRx-3 <...> 107: 197132 0 <...> 0 PCI-MSI-edge iavf-port14-TxRx-0 108: 0 421851 <...> 0 PCI-MSI-edge iavf-port14-TxRx-1 109: 0 0 <...> 0 PCI-MSI-edge iavf-port14-TxRx-2 110: 0 0 <...> 0 PCI-MSI-edge iavf-port14-TxRx-3 <...> 122: 0 0 <...> 0 PCI-MSI-edge iavf-port17-TxRx-0 123: 0 0 <...> 0 PCI-MSI-edge iavf-port17-TxRx-1 124: 0 0 <...> 0 PCI-MSI-edge iavf-port17-TxRx-2 125: 0 0 <...> 345768 PCI-MSI-edge iavf-port17-TxRx-3 <...>
The interrupt names can differ. For example, the Mellanox ConnectX-5 NIC card has ten interrupts/queues per port named port2-0 through to port2-9. |
The idea is to spread the interrupts across CPUs to balance the load across all system resources. Using the interrupt names as per the print, you can pin them to particular CPUs:
config system affinity-interrupt edit 20 set interrupt "iavf-port2-TxRx-0" set affinity-cpumask "0x0000000000000001" next edit 21 set interrupt "iavf-port2-TxRx-1" set affinity-cpumask "0x0000000000000002" next edit 22 set interrupt "iavf-port2-TxRx-2" set affinity-cpumask "0x0000000000000004" next edit 23 set interrupt "iavf-port2-TxRx-3" set affinity-cpumask "0x0000000000000008" next <...> edit 60 set interrupt "iavf-port6-TxRx-0" set affinity-cpumask "0x0000000000000001" next edit 61 set interrupt "iavf-port6-TxRx-1" set affinity-cpumask "0x0000000000000002" next edit 62 set interrupt "iavf-port6-TxRx-2" set affinity-cpumask "0x0000000000000004" next edit 63 set interrupt "iavf-port6-TxRx-3" set affinity-cpumask "0x0000000000000008" next <...> edit 170 set interrupt "iavf-port17-TxRx-0" set affinity-cpumask "0x0000000000001000" next edit 171 set interrupt "iavf-port17-TxRx-1" set affinity-cpumask "0x0000000000002000" next edit 172 set interrupt "iavf-port17-TxRx-2" set affinity-cpumask "0x0000000000004000" next edit 173 set interrupt "iavf-port17-TxRx-3" set affinity-cpumask "0x0000000000008000" next end
This is a mapping of the four queues on an interface to one of four CPUs in a group, but also reusing the group of four CPUs across four interfaces as the following diagrams. This interleaving of the functions gets an even interrupt distribution, which gives the most performant deployment scenario.
In case of a failure, for example of the NIC card, this interleaving model ensures that the traffic interfaces where most traffic is expected are processed by different CPUs as the diagram shows, keeping the performance to a maximum.
Working out how best to balance the interrupts is the main thing to address in these circumstances. In the example case, each port has four queues/interrupts that you can map, making a VM16 effective with four PFs. The SR-IOV VLAN filtering and resultant LAG configuration provides interleaving, which helps balance the load across all CPUs.
Simularly, it may be that a VM32 is best serviced with eight PFs. It may be that the NIC card allows configuration of how many PFs are presented. For example, you may use an NIC presenting 4 x 10G more effectively across the CPUs than 1 x 40G.
Without much flexibility in using transparent VLANs or number of PFs, affining some services such as IPS, logging, or Web Filter to CPUs unused for traffic and providing effective CPU use may be the best option.
Effectively, there is significant flexibility, which should allow you to find a sweet spot of performance in most scenarios.