CPU pinning
vcpu
represents the vCPU allocated and seen by the guest. cpuset
represents the physical
CPU thread. Comparing this to the earlier print in Tuned indirect update, only the CPUs from one NUMA node are selected, matching the memory, and this should correlate to the NIC.
emulatorpin
specifies that the emulator is pinned to all vCPUs in the definition.
<vcpu placement='static'>32</vcpu> <cputune> <vcpupin vcpu='0' cpuset='5'/> <vcpupin vcpu='1' cpuset='41'/> <vcpupin vcpu='2' cpuset='7'/> <vcpupin vcpu='3' cpuset='43'/> <vcpupin vcpu='4' cpuset='9'/> <vcpupin vcpu='5' cpuset='45'/> <vcpupin vcpu='6' cpuset='11'/> <vcpupin vcpu='7' cpuset='47'/> <vcpupin vcpu='8' cpuset='13'/> <vcpupin vcpu='9' cpuset='49'/> <vcpupin vcpu='10' cpuset='15'/> <vcpupin vcpu='11' cpuset='51'/> <vcpupin vcpu='12' cpuset='17'/> <vcpupin vcpu='13' cpuset='53'/> <vcpupin vcpu='14' cpuset='19'/> <vcpupin vcpu='15' cpuset='55'/> <vcpupin vcpu='16' cpuset='21'/> <vcpupin vcpu='17' cpuset='57'/> <vcpupin vcpu='18' cpuset='23'/> <vcpupin vcpu='19' cpuset='59'/> <vcpupin vcpu='20' cpuset='25'/> <vcpupin vcpu='21' cpuset='61'/> <vcpupin vcpu='22' cpuset='27'/> <vcpupin vcpu='23' cpuset='63'/> <vcpupin vcpu='24' cpuset='29'/> <vcpupin vcpu='25' cpuset='65'/> <vcpupin vcpu='26' cpuset='31'/> <vcpupin vcpu='27' cpuset='67'/> <vcpupin vcpu='28' cpuset='33'/> <vcpupin vcpu='29' cpuset='69'/> <vcpupin vcpu='30' cpuset='35'/> <vcpupin vcpu='31' cpuset='71'/> <emulatorpin cpuset='5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71'/> </cputune>
When considering CPU pinning for the best performance, hyperthreading may be a concern. Refer again to the lstopo-no-graphics
output:
[root@rhel-tiger-14-6 ~]# lstopo-no-graphics Machine (187GB total) <output omitted for brevity> Package L#1 NUMANode L#1 (P#1 94GB) L3 L#1 (25MB) L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 PU L#36 (P#1) PU L#37 (P#37) L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 PU L#38 (P#3) PU L#39 (P#39) L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 PU L#40 (P#5) PU L#41 (P#41) L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 PU L#42 (P#7) PU L#43 (P#43) L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 PU L#44 (P#9) PU L#45 (P#45) L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 PU L#46 (P#11) PU L#47 (P#47) L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 PU L#48 (P#13) PU L#49 (P#49) L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 PU L#50 (P#15) PU L#51 (P#51) L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 PU L#52 (P#17) PU L#53 (P#53) L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 PU L#54 (P#19) PU L#55 (P#55) L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 PU L#56 (P#21) PU L#57 (P#57) L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 PU L#58 (P#23) PU L#59 (P#59) L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 PU L#60 (P#25) PU L#61 (P#61) L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 PU L#62 (P#27) PU L#63 (P#63) L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 PU L#64 (P#29) PU L#65 (P#65) L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 PU L#66 (P#31) PU L#67 (P#67) L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 PU L#68 (P#33) PU L#69 (P#69) L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 PU L#70 (P#35) PU L#71 (P#71) <output omitted for brevity>
To get the best performance from a vCPU, it may be necessary to avoid using the second thread on the physical core. The lstopo-no-graphics
output indicates that the hyperthreading causes two vCPUs to share the same L1 and L2 cache. For example, vCPUs 5 and 41 are sharing. This may lead to contention and performance degradation.
However, using the second core to increase the number of vCPUs is better than not providing it to the VM. The consideration is thus using the FortiGate-VM license effectively across the hardware.
The L3 cache is shared across the whole NUMA node in this case, so there is potential for contention there. |