Quick Fix: Using NCCL with multi-vGPU VMware VMs

If you’re using a virtual machine with multiple vGPUs and considering using NVIDIA Collective Communications Library (NCCL) to implement multi-GPU communications, you may face an error like this during nccl-test:

Test NCCL failure common.cu:1279 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. ollama-cb-01 pid 1598: Test failure common.cu:1100

In the detailed log we can see errors like:

init.cc:491 NCCL WARN Cuda failure 'operation not supported'
...
init.cc:491 NCCL WARN Cuda failure 'operation not supported'
...
Test NCCL failure common.cu:1279 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
...

One common reason for this issue in the VMware environment is the UVM (Unified Memory), which is disabled by default in the virtual machine.

To enable UVM, power off the VM and add advanced parameters, based on the number of vGPUs attached to it:

pciPassthru0.cfg.enable_uvm = 1
pciPassthru1.cfg.enable_uvm = 1

The example above is for two vGPUs. For four vGPUs, you should add four additional parameters (and so on):

pciPassthru0.cfg.enable_uvm = 1
pciPassthru1.cfg.enable_uvm = 1
pciPassthru2.cfg.enable_uvm = 1
pciPassthru3.cfg.enable_uvm = 1

Thereafter, power on the VM, and NCCL-Test will likely pass.

Another problem that can prevent passing the NCCL test is broken P2P vGPU communication. For example, you can run p2pBandwidthLatencyTest from the NVIDIA Cuda-samples package. If you have a problem with P2P, you will see something like that in the output:

Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

If everything is OK, in the log we will see:

Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

I faced this issue with the drivers from the AI Enterprise 7.1 package (580.95). The solution in my case was to update ESXi and VM drivers to version 580.105 (AI Enterprise 7.3 package).

Leave a Reply Cancel reply