NVIDIA – en.vmik.net

Quick Fix: Updating ESXi GPU driver with vLCM – files may still be in use

Recently I needed to update the NVIDIA GPU driver on my ESXi 8.0U3 hosts. Although it looks simple: upload the new driver to the depot, edit the image, and update hosts, I faced an unexpected issue during remediation of the cluster:

Remediation of cluster failed
 Remediation failed for Host 'gpu-esxi-01'
gpu-esxi-01 - Failed to remediate host
 Remediation failed for Host 'gpu-esxi-01'
 Failed to remove Component NVD-AIE-800(580.95.02-1OEM.800.1.0.20613240), files may still be in use.

Keeping in mind that the host is in the maintenance mode and there is no VMs running on it, the only thing using the GPU driver can be a service like Xorg and/or vGPU Manager. In my case the Xorg service wasn’t running, but vGPU Manager was.

Therefore, the workaround in this situation was simple (but I hope there is a better way):
1. Place ESXi host into maintenance mode and make sure – no VMs are running on it;
2. Enable SSH and connect;
3. Stop the vGPU manager service:

[root@gpu-esxi-01:~] /etc/init.d/nvdGpuMgmtDaemon status
daemon_nvdGpuMgmtDaemon is running

[root@gpu-esxi-01:~] /etc/init.d/nvdGpuMgmtDaemon stop

[root@gpu-esxi-01:~] /etc/init.d/nvdGpuMgmtDaemon status
daemon_nvdGpuMgmtDaemon is not running

4. Return to the cluster update section, and remediate only one host (with a stopped vGPU manager). This time there should not be any problems with installing a new driver;
5. After finishing the remediation, reboot the host;
6. Leave the host from the maintenance mode;
7. Repeat tasks for each host.

After rebooting, you will see that the driver is updated and the host is compliant with the new cluster image.

Quick Fix: Using NCCL with multi-vGPU VMware VMs

If you’re using a virtual machine with multiple vGPUs and considering using NVIDIA Collective Communications Library (NCCL) to implement multi-GPU communications, you may face an error like this during nccl-test:

Test NCCL failure common.cu:1279 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. ollama-cb-01 pid 1598: Test failure common.cu:1100

In the detailed log we can see errors like:

init.cc:491 NCCL WARN Cuda failure 'operation not supported'
...
init.cc:491 NCCL WARN Cuda failure 'operation not supported'
...
Test NCCL failure common.cu:1279 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
...

One common reason for this issue in the VMware environment is the UVM (Unified Memory), which is disabled by default in the virtual machine.

To enable UVM, power off the VM and add advanced parameters, based on the number of vGPUs attached to it:

pciPassthru0.cfg.enable_uvm = 1
pciPassthru1.cfg.enable_uvm = 1

The example above is for two vGPUs. For four vGPUs, you should add four additional parameters (and so on):

pciPassthru0.cfg.enable_uvm = 1
pciPassthru1.cfg.enable_uvm = 1
pciPassthru2.cfg.enable_uvm = 1
pciPassthru3.cfg.enable_uvm = 1

Thereafter, power on the VM, and NCCL-Test will likely pass.

Another problem that can prevent passing the NCCL test is broken P2P vGPU communication. For example, you can run p2pBandwidthLatencyTest from the NVIDIA Cuda-samples package. If you have a problem with P2P, you will see something like that in the output:

Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

If everything is OK, in the log we will see:

Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

I faced this issue with the drivers from the AI Enterprise 7.1 package (580.95). The solution in my case was to update ESXi and VM drivers to version 580.105 (AI Enterprise 7.3 package).

Using NVIDIA vGPUs with VMware vSphere

Today, AI is everywhere, and everyone wants a VM with a GPU adapter to deploy/test/play with AI/ML. Although it is not a problem to add a PCI device to the VM (whole GPU), sometimes it can be overkill.

For example, running a small model requires only a small amount of GPU memory, while our server is equipped with modern NVIDIA H200, B200, or even B300 GPUs that have a large amount of memory.

And this is where vGPU comes into play. vGPU allows us to divide a GPU into smaller pieces and share it among a number of VMs located on the host.

In this article, we will focus on how to configure ESXi hosts to run VMs with the vGPU support in vSphere 8.0 Update 3.

Quick Fix: Adjusting MMIO values in ESXi 8U3 to use Large GPUs

Recently I’ve been asked to deploy a “Monster VM” with 8 H200 GPUs aboard. Although everything looks simple, and there weren’t any problems with VMs with small vGPUs, the first thing I faced after running such a large VM was an error:

Error message from esxi-01: The firmware could not allocate 50331648 KB of PCI MMIO. Increase the size of PCI MMIO and try again.

Luckily, I read a recent VMware document, “Deploy Distributed LLM Inference with GPUDirect RDMA over InfiniBand in VMware Private AI“, a few weeks before, and this moment was covered.

I strongly recommend this document to anyone utilizing large GPU servers (HGX, DGX), particularly when cross-server communication is necessary.

To run such a large VM, it requires adjusting the VM’s MMIO settings to add two values to the VM’s advanced settings:

pciPassthru.use64bitMMIO = TRUE
pciPassthru.64bitMMIOSizeGB = 1024

MMIO size should be calculated based on the number and type of passthrough devices attached to the VM.

According to the doc above, each passthrough NVIDIA H100 (or H200) GPU requires 128 GB of MMIO space.

You can obtain more information about calculating the MMIO size in KB 323402. Please refer to the example, which explains how to calculate MMIO size based on the GPU size.

After adjusting MMIO settings, the VM will boot successfully.

Preparing for the NVIDIA-Certified Associate: AI Infrastructure and Operations Exam

I am a new one to the AI field, and this year I decided that this is a time to sharpen my skills. One of the pillars of AI is infrastructure, which is somewhat different from the traditional one used for running typical applications and virtual machines.

I started with the NVIDIA technologies and solutions as the leader in a modern AI infrastructure. NVIDIA provides a lot of training materials, documentation, and certifications on its technologies. It looks like a good way to start, because I believe that the best way to learn is by pursuing the certification.

Recently I completed the NVIDIA-Certified Associate: AI Infrastructure and Operations certification, and in this post, I want to share the materials I used to successfully pass the exam.