Quick Fix: Adjusting MMIO values in ESXi 8U3 to use Large GPUs

Recently I’ve been asked to deploy a “Monster VM” with 8 H200 GPUs aboard. Although everything looks simple, and there weren’t any problems with VMs with small vGPUs, the first thing I faced after running such a large VM was an error:

Error message from esxi-01: The firmware could not allocate 50331648 KB of PCI MMIO. Increase the size of PCI MMIO and try again.

Luckily, I read a recent VMware document, “Deploy Distributed LLM Inference with GPUDirect RDMA over InfiniBand in VMware Private AI“, a few weeks before, and this moment was covered.

I strongly recommend this document to anyone utilizing large GPU servers (HGX, DGX), particularly when cross-server communication is necessary.

To run such a large VM, it requires adjusting the VM’s MMIO settings to add two values to the VM’s advanced settings:

pciPassthru.use64bitMMIO = TRUE
pciPassthru.64bitMMIOSizeGB = 1024

MMIO size should be calculated based on the number and type of passthrough devices attached to the VM.

According to the doc above, each passthrough NVIDIA H100 (or H200) GPU requires 128 GB of MMIO space.

You can obtain more information about calculating the MMIO size in KB 323402. Please refer to the example, which explains how to calculate MMIO size based on the GPU size.

After adjusting MMIO settings, the VM will boot successfully.

Loading

Preparing for the NVIDIA-Certified Associate: AI Infrastructure and Operations Exam

I am a new one to the AI field, and this year I decided that this is a time to sharpen my skills. One of the pillars of AI is infrastructure, which is somewhat different from the traditional one used for running typical applications and virtual machines.

I started with the NVIDIA technologies and solutions as the leader in a modern AI infrastructure. NVIDIA provides a lot of training materials, documentation, and certifications on its technologies. It looks like a good way to start, because I believe that the best way to learn is by pursuing the certification.

Recently I completed the NVIDIA-Certified Associate: AI Infrastructure and Operations certification, and in this post, I want to share the materials I used to successfully pass the exam.

Continue reading “Preparing for the NVIDIA-Certified Associate: AI Infrastructure and Operations Exam”

Loading