Quick Fix: Updating ESXi GPU driver with vLCM – files may still be in use

Recently I needed to update the NVIDIA GPU driver on my ESXi 8.0U3 hosts. Although it looks simple: upload the new driver to the depot, edit the image, and update hosts, I faced an unexpected issue during remediation of the cluster:

Remediation of cluster failed
 Remediation failed for Host 'gpu-esxi-01'
gpu-esxi-01 - Failed to remediate host
 Remediation failed for Host 'gpu-esxi-01'
 Failed to remove Component NVD-AIE-800(580.95.02-1OEM.800.1.0.20613240), files may still be in use.

Keeping in mind that the host is in the maintenance mode and there is no VMs running on it, the only thing using the GPU driver can be a service like Xorg and/or vGPU Manager. In my case the Xorg service wasn’t running, but vGPU Manager was.

Therefore, the workaround in this situation was simple (but I hope there is a better way):
1. Place ESXi host into maintenance mode and make sure – no VMs are running on it;
2. Enable SSH and connect;
3. Stop the vGPU manager service:

[root@gpu-esxi-01:~] /etc/init.d/nvdGpuMgmtDaemon status
daemon_nvdGpuMgmtDaemon is running

[root@gpu-esxi-01:~] /etc/init.d/nvdGpuMgmtDaemon stop

[root@gpu-esxi-01:~] /etc/init.d/nvdGpuMgmtDaemon status
daemon_nvdGpuMgmtDaemon is not running

4. Return to the cluster update section, and remediate only one host (with a stopped vGPU manager). This time there should not be any problems with installing a new driver;
5. After finishing the remediation, reboot the host;
6. Leave the host from the maintenance mode;
7. Repeat tasks for each host.

After rebooting, you will see that the driver is updated and the host is compliant with the new cluster image.

Loading

Quick Fix: Using NCCL with multi-vGPU VMware VMs

If you’re using a virtual machine with multiple vGPUs and considering using NVIDIA Collective Communications Library (NCCL) to implement multi-GPU communications, you may face an error like this during nccl-test:

Test NCCL failure common.cu:1279 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. ollama-cb-01 pid 1598: Test failure common.cu:1100

In the detailed log we can see errors like:

init.cc:491 NCCL WARN Cuda failure 'operation not supported'
...
init.cc:491 NCCL WARN Cuda failure 'operation not supported'
...
Test NCCL failure common.cu:1279 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
...

One common reason for this issue in the VMware environment is the UVM (Unified Memory), which is disabled by default in the virtual machine.

To enable UVM, power off the VM and add advanced parameters, based on the number of vGPUs attached to it:

pciPassthru0.cfg.enable_uvm = 1
pciPassthru1.cfg.enable_uvm = 1

The example above is for two vGPUs. For four vGPUs, you should add four additional parameters (and so on):

pciPassthru0.cfg.enable_uvm = 1
pciPassthru1.cfg.enable_uvm = 1
pciPassthru2.cfg.enable_uvm = 1
pciPassthru3.cfg.enable_uvm = 1

Thereafter, power on the VM, and NCCL-Test will likely pass.

Another problem that can prevent passing the NCCL test is broken P2P vGPU communication. For example, you can run p2pBandwidthLatencyTest from the NVIDIA Cuda-samples package. If you have a problem with P2P, you will see something like that in the output:

Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

If everything is OK, in the log we will see:

Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

I faced this issue with the drivers from the AI Enterprise 7.1 package (580.95). The solution in my case was to update ESXi and VM drivers to version 580.105 (AI Enterprise 7.3 package).

Loading

Veeam 13.0.1 GA

Veeam just released version 13.0.1, and we can consider this edition a full v13 launch for all Veeam products, including the Backup Server on Windows. Previously, most of the v13 features were available only on the recently launched Veeam Software Appliance.

Download the newest versions or read the release notes on the portal.

What’s new in v13? – Official and extra-large document. My short (but not full) version.

If you didn’t hear about Veeam Software Appliance before (Linux-based Backup Server) – check out my walkthrough article

Keep in mind: If you are using a Backup Server on Windows and want to update to version 13.0.1; the minimum version of the Backup Server should be 12.3.1 or later. Follow this KB for more information.

Loading

Using NVIDIA vGPUs with VMware vSphere

Today, AI is everywhere, and everyone wants a VM with a GPU adapter to deploy/test/play with AI/ML. Although it is not a problem to add a PCI device to the VM (whole GPU), sometimes it can be overkill.

For example, running a small model requires only a small amount of GPU memory, while our server is equipped with modern NVIDIA H200, B200, or even B300 GPUs that have a large amount of memory.

And this is where vGPU comes into play. vGPU allows us to divide a GPU into smaller pieces and share it among a number of VMs located on the host.

In this article, we will focus on how to configure ESXi hosts to run VMs with the vGPU support in vSphere 8.0 Update 3.

Continue reading “Using NVIDIA vGPUs with VMware vSphere”

Loading

Latest updates in VMware certification programs

In the last few months, many changes happened to VMware certifications, and in this short post, I want to cover those changes.

First and foremost: since October 31, legacy certifications are no longer available, which means we cannot schedule VCP-DCV or VCAP-Design exams anymore.

What about the current available certification? As usual, we can get all the information from the VMware certification page.

Nowadays, the certification focuses on two products: primarily on VMware Cloud Foundation and, in addition, on VMware vSphere Foundation.

Three levels of certification are available: Professional, Advanced-Professional, and Expert.

There are three types of professional exams:

  1. Administration – focused on administrating and implementing the product;
  2. Support – focused on troubleshooting;
  3. Architect – focused on designing the solution.

While administration and support are available for VCF and VVF products, the architect exam is available only for the VCF solution stack.

In total, we have five professional-level exams:

  1. VMware Certified Professional – VMware vSphere Foundation Support (2V0-18.25), blueprint;
  2. VMware Certified Professional – VMware Cloud Foundation Support (2V0-15.25), blueprint;
  3. VMware Certified Professional – VMware vSphere Foundation Administrator (2V0-16.25), blueprint;
  4. VMware Certified Professional – VMware Cloud Foundation Administrator (2V0-17.25), blueprint;
  5. VMware Certified Professional – VMware Cloud Foundation Architect (2V0-13.25), blueprint.

    If you are using only the vSphere Foundation stack, you can start with 2V0-16.25 and 2V0-18.25. For VCF administrators, the starting point could be the 2V0-17.25 and 2V0-15.25 exams. Next is to take the Architect exam 2V0-13.25; I assume it’s kind of like the VCAP-Design exam.

    The next step is three advanced-professional exams:

    1. VMware Certified Advanced Professional – VMware Cloud Foundation 9.0 vSphere Kubernetes Service (3V0-24.25), blueprint;
    2. VMware Certified Advanced Professional – VMware Cloud Foundation 9.0 Automation (3V0-21.25), blueprint;
    3. VMware Certified Advanced Professional – VMware Cloud Foundation 9.0 Operations (3V0-22.25), blueprint.

    You can see by the name that each advanced-level exam focused on a specified topic – Kubernetes, Automation, or Operations.

    Each professional and advanced exam costs $250, consists of 60 questions, has a passing score of 300 (as usual), and requires 135 minutes to clear.

    The last step is VCDX – VMware Certified Distinguished Expert. – What? Where is the “Design Expert”? Currently, there is limited information available on the internet, aside from this post. As of writing, the official certification page is not updated. Therefore, I will add more info about VCDX later.

    Loading

    Quick Fix: Adjusting MMIO values in ESXi 8U3 to use Large GPUs

    Recently I’ve been asked to deploy a “Monster VM” with 8 H200 GPUs aboard. Although everything looks simple, and there weren’t any problems with VMs with small vGPUs, the first thing I faced after running such a large VM was an error:

    Error message from esxi-01: The firmware could not allocate 50331648 KB of PCI MMIO. Increase the size of PCI MMIO and try again.

    Luckily, I read a recent VMware document, “Deploy Distributed LLM Inference with GPUDirect RDMA over InfiniBand in VMware Private AI“, a few weeks before, and this moment was covered.

    I strongly recommend this document to anyone utilizing large GPU servers (HGX, DGX), particularly when cross-server communication is necessary.

    To run such a large VM, it requires adjusting the VM’s MMIO settings to add two values to the VM’s advanced settings:

    pciPassthru.use64bitMMIO = TRUE
    pciPassthru.64bitMMIOSizeGB = 1024

    MMIO size should be calculated based on the number and type of passthrough devices attached to the VM.

    According to the doc above, each passthrough NVIDIA H100 (or H200) GPU requires 128 GB of MMIO space.

    You can obtain more information about calculating the MMIO size in KB 323402. Please refer to the example, which explains how to calculate MMIO size based on the GPU size.

    After adjusting MMIO settings, the VM will boot successfully.

    Loading

    What’s new at Nutanix University? NCP-MCA 6.10 and NCP-EUC 6.10 are open for scheduling

    Nutanix Certified Professional – Multicloud Automation (NCP-MCA) 6.10 and Nutanix Certified Professional – End User Computing (NCP-EUC) 6.10 are ready to leave the beta state and are now open for scheduling, with the appointments starting on November 18, 2025.

    Great news: now you can book one exam for free using the NCPMCAEUC610 voucher during checkout. It could be NCP-MCA or NCP-EUC. The voucher is valid only for one exam of your choice and only for the first 250 participants. So, hurry!

    You can schedule both exams from the Nutanix University Certifications page.

    And please remember to check out the updated courses: Nutanix Multicloud Automation Administration (NMCAA) and Nutanix End User Computing Administration (NEUCA). It will help you to prepare.

    For more information, please check the official announce.

    Loading

    Updating VMware ESXi 9 cluster

    In the previous articles, we updated VMware Cloud Foundation Operations and vCenter Server to version 9.0.1, and just to complete the series, I will add one more post about updating vSphere hosts using a single-cluster image.

    Although the overall procedure for updating is the same and simple, you may have heard of or even faced a new token-based authentication to download updates from the Broadcom repositories, and in this article, I will cover this moment too.

    Continue reading “Updating VMware ESXi 9 cluster”

    Loading

    Updating VMware vCenter Server 9

    In the previous article, we updated VMware Cloud Foundation Operations to version 9.0.1 and now it is time to update vCenter Server.

    Although the overall procedure for updating is the same and simple, as you may have heard, or even experienced, there is a new token-based authentication to download updates from the Broadcom repositories, and in this article, I will cover this moment too.

    Continue reading “Updating VMware vCenter Server 9”

    Loading

    Veeam Backup & Replication 12.3.2.4165 Patch and critical security issues

    Veeam just dropped a new KB, related to three critical security fixes:
    CVE-2025-48983, CVSS v3.1 Score: 9.9:
    A vulnerability in the Mount service of Veeam Backup & Replication, which allows for remote code execution (RCE) on the Backup infrastructure hosts by an authenticated domain user.

    CVE-2025-48984, CVSS v3.1 Score: 9.9:
    A vulnerability allowing remote code execution (RCE) on the Backup Server by an authenticated domain user.

    Both vulnerabilities only impact domain-joined Veeam Backup & Replication v12.

    The vulnerabilities affected all versions of Veeam Backup & Replication v12.3.2.3617 and earlier builds and were fixed in the latest VBR release, 12.3.2.4165, so consider updating as soon as possible.

    One vulnerability is related to the Veeam agent for MS Windows.
    CVE-2025-48982, CVSS v3.1 Score: 7.3:
    This vulnerability in Veeam Agent for Microsoft Windows allows for Local Privilege Escalation if a system administrator is tricked into restoring a malicious file.

    Consider updating your Veeam Agent for Microsoft Windows to version 6.3.2.1302.

    Loading