Recently I needed to update the NVIDIA GPU driver on my ESXi 8.0U3 hosts. Although it looks simple: upload the new driver to the depot, edit the image, and update hosts, I faced an unexpected issue during remediation of the cluster:
Remediation of cluster failed
Remediation failed for Host 'gpu-esxi-01'
gpu-esxi-01 - Failed to remediate host
Remediation failed for Host 'gpu-esxi-01'
Failed to remove Component NVD-AIE-800(580.95.02-1OEM.800.1.0.20613240), files may still be in use.Keeping in mind that the host is in the maintenance mode and there is no VMs running on it, the only thing using the GPU driver can be a service like Xorg and/or vGPU Manager. In my case the Xorg service wasn’t running, but vGPU Manager was.
Therefore, the workaround in this situation was simple (but I hope there is a better way):
1. Place ESXi host into maintenance mode and make sure – no VMs are running on it;
2. Enable SSH and connect;
3. Stop the vGPU manager service:
[root@gpu-esxi-01:~] /etc/init.d/nvdGpuMgmtDaemon status
daemon_nvdGpuMgmtDaemon is running
[root@gpu-esxi-01:~] /etc/init.d/nvdGpuMgmtDaemon stop
[root@gpu-esxi-01:~] /etc/init.d/nvdGpuMgmtDaemon status
daemon_nvdGpuMgmtDaemon is not running4. Return to the cluster update section, and remediate only one host (with a stopped vGPU manager). This time there should not be any problems with installing a new driver;
5. After finishing the remediation, reboot the host;
6. Leave the host from the maintenance mode;
7. Repeat tasks for each host.
After rebooting, you will see that the driver is updated and the host is compliant with the new cluster image.
![]()
