Using Prometheus snmp-exporter

Recently I’ve been asked to create a dashboard in Grafana, displaying the power usage of the Smart PDU Sentry3.

Although most of the modern systems have a metric page or a written exporter, which can be simply gathered by Prometheus, most of the old systems or devices are still answering SNMP queries and do not have exposed metrics or an exporter.

In this situation we can use the Prometheus SNMP exporter, which acts like a proxy between Prometheus and the SNMP device. The Prometheus SNMP exporter exposed metrics to the Prometheus server based on the SNMP queries to the requested device.

Continue reading “Using Prometheus snmp-exporter”

Loading

What’s new at Nutanix University? New Nutanix Certified Professional – Network and Security certification

Good news to all Flow users – Nutanix announced new certification and training related to this product.

First is a new course, available for free at Nutanix University, called Nutanix Network and Security Administration (NNSA). In this course you will dive into concepts of Flow Virtual Networking and Flow Network Security, and this is a great way to start preparing for the brand-new NCP certification.

Nutanix Certified Professional – Network and Security (NCP‑NS) 6.10 Exam – validates your ability to deploy, manage, and troubleshoot network virtualization and network security using Nutanix Flow.

As always, you can book the exam for free using the NCPNS610BETA voucher during checkout. The voucher is valid only for the first 250 participants, and the last day to take this exam is March 1st. So, hurry!

You can schedule this exam from the Nutanix University Certifications page.

For more information, please check the official announcement.

Loading

Preparing for the Nutanix Certified Professional – Artificial Intelligence 6.10 (NCP-AI 6.10) Exam

Last year, I took the Nutanix Certified Professional – Artificial Intelligence 6.10 exam, and in this article, I want to share a few tips I used to prepare for this exam.

For those who did not hear about this certification, NCP-AI measures an ability to install, configure, optimize, and troubleshoot Nutanix Enterprise AI (NAI), as well as integrate GenAI applications and agents with NAI.

In this article, we will look at the main topics from the Nutanix Certified Professional – Artificial Intelligence (NCP-AI) exam version 6.10, and I will share materials that you can use to prepare yourself.

This guide describes the methodology I am using in preparing for certification exams, and it could be used for other Nutanix exams as well.

Continue reading “Preparing for the Nutanix Certified Professional – Artificial Intelligence 6.10 (NCP-AI 6.10) Exam”

Loading

Deploying VCF Operations for Logs 9

VCF Operations for Logs (formerly Log Insight) is a powerful tool allowing us to keep, maintain, and explore logs from the VMware Cloud Foundation components, and this is one of the critical components of any virtual infrastructure.  

In this short article, we will deploy VCF Operations for Logs and configure vCenter Server, ESXi hosts, and VCF Operations to send logs to the newly deployed appliance.

Continue reading “Deploying VCF Operations for Logs 9”

Loading

Nutanix AOS 7.5, AHV 11.0, and Prism Central 7.5 were released

Nutanix released new versions of its products, including AOS 7.5, AHV 11.0, and Prism Central – pc.7.5.

This is a really large release, so click “Read More” to read about all the features that were added.

Continue reading “Nutanix AOS 7.5, AHV 11.0, and Prism Central 7.5 were released”

Loading

Using Veeam Infrastructure Appliance v13

Previously I wrote about the newest Veeam Feature – Veeam Software Appliance – pre-built and pre-hardened image containing all the packages needed to run a fully functional backup server on Linux.

In this article, we will look at the Veeam Infrastructure Appliance – an infrastructure component based on Veeam JeOS as well, which can hold different roles – Proxy, Repository (including Hardened), Mount Server, and so on. Do you remember Veeam Hardened Repository ISO? – This release is a massive evolution.

In addition, please consider this article as a little walk through the basic Veeam Software Appliance Web UI.

Continue reading “Using Veeam Infrastructure Appliance v13”

Loading

Quick Fix: Updating ESXi GPU driver with vLCM – files may still be in use

Recently I needed to update the NVIDIA GPU driver on my ESXi 8.0U3 hosts. Although it looks simple: upload the new driver to the depot, edit the image, and update hosts, I faced an unexpected issue during remediation of the cluster:

Remediation of cluster failed
 Remediation failed for Host 'gpu-esxi-01'
gpu-esxi-01 - Failed to remediate host
 Remediation failed for Host 'gpu-esxi-01'
 Failed to remove Component NVD-AIE-800(580.95.02-1OEM.800.1.0.20613240), files may still be in use.

Keeping in mind that the host is in the maintenance mode and there is no VMs running on it, the only thing using the GPU driver can be a service like Xorg and/or vGPU Manager. In my case the Xorg service wasn’t running, but vGPU Manager was.

Therefore, the workaround in this situation was simple (but I hope there is a better way):
1. Place ESXi host into maintenance mode and make sure – no VMs are running on it;
2. Enable SSH and connect;
3. Stop the vGPU manager service:

[root@gpu-esxi-01:~] /etc/init.d/nvdGpuMgmtDaemon status
daemon_nvdGpuMgmtDaemon is running

[root@gpu-esxi-01:~] /etc/init.d/nvdGpuMgmtDaemon stop

[root@gpu-esxi-01:~] /etc/init.d/nvdGpuMgmtDaemon status
daemon_nvdGpuMgmtDaemon is not running

4. Return to the cluster update section, and remediate only one host (with a stopped vGPU manager). This time there should not be any problems with installing a new driver;
5. After finishing the remediation, reboot the host;
6. Leave the host from the maintenance mode;
7. Repeat tasks for each host.

After rebooting, you will see that the driver is updated and the host is compliant with the new cluster image.

Loading

Quick Fix: Using NCCL with multi-vGPU VMware VMs

If you’re using a virtual machine with multiple vGPUs and considering using NVIDIA Collective Communications Library (NCCL) to implement multi-GPU communications, you may face an error like this during nccl-test:

Test NCCL failure common.cu:1279 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. ollama-cb-01 pid 1598: Test failure common.cu:1100

In the detailed log we can see errors like:

init.cc:491 NCCL WARN Cuda failure 'operation not supported'
...
init.cc:491 NCCL WARN Cuda failure 'operation not supported'
...
Test NCCL failure common.cu:1279 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
...

One common reason for this issue in the VMware environment is the UVM (Unified Memory), which is disabled by default in the virtual machine.

To enable UVM, power off the VM and add advanced parameters, based on the number of vGPUs attached to it:

pciPassthru0.cfg.enable_uvm = 1
pciPassthru1.cfg.enable_uvm = 1

The example above is for two vGPUs. For four vGPUs, you should add four additional parameters (and so on):

pciPassthru0.cfg.enable_uvm = 1
pciPassthru1.cfg.enable_uvm = 1
pciPassthru2.cfg.enable_uvm = 1
pciPassthru3.cfg.enable_uvm = 1

Thereafter, power on the VM, and NCCL-Test will likely pass.

Keep in mind, enabling this feature will prevent future live vMotions of the VM with the following error:

A required migration feature is not supported on the "Source" host 'esxi-01'.
vGPU migration is not supported on this VM.

Another problem that can prevent passing the NCCL test is broken P2P vGPU communication. For example, you can run p2pBandwidthLatencyTest from the NVIDIA Cuda-samples package. If you have a problem with P2P, you will see something like that in the output:

Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

If everything is OK, in the log we will see:

Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

I faced this issue with the drivers from the AI Enterprise 7.1 package (580.95). The solution in my case was to update ESXi and VM drivers to version 580.105 (AI Enterprise 7.3 package).

Loading

Veeam 13.0.1 GA

Veeam just released version 13.0.1, and we can consider this edition a full v13 launch for all Veeam products, including the Backup Server on Windows. Previously, most of the v13 features were available only on the recently launched Veeam Software Appliance.

Download the newest versions or read the release notes on the portal.

What’s new in v13? – Official and extra-large document. My short (but not full) version.

If you didn’t hear about Veeam Software Appliance before (Linux-based Backup Server) – check out my walkthrough article

Keep in mind: If you are using a Backup Server on Windows and want to update to version 13.0.1; the minimum version of the Backup Server should be 12.3.1 or later. Follow this KB for more information.

Loading

Using NVIDIA vGPUs with VMware vSphere

Today, AI is everywhere, and everyone wants a VM with a GPU adapter to deploy/test/play with AI/ML. Although it is not a problem to add a PCI device to the VM (whole GPU), sometimes it can be overkill.

For example, running a small model requires only a small amount of GPU memory, while our server is equipped with modern NVIDIA H200, B200, or even B300 GPUs that have a large amount of memory.

And this is where vGPU comes into play. vGPU allows us to divide a GPU into smaller pieces and share it among a number of VMs located on the host.

In this article, we will focus on how to configure ESXi hosts to run VMs with the vGPU support in vSphere 8.0 Update 3.

Continue reading “Using NVIDIA vGPUs with VMware vSphere”

Loading