
This year I took the NVIDIA-Certified Professional: AI Infrastructure exam, and in the post, I want to share materials I used for my preparation.
For those who do not know, this certification validates a candidate’s ability to deploy, configure, and validate advanced NVIDIA AI infrastructure, and I think this is right where we need to start after passing the NVIDIA-Certified Associate: AI Infrastructure and Operations exam (unless you are not a network engineer).
Now, let us briefly talk about the exam
This exam includes about 70-75 questions (in my case 71) and has a 120-minute time limit.
The price is high, $400, but sometimes you can see a promotion on the Internet, giving a solid discount. For example, I saw those twice (50% discount): on Christmas and during a certification-related webinar in January. So stay tuned to the NVIDIA channels on X or LinkedIn.
The exam is taking place remotely on the Certiverse Platform, which I consider the best exam platform so far.
As always, you need:
- A Notebook or PC with only one display allowed and stable internet connection;
- A webcam and a microphone;
- Your ID contains your Name/Surname in Latin;
- Quiet place and no one in the room.
The landing page for the exam is located here.
Prerequisites
Although I am not sure that the NVIDIA-Certified Associate: AI Infrastructure and Operations certification is required prior to taking this exam, I highly recommend you start with this exam before any professional-level exams, and I have a prep guide on this blog.
NVIDIA recommends taking this exam when you have two to three years of operational experience working in a data center with NVIDIA hardware solutions. The candidate should be able to deploy all the parts of a data center infrastructure in support of AI workloads.
In my case, I had less than a year of overall AI infrastructure experience but had an opportunity to work on a large project, which includes most parts of the modern AI factories.
Preparation
The strong hands-on experience is the key to this exam. This exam expects that you know how to work with DGX and HGX hardware and infrastructure software, including all day 0/1 operations: basic configuration, hardware validation, infrastructure preparation, firmware management, testing, troubleshooting, and so on.
The first thing we need to start with is an Exam Preparation Guide, sometimes called a blueprint.
In this document, we can see all the topics we need to master to successfully pass this exam, recommended training, and recommended materials to read.
The exam consists of five large sections:
- System and Server Bring-up (31%) – “hardware” part, including knowledge in the AI Factories designs and topologies, physical hardware management: servers, cables, transceivers, GPUs, firmware management and so on;
- Physical Layer Management (5%) – BlueField network platform and MIG;
- Control Plane Installation and Configuration (19%) – Base Command Manager, NVIDIA GPU and DOCA Drivers, NVIDIA Container Toolkit, and NGC;
- Cluster Test and Verification (33%) – Cluster validation and testing, including NCCL, HPL, NeMo burn-in;
- Troubleshoot and Optimize (12%) – Server optimization and troubleshooting of different components.
Based on the percentage of exam value, we can expect that most parts of the exam will be related to the hardware configuration and validation parts.
Recommended training
If you have an opportunity, I highly recommend taking AI Infrastructure Professional Public Training or even AI Infrastructure and Operations Professional Public Training.
Both trainings are packed with content and labs, which will help you to prepare for the exam. If you check the study guide, you can find that each section of the exam is refered to the course topics.
Unfortunately, I did not have this option, but I used NVIDIA Academy with an active subscription and took a few courses, which were useful in preparation:
I do not have experience with BlueField and BCM, but both basic courses were enough to fill the gaps with the products.
In addition, if you have an active subscription, I can recommend taking the InfiniBand Essentials and InfiniBand Network Administration courses.
Recommended reading
There are many materials with thousands of pages, and read everything can take forever 🙂 For example, the BCM manual includes about 1000 pages. You do not need to read each page! Read the pages related to the exam objectives.
First, check the Preparation Guide; it contains links referring to the recommended reading for each exam topic and usually to the specified page in the documentation.
Below the documentation I used in preparation
High-level design:
NVIDIA Enterprise AI Factory – Design Guide;
Choosing the Right Storage for Enterprise AI Workloads;
Setting the InfiniBand Cluster Topology.
Base Command Manager:
NVIDIA Base Command Manager;
NVIDIA Base Command Manager 11 Administrator Manual. – Again, you do not need to read the whole document. Focus on the exam objectives.
DGX Systems:
Best Practices for DGX;
NVIDIA DGX H100/H200 User Guide;
DGX H100/H200 Firmware Update Guide;
NVIDIA DGX H100/H200 Service Manual.
DPU and Cabling:
NVIDIA BlueField Platform;
NVIDIA BlueField-3 DPU Controller User Manual;
NVIDIA DGX SuperPOD: Cabling Data Centers Design Guide;
NVIDIA LinkX Cables and Transceivers.
Monitoring and managing tools:
NVIDIA System Management (NVSM) User Guide;
NVIDIA DCGM;
NVIDIA DCGM User Guide;
nvidia-smi man.
Validating and Testing:
NCCL;
NCCL-Tests;
NVIDIA HPL Benchmark.
Running applications:
NVIDIA Container Toolkit;
NGC CLI.
This is all the documentation I used in preparing, but I am not sure that reading the provided docs would be enough to pass the exam. As I wrote before, hands-on experience and strong hardware and server knowledge are the keys, but documentation is a great support to learn what you do not know. Focus on the exam objectives.
In conclusion
This is all I want to share regarding this exam. In my own experience this exam is fair; the questions are structured, and even if sometimes you do not know the answer, you can answer using simple logic.
This exam will not be so hard if you have experience with the GPU hardware, especially with the DGX/HGX servers, and if you know how to bring the cluster “UP” or, at a minimum, know which tools you will use to do that.
The last piece of advice is to be familiar with the CLI of utilities mentioned in the exam objectives.
I wish everyone taking this exam good luck!
![]()