You have already completed the Test before. Hence you can not start it again.
Test is loading...
You must sign in or sign up to start the Test.
You have to finish following quiz, to start this Test:
Your results are here!! for" NVIDIA NCP-AII Practice Test 3 "
0 of 60 questions answered correctly
Your time:
Time has elapsed
Your Final Score is : 0
You have attempted : 0
Number of Correct Questions : 0 and scored 0
Number of Incorrect Questions : 0 and Negative marks 0
Average score
Your score
NVIDIA NCP-AII
You have attempted: 0
Number of Correct Questions: 0 and scored 0
Number of Incorrect Questions: 0 and Negative marks 0
You can review your answers by clicking on “View Answers” option. Important Note : Open Reference Documentation Links in New Tab (Right Click and Open in New Tab).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Answered
Review
Question 1 of 60
1. Question
An AI infrastructure specialist is installing a third-party storage solution for an AI cluster. During the initial configuration, the specialist needs to optimize the storage for high-throughput GPUDirect Storage (GDS) access. Which set of parameters must be verified on the server side to ensure that the physical GPUs can communicate directly with the storage subsystem without bypassing the CPU unnecessarily?
Correct
Correct: B Enable IOMMU and VT-d in the BIOS, verify that the PCIe peer-to-peer (P2P) traffic is supported by the PCIe switch topology, and install the nvidia-fs driver. • The Technical Reason: * IOMMU/VT-d: While older GDS documentation occasionally suggested disabling IOMMU to simplify P2P, modern AI infrastructure (especially virtualized or multi-tenant environments) requires IOMMU (AMD) or VT-d (Intel) to be enabled and correctly configured to manage memory address translation for I/O devices. ? PCIe P2P Topology: GDS requires that the storage controller (NVMe) and the GPU reside on a compatible PCIe path. The PCIe Access Control Services (ACS) must often be disabled at the switch level to allow P2P traffic to move “East-West“ between devices rather than being forced “North“ to the CPU Root Complex. ? nvidia-fs Driver: This is the critical kernel-mode driver (nvidia-fs.ko) that enables the cuFile API. It handles the coordination between the Linux virtual file system (VFS) and the GPU memory, allowing the direct DMA path to be established. • The NCP-AII Context: The exam validates your ability to use the gdscheck tool. A successful validation requires the platform to support P2P and the nvidia-fs driver to be active.
Incorrect: A. Disable all PCIe slots except the primary storage controller Disabling PCIe slots would prevent the system from seeing the GPUs or the high-speed network cards (ConnectX) required for the cluster. While IRQ (Interrupt Request) management is a legacy tuning concept, it is handled automatically by modern MSI-X (Message Signaled Interrupts) and does not require disabling hardware.
C. Set the GPU Clock speed to the maximum possible offset GPU clock offsets (overclocking) affect the core compute speed (FLOPs) but have no impact on the I/O path or the memory controller‘s ability to handle storage interrupts. In fact, GDS is designed to reduce the number of interrupts the CPU has to handle, not to make the GPU “handle“ them.
D. Configure the storage array to use RAID 5 with a small stripe size RAID 5 is generally discouraged for high-performance AI write workloads (like checkpointing) due to the “parity write penalty.“ Furthermore, a small stripe size can lead to fragmented I/O, which is inefficient for GDS. GDS performs best with large, aligned I/O transfers (typically 4KB or larger) that match the GPU‘s memory page alignment.
Incorrect
Correct: B Enable IOMMU and VT-d in the BIOS, verify that the PCIe peer-to-peer (P2P) traffic is supported by the PCIe switch topology, and install the nvidia-fs driver. • The Technical Reason: * IOMMU/VT-d: While older GDS documentation occasionally suggested disabling IOMMU to simplify P2P, modern AI infrastructure (especially virtualized or multi-tenant environments) requires IOMMU (AMD) or VT-d (Intel) to be enabled and correctly configured to manage memory address translation for I/O devices. ? PCIe P2P Topology: GDS requires that the storage controller (NVMe) and the GPU reside on a compatible PCIe path. The PCIe Access Control Services (ACS) must often be disabled at the switch level to allow P2P traffic to move “East-West“ between devices rather than being forced “North“ to the CPU Root Complex. ? nvidia-fs Driver: This is the critical kernel-mode driver (nvidia-fs.ko) that enables the cuFile API. It handles the coordination between the Linux virtual file system (VFS) and the GPU memory, allowing the direct DMA path to be established. • The NCP-AII Context: The exam validates your ability to use the gdscheck tool. A successful validation requires the platform to support P2P and the nvidia-fs driver to be active.
Incorrect: A. Disable all PCIe slots except the primary storage controller Disabling PCIe slots would prevent the system from seeing the GPUs or the high-speed network cards (ConnectX) required for the cluster. While IRQ (Interrupt Request) management is a legacy tuning concept, it is handled automatically by modern MSI-X (Message Signaled Interrupts) and does not require disabling hardware.
C. Set the GPU Clock speed to the maximum possible offset GPU clock offsets (overclocking) affect the core compute speed (FLOPs) but have no impact on the I/O path or the memory controller‘s ability to handle storage interrupts. In fact, GDS is designed to reduce the number of interrupts the CPU has to handle, not to make the GPU “handle“ them.
D. Configure the storage array to use RAID 5 with a small stripe size RAID 5 is generally discouraged for high-performance AI write workloads (like checkpointing) due to the “parity write penalty.“ Furthermore, a small stripe size can lead to fragmented I/O, which is inefficient for GDS. GDS performs best with large, aligned I/O transfers (typically 4KB or larger) that match the GPU‘s memory page alignment.
Unattempted
Correct: B Enable IOMMU and VT-d in the BIOS, verify that the PCIe peer-to-peer (P2P) traffic is supported by the PCIe switch topology, and install the nvidia-fs driver. • The Technical Reason: * IOMMU/VT-d: While older GDS documentation occasionally suggested disabling IOMMU to simplify P2P, modern AI infrastructure (especially virtualized or multi-tenant environments) requires IOMMU (AMD) or VT-d (Intel) to be enabled and correctly configured to manage memory address translation for I/O devices. ? PCIe P2P Topology: GDS requires that the storage controller (NVMe) and the GPU reside on a compatible PCIe path. The PCIe Access Control Services (ACS) must often be disabled at the switch level to allow P2P traffic to move “East-West“ between devices rather than being forced “North“ to the CPU Root Complex. ? nvidia-fs Driver: This is the critical kernel-mode driver (nvidia-fs.ko) that enables the cuFile API. It handles the coordination between the Linux virtual file system (VFS) and the GPU memory, allowing the direct DMA path to be established. • The NCP-AII Context: The exam validates your ability to use the gdscheck tool. A successful validation requires the platform to support P2P and the nvidia-fs driver to be active.
Incorrect: A. Disable all PCIe slots except the primary storage controller Disabling PCIe slots would prevent the system from seeing the GPUs or the high-speed network cards (ConnectX) required for the cluster. While IRQ (Interrupt Request) management is a legacy tuning concept, it is handled automatically by modern MSI-X (Message Signaled Interrupts) and does not require disabling hardware.
C. Set the GPU Clock speed to the maximum possible offset GPU clock offsets (overclocking) affect the core compute speed (FLOPs) but have no impact on the I/O path or the memory controller‘s ability to handle storage interrupts. In fact, GDS is designed to reduce the number of interrupts the CPU has to handle, not to make the GPU “handle“ them.
D. Configure the storage array to use RAID 5 with a small stripe size RAID 5 is generally discouraged for high-performance AI write workloads (like checkpointing) due to the “parity write penalty.“ Furthermore, a small stripe size can lead to fragmented I/O, which is inefficient for GDS. GDS performs best with large, aligned I/O transfers (typically 4KB or larger) that match the GPU‘s memory page alignment.
Question 2 of 60
2. Question
A cluster administrator is setting up the software environment and needs to install the NVIDIA Container Toolkit. Which command-line sequence correctly demonstrates how to configure the repository and install the toolkit on an Ubuntu-based system to enable Docker to utilize GPUs?
Correct
Correct: B. The administrator must add the GPG key and repository using ‘curl‘ and ‘gpg‘, update the package list, and then run ‘apt-get install nvidia-container-toolkit‘.
This is correct because the NVIDIA Container Toolkit installation on Ubuntu requires configuring the NVIDIA repository before installation can proceed .
The official installation sequence documented across multiple sources follows this exact pattern :
First, download and add the GPG key using curl and gpg –dearmor to establish package signing trust
Second, configure the repository by creating a .list file in /etc/apt/sources.list.d/
Third, run sudo apt-get update to refresh the package index
Finally, install the toolkit with sudo apt-get install -y nvidia-container-toolkit
This sequence is explicitly documented in NVIDIA‘s official installation methodology and is required because the nvidia-container-toolkit package is not available in the default Ubuntu repositories .
The Tencent Cloud documentation confirms that after adding the repository and GPG key, updating the package list, and installing, the toolkit will be successfully installed .
Incorrect: A. The toolkit is part of the standard Linux ‘kernel-headers‘ package and does not require a separate installation or repository configuration.
This is incorrect because the NVIDIA Container Toolkit is a separate software package that must be explicitly installed after adding NVIDIA‘s repository . The kernel-headers package is unrelated to container GPU support, and multiple sources confirm the toolkit requires dedicated repository configuration .
C. The administrator needs to download a Windows .exe installer from the NGC website and run it using the ‘wine‘ compatibility layer on the Linux compute nodes.
This is completely incorrect because Linux systems use native package management (apt, yum, dnf) for software installation, not Windows executables. The NVIDIA Container Toolkit is distributed through Linux repositories with .deb or .rpm packages . Using Wine to run Windows installers on server nodes would be impractical and unsupported for infrastructure software.
D. The administrator should use ‘apt-get install nvidia-gpu-magic‘ and then restart the server three times to calibrate the container drivers.
This is incorrect because there is no package named nvidia-gpu-magic in any NVIDIA repository. The actual package name is nvidia-container-toolkit . The reference to restarting the server three times for calibration is fabricated and has no basis in actual installation procedures.
Incorrect
Correct: B. The administrator must add the GPG key and repository using ‘curl‘ and ‘gpg‘, update the package list, and then run ‘apt-get install nvidia-container-toolkit‘.
This is correct because the NVIDIA Container Toolkit installation on Ubuntu requires configuring the NVIDIA repository before installation can proceed .
The official installation sequence documented across multiple sources follows this exact pattern :
First, download and add the GPG key using curl and gpg –dearmor to establish package signing trust
Second, configure the repository by creating a .list file in /etc/apt/sources.list.d/
Third, run sudo apt-get update to refresh the package index
Finally, install the toolkit with sudo apt-get install -y nvidia-container-toolkit
This sequence is explicitly documented in NVIDIA‘s official installation methodology and is required because the nvidia-container-toolkit package is not available in the default Ubuntu repositories .
The Tencent Cloud documentation confirms that after adding the repository and GPG key, updating the package list, and installing, the toolkit will be successfully installed .
Incorrect: A. The toolkit is part of the standard Linux ‘kernel-headers‘ package and does not require a separate installation or repository configuration.
This is incorrect because the NVIDIA Container Toolkit is a separate software package that must be explicitly installed after adding NVIDIA‘s repository . The kernel-headers package is unrelated to container GPU support, and multiple sources confirm the toolkit requires dedicated repository configuration .
C. The administrator needs to download a Windows .exe installer from the NGC website and run it using the ‘wine‘ compatibility layer on the Linux compute nodes.
This is completely incorrect because Linux systems use native package management (apt, yum, dnf) for software installation, not Windows executables. The NVIDIA Container Toolkit is distributed through Linux repositories with .deb or .rpm packages . Using Wine to run Windows installers on server nodes would be impractical and unsupported for infrastructure software.
D. The administrator should use ‘apt-get install nvidia-gpu-magic‘ and then restart the server three times to calibrate the container drivers.
This is incorrect because there is no package named nvidia-gpu-magic in any NVIDIA repository. The actual package name is nvidia-container-toolkit . The reference to restarting the server three times for calibration is fabricated and has no basis in actual installation procedures.
Unattempted
Correct: B. The administrator must add the GPG key and repository using ‘curl‘ and ‘gpg‘, update the package list, and then run ‘apt-get install nvidia-container-toolkit‘.
This is correct because the NVIDIA Container Toolkit installation on Ubuntu requires configuring the NVIDIA repository before installation can proceed .
The official installation sequence documented across multiple sources follows this exact pattern :
First, download and add the GPG key using curl and gpg –dearmor to establish package signing trust
Second, configure the repository by creating a .list file in /etc/apt/sources.list.d/
Third, run sudo apt-get update to refresh the package index
Finally, install the toolkit with sudo apt-get install -y nvidia-container-toolkit
This sequence is explicitly documented in NVIDIA‘s official installation methodology and is required because the nvidia-container-toolkit package is not available in the default Ubuntu repositories .
The Tencent Cloud documentation confirms that after adding the repository and GPG key, updating the package list, and installing, the toolkit will be successfully installed .
Incorrect: A. The toolkit is part of the standard Linux ‘kernel-headers‘ package and does not require a separate installation or repository configuration.
This is incorrect because the NVIDIA Container Toolkit is a separate software package that must be explicitly installed after adding NVIDIA‘s repository . The kernel-headers package is unrelated to container GPU support, and multiple sources confirm the toolkit requires dedicated repository configuration .
C. The administrator needs to download a Windows .exe installer from the NGC website and run it using the ‘wine‘ compatibility layer on the Linux compute nodes.
This is completely incorrect because Linux systems use native package management (apt, yum, dnf) for software installation, not Windows executables. The NVIDIA Container Toolkit is distributed through Linux repositories with .deb or .rpm packages . Using Wine to run Windows installers on server nodes would be impractical and unsupported for infrastructure software.
D. The administrator should use ‘apt-get install nvidia-gpu-magic‘ and then restart the server three times to calibrate the container drivers.
This is incorrect because there is no package named nvidia-gpu-magic in any NVIDIA repository. The actual package name is nvidia-container-toolkit . The reference to restarting the server three times for calibration is fabricated and has no basis in actual installation procedures.
Question 3 of 60
3. Question
A team is preparing to run a large-scale NeMo burn-in test to validate the stability of an AI factory. What is the primary objective of this specific test in the context of cluster verification?
Correct
Correct: C. To subject the cluster to a sustained, high-load workload that simulates real-world AI training to ensure thermal and power stability over time.
The NCP-AII certification blueprint explicitly lists “Perform NeMo™ burn-in“ as a core task within the Cluster Test and Verification domain, which comprises 33% of the examination .
The official NVIDIA certification documentation defines the Cluster Test and Verification domain as encompassing activities such as single-node stress tests, HPL burn-in, NCCL burn-in, and NeMo burn-in, all aimed at ensuring the cluster can handle production workloads .
The GitHub benchmarking repository confirms that NeMo workloads are used to evaluate performance for large-scale AI use cases including pre-training, with recipes designed to run at various cluster scales .
The NVIDIA developer blog emphasizes that evaluating real-world AI workload performance requires assessing the entire platform including infrastructure, software frameworks, and application-level enhancements, not just raw GPU metrics .
The NeMo burn-in test validates the cluster under sustained, high-load conditions that simulate actual AI training patterns, confirming that thermal management and power delivery remain stable over extended periods .
Incorrect: A. To benchmark the sequential read speed of the third-party storage array using the NeMo dataset as a test file.
This is incorrect because storage performance testing is a separate task explicitly listed in the exam blueprint under “Test storage“ . NeMo burn-in focuses on end-to-end AI training workload validation, not isolated storage benchmarking.
B. To train a production-ready chatbot for the customer using the entire cluster‘s compute capacity for several weeks.
This is incorrect because NeMo burn-in is a verification test, not a production training run. The purpose is to validate cluster stability before production workloads begin, not to deliver a trained model. Production training occurs after successful cluster verification.
D. To verify that the NGC CLI can download the NeMo container onto every node in less than ten seconds.
This is incorrect because NGC CLI installation and container downloads are part of the Control Plane Installation and Configuration domain . NeMo burn-in validates the cluster under sustained AI workload conditions, not container download speeds.
Incorrect
Correct: C. To subject the cluster to a sustained, high-load workload that simulates real-world AI training to ensure thermal and power stability over time.
The NCP-AII certification blueprint explicitly lists “Perform NeMo™ burn-in“ as a core task within the Cluster Test and Verification domain, which comprises 33% of the examination .
The official NVIDIA certification documentation defines the Cluster Test and Verification domain as encompassing activities such as single-node stress tests, HPL burn-in, NCCL burn-in, and NeMo burn-in, all aimed at ensuring the cluster can handle production workloads .
The GitHub benchmarking repository confirms that NeMo workloads are used to evaluate performance for large-scale AI use cases including pre-training, with recipes designed to run at various cluster scales .
The NVIDIA developer blog emphasizes that evaluating real-world AI workload performance requires assessing the entire platform including infrastructure, software frameworks, and application-level enhancements, not just raw GPU metrics .
The NeMo burn-in test validates the cluster under sustained, high-load conditions that simulate actual AI training patterns, confirming that thermal management and power delivery remain stable over extended periods .
Incorrect: A. To benchmark the sequential read speed of the third-party storage array using the NeMo dataset as a test file.
This is incorrect because storage performance testing is a separate task explicitly listed in the exam blueprint under “Test storage“ . NeMo burn-in focuses on end-to-end AI training workload validation, not isolated storage benchmarking.
B. To train a production-ready chatbot for the customer using the entire cluster‘s compute capacity for several weeks.
This is incorrect because NeMo burn-in is a verification test, not a production training run. The purpose is to validate cluster stability before production workloads begin, not to deliver a trained model. Production training occurs after successful cluster verification.
D. To verify that the NGC CLI can download the NeMo container onto every node in less than ten seconds.
This is incorrect because NGC CLI installation and container downloads are part of the Control Plane Installation and Configuration domain . NeMo burn-in validates the cluster under sustained AI workload conditions, not container download speeds.
Unattempted
Correct: C. To subject the cluster to a sustained, high-load workload that simulates real-world AI training to ensure thermal and power stability over time.
The NCP-AII certification blueprint explicitly lists “Perform NeMo™ burn-in“ as a core task within the Cluster Test and Verification domain, which comprises 33% of the examination .
The official NVIDIA certification documentation defines the Cluster Test and Verification domain as encompassing activities such as single-node stress tests, HPL burn-in, NCCL burn-in, and NeMo burn-in, all aimed at ensuring the cluster can handle production workloads .
The GitHub benchmarking repository confirms that NeMo workloads are used to evaluate performance for large-scale AI use cases including pre-training, with recipes designed to run at various cluster scales .
The NVIDIA developer blog emphasizes that evaluating real-world AI workload performance requires assessing the entire platform including infrastructure, software frameworks, and application-level enhancements, not just raw GPU metrics .
The NeMo burn-in test validates the cluster under sustained, high-load conditions that simulate actual AI training patterns, confirming that thermal management and power delivery remain stable over extended periods .
Incorrect: A. To benchmark the sequential read speed of the third-party storage array using the NeMo dataset as a test file.
This is incorrect because storage performance testing is a separate task explicitly listed in the exam blueprint under “Test storage“ . NeMo burn-in focuses on end-to-end AI training workload validation, not isolated storage benchmarking.
B. To train a production-ready chatbot for the customer using the entire cluster‘s compute capacity for several weeks.
This is incorrect because NeMo burn-in is a verification test, not a production training run. The purpose is to validate cluster stability before production workloads begin, not to deliver a trained model. Production training occurs after successful cluster verification.
D. To verify that the NGC CLI can download the NeMo container onto every node in less than ten seconds.
This is incorrect because NGC CLI installation and container downloads are part of the Control Plane Installation and Configuration domain . NeMo burn-in validates the cluster under sustained AI workload conditions, not container download speeds.
Question 4 of 60
4. Question
When updating NVIDIA GPU drivers on a production cluster managed by Base Command Manager, what is the recommended procedure to ensure the new drivers are correctly applied to all compute nodes without causing job failures or system inconsistency across the factory?
Correct
Correct: C Update the software image or category in BCM, then use the node update command to synchronize the nodes after draining them of jobs. • The Technical Reason: BCM utilizes a “Single Source of Truth“ model via Software Images. ? The Image Path: Updates are performed within the software image (usually located in /cm/images/) using a chroot environment. This ensures the head node‘s own drivers remain untouched and stable. ? Orchestration (Drain): In a production environment, you must first “drain“ the nodes in the workload manager (Slurm). This allows running jobs to complete while preventing new ones from starting, avoiding the “XID 43“ (GPU driver/firmware mismatch) errors that occur if a driver is swapped mid-calculation. ? Synchronization: Once the image is updated and the node is idle, BCM‘s synchronization tools (like imageupdate or a full reboot) push the new kernel modules and libraries to the compute nodes. • The NCP-AII Context: The exam validates your ability to maintain Cluster Consistency. Option C follows the professional lifecycle: Modify Image ? Drain Nodes ? Sync/Reboot.
Incorrect: A. Uninstall using apt-get purge under 100% load NVIDIA GPU drivers are not hot-swappable while the GPU is actively executing kernels. Attempting to remove a driver module while the GPU is at 100% load will cause a kernel panic or a “stuck“ driver state, leading to a system crash and potential data corruption for the running job.
B. Push the driver as a container image While AI applications (like PyTorch) run in containers, the NVIDIA GPU Driver must be installed on the Host OS kernel. Containers rely on the host‘s driver to communicate with the hardware. You cannot “run a driver in the background“ of a user job as a container image to update the system‘s hardware interface.
D. Run .run installer on the head node and propagate via SSH Using the .run (runfile) installer is generally discouraged in managed clusters as it bypasses the package manager (RPM/DEB) and BCM‘s tracking. Furthermore, the head node should never automatically “propagate“ a driver update to active nodes via SSH during a workday, as this would cause immediate failure of any GPU-accelerated jobs currently running on those nodes.
Incorrect
Correct: C Update the software image or category in BCM, then use the node update command to synchronize the nodes after draining them of jobs. • The Technical Reason: BCM utilizes a “Single Source of Truth“ model via Software Images. ? The Image Path: Updates are performed within the software image (usually located in /cm/images/) using a chroot environment. This ensures the head node‘s own drivers remain untouched and stable. ? Orchestration (Drain): In a production environment, you must first “drain“ the nodes in the workload manager (Slurm). This allows running jobs to complete while preventing new ones from starting, avoiding the “XID 43“ (GPU driver/firmware mismatch) errors that occur if a driver is swapped mid-calculation. ? Synchronization: Once the image is updated and the node is idle, BCM‘s synchronization tools (like imageupdate or a full reboot) push the new kernel modules and libraries to the compute nodes. • The NCP-AII Context: The exam validates your ability to maintain Cluster Consistency. Option C follows the professional lifecycle: Modify Image ? Drain Nodes ? Sync/Reboot.
Incorrect: A. Uninstall using apt-get purge under 100% load NVIDIA GPU drivers are not hot-swappable while the GPU is actively executing kernels. Attempting to remove a driver module while the GPU is at 100% load will cause a kernel panic or a “stuck“ driver state, leading to a system crash and potential data corruption for the running job.
B. Push the driver as a container image While AI applications (like PyTorch) run in containers, the NVIDIA GPU Driver must be installed on the Host OS kernel. Containers rely on the host‘s driver to communicate with the hardware. You cannot “run a driver in the background“ of a user job as a container image to update the system‘s hardware interface.
D. Run .run installer on the head node and propagate via SSH Using the .run (runfile) installer is generally discouraged in managed clusters as it bypasses the package manager (RPM/DEB) and BCM‘s tracking. Furthermore, the head node should never automatically “propagate“ a driver update to active nodes via SSH during a workday, as this would cause immediate failure of any GPU-accelerated jobs currently running on those nodes.
Unattempted
Correct: C Update the software image or category in BCM, then use the node update command to synchronize the nodes after draining them of jobs. • The Technical Reason: BCM utilizes a “Single Source of Truth“ model via Software Images. ? The Image Path: Updates are performed within the software image (usually located in /cm/images/) using a chroot environment. This ensures the head node‘s own drivers remain untouched and stable. ? Orchestration (Drain): In a production environment, you must first “drain“ the nodes in the workload manager (Slurm). This allows running jobs to complete while preventing new ones from starting, avoiding the “XID 43“ (GPU driver/firmware mismatch) errors that occur if a driver is swapped mid-calculation. ? Synchronization: Once the image is updated and the node is idle, BCM‘s synchronization tools (like imageupdate or a full reboot) push the new kernel modules and libraries to the compute nodes. • The NCP-AII Context: The exam validates your ability to maintain Cluster Consistency. Option C follows the professional lifecycle: Modify Image ? Drain Nodes ? Sync/Reboot.
Incorrect: A. Uninstall using apt-get purge under 100% load NVIDIA GPU drivers are not hot-swappable while the GPU is actively executing kernels. Attempting to remove a driver module while the GPU is at 100% load will cause a kernel panic or a “stuck“ driver state, leading to a system crash and potential data corruption for the running job.
B. Push the driver as a container image While AI applications (like PyTorch) run in containers, the NVIDIA GPU Driver must be installed on the Host OS kernel. Containers rely on the host‘s driver to communicate with the hardware. You cannot “run a driver in the background“ of a user job as a container image to update the system‘s hardware interface.
D. Run .run installer on the head node and propagate via SSH Using the .run (runfile) installer is generally discouraged in managed clusters as it bypasses the package manager (RPM/DEB) and BCM‘s tracking. Furthermore, the head node should never automatically “propagate“ a driver update to active nodes via SSH during a workday, as this would cause immediate failure of any GPU-accelerated jobs currently running on those nodes.
Question 5 of 60
5. Question
During the initial configuration of third-party storage for an AI factory, the administrator must ensure that the storage fabric is properly integrated with the compute nodes. What is a critical requirement for configuring the initial parameters of the storage system to support high-throughput NVIDIA GPUDirect Storage (GDS) operations?
Correct
Correct: B Enable RDMA support on the storage controllers and ensure the storage network is on the same subnet as the GPU-to-GPU compute fabric for direct data paths.
The Technical Reason: GDS relies on Remote Direct Memory Access (RDMA) to move data directly from the storage controller/NIC to the GPU memory.
RDMA Requirement: Without RDMA (specifically RoCE or InfiniBand), the data must be buffered through the CPU, which creates a bottleneck and increases latency.
Subnet/Fabric Alignment: For the most efficient “direct“ path, the storage traffic should ideally exist on the same high-speed fabric (InfiniBand or high-speed Ethernet) as the compute nodes. This ensures that the ConnectX-7 adapters can facilitate peer-to-peer transfers without complex routing or protocol translations that would disable GDS offloads.
The NCP-AII Context: The certification emphasizes the “Data Plane“ vs. the “Management Plane.“ GDS is the ultimate data plane optimization. You are expected to know that the storage solution must support the mlnx_ofed drivers and RDMA to participate in the GDS ecosystem.
Incorrect Options: A. Configure the storage to use standard NFS v3 Standard NFS v3 is a legacy protocol that does not natively support RDMA or the specialized IOCTL calls required for GPUDirect Storage. While highly compatible, it would force all data through the CPU‘s “bounce buffers,“ negating the performance benefits of an AI factory. NVIDIA-certified storage typically requires NFS over RDMA or specialized parallel filesystems (like Lustre, Weka, or BeeGFS).
C. Disable PCIe peer-to-peer communication in BIOS This is the opposite of what is required. PCIe Peer-to-Peer (P2P) communication is the foundational hardware mechanism that allows a NIC to write directly to a GPU‘s memory. If P2P is disabled in the BIOS, GDS will fail, and the system will fall back to slow CPU-based copying. This setting is often found as “ACS“ (Access Control Services) or “P2P“ in the BIOS.
D. Set storage LUNs to be managed by the TPM The TPM (Trusted Platform Module) is used for hardware-level security, such as storing encryption keys for disk-at-rest or verifying boot integrity. It is not designed to manage high-throughput storage LUNs or encrypt “data-in-flight“ at the speeds required for AI (hundreds of gigabits per second). Encryption of data-in-flight for GDS is typically handled by the ConnectX-7 hardware engines or specialized DOCA services, not the TPM.
Incorrect
Correct: B Enable RDMA support on the storage controllers and ensure the storage network is on the same subnet as the GPU-to-GPU compute fabric for direct data paths.
The Technical Reason: GDS relies on Remote Direct Memory Access (RDMA) to move data directly from the storage controller/NIC to the GPU memory.
RDMA Requirement: Without RDMA (specifically RoCE or InfiniBand), the data must be buffered through the CPU, which creates a bottleneck and increases latency.
Subnet/Fabric Alignment: For the most efficient “direct“ path, the storage traffic should ideally exist on the same high-speed fabric (InfiniBand or high-speed Ethernet) as the compute nodes. This ensures that the ConnectX-7 adapters can facilitate peer-to-peer transfers without complex routing or protocol translations that would disable GDS offloads.
The NCP-AII Context: The certification emphasizes the “Data Plane“ vs. the “Management Plane.“ GDS is the ultimate data plane optimization. You are expected to know that the storage solution must support the mlnx_ofed drivers and RDMA to participate in the GDS ecosystem.
Incorrect Options: A. Configure the storage to use standard NFS v3 Standard NFS v3 is a legacy protocol that does not natively support RDMA or the specialized IOCTL calls required for GPUDirect Storage. While highly compatible, it would force all data through the CPU‘s “bounce buffers,“ negating the performance benefits of an AI factory. NVIDIA-certified storage typically requires NFS over RDMA or specialized parallel filesystems (like Lustre, Weka, or BeeGFS).
C. Disable PCIe peer-to-peer communication in BIOS This is the opposite of what is required. PCIe Peer-to-Peer (P2P) communication is the foundational hardware mechanism that allows a NIC to write directly to a GPU‘s memory. If P2P is disabled in the BIOS, GDS will fail, and the system will fall back to slow CPU-based copying. This setting is often found as “ACS“ (Access Control Services) or “P2P“ in the BIOS.
D. Set storage LUNs to be managed by the TPM The TPM (Trusted Platform Module) is used for hardware-level security, such as storing encryption keys for disk-at-rest or verifying boot integrity. It is not designed to manage high-throughput storage LUNs or encrypt “data-in-flight“ at the speeds required for AI (hundreds of gigabits per second). Encryption of data-in-flight for GDS is typically handled by the ConnectX-7 hardware engines or specialized DOCA services, not the TPM.
Unattempted
Correct: B Enable RDMA support on the storage controllers and ensure the storage network is on the same subnet as the GPU-to-GPU compute fabric for direct data paths.
The Technical Reason: GDS relies on Remote Direct Memory Access (RDMA) to move data directly from the storage controller/NIC to the GPU memory.
RDMA Requirement: Without RDMA (specifically RoCE or InfiniBand), the data must be buffered through the CPU, which creates a bottleneck and increases latency.
Subnet/Fabric Alignment: For the most efficient “direct“ path, the storage traffic should ideally exist on the same high-speed fabric (InfiniBand or high-speed Ethernet) as the compute nodes. This ensures that the ConnectX-7 adapters can facilitate peer-to-peer transfers without complex routing or protocol translations that would disable GDS offloads.
The NCP-AII Context: The certification emphasizes the “Data Plane“ vs. the “Management Plane.“ GDS is the ultimate data plane optimization. You are expected to know that the storage solution must support the mlnx_ofed drivers and RDMA to participate in the GDS ecosystem.
Incorrect Options: A. Configure the storage to use standard NFS v3 Standard NFS v3 is a legacy protocol that does not natively support RDMA or the specialized IOCTL calls required for GPUDirect Storage. While highly compatible, it would force all data through the CPU‘s “bounce buffers,“ negating the performance benefits of an AI factory. NVIDIA-certified storage typically requires NFS over RDMA or specialized parallel filesystems (like Lustre, Weka, or BeeGFS).
C. Disable PCIe peer-to-peer communication in BIOS This is the opposite of what is required. PCIe Peer-to-Peer (P2P) communication is the foundational hardware mechanism that allows a NIC to write directly to a GPU‘s memory. If P2P is disabled in the BIOS, GDS will fail, and the system will fall back to slow CPU-based copying. This setting is often found as “ACS“ (Access Control Services) or “P2P“ in the BIOS.
D. Set storage LUNs to be managed by the TPM The TPM (Trusted Platform Module) is used for hardware-level security, such as storing encryption keys for disk-at-rest or verifying boot integrity. It is not designed to manage high-throughput storage LUNs or encrypt “data-in-flight“ at the speeds required for AI (hundreds of gigabits per second). Encryption of data-in-flight for GDS is typically handled by the ConnectX-7 hardware engines or specialized DOCA services, not the TPM.
Question 6 of 60
6. Question
A user wants to run a multi-node AI training job using the ‘srun‘ command with the Pyxis and Enroot stack. The command fails with an error indicating that the container image cannot be found. Which component of the control plane is responsible for pulling the image from a registry and converting it into a runtime format on the compute nodes?
Correct
Correct: A The Enroot runtime, which is triggered by the Pyxis plugin to download, unpack, and create the container filesystem on the local compute node.
The Technical Reason: In an NVIDIA-optimized cluster, the process follows a specific hand-off:
Pyxis (The Trigger): As a Slurm plugin, Pyxis intercepts the –container-image flag from the srun command. It communicates the image requirement to the runtime.
Enroot (The Worker): Enroot is the component that actually performs the “heavy lifting.“ It interfaces with the container registry (like NVIDIA NGC), downloads the layers, and converts them into a SquashFS file. This format is highly optimized for high-performance computing (HPC) environments because it allows for fast, unprivileged mounting of the filesystem on the compute node.
The Error: If the command fails with an image not found error, it means Enroot was unable to locate the specified tag in the registry or the local cache.
The NCP-AII Context: The certification emphasizes that Enroot is the runtime replacement for Docker in HPC environments, while Pyxis is the integration layer for Slurm.
Incorrect Options: B. The DOCA driver and DPU hardware acceleration DOCA (Data Center Infrastructure-on-a-Chip Architecture) and BlueField DPUs are used to offload networking, security, and storage tasks (like encryption or telemetry). While they manage the data path, they are not responsible for the high-level logic of pulling and unpacking container images, which is a filesystem and application-layer task managed by Enroot.
C. The NGC CLI manually run on every node The NGC CLI is a tool used by administrators or users to manually interact with the NVIDIA GPU Cloud. While it can be used to pull images, the “Pyxis and Enroot stack“ is designed to automate this process. A user does not need to manually pull images to every node; the stack handles it dynamically at job submission time.
D. The Slurm database (slurmdbd) The Slurm database is responsible for accounting, job history, and user associations. It stores metadata about jobs (who ran what and when), but it does not store binary data like container images. Storing multi-gigabyte images in a SQL database would be inefficient and is not supported by the Slurm architecture.
Incorrect
Correct: A The Enroot runtime, which is triggered by the Pyxis plugin to download, unpack, and create the container filesystem on the local compute node.
The Technical Reason: In an NVIDIA-optimized cluster, the process follows a specific hand-off:
Pyxis (The Trigger): As a Slurm plugin, Pyxis intercepts the –container-image flag from the srun command. It communicates the image requirement to the runtime.
Enroot (The Worker): Enroot is the component that actually performs the “heavy lifting.“ It interfaces with the container registry (like NVIDIA NGC), downloads the layers, and converts them into a SquashFS file. This format is highly optimized for high-performance computing (HPC) environments because it allows for fast, unprivileged mounting of the filesystem on the compute node.
The Error: If the command fails with an image not found error, it means Enroot was unable to locate the specified tag in the registry or the local cache.
The NCP-AII Context: The certification emphasizes that Enroot is the runtime replacement for Docker in HPC environments, while Pyxis is the integration layer for Slurm.
Incorrect Options: B. The DOCA driver and DPU hardware acceleration DOCA (Data Center Infrastructure-on-a-Chip Architecture) and BlueField DPUs are used to offload networking, security, and storage tasks (like encryption or telemetry). While they manage the data path, they are not responsible for the high-level logic of pulling and unpacking container images, which is a filesystem and application-layer task managed by Enroot.
C. The NGC CLI manually run on every node The NGC CLI is a tool used by administrators or users to manually interact with the NVIDIA GPU Cloud. While it can be used to pull images, the “Pyxis and Enroot stack“ is designed to automate this process. A user does not need to manually pull images to every node; the stack handles it dynamically at job submission time.
D. The Slurm database (slurmdbd) The Slurm database is responsible for accounting, job history, and user associations. It stores metadata about jobs (who ran what and when), but it does not store binary data like container images. Storing multi-gigabyte images in a SQL database would be inefficient and is not supported by the Slurm architecture.
Unattempted
Correct: A The Enroot runtime, which is triggered by the Pyxis plugin to download, unpack, and create the container filesystem on the local compute node.
The Technical Reason: In an NVIDIA-optimized cluster, the process follows a specific hand-off:
Pyxis (The Trigger): As a Slurm plugin, Pyxis intercepts the –container-image flag from the srun command. It communicates the image requirement to the runtime.
Enroot (The Worker): Enroot is the component that actually performs the “heavy lifting.“ It interfaces with the container registry (like NVIDIA NGC), downloads the layers, and converts them into a SquashFS file. This format is highly optimized for high-performance computing (HPC) environments because it allows for fast, unprivileged mounting of the filesystem on the compute node.
The Error: If the command fails with an image not found error, it means Enroot was unable to locate the specified tag in the registry or the local cache.
The NCP-AII Context: The certification emphasizes that Enroot is the runtime replacement for Docker in HPC environments, while Pyxis is the integration layer for Slurm.
Incorrect Options: B. The DOCA driver and DPU hardware acceleration DOCA (Data Center Infrastructure-on-a-Chip Architecture) and BlueField DPUs are used to offload networking, security, and storage tasks (like encryption or telemetry). While they manage the data path, they are not responsible for the high-level logic of pulling and unpacking container images, which is a filesystem and application-layer task managed by Enroot.
C. The NGC CLI manually run on every node The NGC CLI is a tool used by administrators or users to manually interact with the NVIDIA GPU Cloud. While it can be used to pull images, the “Pyxis and Enroot stack“ is designed to automate this process. A user does not need to manually pull images to every node; the stack handles it dynamically at job submission time.
D. The Slurm database (slurmdbd) The Slurm database is responsible for accounting, job history, and user associations. It stores metadata about jobs (who ran what and when), but it does not store binary data like container images. Storing multi-gigabyte images in a SQL database would be inefficient and is not supported by the Slurm architecture.
Question 7 of 60
7. Question
When designing the network topology for a large-scale AI factory utilizing NVIDIA Quantum-2 InfiniBand switches and HGX nodes, what is the primary consideration for selecting cable types and transceivers to ensure minimal latency and maximum throughput for East-West traffic?
Correct
Correct: A Validate that the transceivers match the port speed of the ConnectX-7 adapters and ensure that cable lengths do not exceed the signal attenuation limits for NDR 400Gb/s.
The Technical Reason:
Transceiver Matching: NVIDIA Quantum-2 switches use OSFP connectors, while ConnectX-7 adapters may use OSFP or QSFP112. Ensuring the form factors and speeds (NDR vs. NDR200) match is critical for link initialization.
Attenuation Limits: NDR InfiniBand is extremely sensitive to distance. For example, Passive Direct Attach Copper (DAC) is limited to 3 meters, and Linear Active Copper (LAC/ACC) to 5 meters. Exceeding these lengths without moving to optical solutions leads to high Bit Error Rates (BER) or total link failure.
The NCP-AII Context: The exam validates your ability to perform “Physical Layer Management.“ You are expected to know that for 400G, even a minor deviation in cable length or a mismatched transceiver “finned“ vs. “flat“ top can prevent a node from reaching its theoretical peak bandwidth.
Incorrect Options: B. Use AOCs for all intra-rack and DACs for all inter-rack This is logically reversed. DACs (Direct Attach Copper) are the preferred choice for intra-rack (short distance) connections because they offer the lowest possible latency (no optical conversion) and lowest power consumption. AOCs or transceivers with fiber are used for inter-rack (longer distance) connections where copper‘s signal attenuation would be too high.
C. Standardize on Cat6e and use Twinax for power reduction Cat6e is an Ethernet standard used for 1GbE or 10GbE management networks; it is physically and electrically incompatible with 400G InfiniBand. While Twinax (DAC) does reduce power, standardizing the entire management network on Cat6e does not address the primary throughput or latency needs of the high-speed AI data fabric (East-West traffic).
D. Prioritize Single-Mode Fiber for short-range intra-rack Single-Mode Fiber (SMF) is designed for long-range communication (up to 2km or more). Using it for short-range intra-rack connections is unnecessarily expensive and complex. Multi-Mode Fiber (MMF) or copper is the standard for short-range connections in an AI factory due to lower transceiver costs and sufficient performance for distances under 50 meters.
Incorrect
Correct: A Validate that the transceivers match the port speed of the ConnectX-7 adapters and ensure that cable lengths do not exceed the signal attenuation limits for NDR 400Gb/s.
The Technical Reason:
Transceiver Matching: NVIDIA Quantum-2 switches use OSFP connectors, while ConnectX-7 adapters may use OSFP or QSFP112. Ensuring the form factors and speeds (NDR vs. NDR200) match is critical for link initialization.
Attenuation Limits: NDR InfiniBand is extremely sensitive to distance. For example, Passive Direct Attach Copper (DAC) is limited to 3 meters, and Linear Active Copper (LAC/ACC) to 5 meters. Exceeding these lengths without moving to optical solutions leads to high Bit Error Rates (BER) or total link failure.
The NCP-AII Context: The exam validates your ability to perform “Physical Layer Management.“ You are expected to know that for 400G, even a minor deviation in cable length or a mismatched transceiver “finned“ vs. “flat“ top can prevent a node from reaching its theoretical peak bandwidth.
Incorrect Options: B. Use AOCs for all intra-rack and DACs for all inter-rack This is logically reversed. DACs (Direct Attach Copper) are the preferred choice for intra-rack (short distance) connections because they offer the lowest possible latency (no optical conversion) and lowest power consumption. AOCs or transceivers with fiber are used for inter-rack (longer distance) connections where copper‘s signal attenuation would be too high.
C. Standardize on Cat6e and use Twinax for power reduction Cat6e is an Ethernet standard used for 1GbE or 10GbE management networks; it is physically and electrically incompatible with 400G InfiniBand. While Twinax (DAC) does reduce power, standardizing the entire management network on Cat6e does not address the primary throughput or latency needs of the high-speed AI data fabric (East-West traffic).
D. Prioritize Single-Mode Fiber for short-range intra-rack Single-Mode Fiber (SMF) is designed for long-range communication (up to 2km or more). Using it for short-range intra-rack connections is unnecessarily expensive and complex. Multi-Mode Fiber (MMF) or copper is the standard for short-range connections in an AI factory due to lower transceiver costs and sufficient performance for distances under 50 meters.
Unattempted
Correct: A Validate that the transceivers match the port speed of the ConnectX-7 adapters and ensure that cable lengths do not exceed the signal attenuation limits for NDR 400Gb/s.
The Technical Reason:
Transceiver Matching: NVIDIA Quantum-2 switches use OSFP connectors, while ConnectX-7 adapters may use OSFP or QSFP112. Ensuring the form factors and speeds (NDR vs. NDR200) match is critical for link initialization.
Attenuation Limits: NDR InfiniBand is extremely sensitive to distance. For example, Passive Direct Attach Copper (DAC) is limited to 3 meters, and Linear Active Copper (LAC/ACC) to 5 meters. Exceeding these lengths without moving to optical solutions leads to high Bit Error Rates (BER) or total link failure.
The NCP-AII Context: The exam validates your ability to perform “Physical Layer Management.“ You are expected to know that for 400G, even a minor deviation in cable length or a mismatched transceiver “finned“ vs. “flat“ top can prevent a node from reaching its theoretical peak bandwidth.
Incorrect Options: B. Use AOCs for all intra-rack and DACs for all inter-rack This is logically reversed. DACs (Direct Attach Copper) are the preferred choice for intra-rack (short distance) connections because they offer the lowest possible latency (no optical conversion) and lowest power consumption. AOCs or transceivers with fiber are used for inter-rack (longer distance) connections where copper‘s signal attenuation would be too high.
C. Standardize on Cat6e and use Twinax for power reduction Cat6e is an Ethernet standard used for 1GbE or 10GbE management networks; it is physically and electrically incompatible with 400G InfiniBand. While Twinax (DAC) does reduce power, standardizing the entire management network on Cat6e does not address the primary throughput or latency needs of the high-speed AI data fabric (East-West traffic).
D. Prioritize Single-Mode Fiber for short-range intra-rack Single-Mode Fiber (SMF) is designed for long-range communication (up to 2km or more). Using it for short-range intra-rack connections is unnecessarily expensive and complex. Multi-Mode Fiber (MMF) or copper is the standard for short-range connections in an AI factory due to lower transceiver costs and sufficient performance for distances under 50 meters.
Question 8 of 60
8. Question
During the operation of an AI factory, a server reports a GPU fault. The administrator suspects a hardware failure. Which sequence of troubleshooting steps is most appropriate for identifying if the GPU is indeed faulty and requires replacement according to NVIDIA best practices?
Correct
Correct: C Check the DCGM (Data Center GPU Manager) logs for XID errors, run the ‘nvidia-smi -q‘ command to check for retired pages, and execute a field diagnostic test.
The Technical Reason: This follows the standardized NVIDIA diagnostic hierarchy:
XID Errors: These are error reports printed to the system logs (dmesg or journalctl) by the NVIDIA driver. Specific codes (like XID 61 or XID 63) provide immediate insight into whether the issue is a thermal trip, a memory controller error, or a physical link failure.
Retired Pages: High-end GPUs use ECC (Error Correction Code) memory. If specific memory cells fail, the driver “retires“ those pages to prevent further errors. A high or increasing count of Retired Pages (viewable via nvidia-smi -q) is a primary indicator of a hardware memory defect.
Field Diagnostics: Tools like NVVS (NVIDIA Validation Suite), part of DCGM, perform “stress tests“ (like the plugin: sm_stress or plugin: diagnostic) to confirm if the hardware can still operate under load.
The NCP-AII Context: The exam validates your ability to use DCGM for health monitoring. You are expected to know that a hardware replacement (RMA) typically requires the output of these specific diagnostic logs as proof of failure.
Incorrect Options: A. Delete all data and reinstall drivers This is a “scorched earth“ approach that does not diagnose the hardware. While corrupted datasets can cause application crashes, they do not cause hardware faults reported by the GPU firmware. Reinstalling drivers might fix a software conflict, but deleting the entire storage array is irrelevant to a GPU hardware status.
B. Replace the CPU first Modern AI servers use highly integrated PCIe topologies. While a CPU failure could theoretically impact the PCIe bus, it is extremely rare for a CPU instruction set issue to manifest as a specific GPU hardware fault. Replacing a multi-thousand-dollar CPU as a “first step“ is inefficient and not aligned with NVIDIA‘s component-level troubleshooting best practices.
D. Restart and increase GPU clock speed Restarting may temporarily clear a hung state, but it does not identify the root cause. Furthermore, increasing the clock speed (overclocking) a suspected faulty GPU is dangerous; it increases heat and electrical stress, which is likely to worsen a hardware defect rather than “overcoming resistance.“
Incorrect
Correct: C Check the DCGM (Data Center GPU Manager) logs for XID errors, run the ‘nvidia-smi -q‘ command to check for retired pages, and execute a field diagnostic test.
The Technical Reason: This follows the standardized NVIDIA diagnostic hierarchy:
XID Errors: These are error reports printed to the system logs (dmesg or journalctl) by the NVIDIA driver. Specific codes (like XID 61 or XID 63) provide immediate insight into whether the issue is a thermal trip, a memory controller error, or a physical link failure.
Retired Pages: High-end GPUs use ECC (Error Correction Code) memory. If specific memory cells fail, the driver “retires“ those pages to prevent further errors. A high or increasing count of Retired Pages (viewable via nvidia-smi -q) is a primary indicator of a hardware memory defect.
Field Diagnostics: Tools like NVVS (NVIDIA Validation Suite), part of DCGM, perform “stress tests“ (like the plugin: sm_stress or plugin: diagnostic) to confirm if the hardware can still operate under load.
The NCP-AII Context: The exam validates your ability to use DCGM for health monitoring. You are expected to know that a hardware replacement (RMA) typically requires the output of these specific diagnostic logs as proof of failure.
Incorrect Options: A. Delete all data and reinstall drivers This is a “scorched earth“ approach that does not diagnose the hardware. While corrupted datasets can cause application crashes, they do not cause hardware faults reported by the GPU firmware. Reinstalling drivers might fix a software conflict, but deleting the entire storage array is irrelevant to a GPU hardware status.
B. Replace the CPU first Modern AI servers use highly integrated PCIe topologies. While a CPU failure could theoretically impact the PCIe bus, it is extremely rare for a CPU instruction set issue to manifest as a specific GPU hardware fault. Replacing a multi-thousand-dollar CPU as a “first step“ is inefficient and not aligned with NVIDIA‘s component-level troubleshooting best practices.
D. Restart and increase GPU clock speed Restarting may temporarily clear a hung state, but it does not identify the root cause. Furthermore, increasing the clock speed (overclocking) a suspected faulty GPU is dangerous; it increases heat and electrical stress, which is likely to worsen a hardware defect rather than “overcoming resistance.“
Unattempted
Correct: C Check the DCGM (Data Center GPU Manager) logs for XID errors, run the ‘nvidia-smi -q‘ command to check for retired pages, and execute a field diagnostic test.
The Technical Reason: This follows the standardized NVIDIA diagnostic hierarchy:
XID Errors: These are error reports printed to the system logs (dmesg or journalctl) by the NVIDIA driver. Specific codes (like XID 61 or XID 63) provide immediate insight into whether the issue is a thermal trip, a memory controller error, or a physical link failure.
Retired Pages: High-end GPUs use ECC (Error Correction Code) memory. If specific memory cells fail, the driver “retires“ those pages to prevent further errors. A high or increasing count of Retired Pages (viewable via nvidia-smi -q) is a primary indicator of a hardware memory defect.
Field Diagnostics: Tools like NVVS (NVIDIA Validation Suite), part of DCGM, perform “stress tests“ (like the plugin: sm_stress or plugin: diagnostic) to confirm if the hardware can still operate under load.
The NCP-AII Context: The exam validates your ability to use DCGM for health monitoring. You are expected to know that a hardware replacement (RMA) typically requires the output of these specific diagnostic logs as proof of failure.
Incorrect Options: A. Delete all data and reinstall drivers This is a “scorched earth“ approach that does not diagnose the hardware. While corrupted datasets can cause application crashes, they do not cause hardware faults reported by the GPU firmware. Reinstalling drivers might fix a software conflict, but deleting the entire storage array is irrelevant to a GPU hardware status.
B. Replace the CPU first Modern AI servers use highly integrated PCIe topologies. While a CPU failure could theoretically impact the PCIe bus, it is extremely rare for a CPU instruction set issue to manifest as a specific GPU hardware fault. Replacing a multi-thousand-dollar CPU as a “first step“ is inefficient and not aligned with NVIDIA‘s component-level troubleshooting best practices.
D. Restart and increase GPU clock speed Restarting may temporarily clear a hung state, but it does not identify the root cause. Furthermore, increasing the clock speed (overclocking) a suspected faulty GPU is dangerous; it increases heat and electrical stress, which is likely to worsen a hardware defect rather than “overcoming resistance.“
Question 9 of 60
9. Question
A system engineer is performing the initial bring-up of an NVIDIA HGX H100 system. After verifying the physical installation and power delivery, the engineer needs to perform a firmware synchronization across the baseboard management controller (BMC) and the complex GPU baseboard. Which sequence represents the most reliable method for ensuring the hardware is validated and firmware is consistent across all components before OS deployment?
Correct
Correct: A Access the BMC via the OOB network to verify the power and cooling health, update the BMC and BIOS firmware, then use the specialized HGX firmware update tools to synchronize the GPU baseboard and NVSwitch components.
The Technical Reason: HGX systems require a structured “Bottom-Up“ update approach:
Foundation (OOB/BMC): Before any logic is applied, you must ensure the management controller (BMC) is stable and that the system‘s power/thermal environment is healthy. An unstable BMC during a GPU firmware flash can lead to a “bricked“ baseboard.
Motherboard (BIOS): The BIOS must be updated next to ensure the PCIe enumeration and power delivery to the GPU tray are correctly handled by the host CPU.
GPU Tray (HGX/NVSwitch): Finally, specialized tools like nvfwupd or vendor-specific HMC (Host Management Controller) utilities are used. These tools handle the simultaneous flashing of all 8 GPUs and the interconnected NVSwitch chips to ensure they are on a “matched“ recipe version.
The NCP-AII Context: The exam validates your knowledge of the Sequence of Events for deployment. Option A represents the only professional workflow that prevents version mismatch (which causes “XID“ errors) and ensures system-wide stability.
Incorrect Options: B. Cold boot and immediately run HPL Running a high-intensity stress test like High-Performance Linpack (HPL) on unvalidated or mismatched firmware is dangerous. Mismatched firmware can cause incorrect thermal throttling or power-distribution failures under load, potentially damaging the hardware or resulting in “silent data corruption“ that makes the test results meaningless.
C. Boot into Linux and use nvidia-smi to flash VBIOS While nvidia-smi can flash an individual GPU‘s VBIOS, it is not the recommended tool for an HGX baseboard. HGX systems require the NVSwitch firmware and the GPU VBIOS to be updated as a synchronized “bundle.“ Updating just the VBIOS via the OS without verifying the BMC/HMC state often leads to fabric initialization failures.
D. Install Container Toolkit and run Docker The NVIDIA Container Toolkit is a high-level software component. Attempting to use it to “check“ firmware compatibility is putting the cart before the horse. If the underlying firmware is inconsistent, the GPU drivers may fail to load entirely, preventing the Container Toolkit from even seeing the GPUs.
Incorrect
Correct: A Access the BMC via the OOB network to verify the power and cooling health, update the BMC and BIOS firmware, then use the specialized HGX firmware update tools to synchronize the GPU baseboard and NVSwitch components.
The Technical Reason: HGX systems require a structured “Bottom-Up“ update approach:
Foundation (OOB/BMC): Before any logic is applied, you must ensure the management controller (BMC) is stable and that the system‘s power/thermal environment is healthy. An unstable BMC during a GPU firmware flash can lead to a “bricked“ baseboard.
Motherboard (BIOS): The BIOS must be updated next to ensure the PCIe enumeration and power delivery to the GPU tray are correctly handled by the host CPU.
GPU Tray (HGX/NVSwitch): Finally, specialized tools like nvfwupd or vendor-specific HMC (Host Management Controller) utilities are used. These tools handle the simultaneous flashing of all 8 GPUs and the interconnected NVSwitch chips to ensure they are on a “matched“ recipe version.
The NCP-AII Context: The exam validates your knowledge of the Sequence of Events for deployment. Option A represents the only professional workflow that prevents version mismatch (which causes “XID“ errors) and ensures system-wide stability.
Incorrect Options: B. Cold boot and immediately run HPL Running a high-intensity stress test like High-Performance Linpack (HPL) on unvalidated or mismatched firmware is dangerous. Mismatched firmware can cause incorrect thermal throttling or power-distribution failures under load, potentially damaging the hardware or resulting in “silent data corruption“ that makes the test results meaningless.
C. Boot into Linux and use nvidia-smi to flash VBIOS While nvidia-smi can flash an individual GPU‘s VBIOS, it is not the recommended tool for an HGX baseboard. HGX systems require the NVSwitch firmware and the GPU VBIOS to be updated as a synchronized “bundle.“ Updating just the VBIOS via the OS without verifying the BMC/HMC state often leads to fabric initialization failures.
D. Install Container Toolkit and run Docker The NVIDIA Container Toolkit is a high-level software component. Attempting to use it to “check“ firmware compatibility is putting the cart before the horse. If the underlying firmware is inconsistent, the GPU drivers may fail to load entirely, preventing the Container Toolkit from even seeing the GPUs.
Unattempted
Correct: A Access the BMC via the OOB network to verify the power and cooling health, update the BMC and BIOS firmware, then use the specialized HGX firmware update tools to synchronize the GPU baseboard and NVSwitch components.
The Technical Reason: HGX systems require a structured “Bottom-Up“ update approach:
Foundation (OOB/BMC): Before any logic is applied, you must ensure the management controller (BMC) is stable and that the system‘s power/thermal environment is healthy. An unstable BMC during a GPU firmware flash can lead to a “bricked“ baseboard.
Motherboard (BIOS): The BIOS must be updated next to ensure the PCIe enumeration and power delivery to the GPU tray are correctly handled by the host CPU.
GPU Tray (HGX/NVSwitch): Finally, specialized tools like nvfwupd or vendor-specific HMC (Host Management Controller) utilities are used. These tools handle the simultaneous flashing of all 8 GPUs and the interconnected NVSwitch chips to ensure they are on a “matched“ recipe version.
The NCP-AII Context: The exam validates your knowledge of the Sequence of Events for deployment. Option A represents the only professional workflow that prevents version mismatch (which causes “XID“ errors) and ensures system-wide stability.
Incorrect Options: B. Cold boot and immediately run HPL Running a high-intensity stress test like High-Performance Linpack (HPL) on unvalidated or mismatched firmware is dangerous. Mismatched firmware can cause incorrect thermal throttling or power-distribution failures under load, potentially damaging the hardware or resulting in “silent data corruption“ that makes the test results meaningless.
C. Boot into Linux and use nvidia-smi to flash VBIOS While nvidia-smi can flash an individual GPU‘s VBIOS, it is not the recommended tool for an HGX baseboard. HGX systems require the NVSwitch firmware and the GPU VBIOS to be updated as a synchronized “bundle.“ Updating just the VBIOS via the OS without verifying the BMC/HMC state often leads to fabric initialization failures.
D. Install Container Toolkit and run Docker The NVIDIA Container Toolkit is a high-level software component. Attempting to use it to “check“ firmware compatibility is putting the cart before the horse. If the underlying firmware is inconsistent, the GPU drivers may fail to load entirely, preventing the Container Toolkit from even seeing the GPUs.
Question 10 of 60
10. Question
An administrator is setting up the NVIDIA Container Toolkit on a compute node. Which configuration step is required to ensure that the Docker runtime can access the NVIDIA GPUs for hardware acceleration when a container is started?
Correct
Correct: A Configure the Docker daemon‘s ‘daemon.json‘ file to include ‘nvidia-container-runtime‘ as a registered runtime and set it as the default.
The Technical Reason: Docker, by default, uses the runc runtime, which has no inherent knowledge of NVIDIA GPUs.
The Shim: The nvidia-container-runtime acts as a specialized shim. When Docker starts a container, it hands off the process to this runtime.
The Hook: The runtime uses a “prestart hook“ to query the host‘s NVIDIA drivers and physically mount the GPU device nodes (e.g., /dev/nvidia0) and driver libraries into the container‘s isolated filesystem.
The Entrypoint: Modifying /etc/docker/daemon.json to register the nvidia runtime is the standard way to enable the –gpus flag in Docker versions 19.03 and later.
The NCP-AII Context: The certification expects you to know the specific manual configuration steps. After editing the JSON file, you must execute sudo systemctl restart docker to apply the changes.
Incorrect Options: B. Install GPU drivers inside every Docker image This is a major architectural “anti-pattern.“
Kernel Mismatch: GPU drivers consist of kernel modules that must match the host‘s running Linux kernel. If you bake drivers into an image, that image will only work on a host with that exact kernel and driver version.
The Correct Way: The NVIDIA Container Toolkit is designed so that the drivers live only on the host. The toolkit then “injects“ the necessary user-space libraries into the container at launch, keeping images portable across different driver versions.
C. Use NGC CLI to download ‘gpu-access-key‘ to the TPM This is factually incorrect. Access to GPUs in a standard AI cluster is managed by Linux device permissions and the container runtime, not by a “GPU access key“ stored in the TPM (Trusted Platform Module). The TPM is used for hardware identity and disk encryption (Secure Boot), not for Docker-to-GPU authorization.
D. Enable ‘GPU-Forwarding‘ in the BlueField-3 DPU While the BlueField-3 DPU can manage network and storage traffic, it is not responsible for “bridging the PCIe bus“ for the Docker daemon on the host CPU. GPU access for containers is a software orchestration task handled by the NVIDIA Container Runtime on the host OS, regardless of whether a DPU is present in the system.
Incorrect
Correct: A Configure the Docker daemon‘s ‘daemon.json‘ file to include ‘nvidia-container-runtime‘ as a registered runtime and set it as the default.
The Technical Reason: Docker, by default, uses the runc runtime, which has no inherent knowledge of NVIDIA GPUs.
The Shim: The nvidia-container-runtime acts as a specialized shim. When Docker starts a container, it hands off the process to this runtime.
The Hook: The runtime uses a “prestart hook“ to query the host‘s NVIDIA drivers and physically mount the GPU device nodes (e.g., /dev/nvidia0) and driver libraries into the container‘s isolated filesystem.
The Entrypoint: Modifying /etc/docker/daemon.json to register the nvidia runtime is the standard way to enable the –gpus flag in Docker versions 19.03 and later.
The NCP-AII Context: The certification expects you to know the specific manual configuration steps. After editing the JSON file, you must execute sudo systemctl restart docker to apply the changes.
Incorrect Options: B. Install GPU drivers inside every Docker image This is a major architectural “anti-pattern.“
Kernel Mismatch: GPU drivers consist of kernel modules that must match the host‘s running Linux kernel. If you bake drivers into an image, that image will only work on a host with that exact kernel and driver version.
The Correct Way: The NVIDIA Container Toolkit is designed so that the drivers live only on the host. The toolkit then “injects“ the necessary user-space libraries into the container at launch, keeping images portable across different driver versions.
C. Use NGC CLI to download ‘gpu-access-key‘ to the TPM This is factually incorrect. Access to GPUs in a standard AI cluster is managed by Linux device permissions and the container runtime, not by a “GPU access key“ stored in the TPM (Trusted Platform Module). The TPM is used for hardware identity and disk encryption (Secure Boot), not for Docker-to-GPU authorization.
D. Enable ‘GPU-Forwarding‘ in the BlueField-3 DPU While the BlueField-3 DPU can manage network and storage traffic, it is not responsible for “bridging the PCIe bus“ for the Docker daemon on the host CPU. GPU access for containers is a software orchestration task handled by the NVIDIA Container Runtime on the host OS, regardless of whether a DPU is present in the system.
Unattempted
Correct: A Configure the Docker daemon‘s ‘daemon.json‘ file to include ‘nvidia-container-runtime‘ as a registered runtime and set it as the default.
The Technical Reason: Docker, by default, uses the runc runtime, which has no inherent knowledge of NVIDIA GPUs.
The Shim: The nvidia-container-runtime acts as a specialized shim. When Docker starts a container, it hands off the process to this runtime.
The Hook: The runtime uses a “prestart hook“ to query the host‘s NVIDIA drivers and physically mount the GPU device nodes (e.g., /dev/nvidia0) and driver libraries into the container‘s isolated filesystem.
The Entrypoint: Modifying /etc/docker/daemon.json to register the nvidia runtime is the standard way to enable the –gpus flag in Docker versions 19.03 and later.
The NCP-AII Context: The certification expects you to know the specific manual configuration steps. After editing the JSON file, you must execute sudo systemctl restart docker to apply the changes.
Incorrect Options: B. Install GPU drivers inside every Docker image This is a major architectural “anti-pattern.“
Kernel Mismatch: GPU drivers consist of kernel modules that must match the host‘s running Linux kernel. If you bake drivers into an image, that image will only work on a host with that exact kernel and driver version.
The Correct Way: The NVIDIA Container Toolkit is designed so that the drivers live only on the host. The toolkit then “injects“ the necessary user-space libraries into the container at launch, keeping images portable across different driver versions.
C. Use NGC CLI to download ‘gpu-access-key‘ to the TPM This is factually incorrect. Access to GPUs in a standard AI cluster is managed by Linux device permissions and the container runtime, not by a “GPU access key“ stored in the TPM (Trusted Platform Module). The TPM is used for hardware identity and disk encryption (Secure Boot), not for Docker-to-GPU authorization.
D. Enable ‘GPU-Forwarding‘ in the BlueField-3 DPU While the BlueField-3 DPU can manage network and storage traffic, it is not responsible for “bridging the PCIe bus“ for the Docker daemon on the host CPU. GPU access for containers is a software orchestration task handled by the NVIDIA Container Runtime on the host OS, regardless of whether a DPU is present in the system.
Question 11 of 60
11. Question
When optimizing performance for a cluster using AMD-based servers with NVIDIA GPUs, an administrator notices that the training jobs are running slower than expected. Upon investigation, they find that the GPU is connected to a PCIe root complex on a different NUMA node than the network card. Why does this impact performance?
Correct
Correct: D Data must travel across the inter-processor link (such as Infinity Fabric), which adds latency and reduces the available bandwidth for GPU-to-NIC communication.
The Technical Reason: In a modern AI server, high-speed communication (like GPUDirect RDMA) relies on a direct, low-latency path between the Network Interface Card (NIC) and the GPU.
The “Hop“ Problem: If the NIC is connected to the PCIe lanes of CPU 0 and the GPU is connected to CPU 1, any data moving between them must cross the AMD Infinity Fabric (the inter-socket interconnect).
Performance Hit: This “cross-socket“ jump introduces additional nanoseconds of latency and consumes bandwidth on the CPU interconnect that should be reserved for CPU-to-memory or CPU-to-CPU tasks. For collective operations (like all_reduce in NCCL), this bottleneck can lead to significant scaling inefficiency.
The NCP-AII Context: The exam expects you to be able to use tools like nvidia-smi topo -m to identify these affinities. A professional administrator ensures that “Local“ affinity is maintained, meaning the GPU and the NIC used for its data transfers reside on the same NUMA node/PCIe Root Complex.
Incorrect Options: A. CPUs are not capable of communicating across NUMA nodes This is false. Modern operating systems and hardware are fully capable of communicating across NUMA nodes; it is simply less efficient. The system will not crash; it will just perform sub-optimally as data takes a longer path through the hardware.
B. GPU will automatically lower its voltage GPUs do not lower their voltage or clock speed based on the “distance“ of a network card. While a GPU might throttle due to thermal issues or power limits, it has no internal logic to downclock itself simply because a NIC is on a different NUMA node. The performance drop is caused by external bus contention and latency, not internal GPU downclocking.
C. NUMA nodes are a software construct only for VMs NUMA is a physical hardware architecture fundamental to multi-socket and multi-die servers (like AMD EPYC and Intel Xeon). While hypervisors can present “Virtual NUMA“ to VMs, the underlying performance constraints are rooted in the physical traces on the motherboard and the silicon design of the CPUs.
Incorrect
Correct: D Data must travel across the inter-processor link (such as Infinity Fabric), which adds latency and reduces the available bandwidth for GPU-to-NIC communication.
The Technical Reason: In a modern AI server, high-speed communication (like GPUDirect RDMA) relies on a direct, low-latency path between the Network Interface Card (NIC) and the GPU.
The “Hop“ Problem: If the NIC is connected to the PCIe lanes of CPU 0 and the GPU is connected to CPU 1, any data moving between them must cross the AMD Infinity Fabric (the inter-socket interconnect).
Performance Hit: This “cross-socket“ jump introduces additional nanoseconds of latency and consumes bandwidth on the CPU interconnect that should be reserved for CPU-to-memory or CPU-to-CPU tasks. For collective operations (like all_reduce in NCCL), this bottleneck can lead to significant scaling inefficiency.
The NCP-AII Context: The exam expects you to be able to use tools like nvidia-smi topo -m to identify these affinities. A professional administrator ensures that “Local“ affinity is maintained, meaning the GPU and the NIC used for its data transfers reside on the same NUMA node/PCIe Root Complex.
Incorrect Options: A. CPUs are not capable of communicating across NUMA nodes This is false. Modern operating systems and hardware are fully capable of communicating across NUMA nodes; it is simply less efficient. The system will not crash; it will just perform sub-optimally as data takes a longer path through the hardware.
B. GPU will automatically lower its voltage GPUs do not lower their voltage or clock speed based on the “distance“ of a network card. While a GPU might throttle due to thermal issues or power limits, it has no internal logic to downclock itself simply because a NIC is on a different NUMA node. The performance drop is caused by external bus contention and latency, not internal GPU downclocking.
C. NUMA nodes are a software construct only for VMs NUMA is a physical hardware architecture fundamental to multi-socket and multi-die servers (like AMD EPYC and Intel Xeon). While hypervisors can present “Virtual NUMA“ to VMs, the underlying performance constraints are rooted in the physical traces on the motherboard and the silicon design of the CPUs.
Unattempted
Correct: D Data must travel across the inter-processor link (such as Infinity Fabric), which adds latency and reduces the available bandwidth for GPU-to-NIC communication.
The Technical Reason: In a modern AI server, high-speed communication (like GPUDirect RDMA) relies on a direct, low-latency path between the Network Interface Card (NIC) and the GPU.
The “Hop“ Problem: If the NIC is connected to the PCIe lanes of CPU 0 and the GPU is connected to CPU 1, any data moving between them must cross the AMD Infinity Fabric (the inter-socket interconnect).
Performance Hit: This “cross-socket“ jump introduces additional nanoseconds of latency and consumes bandwidth on the CPU interconnect that should be reserved for CPU-to-memory or CPU-to-CPU tasks. For collective operations (like all_reduce in NCCL), this bottleneck can lead to significant scaling inefficiency.
The NCP-AII Context: The exam expects you to be able to use tools like nvidia-smi topo -m to identify these affinities. A professional administrator ensures that “Local“ affinity is maintained, meaning the GPU and the NIC used for its data transfers reside on the same NUMA node/PCIe Root Complex.
Incorrect Options: A. CPUs are not capable of communicating across NUMA nodes This is false. Modern operating systems and hardware are fully capable of communicating across NUMA nodes; it is simply less efficient. The system will not crash; it will just perform sub-optimally as data takes a longer path through the hardware.
B. GPU will automatically lower its voltage GPUs do not lower their voltage or clock speed based on the “distance“ of a network card. While a GPU might throttle due to thermal issues or power limits, it has no internal logic to downclock itself simply because a NIC is on a different NUMA node. The performance drop is caused by external bus contention and latency, not internal GPU downclocking.
C. NUMA nodes are a software construct only for VMs NUMA is a physical hardware architecture fundamental to multi-socket and multi-die servers (like AMD EPYC and Intel Xeon). While hypervisors can present “Virtual NUMA“ to VMs, the underlying performance constraints are rooted in the physical traces on the motherboard and the silicon design of the CPUs.
Question 12 of 60
12. Question
A storage optimization task is underway for an AI cluster where data loading is identified as the primary bottleneck. The administrator decides to implement NVIDIA GPUDirect Storage (GDS). What is the primary requirement for the network cards (NICs) to support this feature effectively for the cluster?
Correct
Correct: B The NICs must support RDMA (Remote Direct Memory Access) and be positioned on the same PCIe root complex as the GPUs. • The Technical Reason: * RDMA Requirement: GDS functions by using GPUDirect RDMA technology. This allows the network adapter (like a ConnectX-7) to write data directly into the GPUÂ’s High Bandwidth Memory (HBM) without involving the CPU or bouncing the data through system RAM. Standard TCP/IP requires CPU processing and multiple memory copies, which defeats the purpose of GDS. ? PCIe Affinity: For the most efficient “direct“ path, the NIC and the GPU should be connected to the same PCIe Root Complex (or the same PCIe switch in an HGX system). If data has to cross between different CPU sockets (crossing the UPI or Infinity Fabric), latency increases and bandwidth is halved, significantly reducing the effectiveness of GDS. • The NCP-AII Context: The exam validates your understanding of the “Data Plane“ vs. the “Control Plane.“ GDS is a data plane optimization that requires hardware-level alignment (PCIe peer-to-peer) and specific protocol support (RDMA/RoCE/InfiniBand).
Incorrect: A. NICs connected directly to the BMC management port The BMC (Baseboard Management Controller) port is intended for Out-of-Band (OOB) management, such as remote console access, power control, and thermal monitoring. It typically operates at 1GbE speeds and is completely separate from the high-speed data fabric. It lacks the bandwidth and the architectural wiring to handle AI training data or GDS transfers.
C. NICs must have integrated RGB lighting While some consumer or enthusiast hardware features RGB lighting, it has zero impact on functional performance, data integrity, or GDS support. In a professional data center or “AI Factory“ environment, status is monitored via the BMC, NVIDIA Unified Fabric Manager (UFM), or NVIDIA Bright Manager, not by physical light colors on the NIC.
D. NICs must use the TCP/IP stack exclusively This is the opposite of what GDS requires. The standard TCP/IP stack is the primary reason for the bottlenecks GDS seeks to solve, as it requires the CPU to manage packet headers and data copying. To use GDS, the system must use RDMA-capable protocols (InfiniBand or RoCE). While the Container Toolkit is compatible with TCP/IP, it is also fully compatible with RDMA via the NVIDIA Container Runtime.
Incorrect
Correct: B The NICs must support RDMA (Remote Direct Memory Access) and be positioned on the same PCIe root complex as the GPUs. • The Technical Reason: * RDMA Requirement: GDS functions by using GPUDirect RDMA technology. This allows the network adapter (like a ConnectX-7) to write data directly into the GPUÂ’s High Bandwidth Memory (HBM) without involving the CPU or bouncing the data through system RAM. Standard TCP/IP requires CPU processing and multiple memory copies, which defeats the purpose of GDS. ? PCIe Affinity: For the most efficient “direct“ path, the NIC and the GPU should be connected to the same PCIe Root Complex (or the same PCIe switch in an HGX system). If data has to cross between different CPU sockets (crossing the UPI or Infinity Fabric), latency increases and bandwidth is halved, significantly reducing the effectiveness of GDS. • The NCP-AII Context: The exam validates your understanding of the “Data Plane“ vs. the “Control Plane.“ GDS is a data plane optimization that requires hardware-level alignment (PCIe peer-to-peer) and specific protocol support (RDMA/RoCE/InfiniBand).
Incorrect: A. NICs connected directly to the BMC management port The BMC (Baseboard Management Controller) port is intended for Out-of-Band (OOB) management, such as remote console access, power control, and thermal monitoring. It typically operates at 1GbE speeds and is completely separate from the high-speed data fabric. It lacks the bandwidth and the architectural wiring to handle AI training data or GDS transfers.
C. NICs must have integrated RGB lighting While some consumer or enthusiast hardware features RGB lighting, it has zero impact on functional performance, data integrity, or GDS support. In a professional data center or “AI Factory“ environment, status is monitored via the BMC, NVIDIA Unified Fabric Manager (UFM), or NVIDIA Bright Manager, not by physical light colors on the NIC.
D. NICs must use the TCP/IP stack exclusively This is the opposite of what GDS requires. The standard TCP/IP stack is the primary reason for the bottlenecks GDS seeks to solve, as it requires the CPU to manage packet headers and data copying. To use GDS, the system must use RDMA-capable protocols (InfiniBand or RoCE). While the Container Toolkit is compatible with TCP/IP, it is also fully compatible with RDMA via the NVIDIA Container Runtime.
Unattempted
Correct: B The NICs must support RDMA (Remote Direct Memory Access) and be positioned on the same PCIe root complex as the GPUs. • The Technical Reason: * RDMA Requirement: GDS functions by using GPUDirect RDMA technology. This allows the network adapter (like a ConnectX-7) to write data directly into the GPUÂ’s High Bandwidth Memory (HBM) without involving the CPU or bouncing the data through system RAM. Standard TCP/IP requires CPU processing and multiple memory copies, which defeats the purpose of GDS. ? PCIe Affinity: For the most efficient “direct“ path, the NIC and the GPU should be connected to the same PCIe Root Complex (or the same PCIe switch in an HGX system). If data has to cross between different CPU sockets (crossing the UPI or Infinity Fabric), latency increases and bandwidth is halved, significantly reducing the effectiveness of GDS. • The NCP-AII Context: The exam validates your understanding of the “Data Plane“ vs. the “Control Plane.“ GDS is a data plane optimization that requires hardware-level alignment (PCIe peer-to-peer) and specific protocol support (RDMA/RoCE/InfiniBand).
Incorrect: A. NICs connected directly to the BMC management port The BMC (Baseboard Management Controller) port is intended for Out-of-Band (OOB) management, such as remote console access, power control, and thermal monitoring. It typically operates at 1GbE speeds and is completely separate from the high-speed data fabric. It lacks the bandwidth and the architectural wiring to handle AI training data or GDS transfers.
C. NICs must have integrated RGB lighting While some consumer or enthusiast hardware features RGB lighting, it has zero impact on functional performance, data integrity, or GDS support. In a professional data center or “AI Factory“ environment, status is monitored via the BMC, NVIDIA Unified Fabric Manager (UFM), or NVIDIA Bright Manager, not by physical light colors on the NIC.
D. NICs must use the TCP/IP stack exclusively This is the opposite of what GDS requires. The standard TCP/IP stack is the primary reason for the bottlenecks GDS seeks to solve, as it requires the CPU to manage packet headers and data copying. To use GDS, the system must use RDMA-capable protocols (InfiniBand or RoCE). While the Container Toolkit is compatible with TCP/IP, it is also fully compatible with RDMA via the NVIDIA Container Runtime.
Question 13 of 60
13. Question
During the cluster test and verification phase, an administrator is using NVIDIA ClusterKit to perform a multifaceted node assessment. One of the tests fails with an error indicating ‘Inconsistent Signal Quality‘ on the NDR InfiniBand links. Which corrective action should be taken according to the NCP-AII guidelines?
Correct
Correct: C Inspect and clean the optical transceivers and fiber optic cables, and ensure all cables are properly seated and have no tight bends.
The Technical Reason: NDR InfiniBand uses PAM4 modulation, which packs more data into the same frequency but has a much lower tolerance for signal noise.
Contamination: A single microscopic speck of dust on a transceiver face or fiber tip can scatter the light signal, causing high Bit Error Rates (BER) or signal inconsistency.
Bend Radius: Fiber optic cables have a specific minimum bend radius. If a cable is bent too tightly (e.g., during poor cable management in the rack), it causes “micro-bending“ losses where light leaks out of the core, degrading the signal.
Seating: A transceiver that isn‘t fully latched can result in an air gap between the optical interfaces, leading to signal reflection and instability.
The NCP-AII Context: The exam validates your ability to follow the Physical-to-Software troubleshooting hierarchy. Before assuming a logic error or a hardware failure, an administrator must always verify the integrity of the “L0“ (Physical) layer.
Incorrect Options: A. Disable the TPM 2.0 module The TPM (Trusted Platform Module) is a security chip used for platform integrity and encryption keys. It operates at a very low frequency and is electrically isolated from the high-speed differential pairs of the PCIe bus. Disabling it has no impact on the signal quality of an external InfiniBand network link.
B. Update the Slurm scheduler Slurm is a job scheduler (Layer 7/Application). It manages when and where jobs run but has no control over the physical network hardware or bit-level error correction. Bit errors are handled at the hardware/firmware level by Forward Error Correction (FEC), not by workload management software.
D. Reduce the GPU power limit While high temperatures can eventually affect transceivers if the overall ambient rack temperature rises, modern SXM and PCIe GPUs are physically separated from the network adapters. Reducing GPU power is an optimization for thermal management or power capping, but it is not a corrective action for a specific signal quality error on a network cable.
Incorrect
Correct: C Inspect and clean the optical transceivers and fiber optic cables, and ensure all cables are properly seated and have no tight bends.
The Technical Reason: NDR InfiniBand uses PAM4 modulation, which packs more data into the same frequency but has a much lower tolerance for signal noise.
Contamination: A single microscopic speck of dust on a transceiver face or fiber tip can scatter the light signal, causing high Bit Error Rates (BER) or signal inconsistency.
Bend Radius: Fiber optic cables have a specific minimum bend radius. If a cable is bent too tightly (e.g., during poor cable management in the rack), it causes “micro-bending“ losses where light leaks out of the core, degrading the signal.
Seating: A transceiver that isn‘t fully latched can result in an air gap between the optical interfaces, leading to signal reflection and instability.
The NCP-AII Context: The exam validates your ability to follow the Physical-to-Software troubleshooting hierarchy. Before assuming a logic error or a hardware failure, an administrator must always verify the integrity of the “L0“ (Physical) layer.
Incorrect Options: A. Disable the TPM 2.0 module The TPM (Trusted Platform Module) is a security chip used for platform integrity and encryption keys. It operates at a very low frequency and is electrically isolated from the high-speed differential pairs of the PCIe bus. Disabling it has no impact on the signal quality of an external InfiniBand network link.
B. Update the Slurm scheduler Slurm is a job scheduler (Layer 7/Application). It manages when and where jobs run but has no control over the physical network hardware or bit-level error correction. Bit errors are handled at the hardware/firmware level by Forward Error Correction (FEC), not by workload management software.
D. Reduce the GPU power limit While high temperatures can eventually affect transceivers if the overall ambient rack temperature rises, modern SXM and PCIe GPUs are physically separated from the network adapters. Reducing GPU power is an optimization for thermal management or power capping, but it is not a corrective action for a specific signal quality error on a network cable.
Unattempted
Correct: C Inspect and clean the optical transceivers and fiber optic cables, and ensure all cables are properly seated and have no tight bends.
The Technical Reason: NDR InfiniBand uses PAM4 modulation, which packs more data into the same frequency but has a much lower tolerance for signal noise.
Contamination: A single microscopic speck of dust on a transceiver face or fiber tip can scatter the light signal, causing high Bit Error Rates (BER) or signal inconsistency.
Bend Radius: Fiber optic cables have a specific minimum bend radius. If a cable is bent too tightly (e.g., during poor cable management in the rack), it causes “micro-bending“ losses where light leaks out of the core, degrading the signal.
Seating: A transceiver that isn‘t fully latched can result in an air gap between the optical interfaces, leading to signal reflection and instability.
The NCP-AII Context: The exam validates your ability to follow the Physical-to-Software troubleshooting hierarchy. Before assuming a logic error or a hardware failure, an administrator must always verify the integrity of the “L0“ (Physical) layer.
Incorrect Options: A. Disable the TPM 2.0 module The TPM (Trusted Platform Module) is a security chip used for platform integrity and encryption keys. It operates at a very low frequency and is electrically isolated from the high-speed differential pairs of the PCIe bus. Disabling it has no impact on the signal quality of an external InfiniBand network link.
B. Update the Slurm scheduler Slurm is a job scheduler (Layer 7/Application). It manages when and where jobs run but has no control over the physical network hardware or bit-level error correction. Bit errors are handled at the hardware/firmware level by Forward Error Correction (FEC), not by workload management software.
D. Reduce the GPU power limit While high temperatures can eventually affect transceivers if the overall ambient rack temperature rises, modern SXM and PCIe GPUs are physically separated from the network adapters. Reducing GPU power is an optimization for thermal management or power capping, but it is not a corrective action for a specific signal quality error on a network cable.
Question 14 of 60
14. Question
A system engineer is performing the initial bring-up of a new NVIDIA HGX H100 system. After connecting the power cables and initializing the Baseboard Management Controller (BMC), the engineer notices that the system fails to complete the power-on sequence. Upon checking the Out-of-Band (OOB) management logs, there is an indication of a power-delivery mismatch. Which specific validation step should the engineer prioritize to ensure the server meets the high-density power requirements for full GPU operation?
Correct
Correct: B Verify that the Power Distribution Unit (PDU) provides sufficient voltage and that the BMC is configured for the correct power redundancy mode (N+N or N+1).
The Technical Reason: HGX H100 systems (like the DGX H100) typically utilize six 3.3 kW power supply units (PSUs).
Voltage Requirements: These PSUs require high-line AC power (200-240V). If the PDU is mistakenly providing low-line power (110-120V), the PSUs cannot reach their rated wattage, and the system will report a mismatch.
Redundancy Configuration: The BMC manages how these PSUs are grouped. In a standard 4+2 (N+N) configuration, the system requires four active PSUs to support the full 700W TDP of each of the eight H100 GPUs. If the BMC detects fewer active power paths than the redundancy policy requires, it will halt the power-on sequence or cap the GPU power so low that the system becomes non-functional.
The NCP-AII Context: The exam validates your ability to “Validate power and cooling parameters“ during bring-up. Option B addresses the physical electrical source (PDU) and the logical management layer (BMC policy), which are the two most common failure points in high-density AI deployments.
Incorrect Options: A. Reinstall the NVIDIA Container Toolkit The NVIDIA Container Toolkit is a user-space software utility used to mount GPUs into Docker containers. It has no role in the server‘s hardware power-on sequence or the electrical handshake between the PSUs and the BMC. Software cannot fix a physical power-delivery mismatch that occurs before the OS has even loaded.
C. Replace OSFP transceivers with SFP+ OSFP (Octal Small Form-factor Pluggable) transceivers are required for the 400Gb/s NDR InfiniBand fabric. SFP+ is a legacy 10Gb/s standard that is physically and electrically incompatible with the ConnectX-7 slots in an H100 system. Furthermore, network transceivers draw negligible power compared to the 5.6kW+ required by the GPU tray; replacing them would not resolve a system-level power mismatch.
D. Perform a firmware downgrade using NVIDIA SMI nvidia-smi is a tool that runs inside the Operating System. If the system fails to complete its power-on sequence, the OS cannot boot, rendering nvidia-smi inaccessible. Additionally, a firmware mismatch usually requires an upgrade to a validated “Golden Recipe“ rather than a blind downgrade to factory defaults.
Incorrect
Correct: B Verify that the Power Distribution Unit (PDU) provides sufficient voltage and that the BMC is configured for the correct power redundancy mode (N+N or N+1).
The Technical Reason: HGX H100 systems (like the DGX H100) typically utilize six 3.3 kW power supply units (PSUs).
Voltage Requirements: These PSUs require high-line AC power (200-240V). If the PDU is mistakenly providing low-line power (110-120V), the PSUs cannot reach their rated wattage, and the system will report a mismatch.
Redundancy Configuration: The BMC manages how these PSUs are grouped. In a standard 4+2 (N+N) configuration, the system requires four active PSUs to support the full 700W TDP of each of the eight H100 GPUs. If the BMC detects fewer active power paths than the redundancy policy requires, it will halt the power-on sequence or cap the GPU power so low that the system becomes non-functional.
The NCP-AII Context: The exam validates your ability to “Validate power and cooling parameters“ during bring-up. Option B addresses the physical electrical source (PDU) and the logical management layer (BMC policy), which are the two most common failure points in high-density AI deployments.
Incorrect Options: A. Reinstall the NVIDIA Container Toolkit The NVIDIA Container Toolkit is a user-space software utility used to mount GPUs into Docker containers. It has no role in the server‘s hardware power-on sequence or the electrical handshake between the PSUs and the BMC. Software cannot fix a physical power-delivery mismatch that occurs before the OS has even loaded.
C. Replace OSFP transceivers with SFP+ OSFP (Octal Small Form-factor Pluggable) transceivers are required for the 400Gb/s NDR InfiniBand fabric. SFP+ is a legacy 10Gb/s standard that is physically and electrically incompatible with the ConnectX-7 slots in an H100 system. Furthermore, network transceivers draw negligible power compared to the 5.6kW+ required by the GPU tray; replacing them would not resolve a system-level power mismatch.
D. Perform a firmware downgrade using NVIDIA SMI nvidia-smi is a tool that runs inside the Operating System. If the system fails to complete its power-on sequence, the OS cannot boot, rendering nvidia-smi inaccessible. Additionally, a firmware mismatch usually requires an upgrade to a validated “Golden Recipe“ rather than a blind downgrade to factory defaults.
Unattempted
Correct: B Verify that the Power Distribution Unit (PDU) provides sufficient voltage and that the BMC is configured for the correct power redundancy mode (N+N or N+1).
The Technical Reason: HGX H100 systems (like the DGX H100) typically utilize six 3.3 kW power supply units (PSUs).
Voltage Requirements: These PSUs require high-line AC power (200-240V). If the PDU is mistakenly providing low-line power (110-120V), the PSUs cannot reach their rated wattage, and the system will report a mismatch.
Redundancy Configuration: The BMC manages how these PSUs are grouped. In a standard 4+2 (N+N) configuration, the system requires four active PSUs to support the full 700W TDP of each of the eight H100 GPUs. If the BMC detects fewer active power paths than the redundancy policy requires, it will halt the power-on sequence or cap the GPU power so low that the system becomes non-functional.
The NCP-AII Context: The exam validates your ability to “Validate power and cooling parameters“ during bring-up. Option B addresses the physical electrical source (PDU) and the logical management layer (BMC policy), which are the two most common failure points in high-density AI deployments.
Incorrect Options: A. Reinstall the NVIDIA Container Toolkit The NVIDIA Container Toolkit is a user-space software utility used to mount GPUs into Docker containers. It has no role in the server‘s hardware power-on sequence or the electrical handshake between the PSUs and the BMC. Software cannot fix a physical power-delivery mismatch that occurs before the OS has even loaded.
C. Replace OSFP transceivers with SFP+ OSFP (Octal Small Form-factor Pluggable) transceivers are required for the 400Gb/s NDR InfiniBand fabric. SFP+ is a legacy 10Gb/s standard that is physically and electrically incompatible with the ConnectX-7 slots in an H100 system. Furthermore, network transceivers draw negligible power compared to the 5.6kW+ required by the GPU tray; replacing them would not resolve a system-level power mismatch.
D. Perform a firmware downgrade using NVIDIA SMI nvidia-smi is a tool that runs inside the Operating System. If the system fails to complete its power-on sequence, the OS cannot boot, rendering nvidia-smi inaccessible. Additionally, a firmware mismatch usually requires an upgrade to a validated “Golden Recipe“ rather than a blind downgrade to factory defaults.
Question 15 of 60
15. Question
During a performance optimization phase, an administrator notices that the storage throughput for an AI training job is bottlenecked, even though the backend storage array is capable of higher speeds. The cluster uses an InfiniBand fabric. Which troubleshooting step would help identify if the issue is related to the network configuration for storage?
Correct
Correct: A Verify that ‘PFC‘ (Priority Flow Control) or ‘ECN‘ (Explicit Congestion Notification) is correctly configured on the switches to prevent packet loss during high-load periods.
The Technical Reason: AI workloads and high-speed storage (like NVMe-over-Fabrics) generate “incast“ traffic patterns where multiple nodes send data to a single destination simultaneously.
PFC (Priority Flow Control): This mechanism prevents buffer overflow by sending “PAUSE“ frames to the sender when a switch port‘s buffer reaches a certain threshold. This ensures a lossless fabric, which is mandatory for RDMA (Remote Direct Memory Access) to function without timing out.
ECN (Explicit Congestion Notification): ECN allows switches to mark packets when congestion is detected. The end-nodes see these marks and proactively throttle their transmission rates before the switch is forced to drop packets.
The NCP-AII Context: The exam validates your ability to optimize the “Data Plane.“ In an InfiniBand or RoCE (RDMA over Converged Ethernet) environment, “packet loss“ is the most common cause of massive performance degradation. Verifying these Quality of Service (QoS) parameters is the standard first step in network-layer optimization.
Incorrect Options: B. Disable ‘nvidia-peermem‘ The nvidia-peermem module is actually a requirement for GPUDirect RDMA. It allows the InfiniBand/Ethernet driver to “peer“ directly with the NVIDIA GPU driver to facilitate direct memory transfers. Disabling it would force all data to be copied through the CPU and system RAM (the “bounce buffer“), which would significantly increase the bottleneck rather than fix it.
C. Change protocol to standard TCP/IP Standard TCP/IP is much slower than RDMA for AI workloads because it relies on the CPU to handle packet headers, checksums, and data copying. Switching to TCP/IP adds massive latency and CPU overhead. The NCP-AII path emphasizes moving away from TCP/IP toward RDMA-based protocols to maximize the “AI Factory“ throughput.
D. Decrease the number of ‘storage targets‘ Decreasing the number of storage targets (or parallel data streams) generally reduces aggregate bandwidth. Modern parallel filesystems (like Lustre, Weka, or BeeGFS) and high-performance storage arrays achieve their speed by “striping“ data across many targets. Reducing them would limit the client‘s ability to pull data in parallel, worsening the bottleneck.
Incorrect
Correct: A Verify that ‘PFC‘ (Priority Flow Control) or ‘ECN‘ (Explicit Congestion Notification) is correctly configured on the switches to prevent packet loss during high-load periods.
The Technical Reason: AI workloads and high-speed storage (like NVMe-over-Fabrics) generate “incast“ traffic patterns where multiple nodes send data to a single destination simultaneously.
PFC (Priority Flow Control): This mechanism prevents buffer overflow by sending “PAUSE“ frames to the sender when a switch port‘s buffer reaches a certain threshold. This ensures a lossless fabric, which is mandatory for RDMA (Remote Direct Memory Access) to function without timing out.
ECN (Explicit Congestion Notification): ECN allows switches to mark packets when congestion is detected. The end-nodes see these marks and proactively throttle their transmission rates before the switch is forced to drop packets.
The NCP-AII Context: The exam validates your ability to optimize the “Data Plane.“ In an InfiniBand or RoCE (RDMA over Converged Ethernet) environment, “packet loss“ is the most common cause of massive performance degradation. Verifying these Quality of Service (QoS) parameters is the standard first step in network-layer optimization.
Incorrect Options: B. Disable ‘nvidia-peermem‘ The nvidia-peermem module is actually a requirement for GPUDirect RDMA. It allows the InfiniBand/Ethernet driver to “peer“ directly with the NVIDIA GPU driver to facilitate direct memory transfers. Disabling it would force all data to be copied through the CPU and system RAM (the “bounce buffer“), which would significantly increase the bottleneck rather than fix it.
C. Change protocol to standard TCP/IP Standard TCP/IP is much slower than RDMA for AI workloads because it relies on the CPU to handle packet headers, checksums, and data copying. Switching to TCP/IP adds massive latency and CPU overhead. The NCP-AII path emphasizes moving away from TCP/IP toward RDMA-based protocols to maximize the “AI Factory“ throughput.
D. Decrease the number of ‘storage targets‘ Decreasing the number of storage targets (or parallel data streams) generally reduces aggregate bandwidth. Modern parallel filesystems (like Lustre, Weka, or BeeGFS) and high-performance storage arrays achieve their speed by “striping“ data across many targets. Reducing them would limit the client‘s ability to pull data in parallel, worsening the bottleneck.
Unattempted
Correct: A Verify that ‘PFC‘ (Priority Flow Control) or ‘ECN‘ (Explicit Congestion Notification) is correctly configured on the switches to prevent packet loss during high-load periods.
The Technical Reason: AI workloads and high-speed storage (like NVMe-over-Fabrics) generate “incast“ traffic patterns where multiple nodes send data to a single destination simultaneously.
PFC (Priority Flow Control): This mechanism prevents buffer overflow by sending “PAUSE“ frames to the sender when a switch port‘s buffer reaches a certain threshold. This ensures a lossless fabric, which is mandatory for RDMA (Remote Direct Memory Access) to function without timing out.
ECN (Explicit Congestion Notification): ECN allows switches to mark packets when congestion is detected. The end-nodes see these marks and proactively throttle their transmission rates before the switch is forced to drop packets.
The NCP-AII Context: The exam validates your ability to optimize the “Data Plane.“ In an InfiniBand or RoCE (RDMA over Converged Ethernet) environment, “packet loss“ is the most common cause of massive performance degradation. Verifying these Quality of Service (QoS) parameters is the standard first step in network-layer optimization.
Incorrect Options: B. Disable ‘nvidia-peermem‘ The nvidia-peermem module is actually a requirement for GPUDirect RDMA. It allows the InfiniBand/Ethernet driver to “peer“ directly with the NVIDIA GPU driver to facilitate direct memory transfers. Disabling it would force all data to be copied through the CPU and system RAM (the “bounce buffer“), which would significantly increase the bottleneck rather than fix it.
C. Change protocol to standard TCP/IP Standard TCP/IP is much slower than RDMA for AI workloads because it relies on the CPU to handle packet headers, checksums, and data copying. Switching to TCP/IP adds massive latency and CPU overhead. The NCP-AII path emphasizes moving away from TCP/IP toward RDMA-based protocols to maximize the “AI Factory“ throughput.
D. Decrease the number of ‘storage targets‘ Decreasing the number of storage targets (or parallel data streams) generally reduces aggregate bandwidth. Modern parallel filesystems (like Lustre, Weka, or BeeGFS) and high-performance storage arrays achieve their speed by “striping“ data across many targets. Reducing them would limit the client‘s ability to pull data in parallel, worsening the bottleneck.
Question 16 of 60
16. Question
During the physical installation of GPU-based servers, the technician must validate that the cooling parameters meet the requirements for NVIDIA H100 GPUs. If the BMC reports that the GPU inlet temperature is nearing the thermal throttle limit despite low ambient room temperatures, what is the most likely physical configuration error?
Correct
Correct: B. The server is missing blanking panels in the rack, causing hot air recirculation into the cold aisle.
The NCP-AII certification blueprint explicitly includes “Validate power and cooling parameters“ as a core task within the System and Server Bring-up domain, which comprises 31% of the examination .
Missing blanking panels in server racks is a well-documented physical configuration error that directly causes hot air recirculation from the hot aisle back into the cold aisle, raising GPU inlet temperatures even when ambient room temperatures are normal .
The Schneider Electric blog specifically states that “cabling congestion, missing blanking panels, and open rack gaps create recirculation paths that let hot air reenter the inlet. Sealing these paths solves the problem“ .
The comprehensive environmental monitoring guide confirms that “blanking panel management prevents recirculation degrading cooling effectiveness“ and notes that “missing panels cause 5-10°C temperature increase“ .
This scenario directly matches the certification‘s troubleshooting domain where identifying physical infrastructure issues like improper rack configuration is prioritized before component-level failures .
Incorrect: A. The GPU-based servers are configured with the wrong IP addresses in the OOB management network.
This is incorrect because IP address configuration in the Out-of-Band (OOB) management network has no impact on GPU inlet temperatures or cooling performance. The OOB network is used for remote management access to the Baseboard Management Controller (BMC) , not for thermal management. Temperature readings from the BMC would still be accurate regardless of network configuration.
C. The storage array is connected via SAS instead of NVMe, increasing the heat density of the server chassis.
This is incorrect because storage interface type (SAS vs NVMe) does not significantly affect GPU inlet temperatures. While NVMe drives may have different thermal characteristics than SAS drives, the connection type is not a primary factor in rack-level cooling issues. The symptom of GPU inlet temperatures nearing throttle limits despite low ambient temperatures points to airflow recirculation problems, not storage-related heat generation.
D. The TPM is not initialized, preventing the fans from reaching their maximum RPM setpoints.
This is incorrect because the Trusted Platform Module (TPM) is a security chip for cryptographic operations and platform integrity and has no role in fan speed control or thermal management. Fan speed control is handled by the BMC and system thermal management firmware, independent of TPM initialization status.
Incorrect
Correct: B. The server is missing blanking panels in the rack, causing hot air recirculation into the cold aisle.
The NCP-AII certification blueprint explicitly includes “Validate power and cooling parameters“ as a core task within the System and Server Bring-up domain, which comprises 31% of the examination .
Missing blanking panels in server racks is a well-documented physical configuration error that directly causes hot air recirculation from the hot aisle back into the cold aisle, raising GPU inlet temperatures even when ambient room temperatures are normal .
The Schneider Electric blog specifically states that “cabling congestion, missing blanking panels, and open rack gaps create recirculation paths that let hot air reenter the inlet. Sealing these paths solves the problem“ .
The comprehensive environmental monitoring guide confirms that “blanking panel management prevents recirculation degrading cooling effectiveness“ and notes that “missing panels cause 5-10°C temperature increase“ .
This scenario directly matches the certification‘s troubleshooting domain where identifying physical infrastructure issues like improper rack configuration is prioritized before component-level failures .
Incorrect: A. The GPU-based servers are configured with the wrong IP addresses in the OOB management network.
This is incorrect because IP address configuration in the Out-of-Band (OOB) management network has no impact on GPU inlet temperatures or cooling performance. The OOB network is used for remote management access to the Baseboard Management Controller (BMC) , not for thermal management. Temperature readings from the BMC would still be accurate regardless of network configuration.
C. The storage array is connected via SAS instead of NVMe, increasing the heat density of the server chassis.
This is incorrect because storage interface type (SAS vs NVMe) does not significantly affect GPU inlet temperatures. While NVMe drives may have different thermal characteristics than SAS drives, the connection type is not a primary factor in rack-level cooling issues. The symptom of GPU inlet temperatures nearing throttle limits despite low ambient temperatures points to airflow recirculation problems, not storage-related heat generation.
D. The TPM is not initialized, preventing the fans from reaching their maximum RPM setpoints.
This is incorrect because the Trusted Platform Module (TPM) is a security chip for cryptographic operations and platform integrity and has no role in fan speed control or thermal management. Fan speed control is handled by the BMC and system thermal management firmware, independent of TPM initialization status.
Unattempted
Correct: B. The server is missing blanking panels in the rack, causing hot air recirculation into the cold aisle.
The NCP-AII certification blueprint explicitly includes “Validate power and cooling parameters“ as a core task within the System and Server Bring-up domain, which comprises 31% of the examination .
Missing blanking panels in server racks is a well-documented physical configuration error that directly causes hot air recirculation from the hot aisle back into the cold aisle, raising GPU inlet temperatures even when ambient room temperatures are normal .
The Schneider Electric blog specifically states that “cabling congestion, missing blanking panels, and open rack gaps create recirculation paths that let hot air reenter the inlet. Sealing these paths solves the problem“ .
The comprehensive environmental monitoring guide confirms that “blanking panel management prevents recirculation degrading cooling effectiveness“ and notes that “missing panels cause 5-10°C temperature increase“ .
This scenario directly matches the certification‘s troubleshooting domain where identifying physical infrastructure issues like improper rack configuration is prioritized before component-level failures .
Incorrect: A. The GPU-based servers are configured with the wrong IP addresses in the OOB management network.
This is incorrect because IP address configuration in the Out-of-Band (OOB) management network has no impact on GPU inlet temperatures or cooling performance. The OOB network is used for remote management access to the Baseboard Management Controller (BMC) , not for thermal management. Temperature readings from the BMC would still be accurate regardless of network configuration.
C. The storage array is connected via SAS instead of NVMe, increasing the heat density of the server chassis.
This is incorrect because storage interface type (SAS vs NVMe) does not significantly affect GPU inlet temperatures. While NVMe drives may have different thermal characteristics than SAS drives, the connection type is not a primary factor in rack-level cooling issues. The symptom of GPU inlet temperatures nearing throttle limits despite low ambient temperatures points to airflow recirculation problems, not storage-related heat generation.
D. The TPM is not initialized, preventing the fans from reaching their maximum RPM setpoints.
This is incorrect because the Trusted Platform Module (TPM) is a security chip for cryptographic operations and platform integrity and has no role in fan speed control or thermal management. Fan speed control is handled by the BMC and system thermal management firmware, independent of TPM initialization status.
Question 17 of 60
17. Question
A ClusterKit assessment is performed on a newly deployed AI cluster. The report indicates a failure in the ‘node-to-node‘ communication check. Which of the following is the most logical next step to narrow down the cause of this failure in a multi-node AI factory environment?
Correct
Correct: C Verifying the signal quality and firmware versions of the transceivers. • The Technical Reason: In a 400G (NDR) or 800G (XDR) environment, node-to-node communication failures are most frequently caused by physical layer (L1) issues or firmware mismatches. ? Signal Quality: High-speed links use PAM4 modulation, which is highly sensitive to attenuation. A “dirty“ fiber or a poorly seated transceiver can cause intermittent drops that fail ClusterKit‘s bandwidth and latency checks. ? Firmware Consistency: NVIDIA requires a “Validated Recipe“ where the ConnectX-7 firmware, the Switch firmware, and the transceiver firmware are all in sync. A mismatch can prevent the link from training to its full width (e.g., running at x1 instead of x4). • The NCP-AII Context: The exam validates your ability to use the mlxlink tool. You are expected to check the Eye Diagram (mlxlink -e) and BER (Bit Error Rate) to confirm the physical health of the link before moving to software-level troubleshooting.
Incorrect: A. Replacing all Category 6 cables for the management network The Management Network (1GbE/10GbE) is used for Out-of-Band (OOB) tasks like IPMI and BMC access. ClusterKit‘s “node-to-node“ check focuses on the Compute/Data Fabric (InfiniBand/Ethernet). Replacing management cables will not fix a failure in the high-speed data path used for AI training.
B. Reinstalling the operating system on the primary head node Node-to-node communication is a distributed check between compute nodes. The OS status of the head node (which primarily handles scheduling and management) would not typically cause a specific bandwidth or connectivity failure between two active compute nodes. This is an extreme “scorched earth“ approach that ignores the most likely hardware/firmware culprits.
D. Reducing the GPU clock speed to decrease power consumption While power management is important, reducing GPU clock speeds affects compute performance (TFLOPS), not network connectivity. A communication failure in ClusterKit indicates that the “pipes“ between the nodes are broken or restricted, which is independent of how fast the GPUs are processing data.
Incorrect
Correct: C Verifying the signal quality and firmware versions of the transceivers. • The Technical Reason: In a 400G (NDR) or 800G (XDR) environment, node-to-node communication failures are most frequently caused by physical layer (L1) issues or firmware mismatches. ? Signal Quality: High-speed links use PAM4 modulation, which is highly sensitive to attenuation. A “dirty“ fiber or a poorly seated transceiver can cause intermittent drops that fail ClusterKit‘s bandwidth and latency checks. ? Firmware Consistency: NVIDIA requires a “Validated Recipe“ where the ConnectX-7 firmware, the Switch firmware, and the transceiver firmware are all in sync. A mismatch can prevent the link from training to its full width (e.g., running at x1 instead of x4). • The NCP-AII Context: The exam validates your ability to use the mlxlink tool. You are expected to check the Eye Diagram (mlxlink -e) and BER (Bit Error Rate) to confirm the physical health of the link before moving to software-level troubleshooting.
Incorrect: A. Replacing all Category 6 cables for the management network The Management Network (1GbE/10GbE) is used for Out-of-Band (OOB) tasks like IPMI and BMC access. ClusterKit‘s “node-to-node“ check focuses on the Compute/Data Fabric (InfiniBand/Ethernet). Replacing management cables will not fix a failure in the high-speed data path used for AI training.
B. Reinstalling the operating system on the primary head node Node-to-node communication is a distributed check between compute nodes. The OS status of the head node (which primarily handles scheduling and management) would not typically cause a specific bandwidth or connectivity failure between two active compute nodes. This is an extreme “scorched earth“ approach that ignores the most likely hardware/firmware culprits.
D. Reducing the GPU clock speed to decrease power consumption While power management is important, reducing GPU clock speeds affects compute performance (TFLOPS), not network connectivity. A communication failure in ClusterKit indicates that the “pipes“ between the nodes are broken or restricted, which is independent of how fast the GPUs are processing data.
Unattempted
Correct: C Verifying the signal quality and firmware versions of the transceivers. • The Technical Reason: In a 400G (NDR) or 800G (XDR) environment, node-to-node communication failures are most frequently caused by physical layer (L1) issues or firmware mismatches. ? Signal Quality: High-speed links use PAM4 modulation, which is highly sensitive to attenuation. A “dirty“ fiber or a poorly seated transceiver can cause intermittent drops that fail ClusterKit‘s bandwidth and latency checks. ? Firmware Consistency: NVIDIA requires a “Validated Recipe“ where the ConnectX-7 firmware, the Switch firmware, and the transceiver firmware are all in sync. A mismatch can prevent the link from training to its full width (e.g., running at x1 instead of x4). • The NCP-AII Context: The exam validates your ability to use the mlxlink tool. You are expected to check the Eye Diagram (mlxlink -e) and BER (Bit Error Rate) to confirm the physical health of the link before moving to software-level troubleshooting.
Incorrect: A. Replacing all Category 6 cables for the management network The Management Network (1GbE/10GbE) is used for Out-of-Band (OOB) tasks like IPMI and BMC access. ClusterKit‘s “node-to-node“ check focuses on the Compute/Data Fabric (InfiniBand/Ethernet). Replacing management cables will not fix a failure in the high-speed data path used for AI training.
B. Reinstalling the operating system on the primary head node Node-to-node communication is a distributed check between compute nodes. The OS status of the head node (which primarily handles scheduling and management) would not typically cause a specific bandwidth or connectivity failure between two active compute nodes. This is an extreme “scorched earth“ approach that ignores the most likely hardware/firmware culprits.
D. Reducing the GPU clock speed to decrease power consumption While power management is important, reducing GPU clock speeds affects compute performance (TFLOPS), not network connectivity. A communication failure in ClusterKit indicates that the “pipes“ between the nodes are broken or restricted, which is independent of how fast the GPUs are processing data.
Question 18 of 60
18. Question
An AI training job is failing with ‘GPU fell off the bus‘ errors. After checking the logs, the administrator sees numerous PCIe correctable errors before the failure. What is the most appropriate troubleshooting step for this hardware fault according to NVIDIA best practices?
Correct
Correct: C Inspect the physical GPU seating and clean the PCIe gold fingers.
The Technical Reason: The presence of “PCIe correctable errors“ prior to a total failure is a classic symptom of physical signal degradation.
Physical Seating: In high-density AI servers, thermal expansion and contraction (thermal cycling) or vibration during shipping can cause a GPU to “creep“ out of its PCIe slot.
Contamination: Microscopic dust or oils on the “gold fingers“ (the PCB contacts) increase electrical resistance and cause signal noise.
The Solution: NVIDIA best practices for field service involve re-seating the card and using isopropyl alcohol to clean the contacts. This restores the electrical integrity of the high-speed differential pairs required for Gen4 or Gen5 PCIe speeds.
The NCP-AII Context: The exam validates your ability to differentiate between Layer 1 (Physical) hardware faults and Layer 7 (Application) software bugs. “Correctable errors“ are hardware-level warnings handled by the PCIe AER (Advanced Error Reporting) mechanism.
Incorrect Options: A. Increase the Slurm job timeout value Slurm timeouts manage how long a scheduler waits for a task to respond. Increasing the timeout will not fix a hardware disconnect. If the GPU has “fallen off the bus,“ the OS can no longer see the device, and the training process will crash regardless of how long the scheduler waits.
B. Reinstall the Pyxis plugin for Slurm Pyxis is a software plugin that allows Slurm to interact with the Enroot container runtime. While it is essential for running jobs, it operates entirely in the user-space software layer. It has no capability to resolve physical PCIe bus errors or hardware-level “lost device“ states.
D. Update the NGC CLI The NGC CLI is a command-line tool used to pull container images and manage datasets from the NVIDIA GPU Cloud. It is an administrative utility that does not interact with the GPU hardware or the Linux kernel‘s PCIe bus management. Updating it has no impact on physical hardware stability.
Incorrect
Correct: C Inspect the physical GPU seating and clean the PCIe gold fingers.
The Technical Reason: The presence of “PCIe correctable errors“ prior to a total failure is a classic symptom of physical signal degradation.
Physical Seating: In high-density AI servers, thermal expansion and contraction (thermal cycling) or vibration during shipping can cause a GPU to “creep“ out of its PCIe slot.
Contamination: Microscopic dust or oils on the “gold fingers“ (the PCB contacts) increase electrical resistance and cause signal noise.
The Solution: NVIDIA best practices for field service involve re-seating the card and using isopropyl alcohol to clean the contacts. This restores the electrical integrity of the high-speed differential pairs required for Gen4 or Gen5 PCIe speeds.
The NCP-AII Context: The exam validates your ability to differentiate between Layer 1 (Physical) hardware faults and Layer 7 (Application) software bugs. “Correctable errors“ are hardware-level warnings handled by the PCIe AER (Advanced Error Reporting) mechanism.
Incorrect Options: A. Increase the Slurm job timeout value Slurm timeouts manage how long a scheduler waits for a task to respond. Increasing the timeout will not fix a hardware disconnect. If the GPU has “fallen off the bus,“ the OS can no longer see the device, and the training process will crash regardless of how long the scheduler waits.
B. Reinstall the Pyxis plugin for Slurm Pyxis is a software plugin that allows Slurm to interact with the Enroot container runtime. While it is essential for running jobs, it operates entirely in the user-space software layer. It has no capability to resolve physical PCIe bus errors or hardware-level “lost device“ states.
D. Update the NGC CLI The NGC CLI is a command-line tool used to pull container images and manage datasets from the NVIDIA GPU Cloud. It is an administrative utility that does not interact with the GPU hardware or the Linux kernel‘s PCIe bus management. Updating it has no impact on physical hardware stability.
Unattempted
Correct: C Inspect the physical GPU seating and clean the PCIe gold fingers.
The Technical Reason: The presence of “PCIe correctable errors“ prior to a total failure is a classic symptom of physical signal degradation.
Physical Seating: In high-density AI servers, thermal expansion and contraction (thermal cycling) or vibration during shipping can cause a GPU to “creep“ out of its PCIe slot.
Contamination: Microscopic dust or oils on the “gold fingers“ (the PCB contacts) increase electrical resistance and cause signal noise.
The Solution: NVIDIA best practices for field service involve re-seating the card and using isopropyl alcohol to clean the contacts. This restores the electrical integrity of the high-speed differential pairs required for Gen4 or Gen5 PCIe speeds.
The NCP-AII Context: The exam validates your ability to differentiate between Layer 1 (Physical) hardware faults and Layer 7 (Application) software bugs. “Correctable errors“ are hardware-level warnings handled by the PCIe AER (Advanced Error Reporting) mechanism.
Incorrect Options: A. Increase the Slurm job timeout value Slurm timeouts manage how long a scheduler waits for a task to respond. Increasing the timeout will not fix a hardware disconnect. If the GPU has “fallen off the bus,“ the OS can no longer see the device, and the training process will crash regardless of how long the scheduler waits.
B. Reinstall the Pyxis plugin for Slurm Pyxis is a software plugin that allows Slurm to interact with the Enroot container runtime. While it is essential for running jobs, it operates entirely in the user-space software layer. It has no capability to resolve physical PCIe bus errors or hardware-level “lost device“ states.
D. Update the NGC CLI The NGC CLI is a command-line tool used to pull container images and manage datasets from the NVIDIA GPU Cloud. It is an administrative utility that does not interact with the GPU hardware or the Linux kernel‘s PCIe bus management. Updating it has no impact on physical hardware stability.
Question 19 of 60
19. Question
When configuring Multi-Instance GPU MIG for a High-Performance Computing HPC workload that requires high memory bandwidth, the administrator must choose between different slice sizes. If an H100 GPU is being partitioned, what is the maximum number of GPU instances that can be created, and what is the primary benefit of this isolation?
Correct
Correct: A 7 instances; providing dedicated memory and compute to each user.
The Technical Reason: The NVIDIA H100 Tensor Core GPU architecture supports partitioning into up to 7 hardware-isolated GPU instances.
Hardware Isolation: Unlike traditional time-slicing, MIG provides each instance with its own dedicated slice of the GPUÂ’s hardware resources, including the SMCs (Streaming Multiprocessors), L2 cache, and HBM (High Bandwidth Memory).
Predictable Performance: Because the paths to memory and compute are physically partitioned, a “noisy neighbor“ (a heavy job on one instance) cannot impact the latency or throughput of another instance. This is essential for HPC workloads requiring consistent memory bandwidth.
The NCP-AII Context: The exam tests your knowledge of the “7-slice“ limit for the A100 and H100. It also emphasizes that MIG is the solution for Fault Isolation: if a process crashes on one MIG instance, it does not affect the others.
Incorrect Options: B. 2 instances; ensuring at least 40GB of memory While you can create two large instances (e.g., two 3g.40gb slices), this is not the maximum number of instances possible. The H100 allows for much finer granularity. Furthermore, the 80GB H100 memory capacity would naturally be split, but “2“ is not the architectural limit defined in the NCP-AII curriculum.
C. 16 instances; massive parallelization of small scripts 16 instances exceeds the physical hardware partitioning limit of the H100. While users can run many processes on a single GPU using MPS (Multi-Process Service), MPS does not offer the same hardware-level memory protection and fault isolation as MIG. In MIG, the hardware crossbar only supports up to 7 separate profiles.
D. 32 instances; primarily used for VDI 32 instances is far beyond the MIG specification. Additionally, while MIG can be used in virtualization, the primary focus of the NCP-AII certification is AI Infrastructure and HPC, not Virtual Desktop Infrastructure (VDI). VDI typically utilizes vGPU (Virtual GPU) software profiles, which function differently than the hardware-level partitioning of MIG.
Incorrect
Correct: A 7 instances; providing dedicated memory and compute to each user.
The Technical Reason: The NVIDIA H100 Tensor Core GPU architecture supports partitioning into up to 7 hardware-isolated GPU instances.
Hardware Isolation: Unlike traditional time-slicing, MIG provides each instance with its own dedicated slice of the GPUÂ’s hardware resources, including the SMCs (Streaming Multiprocessors), L2 cache, and HBM (High Bandwidth Memory).
Predictable Performance: Because the paths to memory and compute are physically partitioned, a “noisy neighbor“ (a heavy job on one instance) cannot impact the latency or throughput of another instance. This is essential for HPC workloads requiring consistent memory bandwidth.
The NCP-AII Context: The exam tests your knowledge of the “7-slice“ limit for the A100 and H100. It also emphasizes that MIG is the solution for Fault Isolation: if a process crashes on one MIG instance, it does not affect the others.
Incorrect Options: B. 2 instances; ensuring at least 40GB of memory While you can create two large instances (e.g., two 3g.40gb slices), this is not the maximum number of instances possible. The H100 allows for much finer granularity. Furthermore, the 80GB H100 memory capacity would naturally be split, but “2“ is not the architectural limit defined in the NCP-AII curriculum.
C. 16 instances; massive parallelization of small scripts 16 instances exceeds the physical hardware partitioning limit of the H100. While users can run many processes on a single GPU using MPS (Multi-Process Service), MPS does not offer the same hardware-level memory protection and fault isolation as MIG. In MIG, the hardware crossbar only supports up to 7 separate profiles.
D. 32 instances; primarily used for VDI 32 instances is far beyond the MIG specification. Additionally, while MIG can be used in virtualization, the primary focus of the NCP-AII certification is AI Infrastructure and HPC, not Virtual Desktop Infrastructure (VDI). VDI typically utilizes vGPU (Virtual GPU) software profiles, which function differently than the hardware-level partitioning of MIG.
Unattempted
Correct: A 7 instances; providing dedicated memory and compute to each user.
The Technical Reason: The NVIDIA H100 Tensor Core GPU architecture supports partitioning into up to 7 hardware-isolated GPU instances.
Hardware Isolation: Unlike traditional time-slicing, MIG provides each instance with its own dedicated slice of the GPUÂ’s hardware resources, including the SMCs (Streaming Multiprocessors), L2 cache, and HBM (High Bandwidth Memory).
Predictable Performance: Because the paths to memory and compute are physically partitioned, a “noisy neighbor“ (a heavy job on one instance) cannot impact the latency or throughput of another instance. This is essential for HPC workloads requiring consistent memory bandwidth.
The NCP-AII Context: The exam tests your knowledge of the “7-slice“ limit for the A100 and H100. It also emphasizes that MIG is the solution for Fault Isolation: if a process crashes on one MIG instance, it does not affect the others.
Incorrect Options: B. 2 instances; ensuring at least 40GB of memory While you can create two large instances (e.g., two 3g.40gb slices), this is not the maximum number of instances possible. The H100 allows for much finer granularity. Furthermore, the 80GB H100 memory capacity would naturally be split, but “2“ is not the architectural limit defined in the NCP-AII curriculum.
C. 16 instances; massive parallelization of small scripts 16 instances exceeds the physical hardware partitioning limit of the H100. While users can run many processes on a single GPU using MPS (Multi-Process Service), MPS does not offer the same hardware-level memory protection and fault isolation as MIG. In MIG, the hardware crossbar only supports up to 7 separate profiles.
D. 32 instances; primarily used for VDI 32 instances is far beyond the MIG specification. Additionally, while MIG can be used in virtualization, the primary focus of the NCP-AII certification is AI Infrastructure and HPC, not Virtual Desktop Infrastructure (VDI). VDI typically utilizes vGPU (Virtual GPU) software profiles, which function differently than the hardware-level partitioning of MIG.
Question 20 of 60
20. Question
A site reliability engineer is performing a ‘burn-in‘ test on a new cluster using the NeMo training framework. Why is a framework-specific burn-in test like NeMo preferred over simple synthetic benchmarks during the final stage of cluster verification?
Correct
Correct: A It validates that the entire software stack and fabric can handle real-world AI model training patterns. • The Technical Reason: NVIDIA NeMo is a cloud-native framework for building, customizing, and deploying generative AI models. ? The Full Stack: A NeMo burn-in test exercises the entire “NVIDIA Golden Stack“—from the H100 GPU hardware and NDR InfiniBand fabric up through the drivers, the Enroot/Pyxis container stack, and the NCCL (NVIDIA Collective Communications Library) collectives. ? Communication Patterns: Unlike a simple ping-pong latency test, NeMo performs complex All-Reduce, All-to-All, and Reduce-Scatter operations typical of 3D parallelism (Data, Pipeline, and Tensor parallelism). This is the “ultimate stress test“ for identifying subtle packet drops or GPU-to-GPU synchronization issues that only appear during massive, distributed AI training. • The NCP-AII Context: The exam emphasizes that “Ready for Production“ means more than just passing a POST check. It means the cluster can sustain a multi-day training run without an “XID“ error or a “NCCL timeout.“
Incorrect: B. Only way to check if power cables are plugged in This is a basic physical check. If power cables were not plugged in, the server would not boot, and the BMC (Baseboard Management Controller) would report a critical power alert long before a user could even attempt to load the NeMo framework.
C. It reduces the power consumption of the GPUs Actually, the opposite is true. NeMo training jobs are designed to maximize GPU utilization (TFLOPS) and memory bandwidth. A framework-specific burn-in test is intended to maximize power draw to ensure the data center‘s cooling and PDU (Power Distribution Unit) capacity can handle the peak thermal load of a real AI model.
D. It automatically updates the firmware of the Mellanox switches Frameworks like NeMo operate at the Application Layer. Firmware updates for NVIDIA Quantum-2 (Mellanox) switches are handled at the Infrastructure Layer using tools like mlxfwmanager or NVIDIA Bright Manager. An AI training framework does not have the permissions or the function to modify network hardware microcode.
Incorrect
Correct: A It validates that the entire software stack and fabric can handle real-world AI model training patterns. • The Technical Reason: NVIDIA NeMo is a cloud-native framework for building, customizing, and deploying generative AI models. ? The Full Stack: A NeMo burn-in test exercises the entire “NVIDIA Golden Stack“—from the H100 GPU hardware and NDR InfiniBand fabric up through the drivers, the Enroot/Pyxis container stack, and the NCCL (NVIDIA Collective Communications Library) collectives. ? Communication Patterns: Unlike a simple ping-pong latency test, NeMo performs complex All-Reduce, All-to-All, and Reduce-Scatter operations typical of 3D parallelism (Data, Pipeline, and Tensor parallelism). This is the “ultimate stress test“ for identifying subtle packet drops or GPU-to-GPU synchronization issues that only appear during massive, distributed AI training. • The NCP-AII Context: The exam emphasizes that “Ready for Production“ means more than just passing a POST check. It means the cluster can sustain a multi-day training run without an “XID“ error or a “NCCL timeout.“
Incorrect: B. Only way to check if power cables are plugged in This is a basic physical check. If power cables were not plugged in, the server would not boot, and the BMC (Baseboard Management Controller) would report a critical power alert long before a user could even attempt to load the NeMo framework.
C. It reduces the power consumption of the GPUs Actually, the opposite is true. NeMo training jobs are designed to maximize GPU utilization (TFLOPS) and memory bandwidth. A framework-specific burn-in test is intended to maximize power draw to ensure the data center‘s cooling and PDU (Power Distribution Unit) capacity can handle the peak thermal load of a real AI model.
D. It automatically updates the firmware of the Mellanox switches Frameworks like NeMo operate at the Application Layer. Firmware updates for NVIDIA Quantum-2 (Mellanox) switches are handled at the Infrastructure Layer using tools like mlxfwmanager or NVIDIA Bright Manager. An AI training framework does not have the permissions or the function to modify network hardware microcode.
Unattempted
Correct: A It validates that the entire software stack and fabric can handle real-world AI model training patterns. • The Technical Reason: NVIDIA NeMo is a cloud-native framework for building, customizing, and deploying generative AI models. ? The Full Stack: A NeMo burn-in test exercises the entire “NVIDIA Golden Stack“—from the H100 GPU hardware and NDR InfiniBand fabric up through the drivers, the Enroot/Pyxis container stack, and the NCCL (NVIDIA Collective Communications Library) collectives. ? Communication Patterns: Unlike a simple ping-pong latency test, NeMo performs complex All-Reduce, All-to-All, and Reduce-Scatter operations typical of 3D parallelism (Data, Pipeline, and Tensor parallelism). This is the “ultimate stress test“ for identifying subtle packet drops or GPU-to-GPU synchronization issues that only appear during massive, distributed AI training. • The NCP-AII Context: The exam emphasizes that “Ready for Production“ means more than just passing a POST check. It means the cluster can sustain a multi-day training run without an “XID“ error or a “NCCL timeout.“
Incorrect: B. Only way to check if power cables are plugged in This is a basic physical check. If power cables were not plugged in, the server would not boot, and the BMC (Baseboard Management Controller) would report a critical power alert long before a user could even attempt to load the NeMo framework.
C. It reduces the power consumption of the GPUs Actually, the opposite is true. NeMo training jobs are designed to maximize GPU utilization (TFLOPS) and memory bandwidth. A framework-specific burn-in test is intended to maximize power draw to ensure the data center‘s cooling and PDU (Power Distribution Unit) capacity can handle the peak thermal load of a real AI model.
D. It automatically updates the firmware of the Mellanox switches Frameworks like NeMo operate at the Application Layer. Firmware updates for NVIDIA Quantum-2 (Mellanox) switches are handled at the Infrastructure Layer using tools like mlxfwmanager or NVIDIA Bright Manager. An AI training framework does not have the permissions or the function to modify network hardware microcode.
Question 21 of 60
21. Question
When optimizing performance for an AI cluster consisting of nodes with AMD EPYC processors and NVIDIA H100 GPUs, which BIOS or OS tuning parameter is most critical for ensuring low-latency communication between the CPUs and GPUs for data feeding?
Correct
Correct: D “Ensuring the IOMMU is configured correctly and setting the Determinism Slider to Performance to maintain consistent CPU clock frequencies.“ • Determinism Slider (Performance): AMD EPYC processors feature a “Determinism Slider“ in the BIOS. In “Power“ (or “Efficiency“) mode, the CPU varies its clock speed to save energy. For AI workloads, this creates latency jitter. Setting it to Performance (or Deterministic) ensures the CPU maintains a consistent, predictable high frequency, which is critical for the synchronized timing required in GPU data transfers. • IOMMU (Input-Output Memory Management Unit): For NVIDIA GPUs to communicate efficiently with the CPU (and for features like GPUDirect RDMA), the IOMMU must be configured correctly. While it can sometimes add overhead, in modern NVIDIA-Certified Systems, it is essential for the secure and mapped translation of memory addresses between the CPU and the high-speed PCIe devices.
Incorrect: Why they fail A. Enabling Eco-Mode in the BIOS… • Reason: Eco-Mode is a power-saving feature that caps the CPU‘s TDP (Thermal Design Power). In an AI Infrastructure context, this is counterproductive. Reducing CPU power limit leads to “throttling,“ where the CPU cannot process data fast enough to keep the H100 GPUs busy. Furthermore, CPU thermal headroom has no direct physical impact on the independent cooling of NVLink switches.
B. Setting the BlueField-3 DPU to Bridge Mode to allow AMD Infinity Fabric to manage InfiniBand… • Reason: This describes a technical impossibility. AMD Infinity Fabric is an internal CPU-to-CPU or CPU-to-Die interconnect; it cannot manage external InfiniBand traffic or take over the functions of an NVIDIA DPU. BlueField-3 DPUs manage network traffic using the NVIDIA DOCA stack, not the CPU‘s internal fabric.
C. Disabling all PCIe Gen5 lanes and forcing Gen3…
• Reason: This is the opposite of optimization. The NVIDIA H100 is designed specifically to utilize PCIe Gen5 bandwidth (128 GB/s). Forcing the system to Gen3 (32 GB/s) would create a massive 75% reduction in available bandwidth, severely bottlenecking the AI cluster. The NVIDIA Container Toolkit does not require a lower PCIe version to reduce “retries“; it is designed to run on the latest hardware.
Incorrect
Correct: D “Ensuring the IOMMU is configured correctly and setting the Determinism Slider to Performance to maintain consistent CPU clock frequencies.“ • Determinism Slider (Performance): AMD EPYC processors feature a “Determinism Slider“ in the BIOS. In “Power“ (or “Efficiency“) mode, the CPU varies its clock speed to save energy. For AI workloads, this creates latency jitter. Setting it to Performance (or Deterministic) ensures the CPU maintains a consistent, predictable high frequency, which is critical for the synchronized timing required in GPU data transfers. • IOMMU (Input-Output Memory Management Unit): For NVIDIA GPUs to communicate efficiently with the CPU (and for features like GPUDirect RDMA), the IOMMU must be configured correctly. While it can sometimes add overhead, in modern NVIDIA-Certified Systems, it is essential for the secure and mapped translation of memory addresses between the CPU and the high-speed PCIe devices.
Incorrect: Why they fail A. Enabling Eco-Mode in the BIOS… • Reason: Eco-Mode is a power-saving feature that caps the CPU‘s TDP (Thermal Design Power). In an AI Infrastructure context, this is counterproductive. Reducing CPU power limit leads to “throttling,“ where the CPU cannot process data fast enough to keep the H100 GPUs busy. Furthermore, CPU thermal headroom has no direct physical impact on the independent cooling of NVLink switches.
B. Setting the BlueField-3 DPU to Bridge Mode to allow AMD Infinity Fabric to manage InfiniBand… • Reason: This describes a technical impossibility. AMD Infinity Fabric is an internal CPU-to-CPU or CPU-to-Die interconnect; it cannot manage external InfiniBand traffic or take over the functions of an NVIDIA DPU. BlueField-3 DPUs manage network traffic using the NVIDIA DOCA stack, not the CPU‘s internal fabric.
C. Disabling all PCIe Gen5 lanes and forcing Gen3…
• Reason: This is the opposite of optimization. The NVIDIA H100 is designed specifically to utilize PCIe Gen5 bandwidth (128 GB/s). Forcing the system to Gen3 (32 GB/s) would create a massive 75% reduction in available bandwidth, severely bottlenecking the AI cluster. The NVIDIA Container Toolkit does not require a lower PCIe version to reduce “retries“; it is designed to run on the latest hardware.
Unattempted
Correct: D “Ensuring the IOMMU is configured correctly and setting the Determinism Slider to Performance to maintain consistent CPU clock frequencies.“ • Determinism Slider (Performance): AMD EPYC processors feature a “Determinism Slider“ in the BIOS. In “Power“ (or “Efficiency“) mode, the CPU varies its clock speed to save energy. For AI workloads, this creates latency jitter. Setting it to Performance (or Deterministic) ensures the CPU maintains a consistent, predictable high frequency, which is critical for the synchronized timing required in GPU data transfers. • IOMMU (Input-Output Memory Management Unit): For NVIDIA GPUs to communicate efficiently with the CPU (and for features like GPUDirect RDMA), the IOMMU must be configured correctly. While it can sometimes add overhead, in modern NVIDIA-Certified Systems, it is essential for the secure and mapped translation of memory addresses between the CPU and the high-speed PCIe devices.
Incorrect: Why they fail A. Enabling Eco-Mode in the BIOS… • Reason: Eco-Mode is a power-saving feature that caps the CPU‘s TDP (Thermal Design Power). In an AI Infrastructure context, this is counterproductive. Reducing CPU power limit leads to “throttling,“ where the CPU cannot process data fast enough to keep the H100 GPUs busy. Furthermore, CPU thermal headroom has no direct physical impact on the independent cooling of NVLink switches.
B. Setting the BlueField-3 DPU to Bridge Mode to allow AMD Infinity Fabric to manage InfiniBand… • Reason: This describes a technical impossibility. AMD Infinity Fabric is an internal CPU-to-CPU or CPU-to-Die interconnect; it cannot manage external InfiniBand traffic or take over the functions of an NVIDIA DPU. BlueField-3 DPUs manage network traffic using the NVIDIA DOCA stack, not the CPU‘s internal fabric.
C. Disabling all PCIe Gen5 lanes and forcing Gen3…
• Reason: This is the opposite of optimization. The NVIDIA H100 is designed specifically to utilize PCIe Gen5 bandwidth (128 GB/s). Forcing the system to Gen3 (32 GB/s) would create a massive 75% reduction in available bandwidth, severely bottlenecking the AI cluster. The NVIDIA Container Toolkit does not require a lower PCIe version to reduce “retries“; it is designed to run on the latest hardware.
Question 22 of 60
22. Question
A developer needs to run a specialized AI model using a specific version of the NVIDIA Container Toolkit and a custom Docker image. Which command sequence correctly demonstrates how to utilize a GPU within a Docker container on a properly configured NVIDIA-certified host?
Correct
Correct: B docker run –gpus all –rm nvidia/cuda:12.0-base nvidia-smi
–gpus all: This is the modern, standard flag introduced in Docker 19.03+. It instructs the NVIDIA Container Runtime to expose all available GPUs on the host to the container. This is a primary focus of the NCP-AII exam.
–rm: A best practice in AI infrastructure to automatically remove the container after the process exits, preventing “container sprawl“ on the host.
nvidia/cuda:12.0-base: Uses an official NVIDIA-certified base image that includes the necessary libraries to communicate with the GPU kernel driver.
nvidia-smi: The standard utility used to verify that the GPU is visible and functional within the container environment.
Incorrect: A. docker exec -u gpu-user my_container ./start_training.sh –with-gpu Reason: The docker exec command is used to run a command in an already running container. It does not handle the initial resource allocation or driver mapping required to access a GPU. Furthermore, –with-gpu is a script-specific flag, not a Docker or NVIDIA runtime parameter.
C. nvidia-docker start –container-id=auto –memory-limit=unlimited Reason: This uses deprecated syntax. While nvidia-docker (Version 1) was common in the past, it has been superseded by the nvidia-container-toolkit. Modern NVIDIA-certified infrastructure uses the standard docker command with the –gpus flag. Additionally, –container-id=auto is not a valid Docker or NVIDIA-specific flag.
D. docker run –use-cuda-cores=max -it ubuntu:latest run-ai-model Reason: This is technically incorrect because –use-cuda-cores=max is a “made-up“ flag; Docker does not have a native flag to limit or maximize CUDA cores. To pass GPU resources, the –gpus flag must be used. Furthermore, a plain ubuntu:latest image does not contain the NVIDIA driver libraries or CUDA toolkit necessary to interface with the hardware unless they are manually installed (which contradicts the “standard“ sequence taught in the NCP-AII).
Incorrect
Correct: B docker run –gpus all –rm nvidia/cuda:12.0-base nvidia-smi
–gpus all: This is the modern, standard flag introduced in Docker 19.03+. It instructs the NVIDIA Container Runtime to expose all available GPUs on the host to the container. This is a primary focus of the NCP-AII exam.
–rm: A best practice in AI infrastructure to automatically remove the container after the process exits, preventing “container sprawl“ on the host.
nvidia/cuda:12.0-base: Uses an official NVIDIA-certified base image that includes the necessary libraries to communicate with the GPU kernel driver.
nvidia-smi: The standard utility used to verify that the GPU is visible and functional within the container environment.
Incorrect: A. docker exec -u gpu-user my_container ./start_training.sh –with-gpu Reason: The docker exec command is used to run a command in an already running container. It does not handle the initial resource allocation or driver mapping required to access a GPU. Furthermore, –with-gpu is a script-specific flag, not a Docker or NVIDIA runtime parameter.
C. nvidia-docker start –container-id=auto –memory-limit=unlimited Reason: This uses deprecated syntax. While nvidia-docker (Version 1) was common in the past, it has been superseded by the nvidia-container-toolkit. Modern NVIDIA-certified infrastructure uses the standard docker command with the –gpus flag. Additionally, –container-id=auto is not a valid Docker or NVIDIA-specific flag.
D. docker run –use-cuda-cores=max -it ubuntu:latest run-ai-model Reason: This is technically incorrect because –use-cuda-cores=max is a “made-up“ flag; Docker does not have a native flag to limit or maximize CUDA cores. To pass GPU resources, the –gpus flag must be used. Furthermore, a plain ubuntu:latest image does not contain the NVIDIA driver libraries or CUDA toolkit necessary to interface with the hardware unless they are manually installed (which contradicts the “standard“ sequence taught in the NCP-AII).
Unattempted
Correct: B docker run –gpus all –rm nvidia/cuda:12.0-base nvidia-smi
–gpus all: This is the modern, standard flag introduced in Docker 19.03+. It instructs the NVIDIA Container Runtime to expose all available GPUs on the host to the container. This is a primary focus of the NCP-AII exam.
–rm: A best practice in AI infrastructure to automatically remove the container after the process exits, preventing “container sprawl“ on the host.
nvidia/cuda:12.0-base: Uses an official NVIDIA-certified base image that includes the necessary libraries to communicate with the GPU kernel driver.
nvidia-smi: The standard utility used to verify that the GPU is visible and functional within the container environment.
Incorrect: A. docker exec -u gpu-user my_container ./start_training.sh –with-gpu Reason: The docker exec command is used to run a command in an already running container. It does not handle the initial resource allocation or driver mapping required to access a GPU. Furthermore, –with-gpu is a script-specific flag, not a Docker or NVIDIA runtime parameter.
C. nvidia-docker start –container-id=auto –memory-limit=unlimited Reason: This uses deprecated syntax. While nvidia-docker (Version 1) was common in the past, it has been superseded by the nvidia-container-toolkit. Modern NVIDIA-certified infrastructure uses the standard docker command with the –gpus flag. Additionally, –container-id=auto is not a valid Docker or NVIDIA-specific flag.
D. docker run –use-cuda-cores=max -it ubuntu:latest run-ai-model Reason: This is technically incorrect because –use-cuda-cores=max is a “made-up“ flag; Docker does not have a native flag to limit or maximize CUDA cores. To pass GPU resources, the –gpus flag must be used. Furthermore, a plain ubuntu:latest image does not contain the NVIDIA driver libraries or CUDA toolkit necessary to interface with the hardware unless they are manually installed (which contradicts the “standard“ sequence taught in the NCP-AII).
Question 23 of 60
23. Question
An IT professional is installing NVIDIA Base Command Manager (BCM) to manage a new AI cluster. During the setup of the control plane, it is necessary to configure High Availability (HA) for the head nodes. What is a fundamental requirement for ensuring that the BCM head nodes can failover correctly without losing the cluster‘s state?
Correct
Correct: C Establishing a dedicated heartbeat network between the head nodes and using a shared or synchronized database for the cluster configuration.
The Technical Reason: BCM (formerly Bright Cluster Manager) uses a specific “Failover Object“ architecture.
Heartbeat/Failover Network: A reliable, low-latency link (often a dedicated RJ45 cable or a specific VLAN) is required for the head nodes to monitor each other‘s “liveness.“ If the secondary node stops receiving heartbeats from the primary, it triggers the failover process.
Database Sync: The cmdaemon database (and the workload manager database, e.g., MySQL for Slurm) must be synchronized in real-time between the two nodes. During the cmha-setup process, BCM establishes a replication stream so that the secondary head node always has an up-to-date copy of the cluster‘s state.
Shared Storage: For a complete HA solution, shared storage (like NFS) is used for home directories and software images, ensuring the environment is identical regardless of which head node is active.
The NCP-AII Context: The exam expects you to know the sequence of cmha-setup. A vital part of this is selecting the “internal“ or a “dedicated“ network for the heartbeat and ensuring the secondary node is cloned correctly.
Incorrect Options: A. BlueField-3 DPUs as primary masters While BlueField-3 DPUs can offload networking and security tasks, they do not serve as the “Master“ or “Head Node“ for the BCM control plane software. BCM runs on standard x86 or Arm-based servers acting as central management units. DPUs are managed by the BCM head nodes, not the other way around.
B. NGC CLI replication to a public bucket The NGC CLI is used for pulling AI containers and models, not for backing up the system-level state of a BCM head node. Furthermore, high availability requires local, near-instantaneous failover; relying on a public cloud bucket for emergency recovery would introduce unacceptable downtime and would not constitute a “High Availability“ cluster state.
D. Installing GPU drivers on the BMC The BMC (Baseboard Management Controller) is a small, independent processor used for hardware-level monitoring (power, fans, temperature). It does not have the compute power or the architecture to run NVIDIA GPU drivers or a Slurm database. Failover is a software logic handled by the OS on the head nodes, not by the BMC firmware.
Incorrect
Correct: C Establishing a dedicated heartbeat network between the head nodes and using a shared or synchronized database for the cluster configuration.
The Technical Reason: BCM (formerly Bright Cluster Manager) uses a specific “Failover Object“ architecture.
Heartbeat/Failover Network: A reliable, low-latency link (often a dedicated RJ45 cable or a specific VLAN) is required for the head nodes to monitor each other‘s “liveness.“ If the secondary node stops receiving heartbeats from the primary, it triggers the failover process.
Database Sync: The cmdaemon database (and the workload manager database, e.g., MySQL for Slurm) must be synchronized in real-time between the two nodes. During the cmha-setup process, BCM establishes a replication stream so that the secondary head node always has an up-to-date copy of the cluster‘s state.
Shared Storage: For a complete HA solution, shared storage (like NFS) is used for home directories and software images, ensuring the environment is identical regardless of which head node is active.
The NCP-AII Context: The exam expects you to know the sequence of cmha-setup. A vital part of this is selecting the “internal“ or a “dedicated“ network for the heartbeat and ensuring the secondary node is cloned correctly.
Incorrect Options: A. BlueField-3 DPUs as primary masters While BlueField-3 DPUs can offload networking and security tasks, they do not serve as the “Master“ or “Head Node“ for the BCM control plane software. BCM runs on standard x86 or Arm-based servers acting as central management units. DPUs are managed by the BCM head nodes, not the other way around.
B. NGC CLI replication to a public bucket The NGC CLI is used for pulling AI containers and models, not for backing up the system-level state of a BCM head node. Furthermore, high availability requires local, near-instantaneous failover; relying on a public cloud bucket for emergency recovery would introduce unacceptable downtime and would not constitute a “High Availability“ cluster state.
D. Installing GPU drivers on the BMC The BMC (Baseboard Management Controller) is a small, independent processor used for hardware-level monitoring (power, fans, temperature). It does not have the compute power or the architecture to run NVIDIA GPU drivers or a Slurm database. Failover is a software logic handled by the OS on the head nodes, not by the BMC firmware.
Unattempted
Correct: C Establishing a dedicated heartbeat network between the head nodes and using a shared or synchronized database for the cluster configuration.
The Technical Reason: BCM (formerly Bright Cluster Manager) uses a specific “Failover Object“ architecture.
Heartbeat/Failover Network: A reliable, low-latency link (often a dedicated RJ45 cable or a specific VLAN) is required for the head nodes to monitor each other‘s “liveness.“ If the secondary node stops receiving heartbeats from the primary, it triggers the failover process.
Database Sync: The cmdaemon database (and the workload manager database, e.g., MySQL for Slurm) must be synchronized in real-time between the two nodes. During the cmha-setup process, BCM establishes a replication stream so that the secondary head node always has an up-to-date copy of the cluster‘s state.
Shared Storage: For a complete HA solution, shared storage (like NFS) is used for home directories and software images, ensuring the environment is identical regardless of which head node is active.
The NCP-AII Context: The exam expects you to know the sequence of cmha-setup. A vital part of this is selecting the “internal“ or a “dedicated“ network for the heartbeat and ensuring the secondary node is cloned correctly.
Incorrect Options: A. BlueField-3 DPUs as primary masters While BlueField-3 DPUs can offload networking and security tasks, they do not serve as the “Master“ or “Head Node“ for the BCM control plane software. BCM runs on standard x86 or Arm-based servers acting as central management units. DPUs are managed by the BCM head nodes, not the other way around.
B. NGC CLI replication to a public bucket The NGC CLI is used for pulling AI containers and models, not for backing up the system-level state of a BCM head node. Furthermore, high availability requires local, near-instantaneous failover; relying on a public cloud bucket for emergency recovery would introduce unacceptable downtime and would not constitute a “High Availability“ cluster state.
D. Installing GPU drivers on the BMC The BMC (Baseboard Management Controller) is a small, independent processor used for hardware-level monitoring (power, fans, temperature). It does not have the compute power or the architecture to run NVIDIA GPU drivers or a Slurm database. Failover is a software logic handled by the OS on the head nodes, not by the BMC firmware.
Question 24 of 60
24. Question
A cluster node is reporting frequent GPU Fan Speed Out of Range errors in the system logs. Although the GPU temperatures appear normal at the moment, the administrator wants to prevent a potential hardware failure during an upcoming month-long training run. What is the most appropriate action to take according to the NCP-AII troubleshooting guidelines?
Correct
Correct: C Identify the specific faulty fan using the BMC web interface, then schedule a maintenance window to replace the fan or the entire GPU assembly as required.
The Technical Reason: Modern AI servers use a sophisticated Baseboard Management Controller (BMC) to monitor the health of every component.
Predictive Failure: A “Fan Speed Out of Range“ error is a leading indicator of mechanical bearing failure or an obstruction. Even if temperatures are currently normal, a fan operating outside its defined RPM (Revolutions Per Minute) range cannot guarantee cooling during the 100% duty cycle of a month-long training run.
Replacement Protocol: Depending on the server architecture (e.g., a PCIe-based server vs. an integrated HGX baseboard), you may replace a hot-swappable system fan or, in some cases, require an RMA (Return Merchandise Authorization) for the entire GPU if the integrated cooling shroud has failed.
The NCP-AII Context: The exam validates your ability to use Out-of-Band (OOB) management tools. The BMC provides the most accurate sensor data (IPMI/Redfish) independent of the Operating System.
Incorrect Options: A. Swap the VBIOS with a version from a different manufacturer Flashing a VBIOS (Video BIOS) from a different manufacturer is a violation of NVIDIA support policies and will likely “brick“ the GPU. VBIOS versions are specifically tuned for the electrical and thermal characteristics of a particular board design. Lowering the threshold does not fix the mechanical failure; it simply hides the symptom while increasing the risk of a fire or hardware meltdown.
B. Use an industrial floor fan While this might provide temporary relief in an emergency, it is not a professional infrastructure solution. Open server panels disrupt the engineered Airflow Pressure (shrouding) inside the chassis, which can actually cause other components (like the CPU or NVSwitches) to overheat because the air is no longer being pulled through the intended heat sinks.
D. Manually override internal thermal protection NVIDIA GPU thermal protections (Hardware Slowdown and Thermal Shutdown) are hardcoded into the silicon and firmware to prevent catastrophic hardware failure. Attempting to bypass these safety limits via software is not supported and would lead to a “dead“ GPU as soon as the fan fails completely and the temperature spikes.
Incorrect
Correct: C Identify the specific faulty fan using the BMC web interface, then schedule a maintenance window to replace the fan or the entire GPU assembly as required.
The Technical Reason: Modern AI servers use a sophisticated Baseboard Management Controller (BMC) to monitor the health of every component.
Predictive Failure: A “Fan Speed Out of Range“ error is a leading indicator of mechanical bearing failure or an obstruction. Even if temperatures are currently normal, a fan operating outside its defined RPM (Revolutions Per Minute) range cannot guarantee cooling during the 100% duty cycle of a month-long training run.
Replacement Protocol: Depending on the server architecture (e.g., a PCIe-based server vs. an integrated HGX baseboard), you may replace a hot-swappable system fan or, in some cases, require an RMA (Return Merchandise Authorization) for the entire GPU if the integrated cooling shroud has failed.
The NCP-AII Context: The exam validates your ability to use Out-of-Band (OOB) management tools. The BMC provides the most accurate sensor data (IPMI/Redfish) independent of the Operating System.
Incorrect Options: A. Swap the VBIOS with a version from a different manufacturer Flashing a VBIOS (Video BIOS) from a different manufacturer is a violation of NVIDIA support policies and will likely “brick“ the GPU. VBIOS versions are specifically tuned for the electrical and thermal characteristics of a particular board design. Lowering the threshold does not fix the mechanical failure; it simply hides the symptom while increasing the risk of a fire or hardware meltdown.
B. Use an industrial floor fan While this might provide temporary relief in an emergency, it is not a professional infrastructure solution. Open server panels disrupt the engineered Airflow Pressure (shrouding) inside the chassis, which can actually cause other components (like the CPU or NVSwitches) to overheat because the air is no longer being pulled through the intended heat sinks.
D. Manually override internal thermal protection NVIDIA GPU thermal protections (Hardware Slowdown and Thermal Shutdown) are hardcoded into the silicon and firmware to prevent catastrophic hardware failure. Attempting to bypass these safety limits via software is not supported and would lead to a “dead“ GPU as soon as the fan fails completely and the temperature spikes.
Unattempted
Correct: C Identify the specific faulty fan using the BMC web interface, then schedule a maintenance window to replace the fan or the entire GPU assembly as required.
The Technical Reason: Modern AI servers use a sophisticated Baseboard Management Controller (BMC) to monitor the health of every component.
Predictive Failure: A “Fan Speed Out of Range“ error is a leading indicator of mechanical bearing failure or an obstruction. Even if temperatures are currently normal, a fan operating outside its defined RPM (Revolutions Per Minute) range cannot guarantee cooling during the 100% duty cycle of a month-long training run.
Replacement Protocol: Depending on the server architecture (e.g., a PCIe-based server vs. an integrated HGX baseboard), you may replace a hot-swappable system fan or, in some cases, require an RMA (Return Merchandise Authorization) for the entire GPU if the integrated cooling shroud has failed.
The NCP-AII Context: The exam validates your ability to use Out-of-Band (OOB) management tools. The BMC provides the most accurate sensor data (IPMI/Redfish) independent of the Operating System.
Incorrect Options: A. Swap the VBIOS with a version from a different manufacturer Flashing a VBIOS (Video BIOS) from a different manufacturer is a violation of NVIDIA support policies and will likely “brick“ the GPU. VBIOS versions are specifically tuned for the electrical and thermal characteristics of a particular board design. Lowering the threshold does not fix the mechanical failure; it simply hides the symptom while increasing the risk of a fire or hardware meltdown.
B. Use an industrial floor fan While this might provide temporary relief in an emergency, it is not a professional infrastructure solution. Open server panels disrupt the engineered Airflow Pressure (shrouding) inside the chassis, which can actually cause other components (like the CPU or NVSwitches) to overheat because the air is no longer being pulled through the intended heat sinks.
D. Manually override internal thermal protection NVIDIA GPU thermal protections (Hardware Slowdown and Thermal Shutdown) are hardcoded into the silicon and firmware to prevent catastrophic hardware failure. Attempting to bypass these safety limits via software is not supported and would lead to a “dead“ GPU as soon as the fan fails completely and the temperature spikes.
Question 25 of 60
25. Question
As part of the multifaceted node assessment, an administrator runs ClusterKit. What is the primary purpose of ClusterKit in an NVIDIA-certified environment, and how does it differ from a standard HPL or NCCL test run during the verification process?
Correct
Correct: C ClusterKit performs a comprehensive suite of hardware and software checks, including driver versions, GPU health, and peer-to-peer connectivity. • The Technical Reason: Unlike a single-purpose benchmark, ClusterKit is a multifaceted validation tool. It orchestrates several tests to ensure “Cluster Readiness“: ? Software Audit: It verifies that every node has a consistent “NVIDIA Golden Stack“ (matching Driver, CUDA, and Fabric Manager versions). ? Hardware Health: It leverages DCGM to check for retired pages, XID errors, and thermal throttling. ? Fabric Validation: It runs NCCL tests to confirm that point-to-point and collective communications (All-Reduce) achieve the expected bandwidth across the InfiniBand/RoCE fabric. • The Difference: While a standard HPL test only stresses the floating-point units (FPUs) and NCCL only tests the network, ClusterKit combines these with configuration checks to ensure no “silent“ configuration drift exists across the factory.
Incorrect: A. Automatically install Windows OS NVIDIA AI Infrastructure and BCM (Base Command Manager) almost exclusively utilize Linux distributions (Ubuntu, RHEL, or Rocky Linux) because the NVIDIA container stack (Enroot/Pyxis) and InfiniBand drivers are optimized for the Linux kernel. ClusterKit is a validation tool, not an OS deployment or imaging utility.
B. A game engine for VR visualization While NVIDIA is a leader in gaming and Omniverse (virtualization), ClusterKit is a strictly technical command-line utility for system administrators. It does not provide VR simulations; it provides log files, JSON reports, and pass/fail metrics for hardware integrity.
D. A physical tool kit (screwdrivers and wrenches) This is a literal interpretation of the word “kit.“ In the NCP-AII certification, “Kit“ refers to a Software Development Kit (SDK) or a Validation Suite. Physical assembly tools are part of data center facility management, whereas ClusterKit is part of the digital Control Plane and Verification workflow.
Incorrect
Correct: C ClusterKit performs a comprehensive suite of hardware and software checks, including driver versions, GPU health, and peer-to-peer connectivity. • The Technical Reason: Unlike a single-purpose benchmark, ClusterKit is a multifaceted validation tool. It orchestrates several tests to ensure “Cluster Readiness“: ? Software Audit: It verifies that every node has a consistent “NVIDIA Golden Stack“ (matching Driver, CUDA, and Fabric Manager versions). ? Hardware Health: It leverages DCGM to check for retired pages, XID errors, and thermal throttling. ? Fabric Validation: It runs NCCL tests to confirm that point-to-point and collective communications (All-Reduce) achieve the expected bandwidth across the InfiniBand/RoCE fabric. • The Difference: While a standard HPL test only stresses the floating-point units (FPUs) and NCCL only tests the network, ClusterKit combines these with configuration checks to ensure no “silent“ configuration drift exists across the factory.
Incorrect: A. Automatically install Windows OS NVIDIA AI Infrastructure and BCM (Base Command Manager) almost exclusively utilize Linux distributions (Ubuntu, RHEL, or Rocky Linux) because the NVIDIA container stack (Enroot/Pyxis) and InfiniBand drivers are optimized for the Linux kernel. ClusterKit is a validation tool, not an OS deployment or imaging utility.
B. A game engine for VR visualization While NVIDIA is a leader in gaming and Omniverse (virtualization), ClusterKit is a strictly technical command-line utility for system administrators. It does not provide VR simulations; it provides log files, JSON reports, and pass/fail metrics for hardware integrity.
D. A physical tool kit (screwdrivers and wrenches) This is a literal interpretation of the word “kit.“ In the NCP-AII certification, “Kit“ refers to a Software Development Kit (SDK) or a Validation Suite. Physical assembly tools are part of data center facility management, whereas ClusterKit is part of the digital Control Plane and Verification workflow.
Unattempted
Correct: C ClusterKit performs a comprehensive suite of hardware and software checks, including driver versions, GPU health, and peer-to-peer connectivity. • The Technical Reason: Unlike a single-purpose benchmark, ClusterKit is a multifaceted validation tool. It orchestrates several tests to ensure “Cluster Readiness“: ? Software Audit: It verifies that every node has a consistent “NVIDIA Golden Stack“ (matching Driver, CUDA, and Fabric Manager versions). ? Hardware Health: It leverages DCGM to check for retired pages, XID errors, and thermal throttling. ? Fabric Validation: It runs NCCL tests to confirm that point-to-point and collective communications (All-Reduce) achieve the expected bandwidth across the InfiniBand/RoCE fabric. • The Difference: While a standard HPL test only stresses the floating-point units (FPUs) and NCCL only tests the network, ClusterKit combines these with configuration checks to ensure no “silent“ configuration drift exists across the factory.
Incorrect: A. Automatically install Windows OS NVIDIA AI Infrastructure and BCM (Base Command Manager) almost exclusively utilize Linux distributions (Ubuntu, RHEL, or Rocky Linux) because the NVIDIA container stack (Enroot/Pyxis) and InfiniBand drivers are optimized for the Linux kernel. ClusterKit is a validation tool, not an OS deployment or imaging utility.
B. A game engine for VR visualization While NVIDIA is a leader in gaming and Omniverse (virtualization), ClusterKit is a strictly technical command-line utility for system administrators. It does not provide VR simulations; it provides log files, JSON reports, and pass/fail metrics for hardware integrity.
D. A physical tool kit (screwdrivers and wrenches) This is a literal interpretation of the word “kit.“ In the NCP-AII certification, “Kit“ refers to a Software Development Kit (SDK) or a Validation Suite. Physical assembly tools are part of data center facility management, whereas ClusterKit is part of the digital Control Plane and Verification workflow.
Question 26 of 60
26. Question
When configuring a BlueField-3 DPU to support high-performance AI workloads, which feature must be correctly implemented to allow for efficient communication between the GPU memory and the network without involving the host CPU‘s system memory?
Correct
Correct: C GPUDirect RDMA, which requires the peer-to-peer (P2P) capability to be supported and enabled between the DPU and the GPU over the PCIe bus.
The Technical Reason: GPUDirect RDMA is a key technology that allows a network interface (like the one on the BlueField-3 DPU) to read from or write to GPU memory directly.
Bypassing the CPU: Traditionally, data must be copied from the GPU to the system RAM (CPU memory) before it can be sent over the network. GPUDirect RDMA removes this “bounce buffer,“ significantly reducing latency and CPU overhead.
P2P Requirement: For this to work, the PCIe fabric (including any PCIe switches) must support Peer-to-Peer (P2P) transactions. This allows the DPU to access the GPU‘s memory space directly over the PCIe bus.
The NCP-AII Context: The exam validates your ability to “Confirm FW/SW on BlueField-3“ and ensure the system is “workload ready.“ Correct implementation involves ensuring the nvidia-peermem kernel module is loaded and that the BIOS/PCIe topology doesn‘t block P2P traffic.
Incorrect Options: A. NVIDIA SMI migration While nvidia-smi is a powerful management tool, there is no such feature as “SMI migration“ that moves GPU memory pages to the DPU‘s internal DDR5 memory. The DPU‘s onboard memory is primarily used for its own Arm-based OS, DOCA applications, and packet buffering—not as a secondary swap space for GPU VRAM.
B. Slurm scheduler‘s DPU-plugin for SSH Slurm is used for job scheduling, and while it can manage DPU resources, it does not “partition Arm cores into MIG-like instances“ to handle SSH connections. MIG (Multi-Instance GPU) is a specific hardware partitioning feature for NVIDIA GPUs (like the A100 or H100), not for the Arm CPU cores on a DPU.
D. Encapsulated Remote Port Mirroring (ERSPAN) ERSPAN is a network monitoring protocol used to mirror traffic from one port to another across a Layer 3 network for analysis or sniffing. It is a troubleshooting and security tool, not a data-path technology for high-performance GPU-to-network communication. It would be inefficient and inappropriate for real-time backup of GPU memory.
Incorrect
Correct: C GPUDirect RDMA, which requires the peer-to-peer (P2P) capability to be supported and enabled between the DPU and the GPU over the PCIe bus.
The Technical Reason: GPUDirect RDMA is a key technology that allows a network interface (like the one on the BlueField-3 DPU) to read from or write to GPU memory directly.
Bypassing the CPU: Traditionally, data must be copied from the GPU to the system RAM (CPU memory) before it can be sent over the network. GPUDirect RDMA removes this “bounce buffer,“ significantly reducing latency and CPU overhead.
P2P Requirement: For this to work, the PCIe fabric (including any PCIe switches) must support Peer-to-Peer (P2P) transactions. This allows the DPU to access the GPU‘s memory space directly over the PCIe bus.
The NCP-AII Context: The exam validates your ability to “Confirm FW/SW on BlueField-3“ and ensure the system is “workload ready.“ Correct implementation involves ensuring the nvidia-peermem kernel module is loaded and that the BIOS/PCIe topology doesn‘t block P2P traffic.
Incorrect Options: A. NVIDIA SMI migration While nvidia-smi is a powerful management tool, there is no such feature as “SMI migration“ that moves GPU memory pages to the DPU‘s internal DDR5 memory. The DPU‘s onboard memory is primarily used for its own Arm-based OS, DOCA applications, and packet buffering—not as a secondary swap space for GPU VRAM.
B. Slurm scheduler‘s DPU-plugin for SSH Slurm is used for job scheduling, and while it can manage DPU resources, it does not “partition Arm cores into MIG-like instances“ to handle SSH connections. MIG (Multi-Instance GPU) is a specific hardware partitioning feature for NVIDIA GPUs (like the A100 or H100), not for the Arm CPU cores on a DPU.
D. Encapsulated Remote Port Mirroring (ERSPAN) ERSPAN is a network monitoring protocol used to mirror traffic from one port to another across a Layer 3 network for analysis or sniffing. It is a troubleshooting and security tool, not a data-path technology for high-performance GPU-to-network communication. It would be inefficient and inappropriate for real-time backup of GPU memory.
Unattempted
Correct: C GPUDirect RDMA, which requires the peer-to-peer (P2P) capability to be supported and enabled between the DPU and the GPU over the PCIe bus.
The Technical Reason: GPUDirect RDMA is a key technology that allows a network interface (like the one on the BlueField-3 DPU) to read from or write to GPU memory directly.
Bypassing the CPU: Traditionally, data must be copied from the GPU to the system RAM (CPU memory) before it can be sent over the network. GPUDirect RDMA removes this “bounce buffer,“ significantly reducing latency and CPU overhead.
P2P Requirement: For this to work, the PCIe fabric (including any PCIe switches) must support Peer-to-Peer (P2P) transactions. This allows the DPU to access the GPU‘s memory space directly over the PCIe bus.
The NCP-AII Context: The exam validates your ability to “Confirm FW/SW on BlueField-3“ and ensure the system is “workload ready.“ Correct implementation involves ensuring the nvidia-peermem kernel module is loaded and that the BIOS/PCIe topology doesn‘t block P2P traffic.
Incorrect Options: A. NVIDIA SMI migration While nvidia-smi is a powerful management tool, there is no such feature as “SMI migration“ that moves GPU memory pages to the DPU‘s internal DDR5 memory. The DPU‘s onboard memory is primarily used for its own Arm-based OS, DOCA applications, and packet buffering—not as a secondary swap space for GPU VRAM.
B. Slurm scheduler‘s DPU-plugin for SSH Slurm is used for job scheduling, and while it can manage DPU resources, it does not “partition Arm cores into MIG-like instances“ to handle SSH connections. MIG (Multi-Instance GPU) is a specific hardware partitioning feature for NVIDIA GPUs (like the A100 or H100), not for the Arm CPU cores on a DPU.
D. Encapsulated Remote Port Mirroring (ERSPAN) ERSPAN is a network monitoring protocol used to mirror traffic from one port to another across a Layer 3 network for analysis or sniffing. It is a troubleshooting and security tool, not a data-path technology for high-performance GPU-to-network communication. It would be inefficient and inappropriate for real-time backup of GPU memory.
Question 27 of 60
27. Question
To facilitate large-scale deep learning training, a cluster administrator is setting up Slurm with the Enroot and Pyxis plugins. What is the primary advantage of using this specific combination (Enroot + Pyxis) over traditional Docker containers for running multi-node training jobs on an NVIDIA-certified AI cluster?
Correct
Correct: D This combination provides a more ‘HPC-native‘ experience by allowing users to run containerized workloads as unprivileged users while seamlessly integrating with the Slurm resource manager.
The Technical Reason:
Enroot: Unlike Docker, which relies on a root-level daemon, Enroot is a “chroot-based“ container runtime that turns container images into simple unprivileged sandboxes. This aligns with HPC security where users should not have root access to the compute nodes.
Pyxis: This is a Slurm plugin that acts as the “bridge.“ It allows users to use standard Slurm commands (like srun –container-image=…) to launch containers across hundreds of nodes without manually managing Docker pull commands or storage drivers on every node.
Performance: Enroot is designed for high-performance I/O and provides native-like performance for InfiniBand and GPU access, which is critical for multi-node training using NCCL.
The NCP-AII Context: The exam validates your ability to deploy a scalable software stack. The Enroot + Pyxis combination is the recommended “NVIDIA-Certified“ path for multi-node clusters because it eliminates the security risks and overhead associated with the Docker daemon in a multi-tenant environment.
Incorrect Options: A. Enroot replaces the need for the GPU driver This is a fundamental misunderstanding of containerization. No container runtime (Enroot, Docker, or Singularity) replaces the host‘s GPU driver. The driver contains kernel-space modules that must reside on the host OS. Enroot simply maps these host drivers into the container so the application can communicate with the hardware.
B. Pyxis converts Python code to C++ Pyxis is a resource management plugin, not a compiler or a code translation tool. It manages container lifecycle and integration with Slurm. Speed increases in AI models on HGX systems come from hardware acceleration (Tensor Cores) and optimized libraries (cuDNN, NCCL), not from automatic language conversion by the scheduler.
C. Containers run with root privileges for NCCL access Actually, one of the primary reasons for using Enroot is to avoid running as root. NCCL (NVIDIA Collective Communications Library) does not require root privileges to access InfiniBand hardware; it requires the correct user permissions for the InfiniBand device nodes (e.g., /dev/infiniband/uverbs0) and memory pinning capabilities, which Enroot handles securely for unprivileged users.
Incorrect
Correct: D This combination provides a more ‘HPC-native‘ experience by allowing users to run containerized workloads as unprivileged users while seamlessly integrating with the Slurm resource manager.
The Technical Reason:
Enroot: Unlike Docker, which relies on a root-level daemon, Enroot is a “chroot-based“ container runtime that turns container images into simple unprivileged sandboxes. This aligns with HPC security where users should not have root access to the compute nodes.
Pyxis: This is a Slurm plugin that acts as the “bridge.“ It allows users to use standard Slurm commands (like srun –container-image=…) to launch containers across hundreds of nodes without manually managing Docker pull commands or storage drivers on every node.
Performance: Enroot is designed for high-performance I/O and provides native-like performance for InfiniBand and GPU access, which is critical for multi-node training using NCCL.
The NCP-AII Context: The exam validates your ability to deploy a scalable software stack. The Enroot + Pyxis combination is the recommended “NVIDIA-Certified“ path for multi-node clusters because it eliminates the security risks and overhead associated with the Docker daemon in a multi-tenant environment.
Incorrect Options: A. Enroot replaces the need for the GPU driver This is a fundamental misunderstanding of containerization. No container runtime (Enroot, Docker, or Singularity) replaces the host‘s GPU driver. The driver contains kernel-space modules that must reside on the host OS. Enroot simply maps these host drivers into the container so the application can communicate with the hardware.
B. Pyxis converts Python code to C++ Pyxis is a resource management plugin, not a compiler or a code translation tool. It manages container lifecycle and integration with Slurm. Speed increases in AI models on HGX systems come from hardware acceleration (Tensor Cores) and optimized libraries (cuDNN, NCCL), not from automatic language conversion by the scheduler.
C. Containers run with root privileges for NCCL access Actually, one of the primary reasons for using Enroot is to avoid running as root. NCCL (NVIDIA Collective Communications Library) does not require root privileges to access InfiniBand hardware; it requires the correct user permissions for the InfiniBand device nodes (e.g., /dev/infiniband/uverbs0) and memory pinning capabilities, which Enroot handles securely for unprivileged users.
Unattempted
Correct: D This combination provides a more ‘HPC-native‘ experience by allowing users to run containerized workloads as unprivileged users while seamlessly integrating with the Slurm resource manager.
The Technical Reason:
Enroot: Unlike Docker, which relies on a root-level daemon, Enroot is a “chroot-based“ container runtime that turns container images into simple unprivileged sandboxes. This aligns with HPC security where users should not have root access to the compute nodes.
Pyxis: This is a Slurm plugin that acts as the “bridge.“ It allows users to use standard Slurm commands (like srun –container-image=…) to launch containers across hundreds of nodes without manually managing Docker pull commands or storage drivers on every node.
Performance: Enroot is designed for high-performance I/O and provides native-like performance for InfiniBand and GPU access, which is critical for multi-node training using NCCL.
The NCP-AII Context: The exam validates your ability to deploy a scalable software stack. The Enroot + Pyxis combination is the recommended “NVIDIA-Certified“ path for multi-node clusters because it eliminates the security risks and overhead associated with the Docker daemon in a multi-tenant environment.
Incorrect Options: A. Enroot replaces the need for the GPU driver This is a fundamental misunderstanding of containerization. No container runtime (Enroot, Docker, or Singularity) replaces the host‘s GPU driver. The driver contains kernel-space modules that must reside on the host OS. Enroot simply maps these host drivers into the container so the application can communicate with the hardware.
B. Pyxis converts Python code to C++ Pyxis is a resource management plugin, not a compiler or a code translation tool. It manages container lifecycle and integration with Slurm. Speed increases in AI models on HGX systems come from hardware acceleration (Tensor Cores) and optimized libraries (cuDNN, NCCL), not from automatic language conversion by the scheduler.
C. Containers run with root privileges for NCCL access Actually, one of the primary reasons for using Enroot is to avoid running as root. NCCL (NVIDIA Collective Communications Library) does not require root privileges to access InfiniBand hardware; it requires the correct user permissions for the InfiniBand device nodes (e.g., /dev/infiniband/uverbs0) and memory pinning capabilities, which Enroot handles securely for unprivileged users.
Question 28 of 60
28. Question
During the physical validation phase of an AI factory deployment involving multiple NVIDIA DGX nodes, an administrator observes that several links are failing to negotiate at the expected 400Gbps speed despite using Twinax copper cables. The design utilizes a Fat-Tree topology. Which physical layer check should be prioritized to validate that the cable types and transceivers are sufficient for the required East-West traffic bandwidth?
Correct
Correct: C Checking the cable length against DAC maximum reach specifications.
The Technical Reason: In the latest generation of NVIDIA networking (Quantum-2 / NDR 400G), Direct Attach Copper (DAC) cables have very strict physical limitations due to the high-frequency 100G PAM4 signaling.
Passive DAC Reach: Passive copper cables (DACs) are typically limited to a maximum length of 3 meters for 400Gbps operation.
Signal Degradation: If an administrator attempts to use a cable longer than the certified length (e.g., trying to span too many racks in a Fat-Tree topology with a 5m passive cable), the signal-to-noise ratio drops below the threshold required for 400Gbps. This results in the link either failing to “train“ entirely or “downshifting“ to a lower speed (like 200Gbps or 100Gbps) to maintain stability.
The NCP-AII Context: The exam validates your ability to “Describe and validate cable types and transceivers.“ For distances beyond 3 meters, NVIDIA best practices require switching to Active Copper Cables (ACC) (up to 5m) or Active Optical Cables (AOC) / Transceivers (up to 50m+) to maintain the 400Gbps line rate.
Incorrect Options: A. Verifying OSFP to QSFP adapter compatibility While adapters exist (e.g., to connect an older HDR switch to an NDR node), they are usually intended for downward compatibility (200G/100G). If the design is intended to be a native 400Gbps NDR fabric, using adapters is generally avoided in the “East-West“ compute path as they introduce additional insertion loss and complexity that could prevent 400G negotiation.
B. Confirming TPM is enabled in UEFI The Trusted Platform Module (TPM) is a security component used for hardware-level encryption keys and “Measured Boot.“ It has no relationship with the network interface card‘s (NIC) physical layer link training or the InfiniBand/Ethernet cable‘s electrical negotiation.
D. Validating the BMC firmware version on the storage array The Baseboard Management Controller (BMC) manages the server‘s health (fans, power, sensors). While a BMC might report a network error, its firmware version does not control the physical signaling speed of the high-speed data fabric. Furthermore, the issue described is between DGX nodes and the fabric, not specifically isolated to the storage array.
Incorrect
Correct: C Checking the cable length against DAC maximum reach specifications.
The Technical Reason: In the latest generation of NVIDIA networking (Quantum-2 / NDR 400G), Direct Attach Copper (DAC) cables have very strict physical limitations due to the high-frequency 100G PAM4 signaling.
Passive DAC Reach: Passive copper cables (DACs) are typically limited to a maximum length of 3 meters for 400Gbps operation.
Signal Degradation: If an administrator attempts to use a cable longer than the certified length (e.g., trying to span too many racks in a Fat-Tree topology with a 5m passive cable), the signal-to-noise ratio drops below the threshold required for 400Gbps. This results in the link either failing to “train“ entirely or “downshifting“ to a lower speed (like 200Gbps or 100Gbps) to maintain stability.
The NCP-AII Context: The exam validates your ability to “Describe and validate cable types and transceivers.“ For distances beyond 3 meters, NVIDIA best practices require switching to Active Copper Cables (ACC) (up to 5m) or Active Optical Cables (AOC) / Transceivers (up to 50m+) to maintain the 400Gbps line rate.
Incorrect Options: A. Verifying OSFP to QSFP adapter compatibility While adapters exist (e.g., to connect an older HDR switch to an NDR node), they are usually intended for downward compatibility (200G/100G). If the design is intended to be a native 400Gbps NDR fabric, using adapters is generally avoided in the “East-West“ compute path as they introduce additional insertion loss and complexity that could prevent 400G negotiation.
B. Confirming TPM is enabled in UEFI The Trusted Platform Module (TPM) is a security component used for hardware-level encryption keys and “Measured Boot.“ It has no relationship with the network interface card‘s (NIC) physical layer link training or the InfiniBand/Ethernet cable‘s electrical negotiation.
D. Validating the BMC firmware version on the storage array The Baseboard Management Controller (BMC) manages the server‘s health (fans, power, sensors). While a BMC might report a network error, its firmware version does not control the physical signaling speed of the high-speed data fabric. Furthermore, the issue described is between DGX nodes and the fabric, not specifically isolated to the storage array.
Unattempted
Correct: C Checking the cable length against DAC maximum reach specifications.
The Technical Reason: In the latest generation of NVIDIA networking (Quantum-2 / NDR 400G), Direct Attach Copper (DAC) cables have very strict physical limitations due to the high-frequency 100G PAM4 signaling.
Passive DAC Reach: Passive copper cables (DACs) are typically limited to a maximum length of 3 meters for 400Gbps operation.
Signal Degradation: If an administrator attempts to use a cable longer than the certified length (e.g., trying to span too many racks in a Fat-Tree topology with a 5m passive cable), the signal-to-noise ratio drops below the threshold required for 400Gbps. This results in the link either failing to “train“ entirely or “downshifting“ to a lower speed (like 200Gbps or 100Gbps) to maintain stability.
The NCP-AII Context: The exam validates your ability to “Describe and validate cable types and transceivers.“ For distances beyond 3 meters, NVIDIA best practices require switching to Active Copper Cables (ACC) (up to 5m) or Active Optical Cables (AOC) / Transceivers (up to 50m+) to maintain the 400Gbps line rate.
Incorrect Options: A. Verifying OSFP to QSFP adapter compatibility While adapters exist (e.g., to connect an older HDR switch to an NDR node), they are usually intended for downward compatibility (200G/100G). If the design is intended to be a native 400Gbps NDR fabric, using adapters is generally avoided in the “East-West“ compute path as they introduce additional insertion loss and complexity that could prevent 400G negotiation.
B. Confirming TPM is enabled in UEFI The Trusted Platform Module (TPM) is a security component used for hardware-level encryption keys and “Measured Boot.“ It has no relationship with the network interface card‘s (NIC) physical layer link training or the InfiniBand/Ethernet cable‘s electrical negotiation.
D. Validating the BMC firmware version on the storage array The Baseboard Management Controller (BMC) manages the server‘s health (fans, power, sensors). While a BMC might report a network error, its firmware version does not control the physical signaling speed of the high-speed data fabric. Furthermore, the issue described is between DGX nodes and the fabric, not specifically isolated to the storage array.
Question 29 of 60
29. Question
A cluster experiences intermittent network performance drops during large-scale NeMo burn-in tests. Troubleshooting reveals that several fan modules in the leaf switches have failed. How does fan failure in a network switch impact the performance of the AI cluster‘s compute fabric?
Correct
Correct: C Thermal throttling of the switch ASIC can lead to dropped packets and increased latency, which severely degrades the performance of NCCL collective operations.
The Technical Reason: The ASIC (Application-Specific Integrated Circuit) in an NVIDIA Quantum-2 switch is designed to operate within a tight temperature range to maintain nanosecond-level latency.
Throttling: If fan modules fail, the BMC (Baseboard Management Controller) inside the switch detects the rising temperature. To prevent permanent silicon damage, the ASIC will reduce its clock frequency (thermal throttling).
The “Jitter“ Effect: In a synchronous AI training environment, if one switch in the Fat-Tree topology throttles, it introduces jitter and latency. Because NCCL (NVIDIA Collective Communications Library) operations like All-Reduce are only as fast as the slowest path, a single hot switch can stall the entire cluster of thousands of GPUs.
Packet Loss: If the temperature continues to rise, the ASIC may fail to process buffers in time, leading to silent packet drops and expensive re-transmissions that collapse the RDMA throughput.
The NCP-AII Context: The exam validates your understanding of the “Data Plane“ integrity. For an AI Factory, “lossless“ means zero packet drops. Thermal instability is a primary cause of intermittent, hard-to-diagnose performance dips during high-stress workloads like NeMo.
Incorrect Options: A. Automatically increase the packet size Switch hardware does not change the MTU (Maximum Transmission Unit) or packet size based on temperature. Packet sizes are defined by the software and NIC configuration (typically 4096 bytes for InfiniBand). Increasing packet size would actually increase the buffer pressure on a struggling, overheated ASIC.
B. Switch from InfiniBand to Ethernet mode While some NVIDIA switches are “VPI“ (Virtual Protocol Interconnect) capable, the protocol mode is a static configuration set by the administrator in the firmware or OS (MLNX-OS/SONiC). A switch cannot dynamically flip its entire physical and link layer protocol mid-operation to save power; doing so would immediately crash every active connection in the fabric.
D. Fans are only for noise reduction This is fundamentally incorrect for enterprise-grade hardware. The fans in an NDR switch are industrial-strength components required to dissipate hundreds of watts of heat. Without active cooling, an NDR switch ASIC would reach its thermal shutdown limit (Tjunction) and power off within minutes of being under load.
Incorrect
Correct: C Thermal throttling of the switch ASIC can lead to dropped packets and increased latency, which severely degrades the performance of NCCL collective operations.
The Technical Reason: The ASIC (Application-Specific Integrated Circuit) in an NVIDIA Quantum-2 switch is designed to operate within a tight temperature range to maintain nanosecond-level latency.
Throttling: If fan modules fail, the BMC (Baseboard Management Controller) inside the switch detects the rising temperature. To prevent permanent silicon damage, the ASIC will reduce its clock frequency (thermal throttling).
The “Jitter“ Effect: In a synchronous AI training environment, if one switch in the Fat-Tree topology throttles, it introduces jitter and latency. Because NCCL (NVIDIA Collective Communications Library) operations like All-Reduce are only as fast as the slowest path, a single hot switch can stall the entire cluster of thousands of GPUs.
Packet Loss: If the temperature continues to rise, the ASIC may fail to process buffers in time, leading to silent packet drops and expensive re-transmissions that collapse the RDMA throughput.
The NCP-AII Context: The exam validates your understanding of the “Data Plane“ integrity. For an AI Factory, “lossless“ means zero packet drops. Thermal instability is a primary cause of intermittent, hard-to-diagnose performance dips during high-stress workloads like NeMo.
Incorrect Options: A. Automatically increase the packet size Switch hardware does not change the MTU (Maximum Transmission Unit) or packet size based on temperature. Packet sizes are defined by the software and NIC configuration (typically 4096 bytes for InfiniBand). Increasing packet size would actually increase the buffer pressure on a struggling, overheated ASIC.
B. Switch from InfiniBand to Ethernet mode While some NVIDIA switches are “VPI“ (Virtual Protocol Interconnect) capable, the protocol mode is a static configuration set by the administrator in the firmware or OS (MLNX-OS/SONiC). A switch cannot dynamically flip its entire physical and link layer protocol mid-operation to save power; doing so would immediately crash every active connection in the fabric.
D. Fans are only for noise reduction This is fundamentally incorrect for enterprise-grade hardware. The fans in an NDR switch are industrial-strength components required to dissipate hundreds of watts of heat. Without active cooling, an NDR switch ASIC would reach its thermal shutdown limit (Tjunction) and power off within minutes of being under load.
Unattempted
Correct: C Thermal throttling of the switch ASIC can lead to dropped packets and increased latency, which severely degrades the performance of NCCL collective operations.
The Technical Reason: The ASIC (Application-Specific Integrated Circuit) in an NVIDIA Quantum-2 switch is designed to operate within a tight temperature range to maintain nanosecond-level latency.
Throttling: If fan modules fail, the BMC (Baseboard Management Controller) inside the switch detects the rising temperature. To prevent permanent silicon damage, the ASIC will reduce its clock frequency (thermal throttling).
The “Jitter“ Effect: In a synchronous AI training environment, if one switch in the Fat-Tree topology throttles, it introduces jitter and latency. Because NCCL (NVIDIA Collective Communications Library) operations like All-Reduce are only as fast as the slowest path, a single hot switch can stall the entire cluster of thousands of GPUs.
Packet Loss: If the temperature continues to rise, the ASIC may fail to process buffers in time, leading to silent packet drops and expensive re-transmissions that collapse the RDMA throughput.
The NCP-AII Context: The exam validates your understanding of the “Data Plane“ integrity. For an AI Factory, “lossless“ means zero packet drops. Thermal instability is a primary cause of intermittent, hard-to-diagnose performance dips during high-stress workloads like NeMo.
Incorrect Options: A. Automatically increase the packet size Switch hardware does not change the MTU (Maximum Transmission Unit) or packet size based on temperature. Packet sizes are defined by the software and NIC configuration (typically 4096 bytes for InfiniBand). Increasing packet size would actually increase the buffer pressure on a struggling, overheated ASIC.
B. Switch from InfiniBand to Ethernet mode While some NVIDIA switches are “VPI“ (Virtual Protocol Interconnect) capable, the protocol mode is a static configuration set by the administrator in the firmware or OS (MLNX-OS/SONiC). A switch cannot dynamically flip its entire physical and link layer protocol mid-operation to save power; doing so would immediately crash every active connection in the fabric.
D. Fans are only for noise reduction This is fundamentally incorrect for enterprise-grade hardware. The fans in an NDR switch are industrial-strength components required to dissipate hundreds of watts of heat. Without active cooling, an NDR switch ASIC would reach its thermal shutdown limit (Tjunction) and power off within minutes of being under load.
Question 30 of 60
30. Question
When managing the physical layer of an AI cluster, why is it essential to correctly configure the BlueField-3 DPU for a rail-optimized network fabric? Choose the answer that best describes the relationship between the DPU and the compute performance during large-scale AI training tasks like All-Reduce operations.
Correct
Correct: C The DPU provides the high-speed interface for GPUDirect RDMA, allowing the GPUs to communicate across the network without involving the host CPU. • The Technical Reason: In large-scale AI training, All-Reduce operations involve massive data exchanges between GPUs across different nodes. ? GPUDirect RDMA: The BlueField-3 DPU supports this technology, which allows the network hardware to access GPU memory (HBM) directly over the PCIe bus. ? CPU Bypass: By bypassing the host CPU and system RAM, the DPU reduces latency by several microseconds and eliminates CPU overhead. This is critical because, at the scale of thousands of GPUs, even minor CPU-induced jitters can cause the entire training job to stall. ? Rail-Optimization: In a rail-optimized design, each “rail“ (a specific GPU position across all nodes) is mapped to a specific DPU/NIC. This ensures that GPU 0 on Node A communicates with GPU 0 on Node B through a non-blocking, dedicated leaf-switch path. • The NCP-AII Context: The exam validates your understanding of the Data Plane. The DPU is the engine of the data plane, facilitating the high-speed movement of tensors during distributed training.
Incorrect Options: A. The DPU acts as a backup GPU This is a common misconception. While the BlueField-3 DPU contains powerful Arm CPU cores and acceleration engines for networking, security, and storage, it does not contain Tensor Cores capable of performing high-performance floating-point AI training calculations (FP8/BF16). It cannot “take over“ for an HGX baseboard.
B. The DPU manages NVLink Switch signals and encryption NVLink is the internal high-speed fabric inside the server (managed by NVSwitch chips). The DPU manages the External Fabric (InfiniBand or Ethernet). While DPUs can handle IPsec or TLS encryption for standard network traffic, the ultra-low-latency GPU-to-GPU traffic within a server via NVLink is handled by the NVSwitch hardware and is typically not encrypted via a TPM module, as the latency hit would be prohibitive for AI training.
D. The DPU flashes the BIOS for every Slurm job Flashing a BIOS is a high-risk operation that takes several minutes and requires a system reboot. Performing this for every Slurm job would result in 0% cluster utilization. The DPU can assist in provisioning or booting a node (via PXE or SNAP), but it does not re-flash the hardware BIOS as part of a job lifecycle.
Incorrect
Correct: C The DPU provides the high-speed interface for GPUDirect RDMA, allowing the GPUs to communicate across the network without involving the host CPU. • The Technical Reason: In large-scale AI training, All-Reduce operations involve massive data exchanges between GPUs across different nodes. ? GPUDirect RDMA: The BlueField-3 DPU supports this technology, which allows the network hardware to access GPU memory (HBM) directly over the PCIe bus. ? CPU Bypass: By bypassing the host CPU and system RAM, the DPU reduces latency by several microseconds and eliminates CPU overhead. This is critical because, at the scale of thousands of GPUs, even minor CPU-induced jitters can cause the entire training job to stall. ? Rail-Optimization: In a rail-optimized design, each “rail“ (a specific GPU position across all nodes) is mapped to a specific DPU/NIC. This ensures that GPU 0 on Node A communicates with GPU 0 on Node B through a non-blocking, dedicated leaf-switch path. • The NCP-AII Context: The exam validates your understanding of the Data Plane. The DPU is the engine of the data plane, facilitating the high-speed movement of tensors during distributed training.
Incorrect Options: A. The DPU acts as a backup GPU This is a common misconception. While the BlueField-3 DPU contains powerful Arm CPU cores and acceleration engines for networking, security, and storage, it does not contain Tensor Cores capable of performing high-performance floating-point AI training calculations (FP8/BF16). It cannot “take over“ for an HGX baseboard.
B. The DPU manages NVLink Switch signals and encryption NVLink is the internal high-speed fabric inside the server (managed by NVSwitch chips). The DPU manages the External Fabric (InfiniBand or Ethernet). While DPUs can handle IPsec or TLS encryption for standard network traffic, the ultra-low-latency GPU-to-GPU traffic within a server via NVLink is handled by the NVSwitch hardware and is typically not encrypted via a TPM module, as the latency hit would be prohibitive for AI training.
D. The DPU flashes the BIOS for every Slurm job Flashing a BIOS is a high-risk operation that takes several minutes and requires a system reboot. Performing this for every Slurm job would result in 0% cluster utilization. The DPU can assist in provisioning or booting a node (via PXE or SNAP), but it does not re-flash the hardware BIOS as part of a job lifecycle.
Unattempted
Correct: C The DPU provides the high-speed interface for GPUDirect RDMA, allowing the GPUs to communicate across the network without involving the host CPU. • The Technical Reason: In large-scale AI training, All-Reduce operations involve massive data exchanges between GPUs across different nodes. ? GPUDirect RDMA: The BlueField-3 DPU supports this technology, which allows the network hardware to access GPU memory (HBM) directly over the PCIe bus. ? CPU Bypass: By bypassing the host CPU and system RAM, the DPU reduces latency by several microseconds and eliminates CPU overhead. This is critical because, at the scale of thousands of GPUs, even minor CPU-induced jitters can cause the entire training job to stall. ? Rail-Optimization: In a rail-optimized design, each “rail“ (a specific GPU position across all nodes) is mapped to a specific DPU/NIC. This ensures that GPU 0 on Node A communicates with GPU 0 on Node B through a non-blocking, dedicated leaf-switch path. • The NCP-AII Context: The exam validates your understanding of the Data Plane. The DPU is the engine of the data plane, facilitating the high-speed movement of tensors during distributed training.
Incorrect Options: A. The DPU acts as a backup GPU This is a common misconception. While the BlueField-3 DPU contains powerful Arm CPU cores and acceleration engines for networking, security, and storage, it does not contain Tensor Cores capable of performing high-performance floating-point AI training calculations (FP8/BF16). It cannot “take over“ for an HGX baseboard.
B. The DPU manages NVLink Switch signals and encryption NVLink is the internal high-speed fabric inside the server (managed by NVSwitch chips). The DPU manages the External Fabric (InfiniBand or Ethernet). While DPUs can handle IPsec or TLS encryption for standard network traffic, the ultra-low-latency GPU-to-GPU traffic within a server via NVLink is handled by the NVSwitch hardware and is typically not encrypted via a TPM module, as the latency hit would be prohibitive for AI training.
D. The DPU flashes the BIOS for every Slurm job Flashing a BIOS is a high-risk operation that takes several minutes and requires a system reboot. Performing this for every Slurm job would result in 0% cluster utilization. The DPU can assist in provisioning or booting a node (via PXE or SNAP), but it does not re-flash the hardware BIOS as part of a job lifecycle.
Question 31 of 60
31. Question
An administrator identifies a faulty BlueField DPU that is causing intermittent network drops. When replacing the card, which of the following is a critical post-replacement step to ensure the new DPU is correctly integrated into the AI cluster automated management framework?
Correct
Correct: C. Updating the DPU firmware and re-provisioning the DOCA runtime image.
This is correct because the NCP-AII certification blueprint explicitly requires candidates to know how to replace faulty cards as part of the Troubleshoot and Optimize domain, which includes identifying faulty cards and performing replacements . The certification documentation specifies “Confirm FW/SW on BlueField-3“ as a critical verification task . After physically replacing a BlueField DPU, it is essential to update its firmware and reinstall the DOCA runtime image (BFB bundle) to ensure it matches the cluster‘s software baseline and integrates correctly with the management framework. The official NVIDIA documentation emphasizes that before a BlueField DPU can function properly, the appropriate BFB image must be installed, and the process includes upgrading the NIC firmware as part of the installation . Furthermore, the documentation confirms that “NIC firmware update done“ is a standard part of the BFB installation process . Without this critical post-replacement step, the DPU would run outdated or mismatched firmware and software, preventing it from being properly discovered and managed by automated tools like Base Command Manager.
Incorrect: A. Painting the card bracket to match the color of the server chassis.
This is incorrect because physical appearance modifications like painting hardware components have no functional impact on DPU integration or performance. This action is purely cosmetic and unrelated to any verification or configuration step required for automated cluster management frameworks.
B. Manually assigning a public IPv4 address to the DPU internal port.
This is incorrect for two reasons. First, management networks in AI factories typically use private IP addressing for out-of-band management, not public IPv4 addresses. Second, while IP configuration is eventually needed for network access, it is not the critical post-replacement step that ensures the DPU is correctly integrated. The priority is updating firmware and provisioning the DOCA runtime so the DPU operates with the correct software baseline before network configuration occurs.
D. Disabling the NVSwitch fabric to prevent the DPU from seeing the GPUs.
This is incorrect because the NVSwitch fabric is a high-speed interconnect for GPU-to-GPU communication within HGX systems . Disabling it would severely impact AI workload performance and is unrelated to DPU replacement procedures. The DPU needs to properly interface with the system, not be isolated from GPUs. The certification documentation includes “verify NVLink™ Switch“ as a verification step , not disabling it.
Incorrect
Correct: C. Updating the DPU firmware and re-provisioning the DOCA runtime image.
This is correct because the NCP-AII certification blueprint explicitly requires candidates to know how to replace faulty cards as part of the Troubleshoot and Optimize domain, which includes identifying faulty cards and performing replacements . The certification documentation specifies “Confirm FW/SW on BlueField-3“ as a critical verification task . After physically replacing a BlueField DPU, it is essential to update its firmware and reinstall the DOCA runtime image (BFB bundle) to ensure it matches the cluster‘s software baseline and integrates correctly with the management framework. The official NVIDIA documentation emphasizes that before a BlueField DPU can function properly, the appropriate BFB image must be installed, and the process includes upgrading the NIC firmware as part of the installation . Furthermore, the documentation confirms that “NIC firmware update done“ is a standard part of the BFB installation process . Without this critical post-replacement step, the DPU would run outdated or mismatched firmware and software, preventing it from being properly discovered and managed by automated tools like Base Command Manager.
Incorrect: A. Painting the card bracket to match the color of the server chassis.
This is incorrect because physical appearance modifications like painting hardware components have no functional impact on DPU integration or performance. This action is purely cosmetic and unrelated to any verification or configuration step required for automated cluster management frameworks.
B. Manually assigning a public IPv4 address to the DPU internal port.
This is incorrect for two reasons. First, management networks in AI factories typically use private IP addressing for out-of-band management, not public IPv4 addresses. Second, while IP configuration is eventually needed for network access, it is not the critical post-replacement step that ensures the DPU is correctly integrated. The priority is updating firmware and provisioning the DOCA runtime so the DPU operates with the correct software baseline before network configuration occurs.
D. Disabling the NVSwitch fabric to prevent the DPU from seeing the GPUs.
This is incorrect because the NVSwitch fabric is a high-speed interconnect for GPU-to-GPU communication within HGX systems . Disabling it would severely impact AI workload performance and is unrelated to DPU replacement procedures. The DPU needs to properly interface with the system, not be isolated from GPUs. The certification documentation includes “verify NVLink™ Switch“ as a verification step , not disabling it.
Unattempted
Correct: C. Updating the DPU firmware and re-provisioning the DOCA runtime image.
This is correct because the NCP-AII certification blueprint explicitly requires candidates to know how to replace faulty cards as part of the Troubleshoot and Optimize domain, which includes identifying faulty cards and performing replacements . The certification documentation specifies “Confirm FW/SW on BlueField-3“ as a critical verification task . After physically replacing a BlueField DPU, it is essential to update its firmware and reinstall the DOCA runtime image (BFB bundle) to ensure it matches the cluster‘s software baseline and integrates correctly with the management framework. The official NVIDIA documentation emphasizes that before a BlueField DPU can function properly, the appropriate BFB image must be installed, and the process includes upgrading the NIC firmware as part of the installation . Furthermore, the documentation confirms that “NIC firmware update done“ is a standard part of the BFB installation process . Without this critical post-replacement step, the DPU would run outdated or mismatched firmware and software, preventing it from being properly discovered and managed by automated tools like Base Command Manager.
Incorrect: A. Painting the card bracket to match the color of the server chassis.
This is incorrect because physical appearance modifications like painting hardware components have no functional impact on DPU integration or performance. This action is purely cosmetic and unrelated to any verification or configuration step required for automated cluster management frameworks.
B. Manually assigning a public IPv4 address to the DPU internal port.
This is incorrect for two reasons. First, management networks in AI factories typically use private IP addressing for out-of-band management, not public IPv4 addresses. Second, while IP configuration is eventually needed for network access, it is not the critical post-replacement step that ensures the DPU is correctly integrated. The priority is updating firmware and provisioning the DOCA runtime so the DPU operates with the correct software baseline before network configuration occurs.
D. Disabling the NVSwitch fabric to prevent the DPU from seeing the GPUs.
This is incorrect because the NVSwitch fabric is a high-speed interconnect for GPU-to-GPU communication within HGX systems . Disabling it would severely impact AI workload performance and is unrelated to DPU replacement procedures. The DPU needs to properly interface with the system, not be isolated from GPUs. The certification documentation includes “verify NVLink™ Switch“ as a verification step , not disabling it.
Question 32 of 60
32. Question
A storage optimization task requires reducing the latency for small-file I/O during the data preprocessing phase of an AI pipeline. The cluster uses an NVIDIA Magnum IO GPUDirect Storage (GDS) capable environment. What is the primary benefit of enabling GDS in this scenario?
Correct
Correct: C It enables a direct DMA path between the storage and the GPU memory, bypassing the CPU bounce buffer.
The Technical Reason: In a standard I/O path, data must be copied from storage into a temporary “bounce buffer“ in system (CPU) memory before it is copied a second time into the GPU‘s VRAM.
Eliminating Bottlenecks: GDS uses Remote Direct Memory Access (RDMA) principles to create a direct path between the storage (local NVMe or remote NVMe-oF) and GPU memory.
Latency & CPU Relief: This bypasses the CPU and system memory entirely. For small-file I/O or high-frequency data preprocessing, this drastically reduces latency by removing the “middleman“ (the CPU) and lowering the CPU utilization load, which otherwise becomes a bottleneck as the GPU waits for the CPU to move data.
The NCP-AII Context: The exam validates your ability to optimize I/O. GDS is the primary solution for “starved GPUs“ where the compute is faster than the data delivery. You are expected to know that GDS requires the nvidia-fs kernel driver and is verified using the gdscheck utility.
Incorrect Options: A. Increases maximum capacity of hard drives GDS is a data transfer technology, not a storage capacity technology. It does not change the physical or logical size of the disks; it only changes how quickly and efficiently data can be moved from those disks into the GPU.
B. Allows the GPU to act as a primary network switch This is a fundamental misunderstanding of GPU hardware. While GPUs have high-speed interconnects like NVLink, they do not function as network switches. Networking is handled by the InfiniBand/Ethernet switches and BlueField-3 DPUs.
D. Compresses data before it reaches the GPU GDS does not perform data compression. While NVIDIA offers other technologies for acceleration (like nvCOMP for GPU-accelerated compression), the purpose of GDS is strictly focused on the physical pathing of data to minimize latency and CPU overhead.
Incorrect
Correct: C It enables a direct DMA path between the storage and the GPU memory, bypassing the CPU bounce buffer.
The Technical Reason: In a standard I/O path, data must be copied from storage into a temporary “bounce buffer“ in system (CPU) memory before it is copied a second time into the GPU‘s VRAM.
Eliminating Bottlenecks: GDS uses Remote Direct Memory Access (RDMA) principles to create a direct path between the storage (local NVMe or remote NVMe-oF) and GPU memory.
Latency & CPU Relief: This bypasses the CPU and system memory entirely. For small-file I/O or high-frequency data preprocessing, this drastically reduces latency by removing the “middleman“ (the CPU) and lowering the CPU utilization load, which otherwise becomes a bottleneck as the GPU waits for the CPU to move data.
The NCP-AII Context: The exam validates your ability to optimize I/O. GDS is the primary solution for “starved GPUs“ where the compute is faster than the data delivery. You are expected to know that GDS requires the nvidia-fs kernel driver and is verified using the gdscheck utility.
Incorrect Options: A. Increases maximum capacity of hard drives GDS is a data transfer technology, not a storage capacity technology. It does not change the physical or logical size of the disks; it only changes how quickly and efficiently data can be moved from those disks into the GPU.
B. Allows the GPU to act as a primary network switch This is a fundamental misunderstanding of GPU hardware. While GPUs have high-speed interconnects like NVLink, they do not function as network switches. Networking is handled by the InfiniBand/Ethernet switches and BlueField-3 DPUs.
D. Compresses data before it reaches the GPU GDS does not perform data compression. While NVIDIA offers other technologies for acceleration (like nvCOMP for GPU-accelerated compression), the purpose of GDS is strictly focused on the physical pathing of data to minimize latency and CPU overhead.
Unattempted
Correct: C It enables a direct DMA path between the storage and the GPU memory, bypassing the CPU bounce buffer.
The Technical Reason: In a standard I/O path, data must be copied from storage into a temporary “bounce buffer“ in system (CPU) memory before it is copied a second time into the GPU‘s VRAM.
Eliminating Bottlenecks: GDS uses Remote Direct Memory Access (RDMA) principles to create a direct path between the storage (local NVMe or remote NVMe-oF) and GPU memory.
Latency & CPU Relief: This bypasses the CPU and system memory entirely. For small-file I/O or high-frequency data preprocessing, this drastically reduces latency by removing the “middleman“ (the CPU) and lowering the CPU utilization load, which otherwise becomes a bottleneck as the GPU waits for the CPU to move data.
The NCP-AII Context: The exam validates your ability to optimize I/O. GDS is the primary solution for “starved GPUs“ where the compute is faster than the data delivery. You are expected to know that GDS requires the nvidia-fs kernel driver and is verified using the gdscheck utility.
Incorrect Options: A. Increases maximum capacity of hard drives GDS is a data transfer technology, not a storage capacity technology. It does not change the physical or logical size of the disks; it only changes how quickly and efficiently data can be moved from those disks into the GPU.
B. Allows the GPU to act as a primary network switch This is a fundamental misunderstanding of GPU hardware. While GPUs have high-speed interconnects like NVLink, they do not function as network switches. Networking is handled by the InfiniBand/Ethernet switches and BlueField-3 DPUs.
D. Compresses data before it reaches the GPU GDS does not perform data compression. While NVIDIA offers other technologies for acceleration (like nvCOMP for GPU-accelerated compression), the purpose of GDS is strictly focused on the physical pathing of data to minimize latency and CPU overhead.
Question 33 of 60
33. Question
An administrator is deploying a cluster of NVIDIA OVX servers and needs to ensure that the hardware Root of Trust is established before installing the operating system. Which set of tasks correctly describes the process for initializing the Trusted Platform Module (TPM) and validating the firmware integrity across the HGX baseboard and the system BIOS during the OOB configuration phase?
Correct
Correct: A Flash the firmware using the NVIDIA Flash Tool, enable the TPM in the BIOS, and then clear the TPM ownership via the BMC web interface or CLI.
The Technical Reason: The professional bring-up workflow for NVIDIA-certified systems follows a strict security sequence:
Firmware Baseline: Before initialization, all components (HGX baseboard, BIOS, and NICs) must be flashed to a known-good, secure version using the NVIDIA Flash Tool or through the BMC (Baseboard Management Controller).
Hardware Enablement: The Trusted Platform Module (TPM) is physically present but often disabled by default for security. It must be enabled within the system BIOS to allow the OS to use it for secure key storage and attestation.
Initialization: To “own“ the security of the server, an administrator must clear the TPM ownership. This clears any factory-default or previous keys, allowing the current organization to initialize the hardware Root of Trust. This is typically done via the BMC to ensure it can be managed remotely and securely.
The NCP-AII Context: The exam blueprint explicitly includes “Perform initial configuration of BMC, OOB, and TPM“ and “Perform firmware upgrades (including on HGX™)“. Option A reflects the standard administrative procedure for securing a new node before the OS installation begins.
Incorrect Options: B. Bypass TPM checks to accelerate bring-up Bypassing security checks contradicts the fundamental requirement to “ensure that the hardware Root of Trust is established.“ While bypassing might save a few minutes during installation, it leaves the cluster vulnerable to firmware-level attacks and prevents the use of modern security features like NVIDIA Magnum IO security or encrypted data-at-rest.
C. Enable Secure Boot only after OS installation Secure Boot should be configured before or during the OS installation to ensure that only signed bootloaders and kernels are permitted to execute. If unsigned drivers are required for hardware initialization, they must be signed by the organization‘s key or the infrastructure should be configured with a “MOK“ (Machine Owner Key). Enabling it “after“ installation is a reactive measure that does not properly secure the boot chain from the start.
D. Physically remove the GPU baseboard and use serial connection NVIDIA OVX and HGX systems are designed for high-availability data centers; they do not require physical disassembly or “jumper“ manipulation for standard security initialization. Modern OOB management (BMC) replaces the need for serial-only connections or physical board access for standard tasks like resetting firmware or managing the TPM.
Incorrect
Correct: A Flash the firmware using the NVIDIA Flash Tool, enable the TPM in the BIOS, and then clear the TPM ownership via the BMC web interface or CLI.
The Technical Reason: The professional bring-up workflow for NVIDIA-certified systems follows a strict security sequence:
Firmware Baseline: Before initialization, all components (HGX baseboard, BIOS, and NICs) must be flashed to a known-good, secure version using the NVIDIA Flash Tool or through the BMC (Baseboard Management Controller).
Hardware Enablement: The Trusted Platform Module (TPM) is physically present but often disabled by default for security. It must be enabled within the system BIOS to allow the OS to use it for secure key storage and attestation.
Initialization: To “own“ the security of the server, an administrator must clear the TPM ownership. This clears any factory-default or previous keys, allowing the current organization to initialize the hardware Root of Trust. This is typically done via the BMC to ensure it can be managed remotely and securely.
The NCP-AII Context: The exam blueprint explicitly includes “Perform initial configuration of BMC, OOB, and TPM“ and “Perform firmware upgrades (including on HGX™)“. Option A reflects the standard administrative procedure for securing a new node before the OS installation begins.
Incorrect Options: B. Bypass TPM checks to accelerate bring-up Bypassing security checks contradicts the fundamental requirement to “ensure that the hardware Root of Trust is established.“ While bypassing might save a few minutes during installation, it leaves the cluster vulnerable to firmware-level attacks and prevents the use of modern security features like NVIDIA Magnum IO security or encrypted data-at-rest.
C. Enable Secure Boot only after OS installation Secure Boot should be configured before or during the OS installation to ensure that only signed bootloaders and kernels are permitted to execute. If unsigned drivers are required for hardware initialization, they must be signed by the organization‘s key or the infrastructure should be configured with a “MOK“ (Machine Owner Key). Enabling it “after“ installation is a reactive measure that does not properly secure the boot chain from the start.
D. Physically remove the GPU baseboard and use serial connection NVIDIA OVX and HGX systems are designed for high-availability data centers; they do not require physical disassembly or “jumper“ manipulation for standard security initialization. Modern OOB management (BMC) replaces the need for serial-only connections or physical board access for standard tasks like resetting firmware or managing the TPM.
Unattempted
Correct: A Flash the firmware using the NVIDIA Flash Tool, enable the TPM in the BIOS, and then clear the TPM ownership via the BMC web interface or CLI.
The Technical Reason: The professional bring-up workflow for NVIDIA-certified systems follows a strict security sequence:
Firmware Baseline: Before initialization, all components (HGX baseboard, BIOS, and NICs) must be flashed to a known-good, secure version using the NVIDIA Flash Tool or through the BMC (Baseboard Management Controller).
Hardware Enablement: The Trusted Platform Module (TPM) is physically present but often disabled by default for security. It must be enabled within the system BIOS to allow the OS to use it for secure key storage and attestation.
Initialization: To “own“ the security of the server, an administrator must clear the TPM ownership. This clears any factory-default or previous keys, allowing the current organization to initialize the hardware Root of Trust. This is typically done via the BMC to ensure it can be managed remotely and securely.
The NCP-AII Context: The exam blueprint explicitly includes “Perform initial configuration of BMC, OOB, and TPM“ and “Perform firmware upgrades (including on HGX™)“. Option A reflects the standard administrative procedure for securing a new node before the OS installation begins.
Incorrect Options: B. Bypass TPM checks to accelerate bring-up Bypassing security checks contradicts the fundamental requirement to “ensure that the hardware Root of Trust is established.“ While bypassing might save a few minutes during installation, it leaves the cluster vulnerable to firmware-level attacks and prevents the use of modern security features like NVIDIA Magnum IO security or encrypted data-at-rest.
C. Enable Secure Boot only after OS installation Secure Boot should be configured before or during the OS installation to ensure that only signed bootloaders and kernels are permitted to execute. If unsigned drivers are required for hardware initialization, they must be signed by the organization‘s key or the infrastructure should be configured with a “MOK“ (Machine Owner Key). Enabling it “after“ installation is a reactive measure that does not properly secure the boot chain from the start.
D. Physically remove the GPU baseboard and use serial connection NVIDIA OVX and HGX systems are designed for high-availability data centers; they do not require physical disassembly or “jumper“ manipulation for standard security initialization. Modern OOB management (BMC) replaces the need for serial-only connections or physical board access for standard tasks like resetting firmware or managing the TPM.
Question 34 of 60
34. Question
A system administrator is configuring a cluster where specific nodes require high-throughput storage access. They decide to use BlueField DPUs to implement NVMe-over-Fabrics (NVMe-oF) storage acceleration. Which step is essential for configuring the BlueField network platform to support this offload capability?
Correct
Correct: B Configure the DPU in ‘Separated‘ or ‘Embedded‘ mode and use DOCA drivers to expose virtual NVMe controllers to the host OS.
The Technical Reason: To implement storage acceleration like NVMe-over-Fabrics (NVMe-oF) using BlueField, the DPU must be properly partitioned and initialized:
Operating Modes: In Embedded Function (EF) mode (often referred to as Embedded or Separated mode), the DPU acts as an independent subsystem with its own Linux OS running on the ARM cores. This allows it to manage the storage stack independently of the host.
DOCA & SNAP: Using the NVIDIA DOCA framework (specifically the SNAP—Storage, Network, and Analytics Performance—service), the DPU can virtualize physical storage located elsewhere on the network.
Host Presentation: The BlueField DPU presents these remote storage volumes to the host OS as local, hardware-level NVMe controllers via the PCIe bus. The host OS “sees“ a local NVMe drive, while the DPU handles the complex network encapsulation and RDMA processing.
The NCP-AII Context: The certification validates your knowledge of the NVIDIA DOCA stack and the hardware-software integration required to move storage logic from the host CPU to the DPU hardware.
Incorrect Options: A. Install CUDA Toolkit on BlueField ARM cores While BlueField DPUs have powerful ARM cores, they do not contain Tensor Cores or high-performance GPUs. CUDA is designed for NVIDIA GPUs. Storage offloading on the DPU is handled by the DOCA Storage libraries and specialized hardware acceleration engines (like the NVMe-oF offload engine), not via CUDA-based GPU processing.
C. Connect DPU to BMC via serial cable The DPU communicates with the host and management plane primarily over the PCIe bus and NC-SI (Network Controller Sideband Interface). There is no requirement for a physical serial cable between the DPU and the BMC to manage the NVMe Flash Translation Layer (FTL). The FTL is typically managed by the storage controller or the DPU‘s internal firmware.
D. Disable the SNAP service on the DPU This is the opposite of what is required. NVIDIA SNAP is the specific DOCA service that enables the DPU to emulate an NVMe device to the host. Disabling SNAP would prevent the DPU from presenting virtualized storage to the host, forcing the host CPU to handle the network storage stack manually, which defeats the purpose of the DPU offload.
Incorrect
Correct: B Configure the DPU in ‘Separated‘ or ‘Embedded‘ mode and use DOCA drivers to expose virtual NVMe controllers to the host OS.
The Technical Reason: To implement storage acceleration like NVMe-over-Fabrics (NVMe-oF) using BlueField, the DPU must be properly partitioned and initialized:
Operating Modes: In Embedded Function (EF) mode (often referred to as Embedded or Separated mode), the DPU acts as an independent subsystem with its own Linux OS running on the ARM cores. This allows it to manage the storage stack independently of the host.
DOCA & SNAP: Using the NVIDIA DOCA framework (specifically the SNAP—Storage, Network, and Analytics Performance—service), the DPU can virtualize physical storage located elsewhere on the network.
Host Presentation: The BlueField DPU presents these remote storage volumes to the host OS as local, hardware-level NVMe controllers via the PCIe bus. The host OS “sees“ a local NVMe drive, while the DPU handles the complex network encapsulation and RDMA processing.
The NCP-AII Context: The certification validates your knowledge of the NVIDIA DOCA stack and the hardware-software integration required to move storage logic from the host CPU to the DPU hardware.
Incorrect Options: A. Install CUDA Toolkit on BlueField ARM cores While BlueField DPUs have powerful ARM cores, they do not contain Tensor Cores or high-performance GPUs. CUDA is designed for NVIDIA GPUs. Storage offloading on the DPU is handled by the DOCA Storage libraries and specialized hardware acceleration engines (like the NVMe-oF offload engine), not via CUDA-based GPU processing.
C. Connect DPU to BMC via serial cable The DPU communicates with the host and management plane primarily over the PCIe bus and NC-SI (Network Controller Sideband Interface). There is no requirement for a physical serial cable between the DPU and the BMC to manage the NVMe Flash Translation Layer (FTL). The FTL is typically managed by the storage controller or the DPU‘s internal firmware.
D. Disable the SNAP service on the DPU This is the opposite of what is required. NVIDIA SNAP is the specific DOCA service that enables the DPU to emulate an NVMe device to the host. Disabling SNAP would prevent the DPU from presenting virtualized storage to the host, forcing the host CPU to handle the network storage stack manually, which defeats the purpose of the DPU offload.
Unattempted
Correct: B Configure the DPU in ‘Separated‘ or ‘Embedded‘ mode and use DOCA drivers to expose virtual NVMe controllers to the host OS.
The Technical Reason: To implement storage acceleration like NVMe-over-Fabrics (NVMe-oF) using BlueField, the DPU must be properly partitioned and initialized:
Operating Modes: In Embedded Function (EF) mode (often referred to as Embedded or Separated mode), the DPU acts as an independent subsystem with its own Linux OS running on the ARM cores. This allows it to manage the storage stack independently of the host.
DOCA & SNAP: Using the NVIDIA DOCA framework (specifically the SNAP—Storage, Network, and Analytics Performance—service), the DPU can virtualize physical storage located elsewhere on the network.
Host Presentation: The BlueField DPU presents these remote storage volumes to the host OS as local, hardware-level NVMe controllers via the PCIe bus. The host OS “sees“ a local NVMe drive, while the DPU handles the complex network encapsulation and RDMA processing.
The NCP-AII Context: The certification validates your knowledge of the NVIDIA DOCA stack and the hardware-software integration required to move storage logic from the host CPU to the DPU hardware.
Incorrect Options: A. Install CUDA Toolkit on BlueField ARM cores While BlueField DPUs have powerful ARM cores, they do not contain Tensor Cores or high-performance GPUs. CUDA is designed for NVIDIA GPUs. Storage offloading on the DPU is handled by the DOCA Storage libraries and specialized hardware acceleration engines (like the NVMe-oF offload engine), not via CUDA-based GPU processing.
C. Connect DPU to BMC via serial cable The DPU communicates with the host and management plane primarily over the PCIe bus and NC-SI (Network Controller Sideband Interface). There is no requirement for a physical serial cable between the DPU and the BMC to manage the NVMe Flash Translation Layer (FTL). The FTL is typically managed by the storage controller or the DPU‘s internal firmware.
D. Disable the SNAP service on the DPU This is the opposite of what is required. NVIDIA SNAP is the specific DOCA service that enables the DPU to emulate an NVMe device to the host. Disabling SNAP would prevent the DPU from presenting virtualized storage to the host, forcing the host CPU to handle the network storage stack manually, which defeats the purpose of the DPU offload.
Question 35 of 60
35. Question
An administrator identifies a faulty ConnectX-7 network card in a production node. What is the correct procedure for replacing the card and returning the node to service in an NVIDIA-managed environment?
Correct
Correct: C Drain the node in Slurm, power down the system, replace the card, verify the firmware version matches the cluster standard, and then resume the node.
The Technical Reason: This follows the standard operational lifecycle for an NVIDIA-certified cluster:
Orchestration (Drain): Before physical maintenance, you must “drain“ the node in the workload manager (Slurm). This prevents new jobs from starting on the node while allowing current jobs to finish, ensuring no user data is lost.
Safety (Power Down): ConnectX-7 cards in HGX/OVX systems are typically not designed for live hot-swapping; the system must be powered down to prevent electrical damage to the PCIe/SXM bus.
Consistency (Firmware): This is a critical NCP-AII concept. Every node in a cluster must run the same “Golden Stack“ of firmware. If the replacement card has a different firmware version than the rest of the cluster, it can cause NCCL errors or performance jitter.
Restoration (Resume): Once verified, the node is “resumed“ in Slurm, making it available for workloads again.
The NCP-AII Context: The certification emphasizes using Base Command Manager (BCM) and Slurm together. You are expected to know how to move a node through different states (Down, Drain, Resume) to maintain cluster health.
Incorrect Options: A. Delete from BCM database and reinstall OS Deleting the node from the database and reinstalling the OS is a “nuclear option“ that is unnecessary for a simple NIC replacement. Base Command Manager (BCM) is designed to manage hardware changes (like MAC address updates) dynamically. Reinstalling the entire OS would waste hours of deployment time and is not a standard maintenance procedure.
B. Hot-swap the card and ‘git commit‘ As noted, ConnectX-7 cards in high-density AI servers are generally not hot-swappable. Furthermore, while some modern infrastructure-as-code (IaC) tools use Git, the standard tool for managing the hardware manifest in an NVIDIA environment is the BCM CMDaemon, not a manual git commit by the administrator to a manifest file.
D. Manually edit CUDA headers CUDA headers (the .h files used for compiling code) have absolutely no relationship with physical hardware addresses like MAC addresses. Hardware identification is handled by the OS kernel and the NVIDIA drivers. Manually editing source headers would not affect how the system recognizes a new network card.
Incorrect
Correct: C Drain the node in Slurm, power down the system, replace the card, verify the firmware version matches the cluster standard, and then resume the node.
The Technical Reason: This follows the standard operational lifecycle for an NVIDIA-certified cluster:
Orchestration (Drain): Before physical maintenance, you must “drain“ the node in the workload manager (Slurm). This prevents new jobs from starting on the node while allowing current jobs to finish, ensuring no user data is lost.
Safety (Power Down): ConnectX-7 cards in HGX/OVX systems are typically not designed for live hot-swapping; the system must be powered down to prevent electrical damage to the PCIe/SXM bus.
Consistency (Firmware): This is a critical NCP-AII concept. Every node in a cluster must run the same “Golden Stack“ of firmware. If the replacement card has a different firmware version than the rest of the cluster, it can cause NCCL errors or performance jitter.
Restoration (Resume): Once verified, the node is “resumed“ in Slurm, making it available for workloads again.
The NCP-AII Context: The certification emphasizes using Base Command Manager (BCM) and Slurm together. You are expected to know how to move a node through different states (Down, Drain, Resume) to maintain cluster health.
Incorrect Options: A. Delete from BCM database and reinstall OS Deleting the node from the database and reinstalling the OS is a “nuclear option“ that is unnecessary for a simple NIC replacement. Base Command Manager (BCM) is designed to manage hardware changes (like MAC address updates) dynamically. Reinstalling the entire OS would waste hours of deployment time and is not a standard maintenance procedure.
B. Hot-swap the card and ‘git commit‘ As noted, ConnectX-7 cards in high-density AI servers are generally not hot-swappable. Furthermore, while some modern infrastructure-as-code (IaC) tools use Git, the standard tool for managing the hardware manifest in an NVIDIA environment is the BCM CMDaemon, not a manual git commit by the administrator to a manifest file.
D. Manually edit CUDA headers CUDA headers (the .h files used for compiling code) have absolutely no relationship with physical hardware addresses like MAC addresses. Hardware identification is handled by the OS kernel and the NVIDIA drivers. Manually editing source headers would not affect how the system recognizes a new network card.
Unattempted
Correct: C Drain the node in Slurm, power down the system, replace the card, verify the firmware version matches the cluster standard, and then resume the node.
The Technical Reason: This follows the standard operational lifecycle for an NVIDIA-certified cluster:
Orchestration (Drain): Before physical maintenance, you must “drain“ the node in the workload manager (Slurm). This prevents new jobs from starting on the node while allowing current jobs to finish, ensuring no user data is lost.
Safety (Power Down): ConnectX-7 cards in HGX/OVX systems are typically not designed for live hot-swapping; the system must be powered down to prevent electrical damage to the PCIe/SXM bus.
Consistency (Firmware): This is a critical NCP-AII concept. Every node in a cluster must run the same “Golden Stack“ of firmware. If the replacement card has a different firmware version than the rest of the cluster, it can cause NCCL errors or performance jitter.
Restoration (Resume): Once verified, the node is “resumed“ in Slurm, making it available for workloads again.
The NCP-AII Context: The certification emphasizes using Base Command Manager (BCM) and Slurm together. You are expected to know how to move a node through different states (Down, Drain, Resume) to maintain cluster health.
Incorrect Options: A. Delete from BCM database and reinstall OS Deleting the node from the database and reinstalling the OS is a “nuclear option“ that is unnecessary for a simple NIC replacement. Base Command Manager (BCM) is designed to manage hardware changes (like MAC address updates) dynamically. Reinstalling the entire OS would waste hours of deployment time and is not a standard maintenance procedure.
B. Hot-swap the card and ‘git commit‘ As noted, ConnectX-7 cards in high-density AI servers are generally not hot-swappable. Furthermore, while some modern infrastructure-as-code (IaC) tools use Git, the standard tool for managing the hardware manifest in an NVIDIA environment is the BCM CMDaemon, not a manual git commit by the administrator to a manifest file.
D. Manually edit CUDA headers CUDA headers (the .h files used for compiling code) have absolutely no relationship with physical hardware addresses like MAC addresses. Hardware identification is handled by the OS kernel and the NVIDIA drivers. Manually editing source headers would not affect how the system recognizes a new network card.
Question 36 of 60
36. Question
When troubleshooting storage performance for an AI factory, an administrator notices that the GPU utilization is low during training, and the ‘iowait‘ metric on the compute nodes is high. What is the most effective optimization to resolve this storage bottleneck?
Correct
Correct: D. Implement NVIDIA GPUDirect Storage (GDS) to enable a direct data path between the storage and the GPU memory, bypassing the CPU.
This is the most effective optimization because the symptoms of low GPU utilization and high iowait indicate a storage bottleneck where the CPU is overloaded with data movement tasks. GPUDirect Storage (GDS) directly addresses this by enabling a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU . By bypassing the CPU, GDS increases system bandwidth and decreases the latency and utilization load on the CPU , allowing the GPUs to receive data faster and stay busy with computation rather than waiting on I/O operations.
Incorrect: A. Reduce the resolution of the training images so that the storage system has less data to read from the disks during each epoch.
This is incorrect because reducing image resolution is a data preprocessing change that could negatively impact model accuracy. It does not address the underlying architectural inefficiency of data moving through CPU bounce buffers. The goal should be to optimize the data path while preserving data quality.
B. Add more GPUs to each node to increase the total amount of compute power available to process the slow-moving data.
This is incorrect because adding more GPUs would not resolve the storage bottleneck—it would likely worsen the problem by creating more GPU compute capacity that remains underutilized while waiting for data. The issue is I/O limited, not compute limited.
C. Change the training algorithm from a parallel approach to a sequential approach to reduce the number of simultaneous read requests.
This is incorrect because changing to sequential processing would reduce overall training throughput and increase training time. While it might reduce the instantaneous I/O load, it would make the training process significantly less efficient and does not solve the root cause of CPU-mediated data movement overhead.
Incorrect
Correct: D. Implement NVIDIA GPUDirect Storage (GDS) to enable a direct data path between the storage and the GPU memory, bypassing the CPU.
This is the most effective optimization because the symptoms of low GPU utilization and high iowait indicate a storage bottleneck where the CPU is overloaded with data movement tasks. GPUDirect Storage (GDS) directly addresses this by enabling a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU . By bypassing the CPU, GDS increases system bandwidth and decreases the latency and utilization load on the CPU , allowing the GPUs to receive data faster and stay busy with computation rather than waiting on I/O operations.
Incorrect: A. Reduce the resolution of the training images so that the storage system has less data to read from the disks during each epoch.
This is incorrect because reducing image resolution is a data preprocessing change that could negatively impact model accuracy. It does not address the underlying architectural inefficiency of data moving through CPU bounce buffers. The goal should be to optimize the data path while preserving data quality.
B. Add more GPUs to each node to increase the total amount of compute power available to process the slow-moving data.
This is incorrect because adding more GPUs would not resolve the storage bottleneck—it would likely worsen the problem by creating more GPU compute capacity that remains underutilized while waiting for data. The issue is I/O limited, not compute limited.
C. Change the training algorithm from a parallel approach to a sequential approach to reduce the number of simultaneous read requests.
This is incorrect because changing to sequential processing would reduce overall training throughput and increase training time. While it might reduce the instantaneous I/O load, it would make the training process significantly less efficient and does not solve the root cause of CPU-mediated data movement overhead.
Unattempted
Correct: D. Implement NVIDIA GPUDirect Storage (GDS) to enable a direct data path between the storage and the GPU memory, bypassing the CPU.
This is the most effective optimization because the symptoms of low GPU utilization and high iowait indicate a storage bottleneck where the CPU is overloaded with data movement tasks. GPUDirect Storage (GDS) directly addresses this by enabling a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU . By bypassing the CPU, GDS increases system bandwidth and decreases the latency and utilization load on the CPU , allowing the GPUs to receive data faster and stay busy with computation rather than waiting on I/O operations.
Incorrect: A. Reduce the resolution of the training images so that the storage system has less data to read from the disks during each epoch.
This is incorrect because reducing image resolution is a data preprocessing change that could negatively impact model accuracy. It does not address the underlying architectural inefficiency of data moving through CPU bounce buffers. The goal should be to optimize the data path while preserving data quality.
B. Add more GPUs to each node to increase the total amount of compute power available to process the slow-moving data.
This is incorrect because adding more GPUs would not resolve the storage bottleneck—it would likely worsen the problem by creating more GPU compute capacity that remains underutilized while waiting for data. The issue is I/O limited, not compute limited.
C. Change the training algorithm from a parallel approach to a sequential approach to reduce the number of simultaneous read requests.
This is incorrect because changing to sequential processing would reduce overall training throughput and increase training time. While it might reduce the instantaneous I/O load, it would make the training process significantly less efficient and does not solve the root cause of CPU-mediated data movement overhead.
Question 37 of 60
37. Question
A site reliability engineer is performing a burn-in test on a new cluster using the NeMo training framework. Why is a framework-specific burn-in test like NeMo preferred over simple synthetic benchmarks during the final stage of cluster verification?
Correct
Correct: D. It validates that the entire software stack and fabric can handle real-world AI model training patterns.
This is correct because framework-specific burn-in tests like NeMo are preferred over simple synthetic benchmarks during the final stage of cluster verification because they validate the entire integrated system under realistic conditions. The NCP-AII exam blueprint explicitly includes “Perform NeMo™ burn-in“ as a key task within the Cluster Test and Verification domain, which comprises 33% of the examination . This domain focuses on comprehensive validation including fabric bandwidth verification, storage testing, and multi-faceted node assessment . Unlike synthetic benchmarks that only test specific components, NeMo burn-in exercises the complete software stack, storage fabric, and network infrastructure using patterns that mirror actual AI model training workloads, ensuring the cluster is production-ready .
Incorrect: A. It reduces the power consumption of the GPUs during the test
This is incorrect because burn-in tests are designed to stress the system under load, not reduce power consumption. The purpose of burn-in testing is to verify stability and performance under maximum expected workload conditions .
B. It automatically updates the firmware of the Mellanox switches
This is incorrect because firmware updates are separate maintenance activities performed during system bring-up, not functions of workload-based burn-in tests. The exam blueprint distinguishes between “Confirm FW/SW on switches“ as a verification task and running application-level burn-ins .
C. It is the only way to check if the power cables are plugged in
This is incorrect because physical cable verification is performed during initial system bring-up through visual inspection and signal quality validation . Power cable connectivity is a basic prerequisite that must be verified before any burn-in testing can occur.
Incorrect
Correct: D. It validates that the entire software stack and fabric can handle real-world AI model training patterns.
This is correct because framework-specific burn-in tests like NeMo are preferred over simple synthetic benchmarks during the final stage of cluster verification because they validate the entire integrated system under realistic conditions. The NCP-AII exam blueprint explicitly includes “Perform NeMo™ burn-in“ as a key task within the Cluster Test and Verification domain, which comprises 33% of the examination . This domain focuses on comprehensive validation including fabric bandwidth verification, storage testing, and multi-faceted node assessment . Unlike synthetic benchmarks that only test specific components, NeMo burn-in exercises the complete software stack, storage fabric, and network infrastructure using patterns that mirror actual AI model training workloads, ensuring the cluster is production-ready .
Incorrect: A. It reduces the power consumption of the GPUs during the test
This is incorrect because burn-in tests are designed to stress the system under load, not reduce power consumption. The purpose of burn-in testing is to verify stability and performance under maximum expected workload conditions .
B. It automatically updates the firmware of the Mellanox switches
This is incorrect because firmware updates are separate maintenance activities performed during system bring-up, not functions of workload-based burn-in tests. The exam blueprint distinguishes between “Confirm FW/SW on switches“ as a verification task and running application-level burn-ins .
C. It is the only way to check if the power cables are plugged in
This is incorrect because physical cable verification is performed during initial system bring-up through visual inspection and signal quality validation . Power cable connectivity is a basic prerequisite that must be verified before any burn-in testing can occur.
Unattempted
Correct: D. It validates that the entire software stack and fabric can handle real-world AI model training patterns.
This is correct because framework-specific burn-in tests like NeMo are preferred over simple synthetic benchmarks during the final stage of cluster verification because they validate the entire integrated system under realistic conditions. The NCP-AII exam blueprint explicitly includes “Perform NeMo™ burn-in“ as a key task within the Cluster Test and Verification domain, which comprises 33% of the examination . This domain focuses on comprehensive validation including fabric bandwidth verification, storage testing, and multi-faceted node assessment . Unlike synthetic benchmarks that only test specific components, NeMo burn-in exercises the complete software stack, storage fabric, and network infrastructure using patterns that mirror actual AI model training workloads, ensuring the cluster is production-ready .
Incorrect: A. It reduces the power consumption of the GPUs during the test
This is incorrect because burn-in tests are designed to stress the system under load, not reduce power consumption. The purpose of burn-in testing is to verify stability and performance under maximum expected workload conditions .
B. It automatically updates the firmware of the Mellanox switches
This is incorrect because firmware updates are separate maintenance activities performed during system bring-up, not functions of workload-based burn-in tests. The exam blueprint distinguishes between “Confirm FW/SW on switches“ as a verification task and running application-level burn-ins .
C. It is the only way to check if the power cables are plugged in
This is incorrect because physical cable verification is performed during initial system bring-up through visual inspection and signal quality validation . Power cable connectivity is a basic prerequisite that must be verified before any burn-in testing can occur.
Question 38 of 60
38. Question
A network engineer is configuring a BlueField-3 Data Processing Unit (DPU) to act as a secure offload engine for the AI cluster management plane. To ensure the DPU is correctly integrated into the fabric, which action must be taken to manage the DPU independently of the host CPU while providing networking services to the host?
Correct
Correct: D. Configure the DPU in Separated Mode where the DPU OS runs independently and manages its own network interfaces and security policies.
This is correct because Separated Mode (also known as Separated Host Mode or symmetric model) is specifically designed for scenarios where the DPU acts as an independent co-processor. In this mode, “a network function is assigned to both the Arm cores and the host cores,“ and “the ports/functions are symmetric“ with “no dependency between the two functions“ . The DPU Arm system can operate “simultaneously or separately“ from the host, with its own MAC and IP addresses, enabling independent management of networking and security policies while still providing networking services to the host . This directly fulfills the requirement of having the DPU manage infrastructure independently while offloading these functions from the host CPU.
Incorrect: A. Disable the internal ARM cores of the BlueField DPU to allow the host operating system to take full control of the network hardware resources.
This is incorrect because disabling the ARM cores would effectively put the DPU into NIC Mode, where “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This would defeat the purpose of using the DPU as a secure offload engine, as it would revert to traditional NIC functionality with no independent processing capability.
B. Install a standard Ethernet driver on the host and ignore the BlueField-specific management tools as they are only used for basic troubleshooting.
This is incorrect because BlueField-specific management tools and the DOCA software framework are essential for unlocking the DPU‘s full potential as a programmable infrastructure processor . The DPU requires its own operating system and management interfaces (such as oob_net0 or console access) to function independently from the host .
C. Set the DPU to Bridge Mode so that all traffic passes through the host CPU for inspection before being processed by the BlueField hardware acceleration.
This is incorrect because there is no operational mode called “Bridge Mode“ defined in the NVIDIA BlueField documentation. Additionally, forcing traffic through the host CPU would contradict the goal of offloading networking and security from the host. In DPU Mode, traffic initially flows through the Arm cores but can be offloaded to the embedded switch (fast path) for performance .
Incorrect
Correct: D. Configure the DPU in Separated Mode where the DPU OS runs independently and manages its own network interfaces and security policies.
This is correct because Separated Mode (also known as Separated Host Mode or symmetric model) is specifically designed for scenarios where the DPU acts as an independent co-processor. In this mode, “a network function is assigned to both the Arm cores and the host cores,“ and “the ports/functions are symmetric“ with “no dependency between the two functions“ . The DPU Arm system can operate “simultaneously or separately“ from the host, with its own MAC and IP addresses, enabling independent management of networking and security policies while still providing networking services to the host . This directly fulfills the requirement of having the DPU manage infrastructure independently while offloading these functions from the host CPU.
Incorrect: A. Disable the internal ARM cores of the BlueField DPU to allow the host operating system to take full control of the network hardware resources.
This is incorrect because disabling the ARM cores would effectively put the DPU into NIC Mode, where “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This would defeat the purpose of using the DPU as a secure offload engine, as it would revert to traditional NIC functionality with no independent processing capability.
B. Install a standard Ethernet driver on the host and ignore the BlueField-specific management tools as they are only used for basic troubleshooting.
This is incorrect because BlueField-specific management tools and the DOCA software framework are essential for unlocking the DPU‘s full potential as a programmable infrastructure processor . The DPU requires its own operating system and management interfaces (such as oob_net0 or console access) to function independently from the host .
C. Set the DPU to Bridge Mode so that all traffic passes through the host CPU for inspection before being processed by the BlueField hardware acceleration.
This is incorrect because there is no operational mode called “Bridge Mode“ defined in the NVIDIA BlueField documentation. Additionally, forcing traffic through the host CPU would contradict the goal of offloading networking and security from the host. In DPU Mode, traffic initially flows through the Arm cores but can be offloaded to the embedded switch (fast path) for performance .
Unattempted
Correct: D. Configure the DPU in Separated Mode where the DPU OS runs independently and manages its own network interfaces and security policies.
This is correct because Separated Mode (also known as Separated Host Mode or symmetric model) is specifically designed for scenarios where the DPU acts as an independent co-processor. In this mode, “a network function is assigned to both the Arm cores and the host cores,“ and “the ports/functions are symmetric“ with “no dependency between the two functions“ . The DPU Arm system can operate “simultaneously or separately“ from the host, with its own MAC and IP addresses, enabling independent management of networking and security policies while still providing networking services to the host . This directly fulfills the requirement of having the DPU manage infrastructure independently while offloading these functions from the host CPU.
Incorrect: A. Disable the internal ARM cores of the BlueField DPU to allow the host operating system to take full control of the network hardware resources.
This is incorrect because disabling the ARM cores would effectively put the DPU into NIC Mode, where “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This would defeat the purpose of using the DPU as a secure offload engine, as it would revert to traditional NIC functionality with no independent processing capability.
B. Install a standard Ethernet driver on the host and ignore the BlueField-specific management tools as they are only used for basic troubleshooting.
This is incorrect because BlueField-specific management tools and the DOCA software framework are essential for unlocking the DPU‘s full potential as a programmable infrastructure processor . The DPU requires its own operating system and management interfaces (such as oob_net0 or console access) to function independently from the host .
C. Set the DPU to Bridge Mode so that all traffic passes through the host CPU for inspection before being processed by the BlueField hardware acceleration.
This is incorrect because there is no operational mode called “Bridge Mode“ defined in the NVIDIA BlueField documentation. Additionally, forcing traffic through the host CPU would contradict the goal of offloading networking and security from the host. In DPU Mode, traffic initially flows through the Arm cores but can be offloaded to the embedded switch (fast path) for performance .
Question 39 of 60
39. Question
An AI cluster is experiencing inconsistent performance on several compute nodes. Investigation reveals that these nodes are equipped with AMD CPUs and NVIDIA GPUs. Which optimization step should be performed to ensure the best performance for GPU-heavy AI workloads on these specific AMD-based servers?
Correct
Correct: D Configure the Nodes Per Socket (NPS) setting in the BIOS to NPS1 or NPS4 and ensure IOMMU is correctly configured for GPUDirect RDMA.
The Technical Reason: AMD EPYC processors use a Multi-Chip Module (MCM) design where the CPU is split into multiple quadrants or “nodes.“
NPS (Nodes Per Socket): The NPS setting defines how the memory and PCIe controllers are partitioned.
NPS1 treats the entire socket as a single NUMA domain, which is often simpler for general workloads.
NPS4 partitions the socket into four domains, aligning each quadrant of CPU cores with its local memory and PCIe lanes. For high-density GPU servers (like an 8-GPU HGX system), NPS4 is often preferred to ensure that each GPU has the lowest latency path to its “local“ CPU cores and memory.
IOMMU & GPUDirect RDMA: To use GPUDirect RDMA (which allows a NIC to write directly to GPU memory), the IOMMU (Input-Output Memory Management Unit) must be enabled and correctly configured in the BIOS. Without proper IOMMU/ACS (Access Control Services) settings, the CPU may intercept peer-to-peer traffic, significantly degrading performance.
The NCP-AII Context: The certification specifically tests your ability to “Execute performance optimization for AMD and Intel servers.“ Knowing the specific BIOS requirements for AMD‘s NUMA topology (NPS) is a key differentiator for a professional-level administrator.
Incorrect Options: A. Disable InfiniBand and use 1GbE This is a severe regression. AI training requires the high bandwidth and low latency of InfiniBand or RoCE. Moving to 1GbE would starve the GPUs of data, causing them to sit idle and making LLM training practically impossible. The CPU interrupt load is managed by the NIC‘s hardware offloads, not by lowering the network speed.
B. Automatic Clock Boost and Powersave governor Setting the OS power governor to “Powersave“ intentionally throttles the CPU to save energy, which is the opposite of what is needed for a performance-heavy AI workload. For production AI servers, the governor should be set to “Performance“ to ensure the CPU can keep up with the GPU‘s data demands.
C. Replace drivers with open-source Nouveau drivers The Nouveau drivers are open-source reverse-engineered drivers that do not support modern NVIDIA features like CUDA, NVLink, MIG, or GPUDirect RDMA. Using Nouveau drivers would effectively disable all AI acceleration capabilities of the NVIDIA hardware.
Incorrect
Correct: D Configure the Nodes Per Socket (NPS) setting in the BIOS to NPS1 or NPS4 and ensure IOMMU is correctly configured for GPUDirect RDMA.
The Technical Reason: AMD EPYC processors use a Multi-Chip Module (MCM) design where the CPU is split into multiple quadrants or “nodes.“
NPS (Nodes Per Socket): The NPS setting defines how the memory and PCIe controllers are partitioned.
NPS1 treats the entire socket as a single NUMA domain, which is often simpler for general workloads.
NPS4 partitions the socket into four domains, aligning each quadrant of CPU cores with its local memory and PCIe lanes. For high-density GPU servers (like an 8-GPU HGX system), NPS4 is often preferred to ensure that each GPU has the lowest latency path to its “local“ CPU cores and memory.
IOMMU & GPUDirect RDMA: To use GPUDirect RDMA (which allows a NIC to write directly to GPU memory), the IOMMU (Input-Output Memory Management Unit) must be enabled and correctly configured in the BIOS. Without proper IOMMU/ACS (Access Control Services) settings, the CPU may intercept peer-to-peer traffic, significantly degrading performance.
The NCP-AII Context: The certification specifically tests your ability to “Execute performance optimization for AMD and Intel servers.“ Knowing the specific BIOS requirements for AMD‘s NUMA topology (NPS) is a key differentiator for a professional-level administrator.
Incorrect Options: A. Disable InfiniBand and use 1GbE This is a severe regression. AI training requires the high bandwidth and low latency of InfiniBand or RoCE. Moving to 1GbE would starve the GPUs of data, causing them to sit idle and making LLM training practically impossible. The CPU interrupt load is managed by the NIC‘s hardware offloads, not by lowering the network speed.
B. Automatic Clock Boost and Powersave governor Setting the OS power governor to “Powersave“ intentionally throttles the CPU to save energy, which is the opposite of what is needed for a performance-heavy AI workload. For production AI servers, the governor should be set to “Performance“ to ensure the CPU can keep up with the GPU‘s data demands.
C. Replace drivers with open-source Nouveau drivers The Nouveau drivers are open-source reverse-engineered drivers that do not support modern NVIDIA features like CUDA, NVLink, MIG, or GPUDirect RDMA. Using Nouveau drivers would effectively disable all AI acceleration capabilities of the NVIDIA hardware.
Unattempted
Correct: D Configure the Nodes Per Socket (NPS) setting in the BIOS to NPS1 or NPS4 and ensure IOMMU is correctly configured for GPUDirect RDMA.
The Technical Reason: AMD EPYC processors use a Multi-Chip Module (MCM) design where the CPU is split into multiple quadrants or “nodes.“
NPS (Nodes Per Socket): The NPS setting defines how the memory and PCIe controllers are partitioned.
NPS1 treats the entire socket as a single NUMA domain, which is often simpler for general workloads.
NPS4 partitions the socket into four domains, aligning each quadrant of CPU cores with its local memory and PCIe lanes. For high-density GPU servers (like an 8-GPU HGX system), NPS4 is often preferred to ensure that each GPU has the lowest latency path to its “local“ CPU cores and memory.
IOMMU & GPUDirect RDMA: To use GPUDirect RDMA (which allows a NIC to write directly to GPU memory), the IOMMU (Input-Output Memory Management Unit) must be enabled and correctly configured in the BIOS. Without proper IOMMU/ACS (Access Control Services) settings, the CPU may intercept peer-to-peer traffic, significantly degrading performance.
The NCP-AII Context: The certification specifically tests your ability to “Execute performance optimization for AMD and Intel servers.“ Knowing the specific BIOS requirements for AMD‘s NUMA topology (NPS) is a key differentiator for a professional-level administrator.
Incorrect Options: A. Disable InfiniBand and use 1GbE This is a severe regression. AI training requires the high bandwidth and low latency of InfiniBand or RoCE. Moving to 1GbE would starve the GPUs of data, causing them to sit idle and making LLM training practically impossible. The CPU interrupt load is managed by the NIC‘s hardware offloads, not by lowering the network speed.
B. Automatic Clock Boost and Powersave governor Setting the OS power governor to “Powersave“ intentionally throttles the CPU to save energy, which is the opposite of what is needed for a performance-heavy AI workload. For production AI servers, the governor should be set to “Performance“ to ensure the CPU can keep up with the GPU‘s data demands.
C. Replace drivers with open-source Nouveau drivers The Nouveau drivers are open-source reverse-engineered drivers that do not support modern NVIDIA features like CUDA, NVLink, MIG, or GPUDirect RDMA. Using Nouveau drivers would effectively disable all AI acceleration capabilities of the NVIDIA hardware.
Question 40 of 60
40. Question
A cluster administrator is using NVIDIA Base Command Manager (BCM) to provision a new set of compute nodes. The administrator needs to define a category that includes the OS, GPU drivers, and the Slurm scheduler. Which BCM tool or interface is primarily used to manage these software images and associate them with specific hardware groups?
Correct
Correct: C The BCM Cluster Manager GUI (Base View) or the cmsh command-line shell is used to define software images (categories) and assign nodes to them. • The Technical Reason: BCM uses a hierarchical management model: ? Software Images: These are directories on the head node (found in /cm/images/) that contain the entire root filesystem for the compute nodes, including the OS (e.g., Ubuntu/Rocky), NVIDIA drivers, CUDA, and the Slurm client. ? Node Categories: To manage at scale, administrators group nodes into “Categories.“ Instead of configuring each server individually, you assign a Software Image to a Category. Any node placed in that category will automatically provision that specific image upon boot. ? Interfaces: The cmsh (Cluster Management Shell) is the powerful CLI used for these tasks (e.g., softwareimage clone, category use compute; set softwareimage ). The Base View GUI provides a visual alternative for the same operations. • The NCP-AII Context: The exam validates your ability to “synchronize software images across cluster nodes using BCM.“ This includes cloning an existing image, using chroot or cm-chroot-sw-image to install additional software (like Slurm), and committing those changes so they are available for node provisioning.
Incorrect: A. nvidia-smi topology and NVLink updates nvidia-smi is a local GPU management utility; it cannot manage OS images or provision network-wide software updates. Furthermore, NVLink is a high-speed data fabric for GPU-to-GPU communication; it is not used as a network transport for pushing operating system updates or PXE booting nodes.
B. Slurm sbatch for PXE booting sbatch is a Slurm command used by end-users to submit batch scripts to the job queue. It has no capability to trigger PXE booting or install operating systems. Provisioning is an “Infrastructure“ task (handled by BCM), while sbatch is a “Workload“ task.
D. Docker Desktop for OS containerization Docker Desktop is a local development tool for Windows/macOS and is not used in data center production environments. While containers are used for applications, BCM provisions the base operating system (the “host“) using disk images, not by deploying the entire OS as a container pod to bare metal.
Incorrect
Correct: C The BCM Cluster Manager GUI (Base View) or the cmsh command-line shell is used to define software images (categories) and assign nodes to them. • The Technical Reason: BCM uses a hierarchical management model: ? Software Images: These are directories on the head node (found in /cm/images/) that contain the entire root filesystem for the compute nodes, including the OS (e.g., Ubuntu/Rocky), NVIDIA drivers, CUDA, and the Slurm client. ? Node Categories: To manage at scale, administrators group nodes into “Categories.“ Instead of configuring each server individually, you assign a Software Image to a Category. Any node placed in that category will automatically provision that specific image upon boot. ? Interfaces: The cmsh (Cluster Management Shell) is the powerful CLI used for these tasks (e.g., softwareimage clone, category use compute; set softwareimage ). The Base View GUI provides a visual alternative for the same operations. • The NCP-AII Context: The exam validates your ability to “synchronize software images across cluster nodes using BCM.“ This includes cloning an existing image, using chroot or cm-chroot-sw-image to install additional software (like Slurm), and committing those changes so they are available for node provisioning.
Incorrect: A. nvidia-smi topology and NVLink updates nvidia-smi is a local GPU management utility; it cannot manage OS images or provision network-wide software updates. Furthermore, NVLink is a high-speed data fabric for GPU-to-GPU communication; it is not used as a network transport for pushing operating system updates or PXE booting nodes.
B. Slurm sbatch for PXE booting sbatch is a Slurm command used by end-users to submit batch scripts to the job queue. It has no capability to trigger PXE booting or install operating systems. Provisioning is an “Infrastructure“ task (handled by BCM), while sbatch is a “Workload“ task.
D. Docker Desktop for OS containerization Docker Desktop is a local development tool for Windows/macOS and is not used in data center production environments. While containers are used for applications, BCM provisions the base operating system (the “host“) using disk images, not by deploying the entire OS as a container pod to bare metal.
Unattempted
Correct: C The BCM Cluster Manager GUI (Base View) or the cmsh command-line shell is used to define software images (categories) and assign nodes to them. • The Technical Reason: BCM uses a hierarchical management model: ? Software Images: These are directories on the head node (found in /cm/images/) that contain the entire root filesystem for the compute nodes, including the OS (e.g., Ubuntu/Rocky), NVIDIA drivers, CUDA, and the Slurm client. ? Node Categories: To manage at scale, administrators group nodes into “Categories.“ Instead of configuring each server individually, you assign a Software Image to a Category. Any node placed in that category will automatically provision that specific image upon boot. ? Interfaces: The cmsh (Cluster Management Shell) is the powerful CLI used for these tasks (e.g., softwareimage clone, category use compute; set softwareimage ). The Base View GUI provides a visual alternative for the same operations. • The NCP-AII Context: The exam validates your ability to “synchronize software images across cluster nodes using BCM.“ This includes cloning an existing image, using chroot or cm-chroot-sw-image to install additional software (like Slurm), and committing those changes so they are available for node provisioning.
Incorrect: A. nvidia-smi topology and NVLink updates nvidia-smi is a local GPU management utility; it cannot manage OS images or provision network-wide software updates. Furthermore, NVLink is a high-speed data fabric for GPU-to-GPU communication; it is not used as a network transport for pushing operating system updates or PXE booting nodes.
B. Slurm sbatch for PXE booting sbatch is a Slurm command used by end-users to submit batch scripts to the job queue. It has no capability to trigger PXE booting or install operating systems. Provisioning is an “Infrastructure“ task (handled by BCM), while sbatch is a “Workload“ task.
D. Docker Desktop for OS containerization Docker Desktop is a local development tool for Windows/macOS and is not used in data center production environments. While containers are used for applications, BCM provisions the base operating system (the “host“) using disk images, not by deploying the entire OS as a container pod to bare metal.
Question 41 of 60
41. Question
In a rail-optimized AI factory network topology designed for NVIDIA HGX systems, why is it critical to ensure that each GPU in a server is connected to a specific leaf switch in a manner that mirrors the internal NVLink topology? Choose the option that best describes the importance of this specific physical layer validation step for AI workloads.
Correct
Correct: D. To maximize the efficiency of Collective Operations by reducing network hops.
This is correct because in a rail-optimized network topology designed for NVIDIA HGX systems, the physical layer cabling must mirror the internal NVLink topology to enable efficient collective operations like all-reduce and all-gather . This design allows each GPU to communicate with another GPU on the same rail through only one hop via a leaf switch . When combined with NVLink‘s high-speed intra-node fabric (900 GB/s on H100 systems) , NCCL can leverage features like PXN (PCI × NVLink) that enable a GPU to communicate with a NIC on the node via NVLink and then PCI, bypassing CPU bottlenecks . This rail-optimized architecture significantly improves all-to-all network performance for large messages and reduces latency for small messages , directly impacting the efficiency of collective operations critical for distributed AI training.
Incorrect: A. To allow the Base Command Manager to monitor the fans more accurately.
This is incorrect because Base Command Manager monitoring of fans and other node health metrics is handled through the Out-of-Band (OOB) Management Network connected to Baseboard Management Controllers (BMCs), not through the high-speed compute fabric . The OOB network typically operates at low speeds (1Gbps) and is physically isolated from the GPU compute fabric .
B. To ensure that management traffic does not interfere with the storage traffic.
This is incorrect because the rail-optimized network is part of the Compute Fabric (East/West traffic) designed for GPU-to-GPU communication . Management traffic and storage traffic are typically separated onto different physical networks—the In-Band Management Fabric and Storage Fabric respectively —to prevent interference. The physical topology of the compute fabric does not primarily address management/storage traffic separation.
C. To increase the cooling efficiency of the high-speed optical transceivers.
This is incorrect because the rail-optimized cabling pattern is designed for communication performance, not thermal management. Cooling efficiency for optical transceivers depends on factors like airflow design, liquid cooling implementation, and rack-level thermal management , not on whether GPUs are connected to specific leaf switches in a rail-optimized pattern.
Incorrect
Correct: D. To maximize the efficiency of Collective Operations by reducing network hops.
This is correct because in a rail-optimized network topology designed for NVIDIA HGX systems, the physical layer cabling must mirror the internal NVLink topology to enable efficient collective operations like all-reduce and all-gather . This design allows each GPU to communicate with another GPU on the same rail through only one hop via a leaf switch . When combined with NVLink‘s high-speed intra-node fabric (900 GB/s on H100 systems) , NCCL can leverage features like PXN (PCI × NVLink) that enable a GPU to communicate with a NIC on the node via NVLink and then PCI, bypassing CPU bottlenecks . This rail-optimized architecture significantly improves all-to-all network performance for large messages and reduces latency for small messages , directly impacting the efficiency of collective operations critical for distributed AI training.
Incorrect: A. To allow the Base Command Manager to monitor the fans more accurately.
This is incorrect because Base Command Manager monitoring of fans and other node health metrics is handled through the Out-of-Band (OOB) Management Network connected to Baseboard Management Controllers (BMCs), not through the high-speed compute fabric . The OOB network typically operates at low speeds (1Gbps) and is physically isolated from the GPU compute fabric .
B. To ensure that management traffic does not interfere with the storage traffic.
This is incorrect because the rail-optimized network is part of the Compute Fabric (East/West traffic) designed for GPU-to-GPU communication . Management traffic and storage traffic are typically separated onto different physical networks—the In-Band Management Fabric and Storage Fabric respectively —to prevent interference. The physical topology of the compute fabric does not primarily address management/storage traffic separation.
C. To increase the cooling efficiency of the high-speed optical transceivers.
This is incorrect because the rail-optimized cabling pattern is designed for communication performance, not thermal management. Cooling efficiency for optical transceivers depends on factors like airflow design, liquid cooling implementation, and rack-level thermal management , not on whether GPUs are connected to specific leaf switches in a rail-optimized pattern.
Unattempted
Correct: D. To maximize the efficiency of Collective Operations by reducing network hops.
This is correct because in a rail-optimized network topology designed for NVIDIA HGX systems, the physical layer cabling must mirror the internal NVLink topology to enable efficient collective operations like all-reduce and all-gather . This design allows each GPU to communicate with another GPU on the same rail through only one hop via a leaf switch . When combined with NVLink‘s high-speed intra-node fabric (900 GB/s on H100 systems) , NCCL can leverage features like PXN (PCI × NVLink) that enable a GPU to communicate with a NIC on the node via NVLink and then PCI, bypassing CPU bottlenecks . This rail-optimized architecture significantly improves all-to-all network performance for large messages and reduces latency for small messages , directly impacting the efficiency of collective operations critical for distributed AI training.
Incorrect: A. To allow the Base Command Manager to monitor the fans more accurately.
This is incorrect because Base Command Manager monitoring of fans and other node health metrics is handled through the Out-of-Band (OOB) Management Network connected to Baseboard Management Controllers (BMCs), not through the high-speed compute fabric . The OOB network typically operates at low speeds (1Gbps) and is physically isolated from the GPU compute fabric .
B. To ensure that management traffic does not interfere with the storage traffic.
This is incorrect because the rail-optimized network is part of the Compute Fabric (East/West traffic) designed for GPU-to-GPU communication . Management traffic and storage traffic are typically separated onto different physical networks—the In-Band Management Fabric and Storage Fabric respectively —to prevent interference. The physical topology of the compute fabric does not primarily address management/storage traffic separation.
C. To increase the cooling efficiency of the high-speed optical transceivers.
This is incorrect because the rail-optimized cabling pattern is designed for communication performance, not thermal management. Cooling efficiency for optical transceivers depends on factors like airflow design, liquid cooling implementation, and rack-level thermal management , not on whether GPUs are connected to specific leaf switches in a rail-optimized pattern.
Question 42 of 60
42. Question
An architect is designing a multi-tenant AI environment where resources must be strictly isolated. They decide to use Multi-Instance GPU (MIG) on NVIDIA H100 GPUs. Which of the following statements correctly describes the configuration of MIG for managing workloads and the role of the BlueField-3 DPU in this physical layer management context?
Correct
Correct: D. MIG allows a single GPU to be partitioned into multiple hardware-isolated instances; the BlueField-3 DPU can manage the networked storage and security policies for these instances.
This is correct because Multi-Instance GPU (MIG) technology enables partitioning a single NVIDIA GPU into multiple isolated instances, each with dedicated hardware resources, which is essential for multi-tenant environments requiring strict workload isolation . The NCP-AII certification explicitly includes “MIG enablement and management“ as a core topic within the Physical Layer Management domain . BlueField-3 DPUs complement this by managing networked storage and security policies, as the certification covers “integrat[ing] DPUs with DOCA for advanced encryption and network isolation“ . The combination allows the DPU to handle infrastructure services independently while MIG provides GPU-level partitioning.
Incorrect: A. MIG is only used for cooling management by reducing the clock speed of individual GPU cores, while the BlueField-3 platform performs the HPL stress tests.
This is incorrect because MIG is a GPU partitioning technology for workload isolation, not a cooling management feature. MIG provides hardware-isolated GPU instances with dedicated compute and memory resources, not clock speed reduction. HPL stress tests are cluster verification tools, not primary DPU functions.
B. MIG configuration requires the DOCA drivers to be installed on the BMC to allow the BlueField-3 to partition the GPU memory into virtual LUNs for storage.
This is incorrect because MIG is configured at the GPU level through NVIDIA drivers, not through DOCA drivers on the BMC. MIG partitions GPU compute and memory resources, not storage LUNs. BlueField-3 with DOCA handles storage acceleration and network security, but does not partition GPU memory.
C. MIG enables multiple physical GPUs to be combined into one virtual instance, while the BlueField-3 DPU handles the NVLink switching between these virtual units.
This is incorrect because MIG partitions a single physical GPU into multiple instances—it does not combine multiple physical GPUs. Combining multiple GPUs is achieved through technologies like NVLink and NVSwitch, not MIG. NVLink switching is handled by dedicated NVSwitch fabric, not BlueField-3 DPUs.
Incorrect
Correct: D. MIG allows a single GPU to be partitioned into multiple hardware-isolated instances; the BlueField-3 DPU can manage the networked storage and security policies for these instances.
This is correct because Multi-Instance GPU (MIG) technology enables partitioning a single NVIDIA GPU into multiple isolated instances, each with dedicated hardware resources, which is essential for multi-tenant environments requiring strict workload isolation . The NCP-AII certification explicitly includes “MIG enablement and management“ as a core topic within the Physical Layer Management domain . BlueField-3 DPUs complement this by managing networked storage and security policies, as the certification covers “integrat[ing] DPUs with DOCA for advanced encryption and network isolation“ . The combination allows the DPU to handle infrastructure services independently while MIG provides GPU-level partitioning.
Incorrect: A. MIG is only used for cooling management by reducing the clock speed of individual GPU cores, while the BlueField-3 platform performs the HPL stress tests.
This is incorrect because MIG is a GPU partitioning technology for workload isolation, not a cooling management feature. MIG provides hardware-isolated GPU instances with dedicated compute and memory resources, not clock speed reduction. HPL stress tests are cluster verification tools, not primary DPU functions.
B. MIG configuration requires the DOCA drivers to be installed on the BMC to allow the BlueField-3 to partition the GPU memory into virtual LUNs for storage.
This is incorrect because MIG is configured at the GPU level through NVIDIA drivers, not through DOCA drivers on the BMC. MIG partitions GPU compute and memory resources, not storage LUNs. BlueField-3 with DOCA handles storage acceleration and network security, but does not partition GPU memory.
C. MIG enables multiple physical GPUs to be combined into one virtual instance, while the BlueField-3 DPU handles the NVLink switching between these virtual units.
This is incorrect because MIG partitions a single physical GPU into multiple instances—it does not combine multiple physical GPUs. Combining multiple GPUs is achieved through technologies like NVLink and NVSwitch, not MIG. NVLink switching is handled by dedicated NVSwitch fabric, not BlueField-3 DPUs.
Unattempted
Correct: D. MIG allows a single GPU to be partitioned into multiple hardware-isolated instances; the BlueField-3 DPU can manage the networked storage and security policies for these instances.
This is correct because Multi-Instance GPU (MIG) technology enables partitioning a single NVIDIA GPU into multiple isolated instances, each with dedicated hardware resources, which is essential for multi-tenant environments requiring strict workload isolation . The NCP-AII certification explicitly includes “MIG enablement and management“ as a core topic within the Physical Layer Management domain . BlueField-3 DPUs complement this by managing networked storage and security policies, as the certification covers “integrat[ing] DPUs with DOCA for advanced encryption and network isolation“ . The combination allows the DPU to handle infrastructure services independently while MIG provides GPU-level partitioning.
Incorrect: A. MIG is only used for cooling management by reducing the clock speed of individual GPU cores, while the BlueField-3 platform performs the HPL stress tests.
This is incorrect because MIG is a GPU partitioning technology for workload isolation, not a cooling management feature. MIG provides hardware-isolated GPU instances with dedicated compute and memory resources, not clock speed reduction. HPL stress tests are cluster verification tools, not primary DPU functions.
B. MIG configuration requires the DOCA drivers to be installed on the BMC to allow the BlueField-3 to partition the GPU memory into virtual LUNs for storage.
This is incorrect because MIG is configured at the GPU level through NVIDIA drivers, not through DOCA drivers on the BMC. MIG partitions GPU compute and memory resources, not storage LUNs. BlueField-3 with DOCA handles storage acceleration and network security, but does not partition GPU memory.
C. MIG enables multiple physical GPUs to be combined into one virtual instance, while the BlueField-3 DPU handles the NVLink switching between these virtual units.
This is incorrect because MIG partitions a single physical GPU into multiple instances—it does not combine multiple physical GPUs. Combining multiple GPUs is achieved through technologies like NVLink and NVSwitch, not MIG. NVLink switching is handled by dedicated NVSwitch fabric, not BlueField-3 DPUs.
Question 43 of 60
43. Question
An AI cluster is experiencing lower-than-expected performance on an AMD-based server platform. The administrator suspects that the I/O topology and NUMA (Non-Uniform Memory Access) settings are not optimized for the installed NVIDIA GPUs. Which optimization step is most likely to improve the data transfer rates between the host memory and the GPUs?
Correct
Correct: C. Ensuring that the BIOS ‘NPS‘ (Nodes Per Socket) setting is configured correctly to align GPU PCIe lanes with the closest CPU cores and memory channels.
This is correct because optimizing NUMA (Non-Uniform Memory Access) settings is critical for data transfer rates between host memory and GPUs on AMD-based server platforms. The NCP-AII certification explicitly includes “Execute performance optimization for AMD and Intel servers“ as a key task within the Troubleshoot and Optimize domain, which comprises 12% of the examination . The BIOS ‘NPS‘ (Nodes Per Socket) setting directly controls how memory is mapped relative to CPU cores and PCIe devices. Configuring NPS correctly ensures that GPU PCIe lanes are aligned with the closest CPU cores and memory channels, minimizing NUMA distance and reducing latency for data transfers between host memory and GPUs . This optimization is essential for AMD-based platforms where improper NUMA configuration can create significant performance bottlenecks.
Incorrect: A. Enabling ‘Auto-NUMA‘ in the Linux kernel to let the OS dynamically move memory pages between different physical CPU sockets.
This is incorrect because while Auto-NUMA can help with memory placement optimization, it is a kernel-level feature that attempts to automatically place memory pages near the accessing CPU. However, it does not address the fundamental hardware-level NUMA configuration required for optimal GPU-to-host memory transfers. The priority is configuring BIOS-level NUMA settings like NPS to establish proper topology, after which OS-level optimizations can be applied. Dynamic page migration adds overhead and is not a substitute for correct physical alignment of GPU PCIe lanes with CPU cores.
B. Moving all GPUs to a single PCIe riser to ensure they all share the same NUMA node, regardless of the CPU‘s architectural limits.
This is incorrect because physically relocating GPUs to a single PCIe riser would likely worsen performance rather than improve it. Modern AMD server platforms have multiple NUMA nodes designed to distribute PCIe lanes across sockets. Concentrating all GPUs in one NUMA node would create imbalance, saturate that node‘s memory bandwidth, and force remote memory access from the other CPU socket. The correct approach is to distribute GPUs across NUMA nodes according to platform architecture, not consolidate them.
D. Disabling all PCIe Gen4 support and forcing the system to run at Gen2 speeds to reduce the electrical complexity of the NUMA fabric.
This is incorrect because reducing PCIe generation speeds would dramatically decrease bandwidth between GPUs and host memory, directly counter to the goal of improving data transfer rates. For example, PCIe Gen4 provides 16 GT/s compared to Gen2‘s 5 GT/s—disabling Gen4 support would reduce available bandwidth by approximately 70%. This approach would severely bottleneck GPU data transfers rather than optimize them. The issue is NUMA configuration, not PCIe electrical complexity.
Incorrect
Correct: C. Ensuring that the BIOS ‘NPS‘ (Nodes Per Socket) setting is configured correctly to align GPU PCIe lanes with the closest CPU cores and memory channels.
This is correct because optimizing NUMA (Non-Uniform Memory Access) settings is critical for data transfer rates between host memory and GPUs on AMD-based server platforms. The NCP-AII certification explicitly includes “Execute performance optimization for AMD and Intel servers“ as a key task within the Troubleshoot and Optimize domain, which comprises 12% of the examination . The BIOS ‘NPS‘ (Nodes Per Socket) setting directly controls how memory is mapped relative to CPU cores and PCIe devices. Configuring NPS correctly ensures that GPU PCIe lanes are aligned with the closest CPU cores and memory channels, minimizing NUMA distance and reducing latency for data transfers between host memory and GPUs . This optimization is essential for AMD-based platforms where improper NUMA configuration can create significant performance bottlenecks.
Incorrect: A. Enabling ‘Auto-NUMA‘ in the Linux kernel to let the OS dynamically move memory pages between different physical CPU sockets.
This is incorrect because while Auto-NUMA can help with memory placement optimization, it is a kernel-level feature that attempts to automatically place memory pages near the accessing CPU. However, it does not address the fundamental hardware-level NUMA configuration required for optimal GPU-to-host memory transfers. The priority is configuring BIOS-level NUMA settings like NPS to establish proper topology, after which OS-level optimizations can be applied. Dynamic page migration adds overhead and is not a substitute for correct physical alignment of GPU PCIe lanes with CPU cores.
B. Moving all GPUs to a single PCIe riser to ensure they all share the same NUMA node, regardless of the CPU‘s architectural limits.
This is incorrect because physically relocating GPUs to a single PCIe riser would likely worsen performance rather than improve it. Modern AMD server platforms have multiple NUMA nodes designed to distribute PCIe lanes across sockets. Concentrating all GPUs in one NUMA node would create imbalance, saturate that node‘s memory bandwidth, and force remote memory access from the other CPU socket. The correct approach is to distribute GPUs across NUMA nodes according to platform architecture, not consolidate them.
D. Disabling all PCIe Gen4 support and forcing the system to run at Gen2 speeds to reduce the electrical complexity of the NUMA fabric.
This is incorrect because reducing PCIe generation speeds would dramatically decrease bandwidth between GPUs and host memory, directly counter to the goal of improving data transfer rates. For example, PCIe Gen4 provides 16 GT/s compared to Gen2‘s 5 GT/s—disabling Gen4 support would reduce available bandwidth by approximately 70%. This approach would severely bottleneck GPU data transfers rather than optimize them. The issue is NUMA configuration, not PCIe electrical complexity.
Unattempted
Correct: C. Ensuring that the BIOS ‘NPS‘ (Nodes Per Socket) setting is configured correctly to align GPU PCIe lanes with the closest CPU cores and memory channels.
This is correct because optimizing NUMA (Non-Uniform Memory Access) settings is critical for data transfer rates between host memory and GPUs on AMD-based server platforms. The NCP-AII certification explicitly includes “Execute performance optimization for AMD and Intel servers“ as a key task within the Troubleshoot and Optimize domain, which comprises 12% of the examination . The BIOS ‘NPS‘ (Nodes Per Socket) setting directly controls how memory is mapped relative to CPU cores and PCIe devices. Configuring NPS correctly ensures that GPU PCIe lanes are aligned with the closest CPU cores and memory channels, minimizing NUMA distance and reducing latency for data transfers between host memory and GPUs . This optimization is essential for AMD-based platforms where improper NUMA configuration can create significant performance bottlenecks.
Incorrect: A. Enabling ‘Auto-NUMA‘ in the Linux kernel to let the OS dynamically move memory pages between different physical CPU sockets.
This is incorrect because while Auto-NUMA can help with memory placement optimization, it is a kernel-level feature that attempts to automatically place memory pages near the accessing CPU. However, it does not address the fundamental hardware-level NUMA configuration required for optimal GPU-to-host memory transfers. The priority is configuring BIOS-level NUMA settings like NPS to establish proper topology, after which OS-level optimizations can be applied. Dynamic page migration adds overhead and is not a substitute for correct physical alignment of GPU PCIe lanes with CPU cores.
B. Moving all GPUs to a single PCIe riser to ensure they all share the same NUMA node, regardless of the CPU‘s architectural limits.
This is incorrect because physically relocating GPUs to a single PCIe riser would likely worsen performance rather than improve it. Modern AMD server platforms have multiple NUMA nodes designed to distribute PCIe lanes across sockets. Concentrating all GPUs in one NUMA node would create imbalance, saturate that node‘s memory bandwidth, and force remote memory access from the other CPU socket. The correct approach is to distribute GPUs across NUMA nodes according to platform architecture, not consolidate them.
D. Disabling all PCIe Gen4 support and forcing the system to run at Gen2 speeds to reduce the electrical complexity of the NUMA fabric.
This is incorrect because reducing PCIe generation speeds would dramatically decrease bandwidth between GPUs and host memory, directly counter to the goal of improving data transfer rates. For example, PCIe Gen4 provides 16 GT/s compared to Gen2‘s 5 GT/s—disabling Gen4 support would reduce available bandwidth by approximately 70%. This approach would severely bottleneck GPU data transfers rather than optimize them. The issue is NUMA configuration, not PCIe electrical complexity.
Question 44 of 60
44. Question
In a scenario where an AI cluster is being scaled to hundreds of nodes, the physical layer management of the BlueField DPUs becomes complex. What is the most efficient method for managing the configuration and deployment of these DPUs at scale to ensure consistency across the entire AI factory fabric?
Correct
Correct: B Utilize an orchestration tool or the NVIDIA Base Command Manager (BCM) to push standardized configuration profiles and firmware updates to all DPUs simultaneously.
The Technical Reason: BlueField DPUs are “computers-on-a-card“ with their own operating systems and complex firmware stacks.
BCM Integration: NVIDIA Base Command Manager (BCM) provides a centralized control plane that treats DPUs as managed objects. It can automate the deployment of BFB (BlueField Bundle) images, which contain the OS, drivers, and firmware.
Consistency: By using Category profiles in BCM, an administrator ensures that every DPU in a specific rack or cluster is running identical software versions (the “Golden Stack“), preventing performance jitter in collective communications like NCCL.
DOCA Deployment: Modern scaling also leverages the DOCA Platform Framework (DPF) to deploy containerized services (like storage or security offloads) across the entire fabric using a single command or API call.
The NCP-AII Context: The exam blueprint for Control Plane Installation and Configuration (19%) specifically highlights the use of BCM to “Install/update/remove NVIDIA GPU and DOCA™ drivers“ and ensure firmware alignment across the fabric.
Incorrect Options: A. Manually log into each node via SSH While SSH is useful for individual troubleshooting, it is physically impossible to maintain consistency across hundreds of nodes manually. Human error (skipping a step, a mistyped command) is inevitable at this scale, which leads to “configuration drift“—the primary cause of unpredictable fabric failures in large clusters.
C. Rely on DHCP and default factory settings Default factory settings are almost never optimized for a high-performance AI Factory. High-density workloads require specific MTU settings, Quality of Service (QoS) for RoCE, and specialized firmware versions for NVMe-oF. Furthermore, relying purely on DHCP without an orchestration layer leaves the DPUs unmanaged and difficult to update.
D. Use DOCA Telemetry and manually adjust DOCA Telemetry Service (DTS) is a powerful monitoring tool used for observability and troubleshooting (e.g., identifying a congested link). However, it is not a management tool. Detecting a performance deviation is only half the battle; manual adjustment is still inefficient at scale compared to an automated orchestration tool that can remediate issues across the entire cluster.
Incorrect
Correct: B Utilize an orchestration tool or the NVIDIA Base Command Manager (BCM) to push standardized configuration profiles and firmware updates to all DPUs simultaneously.
The Technical Reason: BlueField DPUs are “computers-on-a-card“ with their own operating systems and complex firmware stacks.
BCM Integration: NVIDIA Base Command Manager (BCM) provides a centralized control plane that treats DPUs as managed objects. It can automate the deployment of BFB (BlueField Bundle) images, which contain the OS, drivers, and firmware.
Consistency: By using Category profiles in BCM, an administrator ensures that every DPU in a specific rack or cluster is running identical software versions (the “Golden Stack“), preventing performance jitter in collective communications like NCCL.
DOCA Deployment: Modern scaling also leverages the DOCA Platform Framework (DPF) to deploy containerized services (like storage or security offloads) across the entire fabric using a single command or API call.
The NCP-AII Context: The exam blueprint for Control Plane Installation and Configuration (19%) specifically highlights the use of BCM to “Install/update/remove NVIDIA GPU and DOCA™ drivers“ and ensure firmware alignment across the fabric.
Incorrect Options: A. Manually log into each node via SSH While SSH is useful for individual troubleshooting, it is physically impossible to maintain consistency across hundreds of nodes manually. Human error (skipping a step, a mistyped command) is inevitable at this scale, which leads to “configuration drift“—the primary cause of unpredictable fabric failures in large clusters.
C. Rely on DHCP and default factory settings Default factory settings are almost never optimized for a high-performance AI Factory. High-density workloads require specific MTU settings, Quality of Service (QoS) for RoCE, and specialized firmware versions for NVMe-oF. Furthermore, relying purely on DHCP without an orchestration layer leaves the DPUs unmanaged and difficult to update.
D. Use DOCA Telemetry and manually adjust DOCA Telemetry Service (DTS) is a powerful monitoring tool used for observability and troubleshooting (e.g., identifying a congested link). However, it is not a management tool. Detecting a performance deviation is only half the battle; manual adjustment is still inefficient at scale compared to an automated orchestration tool that can remediate issues across the entire cluster.
Unattempted
Correct: B Utilize an orchestration tool or the NVIDIA Base Command Manager (BCM) to push standardized configuration profiles and firmware updates to all DPUs simultaneously.
The Technical Reason: BlueField DPUs are “computers-on-a-card“ with their own operating systems and complex firmware stacks.
BCM Integration: NVIDIA Base Command Manager (BCM) provides a centralized control plane that treats DPUs as managed objects. It can automate the deployment of BFB (BlueField Bundle) images, which contain the OS, drivers, and firmware.
Consistency: By using Category profiles in BCM, an administrator ensures that every DPU in a specific rack or cluster is running identical software versions (the “Golden Stack“), preventing performance jitter in collective communications like NCCL.
DOCA Deployment: Modern scaling also leverages the DOCA Platform Framework (DPF) to deploy containerized services (like storage or security offloads) across the entire fabric using a single command or API call.
The NCP-AII Context: The exam blueprint for Control Plane Installation and Configuration (19%) specifically highlights the use of BCM to “Install/update/remove NVIDIA GPU and DOCA™ drivers“ and ensure firmware alignment across the fabric.
Incorrect Options: A. Manually log into each node via SSH While SSH is useful for individual troubleshooting, it is physically impossible to maintain consistency across hundreds of nodes manually. Human error (skipping a step, a mistyped command) is inevitable at this scale, which leads to “configuration drift“—the primary cause of unpredictable fabric failures in large clusters.
C. Rely on DHCP and default factory settings Default factory settings are almost never optimized for a high-performance AI Factory. High-density workloads require specific MTU settings, Quality of Service (QoS) for RoCE, and specialized firmware versions for NVMe-oF. Furthermore, relying purely on DHCP without an orchestration layer leaves the DPUs unmanaged and difficult to update.
D. Use DOCA Telemetry and manually adjust DOCA Telemetry Service (DTS) is a powerful monitoring tool used for observability and troubleshooting (e.g., identifying a congested link). However, it is not a management tool. Detecting a performance deviation is only half the battle; manual adjustment is still inefficient at scale compared to an automated orchestration tool that can remediate issues across the entire cluster.
Question 45 of 60
45. Question
In a large-scale AI Factory deployment using a Leaf-Spine architecture, an architect must ensure that the network topology supports non-blocking communication for collective operations. When performing the initial configuration of the OOB and BMC network, which management protocol is primarily used to ensure that the infrastructure can be discovered and provisioned by NVIDIA Base Command Manager?
Correct
Correct: D. IPMI or Redfish for hardware management.
This is correct because IPMI (Intelligent Platform Management Interface) and Redfish are the standard protocols used for out-of-band (OOB) management of hardware components such as the Baseboard Management Controller (BMC). The NCP-AII certification explicitly lists “Perform initial configuration of BMC, OOB, and TPM“ as a key task within the System and Server Bring-up domain, which comprises 31% of the examination . NVIDIA documentation confirms that “All devices connect to OOB with 1GbE for IPMI/Redfish“ . Base Command Manager relies on these protocols to discover and provision infrastructure nodes because they provide independent management access to servers regardless of host operating system state . The BlueField BMC documentation further emphasizes that IPMI and Redfish are the primary management interfaces for hardware lifecycle management, monitoring, and configuration .
Incorrect: A. BGP with EVPN for overlay networking.
This is incorrect because BGP (Border Gateway Protocol) with EVPN (Ethernet VPN) is a network protocol used for fabric routing and overlay networking in the data plane, not for hardware management and discovery. These protocols operate on the high-speed compute fabric (East/West traffic), not on the out-of-band management network.
B. GPUDirect Storage (GDS) for data transfer.
This is incorrect because GPUDirect Storage is a technology for enabling direct data paths between storage and GPU memory to accelerate I/O performance . It is a storage optimization feature, not a management protocol for hardware discovery and provisioning by cluster managers like Base Command Manager.
C. RDMA over Converged Ethernet (RoCE) v2.
This is incorrect because RoCE v2 is a network protocol for high-performance remote direct memory access (RDMA) over Ethernet networks, used for data plane communication in the compute or storage fabric. It is not used for out-of-band hardware management and does not facilitate server discovery by management software.
Incorrect
Correct: D. IPMI or Redfish for hardware management.
This is correct because IPMI (Intelligent Platform Management Interface) and Redfish are the standard protocols used for out-of-band (OOB) management of hardware components such as the Baseboard Management Controller (BMC). The NCP-AII certification explicitly lists “Perform initial configuration of BMC, OOB, and TPM“ as a key task within the System and Server Bring-up domain, which comprises 31% of the examination . NVIDIA documentation confirms that “All devices connect to OOB with 1GbE for IPMI/Redfish“ . Base Command Manager relies on these protocols to discover and provision infrastructure nodes because they provide independent management access to servers regardless of host operating system state . The BlueField BMC documentation further emphasizes that IPMI and Redfish are the primary management interfaces for hardware lifecycle management, monitoring, and configuration .
Incorrect: A. BGP with EVPN for overlay networking.
This is incorrect because BGP (Border Gateway Protocol) with EVPN (Ethernet VPN) is a network protocol used for fabric routing and overlay networking in the data plane, not for hardware management and discovery. These protocols operate on the high-speed compute fabric (East/West traffic), not on the out-of-band management network.
B. GPUDirect Storage (GDS) for data transfer.
This is incorrect because GPUDirect Storage is a technology for enabling direct data paths between storage and GPU memory to accelerate I/O performance . It is a storage optimization feature, not a management protocol for hardware discovery and provisioning by cluster managers like Base Command Manager.
C. RDMA over Converged Ethernet (RoCE) v2.
This is incorrect because RoCE v2 is a network protocol for high-performance remote direct memory access (RDMA) over Ethernet networks, used for data plane communication in the compute or storage fabric. It is not used for out-of-band hardware management and does not facilitate server discovery by management software.
Unattempted
Correct: D. IPMI or Redfish for hardware management.
This is correct because IPMI (Intelligent Platform Management Interface) and Redfish are the standard protocols used for out-of-band (OOB) management of hardware components such as the Baseboard Management Controller (BMC). The NCP-AII certification explicitly lists “Perform initial configuration of BMC, OOB, and TPM“ as a key task within the System and Server Bring-up domain, which comprises 31% of the examination . NVIDIA documentation confirms that “All devices connect to OOB with 1GbE for IPMI/Redfish“ . Base Command Manager relies on these protocols to discover and provision infrastructure nodes because they provide independent management access to servers regardless of host operating system state . The BlueField BMC documentation further emphasizes that IPMI and Redfish are the primary management interfaces for hardware lifecycle management, monitoring, and configuration .
Incorrect: A. BGP with EVPN for overlay networking.
This is incorrect because BGP (Border Gateway Protocol) with EVPN (Ethernet VPN) is a network protocol used for fabric routing and overlay networking in the data plane, not for hardware management and discovery. These protocols operate on the high-speed compute fabric (East/West traffic), not on the out-of-band management network.
B. GPUDirect Storage (GDS) for data transfer.
This is incorrect because GPUDirect Storage is a technology for enabling direct data paths between storage and GPU memory to accelerate I/O performance . It is a storage optimization feature, not a management protocol for hardware discovery and provisioning by cluster managers like Base Command Manager.
C. RDMA over Converged Ethernet (RoCE) v2.
This is incorrect because RoCE v2 is a network protocol for high-performance remote direct memory access (RDMA) over Ethernet networks, used for data plane communication in the compute or storage fabric. It is not used for out-of-band hardware management and does not facilitate server discovery by management software.
Question 46 of 60
46. Question
A data center administrator is troubleshooting a connectivity issue between a BlueField-3 DPU and the InfiniBand fabric. The link is up, but performance is significantly lower than the expected 400Gbps. Which physical layer management step is most appropriate to identify the cause of the bandwidth degradation?
Correct
Correct: C. Use the ‘mlxlink‘ tool to check for high bit-error rates (BER) and inspect the transceiver diagnostic data for low optical power levels.
This is the most appropriate physical layer management step because mlxlink is specifically designed to read and display detailed information about optical modules and diagnose physical layer issues . When performance is significantly lower than expected 400Gbps despite the link being up, the cause is likely physical layer degradation such as high bit-error rates (BER) or low optical power levels. The mlxlink -d -m command reads and displays optical module information, including transceiver diagnostic data .
This approach directly aligns with the NCP-AII certification‘s Physical Layer Management and Troubleshoot and Optimize domains, which include validating cable types and transceivers, as well as identifying faulty components . The exam blueprint specifically covers “validate cables by verifying signal quality“ and “confirm FW on transceivers“ as critical verification tasks .
Incorrect: A. Execute an HPL stress test on the DPU‘s ARM cores to see if the thermal output of the DPU is affecting the signal quality of the OSFP ports.
This is incorrect. HPL (High-Performance Linpack) is a benchmark for measuring system performance on solving linear equations , not a diagnostic tool for signal quality. Thermal output affecting signal quality would not be diagnosed through an HPL stress test on ARM cores, as HPL is typically used for compute node validation and burn-in testing .
B. Configure the BMC to disable the Out-of-Band (OOB) network, forcing the DPU to use the internal NVLink fabric for its management heartbeats.
This is incorrect. The Out-of-Band network and NVLink fabric serve completely different purposes. OOB management is for administrative access to the BMC , while NVLink is a high-speed interconnect for GPU-to-GPU communication . Disabling OOB would not help diagnose bandwidth degradation on the InfiniBand fabric and could instead hinder management access to the system.
D. Reinstall the NVIDIA Container Toolkit to ensure that the Docker daemon is correctly prioritizing the DPU‘s management traffic over the data fabric.
This is incorrect. The NVIDIA Container Toolkit is used for enabling GPU access within containers and has no functionality related to InfiniBand fabric performance, DPU management traffic prioritization, or physical layer diagnostics. Reinstalling it would not address a physical layer bandwidth degradation issue on an InfiniBand link.
Incorrect
Correct: C. Use the ‘mlxlink‘ tool to check for high bit-error rates (BER) and inspect the transceiver diagnostic data for low optical power levels.
This is the most appropriate physical layer management step because mlxlink is specifically designed to read and display detailed information about optical modules and diagnose physical layer issues . When performance is significantly lower than expected 400Gbps despite the link being up, the cause is likely physical layer degradation such as high bit-error rates (BER) or low optical power levels. The mlxlink -d -m command reads and displays optical module information, including transceiver diagnostic data .
This approach directly aligns with the NCP-AII certification‘s Physical Layer Management and Troubleshoot and Optimize domains, which include validating cable types and transceivers, as well as identifying faulty components . The exam blueprint specifically covers “validate cables by verifying signal quality“ and “confirm FW on transceivers“ as critical verification tasks .
Incorrect: A. Execute an HPL stress test on the DPU‘s ARM cores to see if the thermal output of the DPU is affecting the signal quality of the OSFP ports.
This is incorrect. HPL (High-Performance Linpack) is a benchmark for measuring system performance on solving linear equations , not a diagnostic tool for signal quality. Thermal output affecting signal quality would not be diagnosed through an HPL stress test on ARM cores, as HPL is typically used for compute node validation and burn-in testing .
B. Configure the BMC to disable the Out-of-Band (OOB) network, forcing the DPU to use the internal NVLink fabric for its management heartbeats.
This is incorrect. The Out-of-Band network and NVLink fabric serve completely different purposes. OOB management is for administrative access to the BMC , while NVLink is a high-speed interconnect for GPU-to-GPU communication . Disabling OOB would not help diagnose bandwidth degradation on the InfiniBand fabric and could instead hinder management access to the system.
D. Reinstall the NVIDIA Container Toolkit to ensure that the Docker daemon is correctly prioritizing the DPU‘s management traffic over the data fabric.
This is incorrect. The NVIDIA Container Toolkit is used for enabling GPU access within containers and has no functionality related to InfiniBand fabric performance, DPU management traffic prioritization, or physical layer diagnostics. Reinstalling it would not address a physical layer bandwidth degradation issue on an InfiniBand link.
Unattempted
Correct: C. Use the ‘mlxlink‘ tool to check for high bit-error rates (BER) and inspect the transceiver diagnostic data for low optical power levels.
This is the most appropriate physical layer management step because mlxlink is specifically designed to read and display detailed information about optical modules and diagnose physical layer issues . When performance is significantly lower than expected 400Gbps despite the link being up, the cause is likely physical layer degradation such as high bit-error rates (BER) or low optical power levels. The mlxlink -d -m command reads and displays optical module information, including transceiver diagnostic data .
This approach directly aligns with the NCP-AII certification‘s Physical Layer Management and Troubleshoot and Optimize domains, which include validating cable types and transceivers, as well as identifying faulty components . The exam blueprint specifically covers “validate cables by verifying signal quality“ and “confirm FW on transceivers“ as critical verification tasks .
Incorrect: A. Execute an HPL stress test on the DPU‘s ARM cores to see if the thermal output of the DPU is affecting the signal quality of the OSFP ports.
This is incorrect. HPL (High-Performance Linpack) is a benchmark for measuring system performance on solving linear equations , not a diagnostic tool for signal quality. Thermal output affecting signal quality would not be diagnosed through an HPL stress test on ARM cores, as HPL is typically used for compute node validation and burn-in testing .
B. Configure the BMC to disable the Out-of-Band (OOB) network, forcing the DPU to use the internal NVLink fabric for its management heartbeats.
This is incorrect. The Out-of-Band network and NVLink fabric serve completely different purposes. OOB management is for administrative access to the BMC , while NVLink is a high-speed interconnect for GPU-to-GPU communication . Disabling OOB would not help diagnose bandwidth degradation on the InfiniBand fabric and could instead hinder management access to the system.
D. Reinstall the NVIDIA Container Toolkit to ensure that the Docker daemon is correctly prioritizing the DPU‘s management traffic over the data fabric.
This is incorrect. The NVIDIA Container Toolkit is used for enabling GPU access within containers and has no functionality related to InfiniBand fabric performance, DPU management traffic prioritization, or physical layer diagnostics. Reinstalling it would not address a physical layer bandwidth degradation issue on an InfiniBand link.
Question 47 of 60
47. Question
After the physical installation of several GPUs into a server, the administrator runs the command nvidia-smi and notices that one GPU is not appearing in the list, while others show P0 power states. According to the System and Server Bring-up domain, what is the most logical sequence of diagnostic steps to resolve this hardware detection issue?
Correct
Correct: D Check physical power cable connections, inspect the PCIe slot for debris, and verify if the GPU is detected in the system BIOS or BMC hardware inventory. • The Technical Reason: If a GPU is missing from nvidia-smi while others are present, it implies a failure in the hardware initialization or the communication path between the GPU and the CPU. ? Physical Layer: High-density HGX and OVX systems draw massive amounts of current. A loose power cable or poor seating in a PCIe slot (often caused by debris or physical misalignment during assembly) can prevent the GPU from completing its Power-On Self-Test (POST). ? Firmware Layer: Before the OS even loads, the BIOS and BMC (Baseboard Management Controller) must successfully discover the device on the PCIe bus. If the BIOS doesn‘t see the card, the OS and nvidia-smi driver never will. • The NCP-AII Context: The exam validates your ability to follow the NVIDIA-Certified bring-up sequence: Physical Verification ? Firmware/BIOS Discovery ? Driver Initialization ? Functional Testing. Option D correctly addresses the first two steps of this hierarchy.
Incorrect: A. Reinstall Linux and upgrade Slurm This is a software-heavy “brute force“ approach that is highly inefficient. If the GPU is not detected by the hardware (BIOS/BMC), reinstalling the OS or upgrading an orchestration tool like Slurm will have no effect. Slurm only schedules resources that the underlying OS can see.
B. Increase PSU voltage via OOB Modern data center Power Supply Units (PSUs) are digital and self-regulating. Manually “increasing voltage“ is not a standard troubleshooting step and can damage sensitive components. If a GPU is failing to initialize due to power, it is typically due to a capacity issue (insufficient wattage) or a connection issue (faulty cable), not a voltage level that requires manual user adjustment.
C. Disable TPM and remove transceivers While troubleshooting sometimes involves reducing the load to identify a conflict, disabling security features like the TPM (Root of Trust) or removing unrelated networking transceivers is irrelevant to a specific GPU detection failure. This approach lacks a logical connection to the PCIe discovery process.
Incorrect
Correct: D Check physical power cable connections, inspect the PCIe slot for debris, and verify if the GPU is detected in the system BIOS or BMC hardware inventory. • The Technical Reason: If a GPU is missing from nvidia-smi while others are present, it implies a failure in the hardware initialization or the communication path between the GPU and the CPU. ? Physical Layer: High-density HGX and OVX systems draw massive amounts of current. A loose power cable or poor seating in a PCIe slot (often caused by debris or physical misalignment during assembly) can prevent the GPU from completing its Power-On Self-Test (POST). ? Firmware Layer: Before the OS even loads, the BIOS and BMC (Baseboard Management Controller) must successfully discover the device on the PCIe bus. If the BIOS doesn‘t see the card, the OS and nvidia-smi driver never will. • The NCP-AII Context: The exam validates your ability to follow the NVIDIA-Certified bring-up sequence: Physical Verification ? Firmware/BIOS Discovery ? Driver Initialization ? Functional Testing. Option D correctly addresses the first two steps of this hierarchy.
Incorrect: A. Reinstall Linux and upgrade Slurm This is a software-heavy “brute force“ approach that is highly inefficient. If the GPU is not detected by the hardware (BIOS/BMC), reinstalling the OS or upgrading an orchestration tool like Slurm will have no effect. Slurm only schedules resources that the underlying OS can see.
B. Increase PSU voltage via OOB Modern data center Power Supply Units (PSUs) are digital and self-regulating. Manually “increasing voltage“ is not a standard troubleshooting step and can damage sensitive components. If a GPU is failing to initialize due to power, it is typically due to a capacity issue (insufficient wattage) or a connection issue (faulty cable), not a voltage level that requires manual user adjustment.
C. Disable TPM and remove transceivers While troubleshooting sometimes involves reducing the load to identify a conflict, disabling security features like the TPM (Root of Trust) or removing unrelated networking transceivers is irrelevant to a specific GPU detection failure. This approach lacks a logical connection to the PCIe discovery process.
Unattempted
Correct: D Check physical power cable connections, inspect the PCIe slot for debris, and verify if the GPU is detected in the system BIOS or BMC hardware inventory. • The Technical Reason: If a GPU is missing from nvidia-smi while others are present, it implies a failure in the hardware initialization or the communication path between the GPU and the CPU. ? Physical Layer: High-density HGX and OVX systems draw massive amounts of current. A loose power cable or poor seating in a PCIe slot (often caused by debris or physical misalignment during assembly) can prevent the GPU from completing its Power-On Self-Test (POST). ? Firmware Layer: Before the OS even loads, the BIOS and BMC (Baseboard Management Controller) must successfully discover the device on the PCIe bus. If the BIOS doesn‘t see the card, the OS and nvidia-smi driver never will. • The NCP-AII Context: The exam validates your ability to follow the NVIDIA-Certified bring-up sequence: Physical Verification ? Firmware/BIOS Discovery ? Driver Initialization ? Functional Testing. Option D correctly addresses the first two steps of this hierarchy.
Incorrect: A. Reinstall Linux and upgrade Slurm This is a software-heavy “brute force“ approach that is highly inefficient. If the GPU is not detected by the hardware (BIOS/BMC), reinstalling the OS or upgrading an orchestration tool like Slurm will have no effect. Slurm only schedules resources that the underlying OS can see.
B. Increase PSU voltage via OOB Modern data center Power Supply Units (PSUs) are digital and self-regulating. Manually “increasing voltage“ is not a standard troubleshooting step and can damage sensitive components. If a GPU is failing to initialize due to power, it is typically due to a capacity issue (insufficient wattage) or a connection issue (faulty cable), not a voltage level that requires manual user adjustment.
C. Disable TPM and remove transceivers While troubleshooting sometimes involves reducing the load to identify a conflict, disabling security features like the TPM (Root of Trust) or removing unrelated networking transceivers is irrelevant to a specific GPU detection failure. This approach lacks a logical connection to the PCIe discovery process.
Question 48 of 60
48. Question
To allow Docker containers to utilize NVIDIA GPUs for accelerated computing, an engineer must install the NVIDIA Container Toolkit. Which of the following best describes the core modification made to the Docker configuration to enable this functionality on a Linux host?
Correct
Correct: B Adding the NVIDIA runtime to the /etc/docker/daemon.json file.
The Technical Reason: To make GPUs visible and usable within a container, the standard Docker engine needs to be informed of a specialized runtime.
The Runtime: The NVIDIA Container Runtime (part of the NVIDIA Container Toolkit) acts as a shim for the standard runc. It handles the mounting of GPU device nodes (e.g., /dev/nvidia0) and driver libraries into the container.
The Configuration: The most common and standardized way to enable this is by modifying the Docker daemon configuration file located at /etc/docker/daemon.json. By adding the nvidia runtime and setting it as the default (or an available option), the docker run –gpus all command becomes functional.
The NCP-AII Context: The certification expects you to know the manual and automated steps for setting up the “NVIDIA Golden Stack.“ Modifying the daemon.json followed by a systemctl restart docker is the standard professional procedure verified in the exam.
Incorrect Options: A. Replacing the Docker executable with ‘nvidia-docker‘ While nvidia-docker was a standalone wrapper in earlier versions (v1), it has long been deprecated in favor of the NVIDIA Container Toolkit (v2+). Modern environments use the standard docker binary and simply pass a flag (like –runtime=nvidia or –gpus all) to the existing engine. Replacing the core Docker binary would break system updates and is not the supported method.
C. Disabling the Linux firewall The Linux firewall (iptables/nftables) manages network traffic. GPU-to-CPU communication happens over the internal PCIe bus or NVLink, which are hardware interconnects that do not operate on the network stack managed by the OS firewall. Disabling the firewall provides no benefit for GPU discovery and creates a significant security risk.
D. Installing a new Linux kernel with built-in GPU support NVIDIA GPU drivers are distributed as kernel modules (DKMS), not as part of the upstream Linux kernel source. While you must have a compatible kernel, you do not “install a new version of the kernel“ to get GPU support; instead, you install the NVIDIA driver package and the Toolkit on top of your existing supported kernel (e.g., Ubuntu LTS or Rocky Linux).
Incorrect
Correct: B Adding the NVIDIA runtime to the /etc/docker/daemon.json file.
The Technical Reason: To make GPUs visible and usable within a container, the standard Docker engine needs to be informed of a specialized runtime.
The Runtime: The NVIDIA Container Runtime (part of the NVIDIA Container Toolkit) acts as a shim for the standard runc. It handles the mounting of GPU device nodes (e.g., /dev/nvidia0) and driver libraries into the container.
The Configuration: The most common and standardized way to enable this is by modifying the Docker daemon configuration file located at /etc/docker/daemon.json. By adding the nvidia runtime and setting it as the default (or an available option), the docker run –gpus all command becomes functional.
The NCP-AII Context: The certification expects you to know the manual and automated steps for setting up the “NVIDIA Golden Stack.“ Modifying the daemon.json followed by a systemctl restart docker is the standard professional procedure verified in the exam.
Incorrect Options: A. Replacing the Docker executable with ‘nvidia-docker‘ While nvidia-docker was a standalone wrapper in earlier versions (v1), it has long been deprecated in favor of the NVIDIA Container Toolkit (v2+). Modern environments use the standard docker binary and simply pass a flag (like –runtime=nvidia or –gpus all) to the existing engine. Replacing the core Docker binary would break system updates and is not the supported method.
C. Disabling the Linux firewall The Linux firewall (iptables/nftables) manages network traffic. GPU-to-CPU communication happens over the internal PCIe bus or NVLink, which are hardware interconnects that do not operate on the network stack managed by the OS firewall. Disabling the firewall provides no benefit for GPU discovery and creates a significant security risk.
D. Installing a new Linux kernel with built-in GPU support NVIDIA GPU drivers are distributed as kernel modules (DKMS), not as part of the upstream Linux kernel source. While you must have a compatible kernel, you do not “install a new version of the kernel“ to get GPU support; instead, you install the NVIDIA driver package and the Toolkit on top of your existing supported kernel (e.g., Ubuntu LTS or Rocky Linux).
Unattempted
Correct: B Adding the NVIDIA runtime to the /etc/docker/daemon.json file.
The Technical Reason: To make GPUs visible and usable within a container, the standard Docker engine needs to be informed of a specialized runtime.
The Runtime: The NVIDIA Container Runtime (part of the NVIDIA Container Toolkit) acts as a shim for the standard runc. It handles the mounting of GPU device nodes (e.g., /dev/nvidia0) and driver libraries into the container.
The Configuration: The most common and standardized way to enable this is by modifying the Docker daemon configuration file located at /etc/docker/daemon.json. By adding the nvidia runtime and setting it as the default (or an available option), the docker run –gpus all command becomes functional.
The NCP-AII Context: The certification expects you to know the manual and automated steps for setting up the “NVIDIA Golden Stack.“ Modifying the daemon.json followed by a systemctl restart docker is the standard professional procedure verified in the exam.
Incorrect Options: A. Replacing the Docker executable with ‘nvidia-docker‘ While nvidia-docker was a standalone wrapper in earlier versions (v1), it has long been deprecated in favor of the NVIDIA Container Toolkit (v2+). Modern environments use the standard docker binary and simply pass a flag (like –runtime=nvidia or –gpus all) to the existing engine. Replacing the core Docker binary would break system updates and is not the supported method.
C. Disabling the Linux firewall The Linux firewall (iptables/nftables) manages network traffic. GPU-to-CPU communication happens over the internal PCIe bus or NVLink, which are hardware interconnects that do not operate on the network stack managed by the OS firewall. Disabling the firewall provides no benefit for GPU discovery and creates a significant security risk.
D. Installing a new Linux kernel with built-in GPU support NVIDIA GPU drivers are distributed as kernel modules (DKMS), not as part of the upstream Linux kernel source. While you must have a compatible kernel, you do not “install a new version of the kernel“ to get GPU support; instead, you install the NVIDIA driver package and the Toolkit on top of your existing supported kernel (e.g., Ubuntu LTS or Rocky Linux).
Question 49 of 60
49. Question
Following the physical installation of an 8-node HGX H100 cluster, the team must run the High-Performance Linpack benchmark. What is the primary purpose of executing HPL during the cluster verification phase of an AI infrastructure deployment?
Correct
Correct: C. To verify the maximum floating-point performance and thermal stability.
The NCP-AII certification blueprint explicitly lists “Execute HPL (High-Performance Linpack)“ and “Perform HPL burn-in“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
HPL is the industry-standard benchmark for measuring the floating-point compute performance of supercomputers and is the basis for the TOP500 list .
It solves a dense system of linear equations using LU decomposition and reports a floating-point execution rate .
During cluster verification, HPL serves two primary purposes:
First, it verifies the maximum floating-point performance (measured in GFLOPS or TFLOPS) that the system can achieve under ideal conditions .
Second, running HPL as a burn-in test stresses the entire system—GPUs, CPUs, memory, and interconnect—under sustained maximum load, which validates thermal stability and ensures all components operate reliably without throttling or failure .
Incorrect: A. To ensure that the NVIDIA NGC CLI is correctly authenticated.
This is incorrect because NGC CLI authentication is a separate task within the Control Plane Installation and Configuration domain . HPL is a compute performance benchmark, not a tool for validating command-line interface authentication.
B. To test the latency of the management network DHCP server.
This is incorrect because the management network and DHCP services are part of the out-of-band (OOB) infrastructure . HPL tests compute performance across the high-speed fabric, not management network services. Network latency testing would be performed using different tools.
D. To check the read/write speeds of the local SATA boot drives.
This is incorrect because storage performance testing is a separate verification task explicitly listed in the exam blueprint under “Test storage“ . HPL focuses on floating-point computation and system thermal stability, not storage I/O performance. Local SATA boot drive speeds are irrelevant to HPL‘s purpose and would be validated through storage-specific benchmarks.
Incorrect
Correct: C. To verify the maximum floating-point performance and thermal stability.
The NCP-AII certification blueprint explicitly lists “Execute HPL (High-Performance Linpack)“ and “Perform HPL burn-in“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
HPL is the industry-standard benchmark for measuring the floating-point compute performance of supercomputers and is the basis for the TOP500 list .
It solves a dense system of linear equations using LU decomposition and reports a floating-point execution rate .
During cluster verification, HPL serves two primary purposes:
First, it verifies the maximum floating-point performance (measured in GFLOPS or TFLOPS) that the system can achieve under ideal conditions .
Second, running HPL as a burn-in test stresses the entire system—GPUs, CPUs, memory, and interconnect—under sustained maximum load, which validates thermal stability and ensures all components operate reliably without throttling or failure .
Incorrect: A. To ensure that the NVIDIA NGC CLI is correctly authenticated.
This is incorrect because NGC CLI authentication is a separate task within the Control Plane Installation and Configuration domain . HPL is a compute performance benchmark, not a tool for validating command-line interface authentication.
B. To test the latency of the management network DHCP server.
This is incorrect because the management network and DHCP services are part of the out-of-band (OOB) infrastructure . HPL tests compute performance across the high-speed fabric, not management network services. Network latency testing would be performed using different tools.
D. To check the read/write speeds of the local SATA boot drives.
This is incorrect because storage performance testing is a separate verification task explicitly listed in the exam blueprint under “Test storage“ . HPL focuses on floating-point computation and system thermal stability, not storage I/O performance. Local SATA boot drive speeds are irrelevant to HPL‘s purpose and would be validated through storage-specific benchmarks.
Unattempted
Correct: C. To verify the maximum floating-point performance and thermal stability.
The NCP-AII certification blueprint explicitly lists “Execute HPL (High-Performance Linpack)“ and “Perform HPL burn-in“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
HPL is the industry-standard benchmark for measuring the floating-point compute performance of supercomputers and is the basis for the TOP500 list .
It solves a dense system of linear equations using LU decomposition and reports a floating-point execution rate .
During cluster verification, HPL serves two primary purposes:
First, it verifies the maximum floating-point performance (measured in GFLOPS or TFLOPS) that the system can achieve under ideal conditions .
Second, running HPL as a burn-in test stresses the entire system—GPUs, CPUs, memory, and interconnect—under sustained maximum load, which validates thermal stability and ensures all components operate reliably without throttling or failure .
Incorrect: A. To ensure that the NVIDIA NGC CLI is correctly authenticated.
This is incorrect because NGC CLI authentication is a separate task within the Control Plane Installation and Configuration domain . HPL is a compute performance benchmark, not a tool for validating command-line interface authentication.
B. To test the latency of the management network DHCP server.
This is incorrect because the management network and DHCP services are part of the out-of-band (OOB) infrastructure . HPL tests compute performance across the high-speed fabric, not management network services. Network latency testing would be performed using different tools.
D. To check the read/write speeds of the local SATA boot drives.
This is incorrect because storage performance testing is a separate verification task explicitly listed in the exam blueprint under “Test storage“ . HPL focuses on floating-point computation and system thermal stability, not storage I/O performance. Local SATA boot drive speeds are irrelevant to HPL‘s purpose and would be validated through storage-specific benchmarks.
Question 50 of 60
50. Question
When configuring a Multi-Instance GPU (MIG) environment, why is it necessary to understand the difference between ‘GPU Instances‘ and ‘Compute Instances‘? Which statement correctly describes the hierarchy and relationship between these two MIG components?
Correct
Correct: A A GPU Instance defines the memory and cache allocation, while a Compute Instance is created within it to define the actual compute resources (SMs) available.
The Technical Reason: MIG uses a nested hierarchy to ensure strict Quality of Service (QoS):
GPU Instance (GI): This is the “parent“ container. When you create a GI, you are carving out a specific slice of the GPUÂ’s physical hardware, including Video RAM (VRAM) and the L2 Cache. This provides memory isolation so that a process in one partition cannot starve the memory bandwidth of another.
Compute Instance (CI): This is created inside a GPU Instance. It defines the number of Streaming Multiprocessors (SMs) and hardware engines (like engines for video decoding or DMA) allocated to that partition.
The Relationship: You cannot have a Compute Instance without a parent GPU Instance. A single GPU Instance can technically host multiple Compute Instances (sharing the memory of that GI), but most standard profiles (like 1g.10gb) create a 1:1 relationship for total isolation.
The NCP-AII Context: The certification requires you to know the specific nvidia-smi commands to build this hierarchy:
nvidia-smi mig -cgi (Create GPU Instance)
nvidia-smi mig -cci (Create Compute Instance within that GI)
Incorrect Options: B. Terms are used interchangeably This is a common misconception, but technically incorrect. In nvidia-smi, they are distinct objects with unique IDs. Confusing the two in a production environment could lead to configuration errors where memory is allocated but no compute resources are assigned, resulting in a “zombie“ partition that cannot execute kernels.
C. GPU Instances for graphics, Compute Instances for LLMs This is factually incorrect. Both components are required for any workload running on a MIG-enabled GPU, whether it is AI training, inference, or high-performance computing (HPC). MIG is a hardware partitioning strategy, not a workload-type filter.
D. Compute Instances are created first This reverses the actual hardware logic. Because compute resources (SMs) require access to memory and cache to function, the memory-backed GPU Instance must be established first to provide the physical “home“ for the Compute Instance.
Incorrect
Correct: A A GPU Instance defines the memory and cache allocation, while a Compute Instance is created within it to define the actual compute resources (SMs) available.
The Technical Reason: MIG uses a nested hierarchy to ensure strict Quality of Service (QoS):
GPU Instance (GI): This is the “parent“ container. When you create a GI, you are carving out a specific slice of the GPUÂ’s physical hardware, including Video RAM (VRAM) and the L2 Cache. This provides memory isolation so that a process in one partition cannot starve the memory bandwidth of another.
Compute Instance (CI): This is created inside a GPU Instance. It defines the number of Streaming Multiprocessors (SMs) and hardware engines (like engines for video decoding or DMA) allocated to that partition.
The Relationship: You cannot have a Compute Instance without a parent GPU Instance. A single GPU Instance can technically host multiple Compute Instances (sharing the memory of that GI), but most standard profiles (like 1g.10gb) create a 1:1 relationship for total isolation.
The NCP-AII Context: The certification requires you to know the specific nvidia-smi commands to build this hierarchy:
nvidia-smi mig -cgi (Create GPU Instance)
nvidia-smi mig -cci (Create Compute Instance within that GI)
Incorrect Options: B. Terms are used interchangeably This is a common misconception, but technically incorrect. In nvidia-smi, they are distinct objects with unique IDs. Confusing the two in a production environment could lead to configuration errors where memory is allocated but no compute resources are assigned, resulting in a “zombie“ partition that cannot execute kernels.
C. GPU Instances for graphics, Compute Instances for LLMs This is factually incorrect. Both components are required for any workload running on a MIG-enabled GPU, whether it is AI training, inference, or high-performance computing (HPC). MIG is a hardware partitioning strategy, not a workload-type filter.
D. Compute Instances are created first This reverses the actual hardware logic. Because compute resources (SMs) require access to memory and cache to function, the memory-backed GPU Instance must be established first to provide the physical “home“ for the Compute Instance.
Unattempted
Correct: A A GPU Instance defines the memory and cache allocation, while a Compute Instance is created within it to define the actual compute resources (SMs) available.
The Technical Reason: MIG uses a nested hierarchy to ensure strict Quality of Service (QoS):
GPU Instance (GI): This is the “parent“ container. When you create a GI, you are carving out a specific slice of the GPUÂ’s physical hardware, including Video RAM (VRAM) and the L2 Cache. This provides memory isolation so that a process in one partition cannot starve the memory bandwidth of another.
Compute Instance (CI): This is created inside a GPU Instance. It defines the number of Streaming Multiprocessors (SMs) and hardware engines (like engines for video decoding or DMA) allocated to that partition.
The Relationship: You cannot have a Compute Instance without a parent GPU Instance. A single GPU Instance can technically host multiple Compute Instances (sharing the memory of that GI), but most standard profiles (like 1g.10gb) create a 1:1 relationship for total isolation.
The NCP-AII Context: The certification requires you to know the specific nvidia-smi commands to build this hierarchy:
nvidia-smi mig -cgi (Create GPU Instance)
nvidia-smi mig -cci (Create Compute Instance within that GI)
Incorrect Options: B. Terms are used interchangeably This is a common misconception, but technically incorrect. In nvidia-smi, they are distinct objects with unique IDs. Confusing the two in a production environment could lead to configuration errors where memory is allocated but no compute resources are assigned, resulting in a “zombie“ partition that cannot execute kernels.
C. GPU Instances for graphics, Compute Instances for LLMs This is factually incorrect. Both components are required for any workload running on a MIG-enabled GPU, whether it is AI training, inference, or high-performance computing (HPC). MIG is a hardware partitioning strategy, not a workload-type filter.
D. Compute Instances are created first This reverses the actual hardware logic. Because compute resources (SMs) require access to memory and cache to function, the memory-backed GPU Instance must be established first to provide the physical “home“ for the Compute Instance.
Question 51 of 60
51. Question
An administrator needs to partition a physical NVIDIA H100 GPU into multiple instances to support diverse workloads ranging from small-scale inference to development tasks. Using Multi-Instance GPU (MIG) technology, what is a critical requirement regarding the memory and compute resources allocated to each instance?
Correct
Correct: B Each MIG instance provides hardware-level isolation with dedicated high-bandwidth memory and compute cores ensuring that one instance does not impact another. • The Technical Reason: Unlike software-based partitioning (like simple Docker resource limits), MIG provides physical hardware isolation. ? Dedicated Resources: When a MIG instance is created, a specific portion of the GPU‘s High Bandwidth Memory (HBM), cache, and Streaming Multiprocessors (SMs) are physically partitioned. ? Fault Isolation: Because the paths to memory and compute are dedicated at the hardware level, a memory error or a “runaway“ kernel in one MIG instance cannot crash or slow down another instance on the same physical GPU. This provides the Quality of Service (QoS) required for production environments. • The NCP-AII Context: The exam validates your understanding that MIG is a hardware-level feature. You are expected to know that this isolation is what allows “diverse workloads“ (like a sensitive inference task and a chaotic development task) to run side-by-side without interference.
Incorrect: A. Total memory can exceed physical memory via swap space MIG does not support oversubscription or “virtual swap“ on host SSDs. Each MIG instance must be backed by the GPU‘s actual physical HBM. The sum of the memory allocated to all MIG instances cannot exceed the total capacity of the physical GPU (e.g., 80GB for an H100). If you attempt to create a partition that exceeds the remaining physical memory, the command will fail.
C. MIG configuration can be changed dynamically during a workload MIG configurations are not dynamic in the sense that you cannot resize or delete a partition while it is actively being used by a compute kernel. To change the MIG profile (e.g., moving from seven 1g.10gb instances to two 3g.40gb instances), all active processes on the GPU must be terminated, and the instances must be destroyed and recreated.
D. MIG instances must all be of the same size One of the primary benefits of MIG is flexibility. A single H100 can be partitioned into a mix of different sizes (e.g., one 3g.40gb instance for a small training job and four 1g.10gb instances for inference). Slurm is fully capable of managing a “heterogeneous“ cluster where different nodes or even different GPUs within a node have different MIG configurations, provided the gres.conf is configured correctly.
Incorrect
Correct: B Each MIG instance provides hardware-level isolation with dedicated high-bandwidth memory and compute cores ensuring that one instance does not impact another. • The Technical Reason: Unlike software-based partitioning (like simple Docker resource limits), MIG provides physical hardware isolation. ? Dedicated Resources: When a MIG instance is created, a specific portion of the GPU‘s High Bandwidth Memory (HBM), cache, and Streaming Multiprocessors (SMs) are physically partitioned. ? Fault Isolation: Because the paths to memory and compute are dedicated at the hardware level, a memory error or a “runaway“ kernel in one MIG instance cannot crash or slow down another instance on the same physical GPU. This provides the Quality of Service (QoS) required for production environments. • The NCP-AII Context: The exam validates your understanding that MIG is a hardware-level feature. You are expected to know that this isolation is what allows “diverse workloads“ (like a sensitive inference task and a chaotic development task) to run side-by-side without interference.
Incorrect: A. Total memory can exceed physical memory via swap space MIG does not support oversubscription or “virtual swap“ on host SSDs. Each MIG instance must be backed by the GPU‘s actual physical HBM. The sum of the memory allocated to all MIG instances cannot exceed the total capacity of the physical GPU (e.g., 80GB for an H100). If you attempt to create a partition that exceeds the remaining physical memory, the command will fail.
C. MIG configuration can be changed dynamically during a workload MIG configurations are not dynamic in the sense that you cannot resize or delete a partition while it is actively being used by a compute kernel. To change the MIG profile (e.g., moving from seven 1g.10gb instances to two 3g.40gb instances), all active processes on the GPU must be terminated, and the instances must be destroyed and recreated.
D. MIG instances must all be of the same size One of the primary benefits of MIG is flexibility. A single H100 can be partitioned into a mix of different sizes (e.g., one 3g.40gb instance for a small training job and four 1g.10gb instances for inference). Slurm is fully capable of managing a “heterogeneous“ cluster where different nodes or even different GPUs within a node have different MIG configurations, provided the gres.conf is configured correctly.
Unattempted
Correct: B Each MIG instance provides hardware-level isolation with dedicated high-bandwidth memory and compute cores ensuring that one instance does not impact another. • The Technical Reason: Unlike software-based partitioning (like simple Docker resource limits), MIG provides physical hardware isolation. ? Dedicated Resources: When a MIG instance is created, a specific portion of the GPU‘s High Bandwidth Memory (HBM), cache, and Streaming Multiprocessors (SMs) are physically partitioned. ? Fault Isolation: Because the paths to memory and compute are dedicated at the hardware level, a memory error or a “runaway“ kernel in one MIG instance cannot crash or slow down another instance on the same physical GPU. This provides the Quality of Service (QoS) required for production environments. • The NCP-AII Context: The exam validates your understanding that MIG is a hardware-level feature. You are expected to know that this isolation is what allows “diverse workloads“ (like a sensitive inference task and a chaotic development task) to run side-by-side without interference.
Incorrect: A. Total memory can exceed physical memory via swap space MIG does not support oversubscription or “virtual swap“ on host SSDs. Each MIG instance must be backed by the GPU‘s actual physical HBM. The sum of the memory allocated to all MIG instances cannot exceed the total capacity of the physical GPU (e.g., 80GB for an H100). If you attempt to create a partition that exceeds the remaining physical memory, the command will fail.
C. MIG configuration can be changed dynamically during a workload MIG configurations are not dynamic in the sense that you cannot resize or delete a partition while it is actively being used by a compute kernel. To change the MIG profile (e.g., moving from seven 1g.10gb instances to two 3g.40gb instances), all active processes on the GPU must be terminated, and the instances must be destroyed and recreated.
D. MIG instances must all be of the same size One of the primary benefits of MIG is flexibility. A single H100 can be partitioned into a mix of different sizes (e.g., one 3g.40gb instance for a small training job and four 1g.10gb instances for inference). Slurm is fully capable of managing a “heterogeneous“ cluster where different nodes or even different GPUs within a node have different MIG configurations, provided the gres.conf is configured correctly.
Question 52 of 60
52. Question
While optimizing an AI infrastructure that includes both AMD and Intel-based servers, which specific performance optimization technique is most relevant for ensuring the CPU does not become a bottleneck for GPU data transfers?
Correct
Correct: D Configure the CPU energy performance bias to ‘Performance‘ and ensure the correct NUMA affinity is set for the GPU processes.
The Technical Reason: * Energy Bias: Modern CPUs (both Intel and AMD) use aggressive power management. If the CPU is set to “Balanced“ or “Powersave,“ the latency involved in ramping up clock speeds to handle a sudden burst of data ingestion can stall the GPU. Setting the bias to Performance ensures the CPU remains at its highest frequency state.
NUMA Affinity: Modern servers are Non-Unified Memory Access (NUMA) systems. A GPU is physically wired to a specific CPU socket (or “node“). If a training process runs on CPU cores in Socket 0 but tries to access a GPU wired to Socket 1, the data must travel across the slow inter-socket link (QPI/UPI for Intel, Infinity Fabric for AMD), increasing latency and decreasing bandwidth. Ensuring NUMA Affinity means pinning the process to the cores directly attached to that GPU‘s PCIe root complex.
The NCP-AII Context: The exam validates your ability to optimize the “System Stack.“ You are expected to know how to use tools like numactl or lstopo to verify that GPUs and NICs are balanced across the CPU topology.
Incorrect Options: A. Install third-party antivirus to scan CUDA kernels Antivirus software operates at the OS/Filesystem level. CUDA kernels are compiled binary code executed directly on the GPU‘s Streaming Multiprocessors (SMs). Scanning every kernel call would introduce massive latency and is not a standard or supported security practice in high-performance AI clusters.
B. Reduce PCIe link speed to Gen2 This is the opposite of optimization. To prevent the CPU/System from becoming a bottleneck, you want the highest possible bandwidth. Reducing an H100 (Gen5) to Gen2 would slash the available bandwidth from ~63GB/s to ~8GB/s per 16x slot, causing a massive performance drop in data-heavy training tasks.
C. Disable all CPU caches CPU caches (L1/L2/L3) are essential for hiding the high latency of system RAM. Disabling them would force the CPU to wait hundreds of cycles for every memory access, making it significantly slower. This would guarantee that the CPU becomes a massive bottleneck for the GPU, rather than solving it.
Incorrect
Correct: D Configure the CPU energy performance bias to ‘Performance‘ and ensure the correct NUMA affinity is set for the GPU processes.
The Technical Reason: * Energy Bias: Modern CPUs (both Intel and AMD) use aggressive power management. If the CPU is set to “Balanced“ or “Powersave,“ the latency involved in ramping up clock speeds to handle a sudden burst of data ingestion can stall the GPU. Setting the bias to Performance ensures the CPU remains at its highest frequency state.
NUMA Affinity: Modern servers are Non-Unified Memory Access (NUMA) systems. A GPU is physically wired to a specific CPU socket (or “node“). If a training process runs on CPU cores in Socket 0 but tries to access a GPU wired to Socket 1, the data must travel across the slow inter-socket link (QPI/UPI for Intel, Infinity Fabric for AMD), increasing latency and decreasing bandwidth. Ensuring NUMA Affinity means pinning the process to the cores directly attached to that GPU‘s PCIe root complex.
The NCP-AII Context: The exam validates your ability to optimize the “System Stack.“ You are expected to know how to use tools like numactl or lstopo to verify that GPUs and NICs are balanced across the CPU topology.
Incorrect Options: A. Install third-party antivirus to scan CUDA kernels Antivirus software operates at the OS/Filesystem level. CUDA kernels are compiled binary code executed directly on the GPU‘s Streaming Multiprocessors (SMs). Scanning every kernel call would introduce massive latency and is not a standard or supported security practice in high-performance AI clusters.
B. Reduce PCIe link speed to Gen2 This is the opposite of optimization. To prevent the CPU/System from becoming a bottleneck, you want the highest possible bandwidth. Reducing an H100 (Gen5) to Gen2 would slash the available bandwidth from ~63GB/s to ~8GB/s per 16x slot, causing a massive performance drop in data-heavy training tasks.
C. Disable all CPU caches CPU caches (L1/L2/L3) are essential for hiding the high latency of system RAM. Disabling them would force the CPU to wait hundreds of cycles for every memory access, making it significantly slower. This would guarantee that the CPU becomes a massive bottleneck for the GPU, rather than solving it.
Unattempted
Correct: D Configure the CPU energy performance bias to ‘Performance‘ and ensure the correct NUMA affinity is set for the GPU processes.
The Technical Reason: * Energy Bias: Modern CPUs (both Intel and AMD) use aggressive power management. If the CPU is set to “Balanced“ or “Powersave,“ the latency involved in ramping up clock speeds to handle a sudden burst of data ingestion can stall the GPU. Setting the bias to Performance ensures the CPU remains at its highest frequency state.
NUMA Affinity: Modern servers are Non-Unified Memory Access (NUMA) systems. A GPU is physically wired to a specific CPU socket (or “node“). If a training process runs on CPU cores in Socket 0 but tries to access a GPU wired to Socket 1, the data must travel across the slow inter-socket link (QPI/UPI for Intel, Infinity Fabric for AMD), increasing latency and decreasing bandwidth. Ensuring NUMA Affinity means pinning the process to the cores directly attached to that GPU‘s PCIe root complex.
The NCP-AII Context: The exam validates your ability to optimize the “System Stack.“ You are expected to know how to use tools like numactl or lstopo to verify that GPUs and NICs are balanced across the CPU topology.
Incorrect Options: A. Install third-party antivirus to scan CUDA kernels Antivirus software operates at the OS/Filesystem level. CUDA kernels are compiled binary code executed directly on the GPU‘s Streaming Multiprocessors (SMs). Scanning every kernel call would introduce massive latency and is not a standard or supported security practice in high-performance AI clusters.
B. Reduce PCIe link speed to Gen2 This is the opposite of optimization. To prevent the CPU/System from becoming a bottleneck, you want the highest possible bandwidth. Reducing an H100 (Gen5) to Gen2 would slash the available bandwidth from ~63GB/s to ~8GB/s per 16x slot, causing a massive performance drop in data-heavy training tasks.
C. Disable all CPU caches CPU caches (L1/L2/L3) are essential for hiding the high latency of system RAM. Disabling them would force the CPU to wait hundreds of cycles for every memory access, making it significantly slower. This would guarantee that the CPU becomes a massive bottleneck for the GPU, rather than solving it.
Question 53 of 60
53. Question
To confirm the integrity of the high-speed signal paths in an AI factory, an engineer must validate the cables and transceivers. The engineer notices that some 400G InfiniBand links are showing a high Bit Error Rate (BER) during the NeMo burn-in test. Which verification step is most appropriate to confirm the health of the transceivers and the signal quality?
Correct
Correct: B Use the mlxlink tool to check the eye diagram parameters and the pre-FEC (Forward Error Correction) BER values for the problematic ports.
The Technical Reason: At 400G speeds, the physical margin for signal error is extremely slim.
The Tool: mlxlink (part of the NVIDIA Firmware Tools – MFT) is the definitive utility for debugging link-level issues. It interacts directly with the NIC and transceiver firmware.
Eye Diagrams: An “eye diagram“ is a visual representation of signal quality. A “closed eye“ indicates high noise or jitter, usually caused by a faulty cable or transceiver. mlxlink -e allows an administrator to view these scan results.
Pre-FEC BER: Modern high-speed links use Forward Error Correction (FEC) to fix minor bit flips. However, if the Pre-FEC BER (the error rate before the hardware fixes it) is too high, the link will eventually flap or drop packets. Monitoring this value is the “gold standard“ for identifying a degrading cable before it fails completely.
The NCP-AII Context: The exam expects you to differentiate between high-level management tools (like UFM) and low-level diagnostic tools (like MFT/mlxlink). Knowing how to interpret physical layer counters to solve “silent“ performance degradation is a key requirement for this professional-level certification.
Incorrect Options: A. Swap InfiniBand with Category 6e Ethernet This is technically impossible. 400G InfiniBand uses specialized OSFP or QSFP-DD transceivers and Twinax/Optical cables. Standard RJ45 Category 6e Ethernet cables are limited to 10GbE and are physically incompatible with InfiniBand switch ports and HCAs.
C. Run the ‘df -h‘ command for disk usage df -h is a filesystem utility used to check disk space. While storage performance can affect an AI pipeline, physical disk capacity has no relationship with electrical bit errors on a network transceiver. Electrical interference is typically caused by poor shielding, cable bends, or power supply noise, not by how much data is stored on a drive.
D. Decrease MTU from 9000 to 1500 Changing the Maximum Transmission Unit (MTU) is a layer-3 networking configuration. While reducing MTU can sometimes help with packet fragmentation issues, it does not fix a Bit Error Rate (BER) issue. BER is a layer-1 physical problem; if the wire is “noisy,“ it will corrupt small packets just as easily as large ones. Furthermore, reducing MTU to 1500 in an AI cluster would severely cripple performance by increasing CPU overhead.
Incorrect
Correct: B Use the mlxlink tool to check the eye diagram parameters and the pre-FEC (Forward Error Correction) BER values for the problematic ports.
The Technical Reason: At 400G speeds, the physical margin for signal error is extremely slim.
The Tool: mlxlink (part of the NVIDIA Firmware Tools – MFT) is the definitive utility for debugging link-level issues. It interacts directly with the NIC and transceiver firmware.
Eye Diagrams: An “eye diagram“ is a visual representation of signal quality. A “closed eye“ indicates high noise or jitter, usually caused by a faulty cable or transceiver. mlxlink -e allows an administrator to view these scan results.
Pre-FEC BER: Modern high-speed links use Forward Error Correction (FEC) to fix minor bit flips. However, if the Pre-FEC BER (the error rate before the hardware fixes it) is too high, the link will eventually flap or drop packets. Monitoring this value is the “gold standard“ for identifying a degrading cable before it fails completely.
The NCP-AII Context: The exam expects you to differentiate between high-level management tools (like UFM) and low-level diagnostic tools (like MFT/mlxlink). Knowing how to interpret physical layer counters to solve “silent“ performance degradation is a key requirement for this professional-level certification.
Incorrect Options: A. Swap InfiniBand with Category 6e Ethernet This is technically impossible. 400G InfiniBand uses specialized OSFP or QSFP-DD transceivers and Twinax/Optical cables. Standard RJ45 Category 6e Ethernet cables are limited to 10GbE and are physically incompatible with InfiniBand switch ports and HCAs.
C. Run the ‘df -h‘ command for disk usage df -h is a filesystem utility used to check disk space. While storage performance can affect an AI pipeline, physical disk capacity has no relationship with electrical bit errors on a network transceiver. Electrical interference is typically caused by poor shielding, cable bends, or power supply noise, not by how much data is stored on a drive.
D. Decrease MTU from 9000 to 1500 Changing the Maximum Transmission Unit (MTU) is a layer-3 networking configuration. While reducing MTU can sometimes help with packet fragmentation issues, it does not fix a Bit Error Rate (BER) issue. BER is a layer-1 physical problem; if the wire is “noisy,“ it will corrupt small packets just as easily as large ones. Furthermore, reducing MTU to 1500 in an AI cluster would severely cripple performance by increasing CPU overhead.
Unattempted
Correct: B Use the mlxlink tool to check the eye diagram parameters and the pre-FEC (Forward Error Correction) BER values for the problematic ports.
The Technical Reason: At 400G speeds, the physical margin for signal error is extremely slim.
The Tool: mlxlink (part of the NVIDIA Firmware Tools – MFT) is the definitive utility for debugging link-level issues. It interacts directly with the NIC and transceiver firmware.
Eye Diagrams: An “eye diagram“ is a visual representation of signal quality. A “closed eye“ indicates high noise or jitter, usually caused by a faulty cable or transceiver. mlxlink -e allows an administrator to view these scan results.
Pre-FEC BER: Modern high-speed links use Forward Error Correction (FEC) to fix minor bit flips. However, if the Pre-FEC BER (the error rate before the hardware fixes it) is too high, the link will eventually flap or drop packets. Monitoring this value is the “gold standard“ for identifying a degrading cable before it fails completely.
The NCP-AII Context: The exam expects you to differentiate between high-level management tools (like UFM) and low-level diagnostic tools (like MFT/mlxlink). Knowing how to interpret physical layer counters to solve “silent“ performance degradation is a key requirement for this professional-level certification.
Incorrect Options: A. Swap InfiniBand with Category 6e Ethernet This is technically impossible. 400G InfiniBand uses specialized OSFP or QSFP-DD transceivers and Twinax/Optical cables. Standard RJ45 Category 6e Ethernet cables are limited to 10GbE and are physically incompatible with InfiniBand switch ports and HCAs.
C. Run the ‘df -h‘ command for disk usage df -h is a filesystem utility used to check disk space. While storage performance can affect an AI pipeline, physical disk capacity has no relationship with electrical bit errors on a network transceiver. Electrical interference is typically caused by poor shielding, cable bends, or power supply noise, not by how much data is stored on a drive.
D. Decrease MTU from 9000 to 1500 Changing the Maximum Transmission Unit (MTU) is a layer-3 networking configuration. While reducing MTU can sometimes help with packet fragmentation issues, it does not fix a Bit Error Rate (BER) issue. BER is a layer-1 physical problem; if the wire is “noisy,“ it will corrupt small packets just as easily as large ones. Furthermore, reducing MTU to 1500 in an AI cluster would severely cripple performance by increasing CPU overhead.
Question 54 of 60
54. Question
To verify the health and performance of a newly installed AI cluster, an engineer is running the High-Performance Linpack (HPL) benchmark. They notice that the performance on a single node is significantly lower than the theoretical peak for an HGX H100 system. Which validation step should be performed next to identify the bottleneck?
Correct
Correct: D Run a single-node NCCL test to verify that all NVLink switches and GPU-to-GPU interconnects are operating at the expected bandwidth.
The Technical Reason: HPL relies heavily on high-speed data exchange between GPUs during large matrix factorizations. In an HGX H100 system, this communication happens over NVLink via internal NVSwitch chips.
The Bottleneck: If one NVLink is down or underperforming (e.g., due to a firmware mismatch or a hardware defect in the NVSwitch), the entire “collective“ operation slows down to the speed of the slowest link.
The Tool: The NCCL (NVIDIA Collective Communications Library) Tests (specifically all_reduce_perf) are the primary diagnostic tools for validating the internal fabric. If the bandwidth is significantly lower than the theoretical peak (e.g., 900 GB/s for H100), it explains the HPL degradation.
The NCP-AII Context: The exam validates your ability to use the NVIDIA-Certified validation suite. You are expected to know that HPL performance is a derivative of three things: GPU Compute, Memory Bandwidth, and Interconnect Bandwidth (NVLink).
Incorrect Options: A. Check power consumption of InfiniBand switches While InfiniBand is critical for multi-node HPL, the question specifies the performance issue is on a single node. Single-node HPL performance is dominated by the internal NVLink fabric, not the external InfiniBand switches. Furthermore, InfiniBand switches do not “draw current“ to support computation; they merely route data packets.
B. Use open-source Nouveau drivers The Nouveau drivers are reverse-engineered and do not support the high-performance features required for HPL, such as CUDA, Tensor Cores, or NVLink. Switching to Nouveau would result in a massive performance drop (often 100x slower) or the benchmark failing to run entirely.
C. Increase the Linux swap partition HPL is designed to run entirely in physical memory (VRAM for GPUs and System RAM for CPUs). If HPL begins to “swap“ to a disk-based partition, performance will collapse by orders of magnitude due to disk latency. Increasing swap space is a workaround for memory exhaustion, not a performance optimization for high-performance computing.
Incorrect
Correct: D Run a single-node NCCL test to verify that all NVLink switches and GPU-to-GPU interconnects are operating at the expected bandwidth.
The Technical Reason: HPL relies heavily on high-speed data exchange between GPUs during large matrix factorizations. In an HGX H100 system, this communication happens over NVLink via internal NVSwitch chips.
The Bottleneck: If one NVLink is down or underperforming (e.g., due to a firmware mismatch or a hardware defect in the NVSwitch), the entire “collective“ operation slows down to the speed of the slowest link.
The Tool: The NCCL (NVIDIA Collective Communications Library) Tests (specifically all_reduce_perf) are the primary diagnostic tools for validating the internal fabric. If the bandwidth is significantly lower than the theoretical peak (e.g., 900 GB/s for H100), it explains the HPL degradation.
The NCP-AII Context: The exam validates your ability to use the NVIDIA-Certified validation suite. You are expected to know that HPL performance is a derivative of three things: GPU Compute, Memory Bandwidth, and Interconnect Bandwidth (NVLink).
Incorrect Options: A. Check power consumption of InfiniBand switches While InfiniBand is critical for multi-node HPL, the question specifies the performance issue is on a single node. Single-node HPL performance is dominated by the internal NVLink fabric, not the external InfiniBand switches. Furthermore, InfiniBand switches do not “draw current“ to support computation; they merely route data packets.
B. Use open-source Nouveau drivers The Nouveau drivers are reverse-engineered and do not support the high-performance features required for HPL, such as CUDA, Tensor Cores, or NVLink. Switching to Nouveau would result in a massive performance drop (often 100x slower) or the benchmark failing to run entirely.
C. Increase the Linux swap partition HPL is designed to run entirely in physical memory (VRAM for GPUs and System RAM for CPUs). If HPL begins to “swap“ to a disk-based partition, performance will collapse by orders of magnitude due to disk latency. Increasing swap space is a workaround for memory exhaustion, not a performance optimization for high-performance computing.
Unattempted
Correct: D Run a single-node NCCL test to verify that all NVLink switches and GPU-to-GPU interconnects are operating at the expected bandwidth.
The Technical Reason: HPL relies heavily on high-speed data exchange between GPUs during large matrix factorizations. In an HGX H100 system, this communication happens over NVLink via internal NVSwitch chips.
The Bottleneck: If one NVLink is down or underperforming (e.g., due to a firmware mismatch or a hardware defect in the NVSwitch), the entire “collective“ operation slows down to the speed of the slowest link.
The Tool: The NCCL (NVIDIA Collective Communications Library) Tests (specifically all_reduce_perf) are the primary diagnostic tools for validating the internal fabric. If the bandwidth is significantly lower than the theoretical peak (e.g., 900 GB/s for H100), it explains the HPL degradation.
The NCP-AII Context: The exam validates your ability to use the NVIDIA-Certified validation suite. You are expected to know that HPL performance is a derivative of three things: GPU Compute, Memory Bandwidth, and Interconnect Bandwidth (NVLink).
Incorrect Options: A. Check power consumption of InfiniBand switches While InfiniBand is critical for multi-node HPL, the question specifies the performance issue is on a single node. Single-node HPL performance is dominated by the internal NVLink fabric, not the external InfiniBand switches. Furthermore, InfiniBand switches do not “draw current“ to support computation; they merely route data packets.
B. Use open-source Nouveau drivers The Nouveau drivers are reverse-engineered and do not support the high-performance features required for HPL, such as CUDA, Tensor Cores, or NVLink. Switching to Nouveau would result in a massive performance drop (often 100x slower) or the benchmark failing to run entirely.
C. Increase the Linux swap partition HPL is designed to run entirely in physical memory (VRAM for GPUs and System RAM for CPUs). If HPL begins to “swap“ to a disk-based partition, performance will collapse by orders of magnitude due to disk latency. Increasing swap space is a workaround for memory exhaustion, not a performance optimization for high-performance computing.
Question 55 of 60
55. Question
A cluster administrator is configuring a new compute node category in Base Command Manager and needs to ensure that all nodes have the correct software stack for AI development. What is the correct sequence of software installation to enable users to run containerized PyTorch jobs using Slurm, Enroot, and Pyxis?
Correct
Correct: D Install the NVIDIA GPU drivers, then the NVIDIA Container Toolkit, followed by the Enroot and Pyxis plugins for Slurm integration.
The Technical Reason: The stack is built from the hardware up to the user interface:
NVIDIA GPU Drivers: The foundation. Without the kernel-level drivers, the hardware is inaccessible.
NVIDIA Container Toolkit: This provides the nvidia-container-runtime, which allows containers to “see“ and interface with the GPU drivers and device nodes.
Enroot: A tool that turns container images (like Docker/OCI) into unprivileged sandboxes. It depends on the Container Toolkit to enable GPU acceleration within those sandboxes.
Pyxis: This is a Slurm plugin. It allows Slurm‘s srun command to interact with Enroot. It is the final piece that enables the user-facing command: srun –container-image=…
The NCP-AII Context: The exam validates your ability to deploy the NVIDIA-Certified software stack. Understanding that Pyxis is an “extension“ of Slurm that relies on the underlying container runtime (Enroot) is a core requirement for the “Software and Application“ domain.
Incorrect Options: A. Install Pyxis first and DOCA drivers for container networking You cannot install a Slurm plugin (Pyxis) effectively if the underlying container runtime it manages (Enroot) isn‘t there. Furthermore, while DOCA is essential for DPU-accelerated networking, it is not a prerequisite for basic PyTorch container execution on standard compute nodes.
B. Install Slurm and Enroot on storage controllers Slurm and Enroot must be installed on the compute nodes where the work is actually performed. Installing them on storage controllers would not enable the GPUs on the compute nodes to run jobs. Additionally, nvidia-smi is a monitoring/management tool; it is not a package manager used to “push“ the Container Toolkit to a DPU.
C. Download PyTorch into BMC firmware The BMC (Baseboard Management Controller) has very limited memory and is intended for out-of-band hardware management (power, thermal, BIOS). It cannot run heavy AI frameworks like PyTorch. Furthermore, Slurm does not “PXE boot containers“; it schedules tasks on an already-booted operating system.
Incorrect
Correct: D Install the NVIDIA GPU drivers, then the NVIDIA Container Toolkit, followed by the Enroot and Pyxis plugins for Slurm integration.
The Technical Reason: The stack is built from the hardware up to the user interface:
NVIDIA GPU Drivers: The foundation. Without the kernel-level drivers, the hardware is inaccessible.
NVIDIA Container Toolkit: This provides the nvidia-container-runtime, which allows containers to “see“ and interface with the GPU drivers and device nodes.
Enroot: A tool that turns container images (like Docker/OCI) into unprivileged sandboxes. It depends on the Container Toolkit to enable GPU acceleration within those sandboxes.
Pyxis: This is a Slurm plugin. It allows Slurm‘s srun command to interact with Enroot. It is the final piece that enables the user-facing command: srun –container-image=…
The NCP-AII Context: The exam validates your ability to deploy the NVIDIA-Certified software stack. Understanding that Pyxis is an “extension“ of Slurm that relies on the underlying container runtime (Enroot) is a core requirement for the “Software and Application“ domain.
Incorrect Options: A. Install Pyxis first and DOCA drivers for container networking You cannot install a Slurm plugin (Pyxis) effectively if the underlying container runtime it manages (Enroot) isn‘t there. Furthermore, while DOCA is essential for DPU-accelerated networking, it is not a prerequisite for basic PyTorch container execution on standard compute nodes.
B. Install Slurm and Enroot on storage controllers Slurm and Enroot must be installed on the compute nodes where the work is actually performed. Installing them on storage controllers would not enable the GPUs on the compute nodes to run jobs. Additionally, nvidia-smi is a monitoring/management tool; it is not a package manager used to “push“ the Container Toolkit to a DPU.
C. Download PyTorch into BMC firmware The BMC (Baseboard Management Controller) has very limited memory and is intended for out-of-band hardware management (power, thermal, BIOS). It cannot run heavy AI frameworks like PyTorch. Furthermore, Slurm does not “PXE boot containers“; it schedules tasks on an already-booted operating system.
Unattempted
Correct: D Install the NVIDIA GPU drivers, then the NVIDIA Container Toolkit, followed by the Enroot and Pyxis plugins for Slurm integration.
The Technical Reason: The stack is built from the hardware up to the user interface:
NVIDIA GPU Drivers: The foundation. Without the kernel-level drivers, the hardware is inaccessible.
NVIDIA Container Toolkit: This provides the nvidia-container-runtime, which allows containers to “see“ and interface with the GPU drivers and device nodes.
Enroot: A tool that turns container images (like Docker/OCI) into unprivileged sandboxes. It depends on the Container Toolkit to enable GPU acceleration within those sandboxes.
Pyxis: This is a Slurm plugin. It allows Slurm‘s srun command to interact with Enroot. It is the final piece that enables the user-facing command: srun –container-image=…
The NCP-AII Context: The exam validates your ability to deploy the NVIDIA-Certified software stack. Understanding that Pyxis is an “extension“ of Slurm that relies on the underlying container runtime (Enroot) is a core requirement for the “Software and Application“ domain.
Incorrect Options: A. Install Pyxis first and DOCA drivers for container networking You cannot install a Slurm plugin (Pyxis) effectively if the underlying container runtime it manages (Enroot) isn‘t there. Furthermore, while DOCA is essential for DPU-accelerated networking, it is not a prerequisite for basic PyTorch container execution on standard compute nodes.
B. Install Slurm and Enroot on storage controllers Slurm and Enroot must be installed on the compute nodes where the work is actually performed. Installing them on storage controllers would not enable the GPUs on the compute nodes to run jobs. Additionally, nvidia-smi is a monitoring/management tool; it is not a package manager used to “push“ the Container Toolkit to a DPU.
C. Download PyTorch into BMC firmware The BMC (Baseboard Management Controller) has very limited memory and is intended for out-of-band hardware management (power, thermal, BIOS). It cannot run heavy AI frameworks like PyTorch. Furthermore, Slurm does not “PXE boot containers“; it schedules tasks on an already-booted operating system.
Question 56 of 60
56. Question
A cluster administrator is configuring a new compute node category in Base Command Manager and needs to ensure that all nodes in this category have the correct software stack for AI development. What is the correct sequence of software installation to enable users to run containerized PyTorch jobs using Slurm and Enroot?
Correct
Correct: B Install the NVIDIA GPU drivers, then the NVIDIA Container Toolkit, followed by the Enroot and Pyxis plugins for Slurm integration.
The Technical Reason: The stack follows a logical hierarchy from hardware to the user interface:
NVIDIA GPU Drivers: The kernel-level foundation. These must be installed first so the OS can communicate with the physical GPU hardware.
NVIDIA Container Toolkit: This layer includes libnvidia-container, which provides the necessary libraries to mount GPU device nodes into container runtimes.
Enroot: NVIDIAÂ’s container runtime for HPC. It sits on top of the Container Toolkit to turn Docker/OCI images into unprivileged sandboxes while maintaining native GPU performance.
Pyxis: A specialized SPANK plugin for Slurm. It acts as the “glue“ that allows Slurm‘s srun command to trigger Enroot to pull and run container images (e.g., using the –container-image flag).
The NCP-AII Context: The exam validates your ability to provision compute nodes using Base Command Manager (BCM). In a typical BCM workflow, you would use cm-chroot-sw-img to enter a software image and install these components in this specific order before “committing“ the image and rebooting your nodes.
Incorrect Options: A. Download containers into BMC firmware The BMC (Baseboard Management Controller) is a small, independent processor used for out-of-band management (power, thermal, and remote console). It does not have the storage capacity or the compute architecture to host or “PXE boot“ large AI container images like PyTorch.
C. Install Pyxis first and DOCA drivers on the head node Pyxis is a plugin that requires the Enroot runtime to be present; installing it first would lead to dependency failures. Furthermore, while DOCA drivers are essential for BlueField DPUs and advanced networking, they are not a prerequisite for enabling basic containerized GPU access on standard compute nodes.
D. Install Slurm/Enroot on storage controllers and push via SMI Slurm (the compute daemon) and Enroot must be installed on the compute nodes where the GPUs reside, not on the storage controllers. Additionally, nvidia-smi is a monitoring and management utility; it is not a deployment tool used to “push“ software stacks to hardware over InfiniBand.
Incorrect
Correct: B Install the NVIDIA GPU drivers, then the NVIDIA Container Toolkit, followed by the Enroot and Pyxis plugins for Slurm integration.
The Technical Reason: The stack follows a logical hierarchy from hardware to the user interface:
NVIDIA GPU Drivers: The kernel-level foundation. These must be installed first so the OS can communicate with the physical GPU hardware.
NVIDIA Container Toolkit: This layer includes libnvidia-container, which provides the necessary libraries to mount GPU device nodes into container runtimes.
Enroot: NVIDIAÂ’s container runtime for HPC. It sits on top of the Container Toolkit to turn Docker/OCI images into unprivileged sandboxes while maintaining native GPU performance.
Pyxis: A specialized SPANK plugin for Slurm. It acts as the “glue“ that allows Slurm‘s srun command to trigger Enroot to pull and run container images (e.g., using the –container-image flag).
The NCP-AII Context: The exam validates your ability to provision compute nodes using Base Command Manager (BCM). In a typical BCM workflow, you would use cm-chroot-sw-img to enter a software image and install these components in this specific order before “committing“ the image and rebooting your nodes.
Incorrect Options: A. Download containers into BMC firmware The BMC (Baseboard Management Controller) is a small, independent processor used for out-of-band management (power, thermal, and remote console). It does not have the storage capacity or the compute architecture to host or “PXE boot“ large AI container images like PyTorch.
C. Install Pyxis first and DOCA drivers on the head node Pyxis is a plugin that requires the Enroot runtime to be present; installing it first would lead to dependency failures. Furthermore, while DOCA drivers are essential for BlueField DPUs and advanced networking, they are not a prerequisite for enabling basic containerized GPU access on standard compute nodes.
D. Install Slurm/Enroot on storage controllers and push via SMI Slurm (the compute daemon) and Enroot must be installed on the compute nodes where the GPUs reside, not on the storage controllers. Additionally, nvidia-smi is a monitoring and management utility; it is not a deployment tool used to “push“ software stacks to hardware over InfiniBand.
Unattempted
Correct: B Install the NVIDIA GPU drivers, then the NVIDIA Container Toolkit, followed by the Enroot and Pyxis plugins for Slurm integration.
The Technical Reason: The stack follows a logical hierarchy from hardware to the user interface:
NVIDIA GPU Drivers: The kernel-level foundation. These must be installed first so the OS can communicate with the physical GPU hardware.
NVIDIA Container Toolkit: This layer includes libnvidia-container, which provides the necessary libraries to mount GPU device nodes into container runtimes.
Enroot: NVIDIAÂ’s container runtime for HPC. It sits on top of the Container Toolkit to turn Docker/OCI images into unprivileged sandboxes while maintaining native GPU performance.
Pyxis: A specialized SPANK plugin for Slurm. It acts as the “glue“ that allows Slurm‘s srun command to trigger Enroot to pull and run container images (e.g., using the –container-image flag).
The NCP-AII Context: The exam validates your ability to provision compute nodes using Base Command Manager (BCM). In a typical BCM workflow, you would use cm-chroot-sw-img to enter a software image and install these components in this specific order before “committing“ the image and rebooting your nodes.
Incorrect Options: A. Download containers into BMC firmware The BMC (Baseboard Management Controller) is a small, independent processor used for out-of-band management (power, thermal, and remote console). It does not have the storage capacity or the compute architecture to host or “PXE boot“ large AI container images like PyTorch.
C. Install Pyxis first and DOCA drivers on the head node Pyxis is a plugin that requires the Enroot runtime to be present; installing it first would lead to dependency failures. Furthermore, while DOCA drivers are essential for BlueField DPUs and advanced networking, they are not a prerequisite for enabling basic containerized GPU access on standard compute nodes.
D. Install Slurm/Enroot on storage controllers and push via SMI Slurm (the compute daemon) and Enroot must be installed on the compute nodes where the GPUs reside, not on the storage controllers. Additionally, nvidia-smi is a monitoring and management utility; it is not a deployment tool used to “push“ software stacks to hardware over InfiniBand.
Question 57 of 60
57. Question
When setting up a MIG (Multi-Instance GPU) configuration for a multi-tenant AI environment, an administrator needs to ensure that memory and cache isolation are strictly enforced between different users. Which MIG profile characteristic ensures that compute resources are dedicated and not shared with other instances on the physical GPU?
Correct
Correct: D The selection of a specific Slice (such as 1g.10gb) which provides hardware-level isolation of the memory controller and SMs.
The Technical Reason: MIG allows a physical GPU (like an H100 or A100) to be partitioned into independent GPU Instances.
Dedicated Hardware Paths: Each instance has its own isolated paths through the entire memory system, including specific memory controllers, DRAM address buses, and even L2 cache banks.
Resource Guarantee: When you assign a profile like 1g.10gb, you are physically allocating 1/7th of the available Streaming Multiprocessors (SMs) and 10GB of high-bandwidth memory to that user. This prevents the “noisy neighbor“ effect where one user‘s intensive memory access could otherwise evict data from another user‘s cache.
The NCP-AII Context: The certification focuses on the ability to provide predictable throughput and latency. Option D is the only one that describes the spatial (hardware) partitioning that defines MIG.
Incorrect Options: A. Enabling the Overcommit flag NVIDIA MIG does not support overcommit. One of the strict rules of MIG is that it is a hard partition of physical resources. You cannot allocate more VRAM than is physically present on the card, nor can instances “borrow“ memory from each other. If a MIG instance runs out of memory, it will trigger an Out of Memory (OOM) error rather than swapping or overcommitting.
B. Shared profiles and bursting into L2 cache This is factually incorrect regarding MIG. The primary goal of MIG is to prevent sharing of the L2 cache. By assigning unique L2 cache banks to each instance, NVIDIA ensures that a memory-heavy workload in Instance A cannot “thrash“ or impact the cache performance of Instance B.
C. Configuring the GPU in Time-Slice mode Time-Slicing is a temporal (time-based) sharing method, not a spatial one. In time-slice mode, different tasks take turns using the entire GPU for a short duration (e.g., 10ms). While this allows multiple users to share a GPU, it provides no memory isolation and no performance guarantees, as one task can still consume all the memory or interfere with the cache of the next task in the queue.
Incorrect
Correct: D The selection of a specific Slice (such as 1g.10gb) which provides hardware-level isolation of the memory controller and SMs.
The Technical Reason: MIG allows a physical GPU (like an H100 or A100) to be partitioned into independent GPU Instances.
Dedicated Hardware Paths: Each instance has its own isolated paths through the entire memory system, including specific memory controllers, DRAM address buses, and even L2 cache banks.
Resource Guarantee: When you assign a profile like 1g.10gb, you are physically allocating 1/7th of the available Streaming Multiprocessors (SMs) and 10GB of high-bandwidth memory to that user. This prevents the “noisy neighbor“ effect where one user‘s intensive memory access could otherwise evict data from another user‘s cache.
The NCP-AII Context: The certification focuses on the ability to provide predictable throughput and latency. Option D is the only one that describes the spatial (hardware) partitioning that defines MIG.
Incorrect Options: A. Enabling the Overcommit flag NVIDIA MIG does not support overcommit. One of the strict rules of MIG is that it is a hard partition of physical resources. You cannot allocate more VRAM than is physically present on the card, nor can instances “borrow“ memory from each other. If a MIG instance runs out of memory, it will trigger an Out of Memory (OOM) error rather than swapping or overcommitting.
B. Shared profiles and bursting into L2 cache This is factually incorrect regarding MIG. The primary goal of MIG is to prevent sharing of the L2 cache. By assigning unique L2 cache banks to each instance, NVIDIA ensures that a memory-heavy workload in Instance A cannot “thrash“ or impact the cache performance of Instance B.
C. Configuring the GPU in Time-Slice mode Time-Slicing is a temporal (time-based) sharing method, not a spatial one. In time-slice mode, different tasks take turns using the entire GPU for a short duration (e.g., 10ms). While this allows multiple users to share a GPU, it provides no memory isolation and no performance guarantees, as one task can still consume all the memory or interfere with the cache of the next task in the queue.
Unattempted
Correct: D The selection of a specific Slice (such as 1g.10gb) which provides hardware-level isolation of the memory controller and SMs.
The Technical Reason: MIG allows a physical GPU (like an H100 or A100) to be partitioned into independent GPU Instances.
Dedicated Hardware Paths: Each instance has its own isolated paths through the entire memory system, including specific memory controllers, DRAM address buses, and even L2 cache banks.
Resource Guarantee: When you assign a profile like 1g.10gb, you are physically allocating 1/7th of the available Streaming Multiprocessors (SMs) and 10GB of high-bandwidth memory to that user. This prevents the “noisy neighbor“ effect where one user‘s intensive memory access could otherwise evict data from another user‘s cache.
The NCP-AII Context: The certification focuses on the ability to provide predictable throughput and latency. Option D is the only one that describes the spatial (hardware) partitioning that defines MIG.
Incorrect Options: A. Enabling the Overcommit flag NVIDIA MIG does not support overcommit. One of the strict rules of MIG is that it is a hard partition of physical resources. You cannot allocate more VRAM than is physically present on the card, nor can instances “borrow“ memory from each other. If a MIG instance runs out of memory, it will trigger an Out of Memory (OOM) error rather than swapping or overcommitting.
B. Shared profiles and bursting into L2 cache This is factually incorrect regarding MIG. The primary goal of MIG is to prevent sharing of the L2 cache. By assigning unique L2 cache banks to each instance, NVIDIA ensures that a memory-heavy workload in Instance A cannot “thrash“ or impact the cache performance of Instance B.
C. Configuring the GPU in Time-Slice mode Time-Slicing is a temporal (time-based) sharing method, not a spatial one. In time-slice mode, different tasks take turns using the entire GPU for a short duration (e.g., 10ms). While this allows multiple users to share a GPU, it provides no memory isolation and no performance guarantees, as one task can still consume all the memory or interfere with the cache of the next task in the queue.
Question 58 of 60
58. Question
A research team needs to run multiple small AI inference jobs on a single H100 GPU to maximize resource utilization. The administrator decides to implement Multi-Instance GPU (MIG). Which of the following conditions must be met to successfully configure and partition the GPU into multiple hardware-isolated instances?
Correct
Correct: B. The GPU must be in a specific MIG mode enabled through nvidia-smi.
The NCP-AII certification blueprint explicitly includes “MIG (Multi-Instance GPU) enablement and management“ as a core topic within the Physical Layer Management domain .
To successfully configure and partition a GPU into multiple hardware-isolated instances, MIG mode must first be enabled on the GPU using the nvidia-smi command .
The specific command to enable MIG mode is: sudo nvidia-smi -i -mig 1 .
Enabling MIG mode requires the GPU to be in a state where no other processes are using it, and typically requires a GPU reset or system reboot for the change to take effect .
After enabling MIG mode, the administrator can then create specific GPU instances and compute instances using additional nvidia-smi mig commands .
This prerequisite of enabling MIG mode is fundamental before any partitioning can occur, making it a critical condition that must be met .
Incorrect: A. The server must have at least 1TB of system RAM for MIG to function.
This is incorrect because MIG operates at the GPU level and partitions GPU resources (memory, cache, compute cores) independently of system RAM . The MIG documentation makes no mention of system RAM requirements for enabling MIG functionality . MIG instances have their own dedicated GPU memory (e.g., 10GB, 20GB, 40GB slices) that is allocated from the GPU‘s onboard memory, not from system RAM .
C. The administrator must disable all NVLink connections between GPUs.
This is incorrect because NVLink connections are not required to be disabled for MIG to function. While there are limitations regarding CUDA Inter-Process Communication (IPC) across MIG instances, which affects data transfers over NVLink and NVSwitch , disabling NVLink is not a prerequisite for enabling MIG mode. MIG can be enabled and configured while NVLink remains active, though some cross-instance communication features may be restricted.
D. The GPU must be connected via an external USB-C Thunderbolt cable.
This is incorrect because NVIDIA data center GPUs that support MIG (such as A100, H100, A30) are designed for internal PCIe connectivity within servers . MIG functionality is independent of the physical connection method to the host system and certainly does not require USB-C or Thunderbolt connections, which are not used for enterprise GPU installation.
Incorrect
Correct: B. The GPU must be in a specific MIG mode enabled through nvidia-smi.
The NCP-AII certification blueprint explicitly includes “MIG (Multi-Instance GPU) enablement and management“ as a core topic within the Physical Layer Management domain .
To successfully configure and partition a GPU into multiple hardware-isolated instances, MIG mode must first be enabled on the GPU using the nvidia-smi command .
The specific command to enable MIG mode is: sudo nvidia-smi -i -mig 1 .
Enabling MIG mode requires the GPU to be in a state where no other processes are using it, and typically requires a GPU reset or system reboot for the change to take effect .
After enabling MIG mode, the administrator can then create specific GPU instances and compute instances using additional nvidia-smi mig commands .
This prerequisite of enabling MIG mode is fundamental before any partitioning can occur, making it a critical condition that must be met .
Incorrect: A. The server must have at least 1TB of system RAM for MIG to function.
This is incorrect because MIG operates at the GPU level and partitions GPU resources (memory, cache, compute cores) independently of system RAM . The MIG documentation makes no mention of system RAM requirements for enabling MIG functionality . MIG instances have their own dedicated GPU memory (e.g., 10GB, 20GB, 40GB slices) that is allocated from the GPU‘s onboard memory, not from system RAM .
C. The administrator must disable all NVLink connections between GPUs.
This is incorrect because NVLink connections are not required to be disabled for MIG to function. While there are limitations regarding CUDA Inter-Process Communication (IPC) across MIG instances, which affects data transfers over NVLink and NVSwitch , disabling NVLink is not a prerequisite for enabling MIG mode. MIG can be enabled and configured while NVLink remains active, though some cross-instance communication features may be restricted.
D. The GPU must be connected via an external USB-C Thunderbolt cable.
This is incorrect because NVIDIA data center GPUs that support MIG (such as A100, H100, A30) are designed for internal PCIe connectivity within servers . MIG functionality is independent of the physical connection method to the host system and certainly does not require USB-C or Thunderbolt connections, which are not used for enterprise GPU installation.
Unattempted
Correct: B. The GPU must be in a specific MIG mode enabled through nvidia-smi.
The NCP-AII certification blueprint explicitly includes “MIG (Multi-Instance GPU) enablement and management“ as a core topic within the Physical Layer Management domain .
To successfully configure and partition a GPU into multiple hardware-isolated instances, MIG mode must first be enabled on the GPU using the nvidia-smi command .
The specific command to enable MIG mode is: sudo nvidia-smi -i -mig 1 .
Enabling MIG mode requires the GPU to be in a state where no other processes are using it, and typically requires a GPU reset or system reboot for the change to take effect .
After enabling MIG mode, the administrator can then create specific GPU instances and compute instances using additional nvidia-smi mig commands .
This prerequisite of enabling MIG mode is fundamental before any partitioning can occur, making it a critical condition that must be met .
Incorrect: A. The server must have at least 1TB of system RAM for MIG to function.
This is incorrect because MIG operates at the GPU level and partitions GPU resources (memory, cache, compute cores) independently of system RAM . The MIG documentation makes no mention of system RAM requirements for enabling MIG functionality . MIG instances have their own dedicated GPU memory (e.g., 10GB, 20GB, 40GB slices) that is allocated from the GPU‘s onboard memory, not from system RAM .
C. The administrator must disable all NVLink connections between GPUs.
This is incorrect because NVLink connections are not required to be disabled for MIG to function. While there are limitations regarding CUDA Inter-Process Communication (IPC) across MIG instances, which affects data transfers over NVLink and NVSwitch , disabling NVLink is not a prerequisite for enabling MIG mode. MIG can be enabled and configured while NVLink remains active, though some cross-instance communication features may be restricted.
D. The GPU must be connected via an external USB-C Thunderbolt cable.
This is incorrect because NVIDIA data center GPUs that support MIG (such as A100, H100, A30) are designed for internal PCIe connectivity within servers . MIG functionality is independent of the physical connection method to the host system and certainly does not require USB-C or Thunderbolt connections, which are not used for enterprise GPU installation.
Question 59 of 60
59. Question
When configuring the network topology for a large-scale AI factory, an administrator must decide on the cabling and transceiver types for the compute fabric. The design requires consistent high-speed low-latency communication across the InfiniBand fabric. If the distance between the leaf and spine switches is 50 meters, which cabling solution is most appropriate to ensure signal integrity and meet the hardware operation requirements for GPU workloads?
Correct
Correct: D. Active Optical Cables or transceivers with multi-mode fiber which provide the necessary reach and signal quality for distances exceeding the limits of passive copper.
The NCP-AII certification blueprint explicitly includes “Describe and validate cable types and transceivers“ as a core task within the System and Server Bring-up domain, which comprises 31% of the examination .
For a distance of 50 meters between leaf and spine switches, passive copper Direct Attach Cables (DACs) are not suitable because they have limited reach—typically up to 3-5 meters for high-speed InfiniBand .
NVIDIA documentation specifies that for InfiniBand NDR and 400G Ethernet, multi-mode fiber (MMF) with Active Optical Cables (AOCs) or transceivers supports distances up to 50 meters (including patch panels, trunk, and cables) .
Active Optical Cables offer several advantages for this scenario:
Much longer reach than passive copper cables in data centers
Lighter weight and more flexible than copper alternatives
Greater airflow and better signal integrity over distance
Immunity to electromagnetic interference
The rigorous production testing of AOCs ensures the best out-of-the-box installation experience, performance, and durability , which aligns with the certification‘s emphasis on validating cable signal quality .
Incorrect: A. Wireless bridge adapters configured in a mesh topology to eliminate the need for physical cabling and reduce the complexity of the physical layer management.
This is incorrect because high-speed InfiniBand fabrics for AI workloads require physical cabling to achieve the necessary bandwidth (400Gbps), low latency, and signal integrity. Wireless technologies cannot meet the performance requirements for GPU collective operations in AI factories. The certification explicitly covers validating cable types and transceivers as essential physical layer tasks .
B. Passive Copper Direct Attach Cables which are cost-effective for long distances and provide the necessary flexibility for high-density rack cable management.
This is incorrect because passive copper DACs have severe distance limitations at high speeds. For 100Gb/s InfiniBand, AOCs provide “a much longer reach than passive copper cables in the data centers“ . At 50 meters, passive copper signals would degrade significantly, causing high bit-error rates and performance issues, making them unsuitable for leaf-to-spine connections at this distance.
C. Standard Category 6 Ethernet cables using RJ45 connectors to leverage existing office networking infrastructure while maintaining compatibility with InfiniBand protocols.
This is incorrect because InfiniBand fabrics do not use standard Ethernet cabling with RJ45 connectors for high-speed data plane connections. InfiniBand requires specific transceivers and cabling (optical or direct-attach copper with appropriate connectors like QSFP) . Category 6 cables cannot support the bandwidth or protocol requirements of InfiniBand NDR/400G networks.
Incorrect
Correct: D. Active Optical Cables or transceivers with multi-mode fiber which provide the necessary reach and signal quality for distances exceeding the limits of passive copper.
The NCP-AII certification blueprint explicitly includes “Describe and validate cable types and transceivers“ as a core task within the System and Server Bring-up domain, which comprises 31% of the examination .
For a distance of 50 meters between leaf and spine switches, passive copper Direct Attach Cables (DACs) are not suitable because they have limited reach—typically up to 3-5 meters for high-speed InfiniBand .
NVIDIA documentation specifies that for InfiniBand NDR and 400G Ethernet, multi-mode fiber (MMF) with Active Optical Cables (AOCs) or transceivers supports distances up to 50 meters (including patch panels, trunk, and cables) .
Active Optical Cables offer several advantages for this scenario:
Much longer reach than passive copper cables in data centers
Lighter weight and more flexible than copper alternatives
Greater airflow and better signal integrity over distance
Immunity to electromagnetic interference
The rigorous production testing of AOCs ensures the best out-of-the-box installation experience, performance, and durability , which aligns with the certification‘s emphasis on validating cable signal quality .
Incorrect: A. Wireless bridge adapters configured in a mesh topology to eliminate the need for physical cabling and reduce the complexity of the physical layer management.
This is incorrect because high-speed InfiniBand fabrics for AI workloads require physical cabling to achieve the necessary bandwidth (400Gbps), low latency, and signal integrity. Wireless technologies cannot meet the performance requirements for GPU collective operations in AI factories. The certification explicitly covers validating cable types and transceivers as essential physical layer tasks .
B. Passive Copper Direct Attach Cables which are cost-effective for long distances and provide the necessary flexibility for high-density rack cable management.
This is incorrect because passive copper DACs have severe distance limitations at high speeds. For 100Gb/s InfiniBand, AOCs provide “a much longer reach than passive copper cables in the data centers“ . At 50 meters, passive copper signals would degrade significantly, causing high bit-error rates and performance issues, making them unsuitable for leaf-to-spine connections at this distance.
C. Standard Category 6 Ethernet cables using RJ45 connectors to leverage existing office networking infrastructure while maintaining compatibility with InfiniBand protocols.
This is incorrect because InfiniBand fabrics do not use standard Ethernet cabling with RJ45 connectors for high-speed data plane connections. InfiniBand requires specific transceivers and cabling (optical or direct-attach copper with appropriate connectors like QSFP) . Category 6 cables cannot support the bandwidth or protocol requirements of InfiniBand NDR/400G networks.
Unattempted
Correct: D. Active Optical Cables or transceivers with multi-mode fiber which provide the necessary reach and signal quality for distances exceeding the limits of passive copper.
The NCP-AII certification blueprint explicitly includes “Describe and validate cable types and transceivers“ as a core task within the System and Server Bring-up domain, which comprises 31% of the examination .
For a distance of 50 meters between leaf and spine switches, passive copper Direct Attach Cables (DACs) are not suitable because they have limited reach—typically up to 3-5 meters for high-speed InfiniBand .
NVIDIA documentation specifies that for InfiniBand NDR and 400G Ethernet, multi-mode fiber (MMF) with Active Optical Cables (AOCs) or transceivers supports distances up to 50 meters (including patch panels, trunk, and cables) .
Active Optical Cables offer several advantages for this scenario:
Much longer reach than passive copper cables in data centers
Lighter weight and more flexible than copper alternatives
Greater airflow and better signal integrity over distance
Immunity to electromagnetic interference
The rigorous production testing of AOCs ensures the best out-of-the-box installation experience, performance, and durability , which aligns with the certification‘s emphasis on validating cable signal quality .
Incorrect: A. Wireless bridge adapters configured in a mesh topology to eliminate the need for physical cabling and reduce the complexity of the physical layer management.
This is incorrect because high-speed InfiniBand fabrics for AI workloads require physical cabling to achieve the necessary bandwidth (400Gbps), low latency, and signal integrity. Wireless technologies cannot meet the performance requirements for GPU collective operations in AI factories. The certification explicitly covers validating cable types and transceivers as essential physical layer tasks .
B. Passive Copper Direct Attach Cables which are cost-effective for long distances and provide the necessary flexibility for high-density rack cable management.
This is incorrect because passive copper DACs have severe distance limitations at high speeds. For 100Gb/s InfiniBand, AOCs provide “a much longer reach than passive copper cables in the data centers“ . At 50 meters, passive copper signals would degrade significantly, causing high bit-error rates and performance issues, making them unsuitable for leaf-to-spine connections at this distance.
C. Standard Category 6 Ethernet cables using RJ45 connectors to leverage existing office networking infrastructure while maintaining compatibility with InfiniBand protocols.
This is incorrect because InfiniBand fabrics do not use standard Ethernet cabling with RJ45 connectors for high-speed data plane connections. InfiniBand requires specific transceivers and cabling (optical or direct-attach copper with appropriate connectors like QSFP) . Category 6 cables cannot support the bandwidth or protocol requirements of InfiniBand NDR/400G networks.
Question 60 of 60
60. Question
After the physical installation of several H100 GPUs into a server, the administrator runs the command ‘nvidia-smi‘ and notices that one GPU is not appearing in the list, while the others show ‘P0‘ power states. According to the System and Server Bring-up domain, what is the most logical sequence of diagnostic steps to resolve this hardware detection issue?
Correct
Correct: A. Check the physical power cable connections to the GPU, inspect the PCIe slot for debris, and verify if the GPU is detected in the system BIOS or BMC hardware inventory.
This is correct because the NCP-AII certification blueprint explicitly includes “Identify and troubleshoot hardware faults (e.g., GPU, fan, network card)“ and “Identify faulty cards, GPUs, and power supplies“ as core tasks within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
During System and Server Bring-up (31% of the exam), validating installed hardware and ensuring proper physical installation is a critical first step before software-level troubleshooting .
The logical sequence described follows standard hardware troubleshooting methodology:
First, check physical power cable connections to the GPU, as power delivery issues are common causes of detection failures .
Second, inspect the PCIe slot for debris or damage that could prevent proper electrical contact .
Third, verify if the GPU appears in the system BIOS or BMC hardware inventory, which confirms whether the issue is at the hardware detection level before investigating OS/driver problems .
The practice test materials specifically identify that “Physical seating/cabling issues are common in bring-up. Validate connections before replacing hardware or changing software“ .
Incorrect: B. Reinstall the Linux operating system from scratch and then upgrade the Slurm workload manager to the latest version to force the GPU to appear.
This is incorrect because reinstalling the OS and upgrading Slurm are drastic measures that bypass proper diagnostic procedure. The NCP-AII troubleshooting methodology requires systematic identification of hardware faults before making software changes . A GPU not appearing in nvidia-smi while others show normal power states indicates a hardware detection issue, not an OS or workload manager problem. Slurm is a job scheduler and has no role in GPU device discovery.
C. Increase the voltage of the server‘s power supply units (PSUs) via the OOB interface to give the missing GPU more power so it can initialize.
This is incorrect because PSU voltage is not user-configurable via OOB interfaces in standard server hardware. Power supply units provide fixed voltages (e.g., 12V rails) within specified tolerances. Attempting to “increase voltage“ could damage components. The NCP-AII blueprint covers “Validate power and cooling parameters“ and “Identify faulty…power supplies“ , which involves checking proper power delivery and connections, not modifying voltage output.
D. Disable the TPM module and remove all transceivers from the network cards to reduce the load on the PCIe bus, allowing the GPU to be discovered.
This is incorrect because the TPM (Trusted Platform Module) is a security component for cryptographic operations and platform integrity , and it has no impact on PCIe bus loading or GPU discovery. Removing network transceivers is an unrelated action that would not affect GPU detection. This approach does not follow the systematic hardware troubleshooting sequence required by the certification .
Incorrect
Correct: A. Check the physical power cable connections to the GPU, inspect the PCIe slot for debris, and verify if the GPU is detected in the system BIOS or BMC hardware inventory.
This is correct because the NCP-AII certification blueprint explicitly includes “Identify and troubleshoot hardware faults (e.g., GPU, fan, network card)“ and “Identify faulty cards, GPUs, and power supplies“ as core tasks within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
During System and Server Bring-up (31% of the exam), validating installed hardware and ensuring proper physical installation is a critical first step before software-level troubleshooting .
The logical sequence described follows standard hardware troubleshooting methodology:
First, check physical power cable connections to the GPU, as power delivery issues are common causes of detection failures .
Second, inspect the PCIe slot for debris or damage that could prevent proper electrical contact .
Third, verify if the GPU appears in the system BIOS or BMC hardware inventory, which confirms whether the issue is at the hardware detection level before investigating OS/driver problems .
The practice test materials specifically identify that “Physical seating/cabling issues are common in bring-up. Validate connections before replacing hardware or changing software“ .
Incorrect: B. Reinstall the Linux operating system from scratch and then upgrade the Slurm workload manager to the latest version to force the GPU to appear.
This is incorrect because reinstalling the OS and upgrading Slurm are drastic measures that bypass proper diagnostic procedure. The NCP-AII troubleshooting methodology requires systematic identification of hardware faults before making software changes . A GPU not appearing in nvidia-smi while others show normal power states indicates a hardware detection issue, not an OS or workload manager problem. Slurm is a job scheduler and has no role in GPU device discovery.
C. Increase the voltage of the server‘s power supply units (PSUs) via the OOB interface to give the missing GPU more power so it can initialize.
This is incorrect because PSU voltage is not user-configurable via OOB interfaces in standard server hardware. Power supply units provide fixed voltages (e.g., 12V rails) within specified tolerances. Attempting to “increase voltage“ could damage components. The NCP-AII blueprint covers “Validate power and cooling parameters“ and “Identify faulty…power supplies“ , which involves checking proper power delivery and connections, not modifying voltage output.
D. Disable the TPM module and remove all transceivers from the network cards to reduce the load on the PCIe bus, allowing the GPU to be discovered.
This is incorrect because the TPM (Trusted Platform Module) is a security component for cryptographic operations and platform integrity , and it has no impact on PCIe bus loading or GPU discovery. Removing network transceivers is an unrelated action that would not affect GPU detection. This approach does not follow the systematic hardware troubleshooting sequence required by the certification .
Unattempted
Correct: A. Check the physical power cable connections to the GPU, inspect the PCIe slot for debris, and verify if the GPU is detected in the system BIOS or BMC hardware inventory.
This is correct because the NCP-AII certification blueprint explicitly includes “Identify and troubleshoot hardware faults (e.g., GPU, fan, network card)“ and “Identify faulty cards, GPUs, and power supplies“ as core tasks within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
During System and Server Bring-up (31% of the exam), validating installed hardware and ensuring proper physical installation is a critical first step before software-level troubleshooting .
The logical sequence described follows standard hardware troubleshooting methodology:
First, check physical power cable connections to the GPU, as power delivery issues are common causes of detection failures .
Second, inspect the PCIe slot for debris or damage that could prevent proper electrical contact .
Third, verify if the GPU appears in the system BIOS or BMC hardware inventory, which confirms whether the issue is at the hardware detection level before investigating OS/driver problems .
The practice test materials specifically identify that “Physical seating/cabling issues are common in bring-up. Validate connections before replacing hardware or changing software“ .
Incorrect: B. Reinstall the Linux operating system from scratch and then upgrade the Slurm workload manager to the latest version to force the GPU to appear.
This is incorrect because reinstalling the OS and upgrading Slurm are drastic measures that bypass proper diagnostic procedure. The NCP-AII troubleshooting methodology requires systematic identification of hardware faults before making software changes . A GPU not appearing in nvidia-smi while others show normal power states indicates a hardware detection issue, not an OS or workload manager problem. Slurm is a job scheduler and has no role in GPU device discovery.
C. Increase the voltage of the server‘s power supply units (PSUs) via the OOB interface to give the missing GPU more power so it can initialize.
This is incorrect because PSU voltage is not user-configurable via OOB interfaces in standard server hardware. Power supply units provide fixed voltages (e.g., 12V rails) within specified tolerances. Attempting to “increase voltage“ could damage components. The NCP-AII blueprint covers “Validate power and cooling parameters“ and “Identify faulty…power supplies“ , which involves checking proper power delivery and connections, not modifying voltage output.
D. Disable the TPM module and remove all transceivers from the network cards to reduce the load on the PCIe bus, allowing the GPU to be discovered.
This is incorrect because the TPM (Trusted Platform Module) is a security component for cryptographic operations and platform integrity , and it has no impact on PCIe bus loading or GPU discovery. Removing network transceivers is an unrelated action that would not affect GPU detection. This approach does not follow the systematic hardware troubleshooting sequence required by the certification .
X
Use Page numbers below to navigate to other practice tests