You have already completed the Test before. Hence you can not start it again.
Test is loading...
You must sign in or sign up to start the Test.
You have to finish following quiz, to start this Test:
Your results are here!! for" NVIDIA NCP-AII Practice Test 4 "
0 of 60 questions answered correctly
Your time:
Time has elapsed
Your Final Score is : 0
You have attempted : 0
Number of Correct Questions : 0 and scored 0
Number of Incorrect Questions : 0 and Negative marks 0
Average score
Your score
NVIDIA NCP-AII
You have attempted: 0
Number of Correct Questions: 0 and scored 0
Number of Incorrect Questions: 0 and Negative marks 0
You can review your answers by clicking on “View Answers” option. Important Note : Open Reference Documentation Links in New Tab (Right Click and Open in New Tab).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Answered
Review
Question 1 of 60
1. Question
An administrator is optimizing the performance of an AI cluster with a mix of Intel and AMD servers. They find that the storage throughput is lower than expected on the Intel nodes. Which optimization technique specifically targets the reduction of latency and CPU overhead for NVMe-based storage in an AI infrastructure?
Correct
Correct: A Implementing NVIDIA GPUDirect Storage (GDS) to enable a direct DMA path between the storage and the GPU memory, bypassing the CPU. The Technical Reason: In traditional I/O, data travels from storage to a “bounce buffer“ in the CPU‘s system memory (RAM) before being copied again to the GPU‘s High-Bandwidth Memory (HBM). This creates two major bottlenecks: CPU overhead (interrupts and context switching) and PCIe congestion (traversing the CPU‘s memory bus). ? The GDS Solution: GDS uses Direct Memory Access (DMA) to create a straight path from the storage controller (or NIC for remote storage) to the GPU. ? The Benefit: It reduces latency by up to 50% and increases bandwidth by up to 2-4x, while simultaneously freeing up CPU cores to handle other tasks like data preprocessing. The NCP-AII Context: The exam validates your ability to identify and configure the nvidia-fs (NVFS) kernel module. This driver is the software “glue“ that allows the Linux VFS (Virtual File System) to support GDS-capable file systems.
Incorrect: B. Increasing the size of the Linux swap file Increasing the swap file is a memory management tactic used when the system runs out of physical RAM. In an AI infrastructure, swapping to a disk (even an NVMe drive) is catastrophic for performance. Furthermore, there is no such thing as “virtual GPU memory“ created by a Linux swap file; GPU memory is a physical resource that cannot be extended into a disk-based swap partition.
C. Disabling PCIe Gen5 and using USB 2.0 This is technically regressive. PCIe Gen5 provides the high-bandwidth lanes (64 GB/s for a x16 slot) necessary for modern H100/A100 GPUs. USB 2.0 is limited to $480$ Mbps (60 MB/s), which would create a bottleneck 1,000 times slower than the native interface, effectively stalling any AI workload.
D. Replacing NVMe with high-capacity tape drives While tape drives are excellent for long-term “cold“ archival storage, they are unsuitable for the high-concurrency, random-access patterns of AI training. Tape has extremely high seek latency (minutes vs. microseconds for NVMe). Using them for active model training would result in the GPUs being idle for the vast majority of the time.
Incorrect
Correct: A Implementing NVIDIA GPUDirect Storage (GDS) to enable a direct DMA path between the storage and the GPU memory, bypassing the CPU. The Technical Reason: In traditional I/O, data travels from storage to a “bounce buffer“ in the CPU‘s system memory (RAM) before being copied again to the GPU‘s High-Bandwidth Memory (HBM). This creates two major bottlenecks: CPU overhead (interrupts and context switching) and PCIe congestion (traversing the CPU‘s memory bus). ? The GDS Solution: GDS uses Direct Memory Access (DMA) to create a straight path from the storage controller (or NIC for remote storage) to the GPU. ? The Benefit: It reduces latency by up to 50% and increases bandwidth by up to 2-4x, while simultaneously freeing up CPU cores to handle other tasks like data preprocessing. The NCP-AII Context: The exam validates your ability to identify and configure the nvidia-fs (NVFS) kernel module. This driver is the software “glue“ that allows the Linux VFS (Virtual File System) to support GDS-capable file systems.
Incorrect: B. Increasing the size of the Linux swap file Increasing the swap file is a memory management tactic used when the system runs out of physical RAM. In an AI infrastructure, swapping to a disk (even an NVMe drive) is catastrophic for performance. Furthermore, there is no such thing as “virtual GPU memory“ created by a Linux swap file; GPU memory is a physical resource that cannot be extended into a disk-based swap partition.
C. Disabling PCIe Gen5 and using USB 2.0 This is technically regressive. PCIe Gen5 provides the high-bandwidth lanes (64 GB/s for a x16 slot) necessary for modern H100/A100 GPUs. USB 2.0 is limited to $480$ Mbps (60 MB/s), which would create a bottleneck 1,000 times slower than the native interface, effectively stalling any AI workload.
D. Replacing NVMe with high-capacity tape drives While tape drives are excellent for long-term “cold“ archival storage, they are unsuitable for the high-concurrency, random-access patterns of AI training. Tape has extremely high seek latency (minutes vs. microseconds for NVMe). Using them for active model training would result in the GPUs being idle for the vast majority of the time.
Unattempted
Correct: A Implementing NVIDIA GPUDirect Storage (GDS) to enable a direct DMA path between the storage and the GPU memory, bypassing the CPU. The Technical Reason: In traditional I/O, data travels from storage to a “bounce buffer“ in the CPU‘s system memory (RAM) before being copied again to the GPU‘s High-Bandwidth Memory (HBM). This creates two major bottlenecks: CPU overhead (interrupts and context switching) and PCIe congestion (traversing the CPU‘s memory bus). ? The GDS Solution: GDS uses Direct Memory Access (DMA) to create a straight path from the storage controller (or NIC for remote storage) to the GPU. ? The Benefit: It reduces latency by up to 50% and increases bandwidth by up to 2-4x, while simultaneously freeing up CPU cores to handle other tasks like data preprocessing. The NCP-AII Context: The exam validates your ability to identify and configure the nvidia-fs (NVFS) kernel module. This driver is the software “glue“ that allows the Linux VFS (Virtual File System) to support GDS-capable file systems.
Incorrect: B. Increasing the size of the Linux swap file Increasing the swap file is a memory management tactic used when the system runs out of physical RAM. In an AI infrastructure, swapping to a disk (even an NVMe drive) is catastrophic for performance. Furthermore, there is no such thing as “virtual GPU memory“ created by a Linux swap file; GPU memory is a physical resource that cannot be extended into a disk-based swap partition.
C. Disabling PCIe Gen5 and using USB 2.0 This is technically regressive. PCIe Gen5 provides the high-bandwidth lanes (64 GB/s for a x16 slot) necessary for modern H100/A100 GPUs. USB 2.0 is limited to $480$ Mbps (60 MB/s), which would create a bottleneck 1,000 times slower than the native interface, effectively stalling any AI workload.
D. Replacing NVMe with high-capacity tape drives While tape drives are excellent for long-term “cold“ archival storage, they are unsuitable for the high-concurrency, random-access patterns of AI training. Tape has extremely high seek latency (minutes vs. microseconds for NVMe). Using them for active model training would result in the GPUs being idle for the vast majority of the time.
Question 2 of 60
2. Question
To confirm the correct operation of the East-West fabric, an administrator runs an NCCL all_reduce test across multiple nodes. The output shows that the bandwidth is exactly half of what is expected for the installed NDR-200 adapters. What is the most likely configuration error that would cause this specific result?
Correct
Correct: A. Only one out of the two available ports on the network adapter is connected or correctly configured for the compute fabric.
The NCP-AII certification blueprint emphasizes the importance of validating East/West fabric bandwidth as a core task within the Cluster Test and Verification domain .
When an NCCL all_reduce test shows bandwidth exactly half of what is expected for NDR-200 adapters, this is a classic symptom of a misconfigured multi-port network adapter .
Modern high-speed network adapters like NDR-200 typically provide two physical ports that must both be connected to the fabric to achieve full aggregate bandwidth .
NCCL is designed to detect and utilize multiple network interfaces to aggregate bandwidth for collective operations . If only one port is connected, NCCL will only use that single interface, resulting in exactly half the expected throughput.
The rail-optimized network architecture commonly used in AI factories requires all ports to be properly connected to their respective rail switches to enable parallel communication across multiple rails .
Incorrect: B. The Slurm scheduler is allocating only half of the CPU cores to the NCCL test, which bottlenecks the GPU communication.
This is incorrect because CPU core allocation does not directly affect NCCL inter-node bandwidth in this manner. NCCL uses GPU Direct RDMA to bypass the CPU for data transfers , so CPU core count would not halve the measured network bandwidth. While Slurm manages job allocation, NCCL bandwidth is determined by network interface configuration and PCIe topology, not CPU core count.
C. The nodes are using a single-rail network topology instead of the required multi-rail configuration for high-performance training.
This is incorrect because single-rail vs multi-rail refers to the overall fabric architecture design, not the connection status of individual ports. The question describes an NCCL test showing exactly half expected bandwidth, which points to a configuration issue with the specific adapter (one port disconnected), not the overall network topology choice .
D. The GPU drivers have been limited to PCIe Gen3 speeds in the BIOS to save on the cluster‘s energy costs.
This is incorrect because PCIe Gen3 speed limitation would reduce bandwidth but not necessarily by exactly half. The symptom of exactly half the expected bandwidth strongly indicates a missing second network port connection rather than a PCIe speed reduction, which would cause a different bandwidth degradation pattern not precisely 50%.
Incorrect
Correct: A. Only one out of the two available ports on the network adapter is connected or correctly configured for the compute fabric.
The NCP-AII certification blueprint emphasizes the importance of validating East/West fabric bandwidth as a core task within the Cluster Test and Verification domain .
When an NCCL all_reduce test shows bandwidth exactly half of what is expected for NDR-200 adapters, this is a classic symptom of a misconfigured multi-port network adapter .
Modern high-speed network adapters like NDR-200 typically provide two physical ports that must both be connected to the fabric to achieve full aggregate bandwidth .
NCCL is designed to detect and utilize multiple network interfaces to aggregate bandwidth for collective operations . If only one port is connected, NCCL will only use that single interface, resulting in exactly half the expected throughput.
The rail-optimized network architecture commonly used in AI factories requires all ports to be properly connected to their respective rail switches to enable parallel communication across multiple rails .
Incorrect: B. The Slurm scheduler is allocating only half of the CPU cores to the NCCL test, which bottlenecks the GPU communication.
This is incorrect because CPU core allocation does not directly affect NCCL inter-node bandwidth in this manner. NCCL uses GPU Direct RDMA to bypass the CPU for data transfers , so CPU core count would not halve the measured network bandwidth. While Slurm manages job allocation, NCCL bandwidth is determined by network interface configuration and PCIe topology, not CPU core count.
C. The nodes are using a single-rail network topology instead of the required multi-rail configuration for high-performance training.
This is incorrect because single-rail vs multi-rail refers to the overall fabric architecture design, not the connection status of individual ports. The question describes an NCCL test showing exactly half expected bandwidth, which points to a configuration issue with the specific adapter (one port disconnected), not the overall network topology choice .
D. The GPU drivers have been limited to PCIe Gen3 speeds in the BIOS to save on the cluster‘s energy costs.
This is incorrect because PCIe Gen3 speed limitation would reduce bandwidth but not necessarily by exactly half. The symptom of exactly half the expected bandwidth strongly indicates a missing second network port connection rather than a PCIe speed reduction, which would cause a different bandwidth degradation pattern not precisely 50%.
Unattempted
Correct: A. Only one out of the two available ports on the network adapter is connected or correctly configured for the compute fabric.
The NCP-AII certification blueprint emphasizes the importance of validating East/West fabric bandwidth as a core task within the Cluster Test and Verification domain .
When an NCCL all_reduce test shows bandwidth exactly half of what is expected for NDR-200 adapters, this is a classic symptom of a misconfigured multi-port network adapter .
Modern high-speed network adapters like NDR-200 typically provide two physical ports that must both be connected to the fabric to achieve full aggregate bandwidth .
NCCL is designed to detect and utilize multiple network interfaces to aggregate bandwidth for collective operations . If only one port is connected, NCCL will only use that single interface, resulting in exactly half the expected throughput.
The rail-optimized network architecture commonly used in AI factories requires all ports to be properly connected to their respective rail switches to enable parallel communication across multiple rails .
Incorrect: B. The Slurm scheduler is allocating only half of the CPU cores to the NCCL test, which bottlenecks the GPU communication.
This is incorrect because CPU core allocation does not directly affect NCCL inter-node bandwidth in this manner. NCCL uses GPU Direct RDMA to bypass the CPU for data transfers , so CPU core count would not halve the measured network bandwidth. While Slurm manages job allocation, NCCL bandwidth is determined by network interface configuration and PCIe topology, not CPU core count.
C. The nodes are using a single-rail network topology instead of the required multi-rail configuration for high-performance training.
This is incorrect because single-rail vs multi-rail refers to the overall fabric architecture design, not the connection status of individual ports. The question describes an NCCL test showing exactly half expected bandwidth, which points to a configuration issue with the specific adapter (one port disconnected), not the overall network topology choice .
D. The GPU drivers have been limited to PCIe Gen3 speeds in the BIOS to save on the cluster‘s energy costs.
This is incorrect because PCIe Gen3 speed limitation would reduce bandwidth but not necessarily by exactly half. The symptom of exactly half the expected bandwidth strongly indicates a missing second network port connection rather than a PCIe speed reduction, which would cause a different bandwidth degradation pattern not precisely 50%.
Question 3 of 60
3. Question
During the Cluster Test and Verification phase, an administrator uses NVIDIA ClusterKit. What is the primary function of this tool in the context of a multi-node AI factory assessment?
Correct
Correct: D To perform a multifaceted assessment of node health, including PCIe bandwidth and GPU connectivity. The Technical Reason: ClusterKit is a specialized utility bundled within the NVIDIA HPC-X software toolkit. It is designed to provide a “holistic“ view of a node‘s readiness by automating several low-level tests that are otherwise tedious to run manually across a large fabric. ? Intra-node Health: It measures PCIe bandwidth and latency between the CPU and GPU (Host-to-Device/Device-to-Host). ? GPU Topology: It validates the NVLink or PCIe peer-to-peer (P2P) connectivity between GPUs within a single chassis. ? Compute Baselines: It can run quick GFLOPS checks and memory bandwidth tests to ensure that the individual components are performing at their theoretical specifications. ? Fabric Readiness: Beyond the node, it assesses Inter-node latency and bandwidth, providing a “pass/fail“ verdict on whether the cluster can handle distributed collective operations like All-Reduce. The NCP-AII Context: The exam expects you to know that ClusterKit is used as a pre-flight check. Before committing to a multi-day HPL or NCCL burn-in, an administrator runs ClusterKit to catch “silent“ hardware degradations (like a GPU that has fallen to PCIe Gen3 speeds) that would otherwise skew the final benchmark results.
Incorrect: A. To replace the need for the Linux operating system ClusterKit is a user-space application that runs on top of a standard Linux distribution (typically Ubuntu, RHEL, or Rocky Linux). It requires a functioning OS, an NVIDIA driver, and an MPI (Message Passing Interface) implementation to coordinate tests across multiple nodes.
B. To act as a primary compiler for CUDA C++ code The primary compiler for NVIDIA GPU programming is nvcc (the NVIDIA CUDA Compiler), which is part of the CUDA Toolkit. While ClusterKit might use binaries compiled with nvcc, its purpose is diagnostic assessment, not software development or compilation.
C. To design the physical floor plan Designing a data center‘s physical layout involves NVIDIA Air (for digital twin simulation) or standard CAD/BIM software. ClusterKit has no spatial awareness or architectural design capabilities; it only interacts with the logical and physical hardware interfaces once the servers are already racked and powered on.
Incorrect
Correct: D To perform a multifaceted assessment of node health, including PCIe bandwidth and GPU connectivity. The Technical Reason: ClusterKit is a specialized utility bundled within the NVIDIA HPC-X software toolkit. It is designed to provide a “holistic“ view of a node‘s readiness by automating several low-level tests that are otherwise tedious to run manually across a large fabric. ? Intra-node Health: It measures PCIe bandwidth and latency between the CPU and GPU (Host-to-Device/Device-to-Host). ? GPU Topology: It validates the NVLink or PCIe peer-to-peer (P2P) connectivity between GPUs within a single chassis. ? Compute Baselines: It can run quick GFLOPS checks and memory bandwidth tests to ensure that the individual components are performing at their theoretical specifications. ? Fabric Readiness: Beyond the node, it assesses Inter-node latency and bandwidth, providing a “pass/fail“ verdict on whether the cluster can handle distributed collective operations like All-Reduce. The NCP-AII Context: The exam expects you to know that ClusterKit is used as a pre-flight check. Before committing to a multi-day HPL or NCCL burn-in, an administrator runs ClusterKit to catch “silent“ hardware degradations (like a GPU that has fallen to PCIe Gen3 speeds) that would otherwise skew the final benchmark results.
Incorrect: A. To replace the need for the Linux operating system ClusterKit is a user-space application that runs on top of a standard Linux distribution (typically Ubuntu, RHEL, or Rocky Linux). It requires a functioning OS, an NVIDIA driver, and an MPI (Message Passing Interface) implementation to coordinate tests across multiple nodes.
B. To act as a primary compiler for CUDA C++ code The primary compiler for NVIDIA GPU programming is nvcc (the NVIDIA CUDA Compiler), which is part of the CUDA Toolkit. While ClusterKit might use binaries compiled with nvcc, its purpose is diagnostic assessment, not software development or compilation.
C. To design the physical floor plan Designing a data center‘s physical layout involves NVIDIA Air (for digital twin simulation) or standard CAD/BIM software. ClusterKit has no spatial awareness or architectural design capabilities; it only interacts with the logical and physical hardware interfaces once the servers are already racked and powered on.
Unattempted
Correct: D To perform a multifaceted assessment of node health, including PCIe bandwidth and GPU connectivity. The Technical Reason: ClusterKit is a specialized utility bundled within the NVIDIA HPC-X software toolkit. It is designed to provide a “holistic“ view of a node‘s readiness by automating several low-level tests that are otherwise tedious to run manually across a large fabric. ? Intra-node Health: It measures PCIe bandwidth and latency between the CPU and GPU (Host-to-Device/Device-to-Host). ? GPU Topology: It validates the NVLink or PCIe peer-to-peer (P2P) connectivity between GPUs within a single chassis. ? Compute Baselines: It can run quick GFLOPS checks and memory bandwidth tests to ensure that the individual components are performing at their theoretical specifications. ? Fabric Readiness: Beyond the node, it assesses Inter-node latency and bandwidth, providing a “pass/fail“ verdict on whether the cluster can handle distributed collective operations like All-Reduce. The NCP-AII Context: The exam expects you to know that ClusterKit is used as a pre-flight check. Before committing to a multi-day HPL or NCCL burn-in, an administrator runs ClusterKit to catch “silent“ hardware degradations (like a GPU that has fallen to PCIe Gen3 speeds) that would otherwise skew the final benchmark results.
Incorrect: A. To replace the need for the Linux operating system ClusterKit is a user-space application that runs on top of a standard Linux distribution (typically Ubuntu, RHEL, or Rocky Linux). It requires a functioning OS, an NVIDIA driver, and an MPI (Message Passing Interface) implementation to coordinate tests across multiple nodes.
B. To act as a primary compiler for CUDA C++ code The primary compiler for NVIDIA GPU programming is nvcc (the NVIDIA CUDA Compiler), which is part of the CUDA Toolkit. While ClusterKit might use binaries compiled with nvcc, its purpose is diagnostic assessment, not software development or compilation.
C. To design the physical floor plan Designing a data center‘s physical layout involves NVIDIA Air (for digital twin simulation) or standard CAD/BIM software. ClusterKit has no spatial awareness or architectural design capabilities; it only interacts with the logical and physical hardware interfaces once the servers are already racked and powered on.
Question 4 of 60
4. Question
A data center team is configuring High Availability (HA) for the Base Command Manager head node. They want to ensure that if the primary head node fails, the secondary node can take over management of the cluster without manual intervention. What is a critical requirement for this HA setup in a BCM environment?
Correct
Correct: C A shared storage system or database replication must be in place, along with a Virtual IP (VIP) address that can migrate between nodes.
The Technical Reason: To ensure a “Zero-Downtime“ management environment, BCM utilizes an Active-Passive failover architecture:
Database Replication: The cluster‘s metadata (node states, user accounts, job history) is stored in a MariaDB/MySQL database. BCM performs real-time replication from the primary to the secondary head node.
Virtual IP (VIP): A single management IP address is assigned to the cluster. During a failover, this IP physically migrates from the primary node‘s interface to the secondary node‘s interface. This ensures compute nodes and administrators do not need to change their connection settings.
Shared Storage: Directories like /home and /cm/shared must be accessible to both nodes (often via an external HA-NFS or local synchronization) so that the software images and user data remain consistent regardless of which node is active.
The NCP-AII Context: The exam validates your ability to use the cmha-setup tool. This utility automates the configuration of the heartbeat network, the database cloning, and the VIP assignment.
Incorrect Options: A. Shared PCIe switch to sync internal memory states BCM is a cluster management software suite, not a real-time memory mirroring hardware solution. While high-end GPUs use NVLink and PCIe switches for data movement, the head nodes (which manage the cluster) do not share GPU memory states for HA. Synchronization is handled at the application and database layers over standard Ethernet.
B. Different geographic regions via 1Gbps satellite link BCM HA is designed for Local Area Network (LAN) environments. Because the heartbeat mechanism and database replication require low latency and high reliability, a satellite link would introduce too much “jitter,“ likely causing a “split-brain“ scenario where both nodes think they are the primary. Geographic redundancy is typically handled through Disaster Recovery (DR) strategies, not a standard BCM HA pair.
D. Compute nodes running in masterless mode Compute nodes in an NVIDIA-certified cluster rely on the head node for critical services like LDAP/Active Directory authentication, DNS, and the Slurm scheduler. While some tasks might continue to run if a head node fails, the cluster cannot be managed, and new jobs cannot be scheduled. The goal of HA is to ensure the Master is always available, not to make the nodes “masterless.“
Incorrect
Correct: C A shared storage system or database replication must be in place, along with a Virtual IP (VIP) address that can migrate between nodes.
The Technical Reason: To ensure a “Zero-Downtime“ management environment, BCM utilizes an Active-Passive failover architecture:
Database Replication: The cluster‘s metadata (node states, user accounts, job history) is stored in a MariaDB/MySQL database. BCM performs real-time replication from the primary to the secondary head node.
Virtual IP (VIP): A single management IP address is assigned to the cluster. During a failover, this IP physically migrates from the primary node‘s interface to the secondary node‘s interface. This ensures compute nodes and administrators do not need to change their connection settings.
Shared Storage: Directories like /home and /cm/shared must be accessible to both nodes (often via an external HA-NFS or local synchronization) so that the software images and user data remain consistent regardless of which node is active.
The NCP-AII Context: The exam validates your ability to use the cmha-setup tool. This utility automates the configuration of the heartbeat network, the database cloning, and the VIP assignment.
Incorrect Options: A. Shared PCIe switch to sync internal memory states BCM is a cluster management software suite, not a real-time memory mirroring hardware solution. While high-end GPUs use NVLink and PCIe switches for data movement, the head nodes (which manage the cluster) do not share GPU memory states for HA. Synchronization is handled at the application and database layers over standard Ethernet.
B. Different geographic regions via 1Gbps satellite link BCM HA is designed for Local Area Network (LAN) environments. Because the heartbeat mechanism and database replication require low latency and high reliability, a satellite link would introduce too much “jitter,“ likely causing a “split-brain“ scenario where both nodes think they are the primary. Geographic redundancy is typically handled through Disaster Recovery (DR) strategies, not a standard BCM HA pair.
D. Compute nodes running in masterless mode Compute nodes in an NVIDIA-certified cluster rely on the head node for critical services like LDAP/Active Directory authentication, DNS, and the Slurm scheduler. While some tasks might continue to run if a head node fails, the cluster cannot be managed, and new jobs cannot be scheduled. The goal of HA is to ensure the Master is always available, not to make the nodes “masterless.“
Unattempted
Correct: C A shared storage system or database replication must be in place, along with a Virtual IP (VIP) address that can migrate between nodes.
The Technical Reason: To ensure a “Zero-Downtime“ management environment, BCM utilizes an Active-Passive failover architecture:
Database Replication: The cluster‘s metadata (node states, user accounts, job history) is stored in a MariaDB/MySQL database. BCM performs real-time replication from the primary to the secondary head node.
Virtual IP (VIP): A single management IP address is assigned to the cluster. During a failover, this IP physically migrates from the primary node‘s interface to the secondary node‘s interface. This ensures compute nodes and administrators do not need to change their connection settings.
Shared Storage: Directories like /home and /cm/shared must be accessible to both nodes (often via an external HA-NFS or local synchronization) so that the software images and user data remain consistent regardless of which node is active.
The NCP-AII Context: The exam validates your ability to use the cmha-setup tool. This utility automates the configuration of the heartbeat network, the database cloning, and the VIP assignment.
Incorrect Options: A. Shared PCIe switch to sync internal memory states BCM is a cluster management software suite, not a real-time memory mirroring hardware solution. While high-end GPUs use NVLink and PCIe switches for data movement, the head nodes (which manage the cluster) do not share GPU memory states for HA. Synchronization is handled at the application and database layers over standard Ethernet.
B. Different geographic regions via 1Gbps satellite link BCM HA is designed for Local Area Network (LAN) environments. Because the heartbeat mechanism and database replication require low latency and high reliability, a satellite link would introduce too much “jitter,“ likely causing a “split-brain“ scenario where both nodes think they are the primary. Geographic redundancy is typically handled through Disaster Recovery (DR) strategies, not a standard BCM HA pair.
D. Compute nodes running in masterless mode Compute nodes in an NVIDIA-certified cluster rely on the head node for critical services like LDAP/Active Directory authentication, DNS, and the Slurm scheduler. While some tasks might continue to run if a head node fails, the cluster cannot be managed, and new jobs cannot be scheduled. The goal of HA is to ensure the Master is always available, not to make the nodes “masterless.“
Question 5 of 60
5. Question
When configuring Multi-Instance GPU (MIG) for a High-Performance Computing (HPC) workload that requires high memory bandwidth, the administrator must choose between different slice sizes. If an H100 GPU is being partitioned, what is the maximum number of GPU instances (GIs) that can be created, and what is the primary benefit of this isolation for a multi-user environment?
Correct
Correct: D 7 instances; providing dedicated high-bandwidth memory and compute to each user. The Technical Reason: The NVIDIA H100 (and A100) GPU architecture is physically divided into 7 GPU slices. ? Maximum Instances: Because there are 7 sets of hardware resources (compute, cache, and memory controllers), the maximum number of fully isolated GPU Instances (GIs) that can be created is 7 (using the 1g.10gb or 1g.20gb profiles, depending on the H100 model). ? Primary Benefit (Isolation): Unlike software-based sharing, MIG provides Fault Isolation and QoS (Quality of Service). Each instance has its own dedicated path to the High-Bandwidth Memory (HBM) and its own set of SMs (Streaming Multiprocessors). This ensures that a “noisy neighbor“ (a user running a massive kernel) cannot affect the latency or throughput of other users on the same physical card. The NCP-AII Context: The exam validates your understanding of the “7-Slice Rule.“ While you can have fewer instances (e.g., one large 7g.80gb instance or two 3g instances), you cannot exceed 7 physical hardware partitions.
Incorrect: A. 32 instances; primarily used for VDI This describes NVIDIA vGPU (Virtual GPU) software, not MIG hardware partitioning. vGPU can support a high density of users (up to 32 or more) for virtual desktops, but it relies on time-slicing the scheduler rather than physical hardware partitioning of the memory controllers and cache.
B. 16 instances; for small Python scripts There is no hardware configuration on the H100 that allows for 16 MIG instances. While you could use NVIDIA MPS (Multi-Process Service) to run 16+ scripts on a single GPU (or within a single MIG instance), those scripts would share the same hardware resources and lack the physical isolation provided by the 7-instance MIG limit.
C. 2 instances; ensuring 40GB of memory While you can create two 3g.40gb instances on an 80GB H100, this is not the maximum number of instances possible. The question asks for the maximum number of GIs. Limiting the GPU to only 2 instances would underutilize the partitioning capability if the goal is to serve a high-concurrency multi-user environment.
Incorrect
Correct: D 7 instances; providing dedicated high-bandwidth memory and compute to each user. The Technical Reason: The NVIDIA H100 (and A100) GPU architecture is physically divided into 7 GPU slices. ? Maximum Instances: Because there are 7 sets of hardware resources (compute, cache, and memory controllers), the maximum number of fully isolated GPU Instances (GIs) that can be created is 7 (using the 1g.10gb or 1g.20gb profiles, depending on the H100 model). ? Primary Benefit (Isolation): Unlike software-based sharing, MIG provides Fault Isolation and QoS (Quality of Service). Each instance has its own dedicated path to the High-Bandwidth Memory (HBM) and its own set of SMs (Streaming Multiprocessors). This ensures that a “noisy neighbor“ (a user running a massive kernel) cannot affect the latency or throughput of other users on the same physical card. The NCP-AII Context: The exam validates your understanding of the “7-Slice Rule.“ While you can have fewer instances (e.g., one large 7g.80gb instance or two 3g instances), you cannot exceed 7 physical hardware partitions.
Incorrect: A. 32 instances; primarily used for VDI This describes NVIDIA vGPU (Virtual GPU) software, not MIG hardware partitioning. vGPU can support a high density of users (up to 32 or more) for virtual desktops, but it relies on time-slicing the scheduler rather than physical hardware partitioning of the memory controllers and cache.
B. 16 instances; for small Python scripts There is no hardware configuration on the H100 that allows for 16 MIG instances. While you could use NVIDIA MPS (Multi-Process Service) to run 16+ scripts on a single GPU (or within a single MIG instance), those scripts would share the same hardware resources and lack the physical isolation provided by the 7-instance MIG limit.
C. 2 instances; ensuring 40GB of memory While you can create two 3g.40gb instances on an 80GB H100, this is not the maximum number of instances possible. The question asks for the maximum number of GIs. Limiting the GPU to only 2 instances would underutilize the partitioning capability if the goal is to serve a high-concurrency multi-user environment.
Unattempted
Correct: D 7 instances; providing dedicated high-bandwidth memory and compute to each user. The Technical Reason: The NVIDIA H100 (and A100) GPU architecture is physically divided into 7 GPU slices. ? Maximum Instances: Because there are 7 sets of hardware resources (compute, cache, and memory controllers), the maximum number of fully isolated GPU Instances (GIs) that can be created is 7 (using the 1g.10gb or 1g.20gb profiles, depending on the H100 model). ? Primary Benefit (Isolation): Unlike software-based sharing, MIG provides Fault Isolation and QoS (Quality of Service). Each instance has its own dedicated path to the High-Bandwidth Memory (HBM) and its own set of SMs (Streaming Multiprocessors). This ensures that a “noisy neighbor“ (a user running a massive kernel) cannot affect the latency or throughput of other users on the same physical card. The NCP-AII Context: The exam validates your understanding of the “7-Slice Rule.“ While you can have fewer instances (e.g., one large 7g.80gb instance or two 3g instances), you cannot exceed 7 physical hardware partitions.
Incorrect: A. 32 instances; primarily used for VDI This describes NVIDIA vGPU (Virtual GPU) software, not MIG hardware partitioning. vGPU can support a high density of users (up to 32 or more) for virtual desktops, but it relies on time-slicing the scheduler rather than physical hardware partitioning of the memory controllers and cache.
B. 16 instances; for small Python scripts There is no hardware configuration on the H100 that allows for 16 MIG instances. While you could use NVIDIA MPS (Multi-Process Service) to run 16+ scripts on a single GPU (or within a single MIG instance), those scripts would share the same hardware resources and lack the physical isolation provided by the 7-instance MIG limit.
C. 2 instances; ensuring 40GB of memory While you can create two 3g.40gb instances on an 80GB H100, this is not the maximum number of instances possible. The question asks for the maximum number of GIs. Limiting the GPU to only 2 instances would underutilize the partitioning capability if the goal is to serve a high-concurrency multi-user environment.
Question 6 of 60
6. Question
A technician is tasked with performing a firmware upgrade on an NVIDIA HGX system as part of the initial server bring-up. The process involves updating the GPU firmware, the NVSwitch firmware, and the BMC. What is the most critical step to ensure that the firmware installation is successful and that the system recognizes all hardware components correctly after the reboot?
Correct
Correct: B Verify the digital signatures of the firmware packages and perform a full AC power cycle or cold boot if required by the specific component instructions.
The Technical Reason:
Digital Signatures: NVIDIA-certified systems utilize Secure Boot and Hardware Root of Trust. Installing unsigned or corrupted firmware can brick a component or cause a security violation that prevents the system from booting. Verifying signatures (often automated by tools like nvsm or mlxfwmanager) ensures the integrity of the update.
Cold Boot/AC Power Cycle: Many firmware updates for the NVSwitch and GPU Baseboard do not take effect during a simple “warm“ OS reboot (where the motherboard stays powered). A Cold Boot or a full AC Power Cycle (removing and restoring power) is often required to reset the hardware‘s electrical state and force the complex PCIe and NVLink fabric to re-initialize with the new microcode.
The NCP-AII Context: The exam validates your ability to follow the NVIDIA Service Manual. For HGX systems, the “Golden Rule“ is that certain firmware components (like the BMC or the NVSwitch tray) require a cold reset to ensure the internal management controller correctly handshakes with the new software version.
Incorrect Options: A. Perform a warm reboot immediately A warm reboot (sudo reboot) only restarts the operating system. It does not power-cycle the underlying GPU baseboard or the NVSwitch logic. If the firmware update requires a hardware reset to load the new image into the FPGA or EEPROM, a warm reboot will result in the hardware still running the old version, or worse, entering an inconsistent “half-updated“ state that leads to training failures.
C. Change BIOS to legacy boot mode NVIDIA-certified systems (especially those using H100/H200 GPUs) require UEFI mode. Legacy BIOS mode does not support the large BAR (Base Address Register) sizes required for modern GPUs or the security features (Secure Boot/TPM) integrated into the AI infrastructure. Switching to legacy mode would likely prevent the system from seeing the GPUs entirely.
D. Disconnect network cables from the OOB port Disconnecting the Out-of-Band (OOB) port is counterproductive. Most firmware updates on an HGX system are performed through the BMC via the OOB network using tools like the Web UI or Redfish API. Furthermore, the server does not “force“ online updates during a manual flash; the administrator has full control over the source of the firmware file.
Incorrect
Correct: B Verify the digital signatures of the firmware packages and perform a full AC power cycle or cold boot if required by the specific component instructions.
The Technical Reason:
Digital Signatures: NVIDIA-certified systems utilize Secure Boot and Hardware Root of Trust. Installing unsigned or corrupted firmware can brick a component or cause a security violation that prevents the system from booting. Verifying signatures (often automated by tools like nvsm or mlxfwmanager) ensures the integrity of the update.
Cold Boot/AC Power Cycle: Many firmware updates for the NVSwitch and GPU Baseboard do not take effect during a simple “warm“ OS reboot (where the motherboard stays powered). A Cold Boot or a full AC Power Cycle (removing and restoring power) is often required to reset the hardware‘s electrical state and force the complex PCIe and NVLink fabric to re-initialize with the new microcode.
The NCP-AII Context: The exam validates your ability to follow the NVIDIA Service Manual. For HGX systems, the “Golden Rule“ is that certain firmware components (like the BMC or the NVSwitch tray) require a cold reset to ensure the internal management controller correctly handshakes with the new software version.
Incorrect Options: A. Perform a warm reboot immediately A warm reboot (sudo reboot) only restarts the operating system. It does not power-cycle the underlying GPU baseboard or the NVSwitch logic. If the firmware update requires a hardware reset to load the new image into the FPGA or EEPROM, a warm reboot will result in the hardware still running the old version, or worse, entering an inconsistent “half-updated“ state that leads to training failures.
C. Change BIOS to legacy boot mode NVIDIA-certified systems (especially those using H100/H200 GPUs) require UEFI mode. Legacy BIOS mode does not support the large BAR (Base Address Register) sizes required for modern GPUs or the security features (Secure Boot/TPM) integrated into the AI infrastructure. Switching to legacy mode would likely prevent the system from seeing the GPUs entirely.
D. Disconnect network cables from the OOB port Disconnecting the Out-of-Band (OOB) port is counterproductive. Most firmware updates on an HGX system are performed through the BMC via the OOB network using tools like the Web UI or Redfish API. Furthermore, the server does not “force“ online updates during a manual flash; the administrator has full control over the source of the firmware file.
Unattempted
Correct: B Verify the digital signatures of the firmware packages and perform a full AC power cycle or cold boot if required by the specific component instructions.
The Technical Reason:
Digital Signatures: NVIDIA-certified systems utilize Secure Boot and Hardware Root of Trust. Installing unsigned or corrupted firmware can brick a component or cause a security violation that prevents the system from booting. Verifying signatures (often automated by tools like nvsm or mlxfwmanager) ensures the integrity of the update.
Cold Boot/AC Power Cycle: Many firmware updates for the NVSwitch and GPU Baseboard do not take effect during a simple “warm“ OS reboot (where the motherboard stays powered). A Cold Boot or a full AC Power Cycle (removing and restoring power) is often required to reset the hardware‘s electrical state and force the complex PCIe and NVLink fabric to re-initialize with the new microcode.
The NCP-AII Context: The exam validates your ability to follow the NVIDIA Service Manual. For HGX systems, the “Golden Rule“ is that certain firmware components (like the BMC or the NVSwitch tray) require a cold reset to ensure the internal management controller correctly handshakes with the new software version.
Incorrect Options: A. Perform a warm reboot immediately A warm reboot (sudo reboot) only restarts the operating system. It does not power-cycle the underlying GPU baseboard or the NVSwitch logic. If the firmware update requires a hardware reset to load the new image into the FPGA or EEPROM, a warm reboot will result in the hardware still running the old version, or worse, entering an inconsistent “half-updated“ state that leads to training failures.
C. Change BIOS to legacy boot mode NVIDIA-certified systems (especially those using H100/H200 GPUs) require UEFI mode. Legacy BIOS mode does not support the large BAR (Base Address Register) sizes required for modern GPUs or the security features (Secure Boot/TPM) integrated into the AI infrastructure. Switching to legacy mode would likely prevent the system from seeing the GPUs entirely.
D. Disconnect network cables from the OOB port Disconnecting the Out-of-Band (OOB) port is counterproductive. Most firmware updates on an HGX system are performed through the BMC via the OOB network using tools like the Web UI or Redfish API. Furthermore, the server does not “force“ online updates during a manual flash; the administrator has full control over the source of the firmware file.
Question 7 of 60
7. Question
When performing the initial bring-up of an NVIDIA HGX system within a large-scale AI factory, an administrator must ensure that the firmware versions across all components are synchronized for stability. During the firmware upgrade process for the GPU complex, which specific utility should be prioritized to verify the current firmware versions of the NVIDIA GPUs and the NVSwitch fabric before proceeding with a production-level update using the NVIDIA Firmware Update tool?
Correct
Correct: A The NVIDIA System Management Interface nvidia-smi command.
The Technical Reason: nvidia-smi is the primary and most accessible utility for checking the current status of the GPU complex.
GPU Firmware: Running nvidia-smi -q provides a detailed report including the VBIOS version (firmware) for every GPU in the system.
NVSwitch Fabric: On HGX systems, nvidia-smi can report on the status and versions of the integrated NVSwitch fabric. Furthermore, more granular fabric information can be retrieved via nvidia-smi nvlink -s to ensure all links are active and running at the correct firmware-defined speeds.
Verification before Update: Before using a production-level update tool (like nvfwupd or nvsm), an administrator uses nvidia-smi to establish a “baseline“ of the currently loaded versions to confirm which components require the update.
The NCP-AII Context: The exam validates your proficiency with the NVIDIA software stack. While other tools exist, nvidia-smi is the “standard“ tool identified in the NCP-AII blueprint for initial configuration, validation, and status checks of GPU-based servers.
Incorrect Options: B. The standard Linux dmidecode utility dmidecode reads the system‘s DMI (SMBIOS) table. While it can provide information about the server‘s motherboard, CPU, and RAM, it is a generic Linux tool and does not have the specialized capability to query the internal firmware of NVIDIA GPUs or the NVSwitch fabric directly. It cannot see the VBIOS versions required for this task.
C. The ipmitool sensors command ipmitool is used to interact with the BMC (Baseboard Management Controller) via the OOB network. While it is excellent for monitoring temperatures, voltages, and fan speeds (sensors), it typically does not report the specific VBIOS versions of the GPUs or the internal microcode of the NVSwitch chips. For firmware versioning, one would use the BMCs inventory or Redfish interface, not the sensors command.
D. The NVIDIA Fabric Manager status dashboard The NVIDIA Fabric Manager is a background service responsible for initializing and maintaining the NVLink fabric. While its logs or a potential web dashboard might show if the fabric is “Up“ or “Down,“ it is not the primary tool used by a technician to query and verify component firmware versions during a bring-up procedure. nvidia-smi is the more direct, command-line standard for this specific validation step.
Incorrect
Correct: A The NVIDIA System Management Interface nvidia-smi command.
The Technical Reason: nvidia-smi is the primary and most accessible utility for checking the current status of the GPU complex.
GPU Firmware: Running nvidia-smi -q provides a detailed report including the VBIOS version (firmware) for every GPU in the system.
NVSwitch Fabric: On HGX systems, nvidia-smi can report on the status and versions of the integrated NVSwitch fabric. Furthermore, more granular fabric information can be retrieved via nvidia-smi nvlink -s to ensure all links are active and running at the correct firmware-defined speeds.
Verification before Update: Before using a production-level update tool (like nvfwupd or nvsm), an administrator uses nvidia-smi to establish a “baseline“ of the currently loaded versions to confirm which components require the update.
The NCP-AII Context: The exam validates your proficiency with the NVIDIA software stack. While other tools exist, nvidia-smi is the “standard“ tool identified in the NCP-AII blueprint for initial configuration, validation, and status checks of GPU-based servers.
Incorrect Options: B. The standard Linux dmidecode utility dmidecode reads the system‘s DMI (SMBIOS) table. While it can provide information about the server‘s motherboard, CPU, and RAM, it is a generic Linux tool and does not have the specialized capability to query the internal firmware of NVIDIA GPUs or the NVSwitch fabric directly. It cannot see the VBIOS versions required for this task.
C. The ipmitool sensors command ipmitool is used to interact with the BMC (Baseboard Management Controller) via the OOB network. While it is excellent for monitoring temperatures, voltages, and fan speeds (sensors), it typically does not report the specific VBIOS versions of the GPUs or the internal microcode of the NVSwitch chips. For firmware versioning, one would use the BMCs inventory or Redfish interface, not the sensors command.
D. The NVIDIA Fabric Manager status dashboard The NVIDIA Fabric Manager is a background service responsible for initializing and maintaining the NVLink fabric. While its logs or a potential web dashboard might show if the fabric is “Up“ or “Down,“ it is not the primary tool used by a technician to query and verify component firmware versions during a bring-up procedure. nvidia-smi is the more direct, command-line standard for this specific validation step.
Unattempted
Correct: A The NVIDIA System Management Interface nvidia-smi command.
The Technical Reason: nvidia-smi is the primary and most accessible utility for checking the current status of the GPU complex.
GPU Firmware: Running nvidia-smi -q provides a detailed report including the VBIOS version (firmware) for every GPU in the system.
NVSwitch Fabric: On HGX systems, nvidia-smi can report on the status and versions of the integrated NVSwitch fabric. Furthermore, more granular fabric information can be retrieved via nvidia-smi nvlink -s to ensure all links are active and running at the correct firmware-defined speeds.
Verification before Update: Before using a production-level update tool (like nvfwupd or nvsm), an administrator uses nvidia-smi to establish a “baseline“ of the currently loaded versions to confirm which components require the update.
The NCP-AII Context: The exam validates your proficiency with the NVIDIA software stack. While other tools exist, nvidia-smi is the “standard“ tool identified in the NCP-AII blueprint for initial configuration, validation, and status checks of GPU-based servers.
Incorrect Options: B. The standard Linux dmidecode utility dmidecode reads the system‘s DMI (SMBIOS) table. While it can provide information about the server‘s motherboard, CPU, and RAM, it is a generic Linux tool and does not have the specialized capability to query the internal firmware of NVIDIA GPUs or the NVSwitch fabric directly. It cannot see the VBIOS versions required for this task.
C. The ipmitool sensors command ipmitool is used to interact with the BMC (Baseboard Management Controller) via the OOB network. While it is excellent for monitoring temperatures, voltages, and fan speeds (sensors), it typically does not report the specific VBIOS versions of the GPUs or the internal microcode of the NVSwitch chips. For firmware versioning, one would use the BMCs inventory or Redfish interface, not the sensors command.
D. The NVIDIA Fabric Manager status dashboard The NVIDIA Fabric Manager is a background service responsible for initializing and maintaining the NVLink fabric. While its logs or a potential web dashboard might show if the fabric is “Up“ or “Down,“ it is not the primary tool used by a technician to query and verify component firmware versions during a bring-up procedure. nvidia-smi is the more direct, command-line standard for this specific validation step.
Question 8 of 60
8. Question
A data scientist reports that their distributed training job is running 50% slower than usual. The administrator uses ‘mlnx_perf‘ and ‘nvidia-smi‘ to troubleshoot. They see high ‘retransmission rates‘ on the network and ‘Power Brake‘ events on the GPUs. What is the most likely root cause of these combined symptoms?
Correct
Correct: D. A failing Power Supply Unit (PSU) is causing the GPUs to throttle (Power Brake) and the network switch to drop packets due to unstable voltage.
This is correct because the NCP-AII certification blueprint explicitly includes “Identify faulty cards, GPUs, and power supplies“ as a core task within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
The combination of two distinct symptomsnetwork retransmissions (detected by mlnx_perf) and GPU “Power Brake“ events (detected by nvidia-smi)points to a common root cause affecting multiple subsystems .
A failing or unstable Power Supply Unit (PSU) can cause voltage fluctuations that simultaneously impact both the GPUs and the network infrastructure:
GPUs: When voltage drops below required levels, the GPU power management system triggers “Power Brake“ events, which throttle GPU performance to prevent system instability or hardware damage .
Network: Unstable voltage to network switches or NICs causes intermittent link issues, resulting in packet loss and high retransmission rates observed in mlnx_perf .
This scenario represents a classic hardware fault identification case where multiple seemingly unrelated symptoms trace back to a single failing component, aligning with the troubleshooting methodology emphasized in the certification .
Incorrect: A. The users are using the wrong font in their Jupyter notebooks, which is causing the GPU to work harder to render the text.
This is incorrect because Jupyter notebook font selection has no impact on GPU compute performance or network retransmission rates. GPUs are designed for compute workloads, not text rendering in web interfaces. This option completely misunderstands GPU functionality and has no basis in NVIDIA diagnostic methodology.
B. The GPUs are waiting for a software update from the Windows Update service, which is blocking the InfiniBand fabric.
This is incorrect for multiple reasons. First, AI clusters running H100 GPUs use Linux-based operating systems, not Windows. Second, Windows Update does not run on Linux servers and cannot block InfiniBand fabrics. Third, software updates do not manifest as “Power Brake“ events or network retransmissions in the manner described.
C. The Slurm scheduler has been set to ‘slow mode‘ by the administrator to save on the cluster‘s monthly electricity bill.
This is incorrect because Slurm does not have a configurable ‘slow mode‘ for power saving. Slurm is a workload manager for job scheduling, not a power management tool. While power saving can be configured through other mechanisms, the symptoms of GPU Power Brake events and network retransmissions indicate hardware-level issues, not scheduler configuration changes.
Incorrect
Correct: D. A failing Power Supply Unit (PSU) is causing the GPUs to throttle (Power Brake) and the network switch to drop packets due to unstable voltage.
This is correct because the NCP-AII certification blueprint explicitly includes “Identify faulty cards, GPUs, and power supplies“ as a core task within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
The combination of two distinct symptomsnetwork retransmissions (detected by mlnx_perf) and GPU “Power Brake“ events (detected by nvidia-smi)points to a common root cause affecting multiple subsystems .
A failing or unstable Power Supply Unit (PSU) can cause voltage fluctuations that simultaneously impact both the GPUs and the network infrastructure:
GPUs: When voltage drops below required levels, the GPU power management system triggers “Power Brake“ events, which throttle GPU performance to prevent system instability or hardware damage .
Network: Unstable voltage to network switches or NICs causes intermittent link issues, resulting in packet loss and high retransmission rates observed in mlnx_perf .
This scenario represents a classic hardware fault identification case where multiple seemingly unrelated symptoms trace back to a single failing component, aligning with the troubleshooting methodology emphasized in the certification .
Incorrect: A. The users are using the wrong font in their Jupyter notebooks, which is causing the GPU to work harder to render the text.
This is incorrect because Jupyter notebook font selection has no impact on GPU compute performance or network retransmission rates. GPUs are designed for compute workloads, not text rendering in web interfaces. This option completely misunderstands GPU functionality and has no basis in NVIDIA diagnostic methodology.
B. The GPUs are waiting for a software update from the Windows Update service, which is blocking the InfiniBand fabric.
This is incorrect for multiple reasons. First, AI clusters running H100 GPUs use Linux-based operating systems, not Windows. Second, Windows Update does not run on Linux servers and cannot block InfiniBand fabrics. Third, software updates do not manifest as “Power Brake“ events or network retransmissions in the manner described.
C. The Slurm scheduler has been set to ‘slow mode‘ by the administrator to save on the cluster‘s monthly electricity bill.
This is incorrect because Slurm does not have a configurable ‘slow mode‘ for power saving. Slurm is a workload manager for job scheduling, not a power management tool. While power saving can be configured through other mechanisms, the symptoms of GPU Power Brake events and network retransmissions indicate hardware-level issues, not scheduler configuration changes.
Unattempted
Correct: D. A failing Power Supply Unit (PSU) is causing the GPUs to throttle (Power Brake) and the network switch to drop packets due to unstable voltage.
This is correct because the NCP-AII certification blueprint explicitly includes “Identify faulty cards, GPUs, and power supplies“ as a core task within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
The combination of two distinct symptomsnetwork retransmissions (detected by mlnx_perf) and GPU “Power Brake“ events (detected by nvidia-smi)points to a common root cause affecting multiple subsystems .
A failing or unstable Power Supply Unit (PSU) can cause voltage fluctuations that simultaneously impact both the GPUs and the network infrastructure:
GPUs: When voltage drops below required levels, the GPU power management system triggers “Power Brake“ events, which throttle GPU performance to prevent system instability or hardware damage .
Network: Unstable voltage to network switches or NICs causes intermittent link issues, resulting in packet loss and high retransmission rates observed in mlnx_perf .
This scenario represents a classic hardware fault identification case where multiple seemingly unrelated symptoms trace back to a single failing component, aligning with the troubleshooting methodology emphasized in the certification .
Incorrect: A. The users are using the wrong font in their Jupyter notebooks, which is causing the GPU to work harder to render the text.
This is incorrect because Jupyter notebook font selection has no impact on GPU compute performance or network retransmission rates. GPUs are designed for compute workloads, not text rendering in web interfaces. This option completely misunderstands GPU functionality and has no basis in NVIDIA diagnostic methodology.
B. The GPUs are waiting for a software update from the Windows Update service, which is blocking the InfiniBand fabric.
This is incorrect for multiple reasons. First, AI clusters running H100 GPUs use Linux-based operating systems, not Windows. Second, Windows Update does not run on Linux servers and cannot block InfiniBand fabrics. Third, software updates do not manifest as “Power Brake“ events or network retransmissions in the manner described.
C. The Slurm scheduler has been set to ‘slow mode‘ by the administrator to save on the cluster‘s monthly electricity bill.
This is incorrect because Slurm does not have a configurable ‘slow mode‘ for power saving. Slurm is a workload manager for job scheduling, not a power management tool. While power saving can be configured through other mechanisms, the symptoms of GPU Power Brake events and network retransmissions indicate hardware-level issues, not scheduler configuration changes.
Question 9 of 60
9. Question
While installing GPU-based servers and validating hardware using NVIDIA System Management Interface (nvidia-smi), an administrator notices that one GPU in a multi-GPU HGX system reports a significantly lower power limit than the others. The system is currently in the validation phase before production workloads. What is the most likely cause related to the physical installation or power parameters that would trigger this behavior?
Correct
Correct: B Incomplete seating of one of the redundant Power Supply Units (PSUs). The Technical Reason: In an HGX system, the Baseboard Management Controller (BMC) monitors the total available power from the PSU grid. If a PSU is not fully seated or is disconnected, the system enters a “degraded power state.“ ? Power Capping: To prevent the entire system from shutting down due to an over-current event on the remaining functional PSUs, the BMC/firmware will automatically enforce a lower power limit on the GPUs. ? Individual GPU Reporting: While power capping often affects the whole system, certain failure modes or specific power phases on the HGX baseboard can result in specific GPUs being throttled more aggressively than others to stay within the reduced power budget. ? Observation via SMI: You would see this in nvidia-smi under the “Enforced Power Limit“ or “Power Cap“ field, where one GPU might show 300W while the others show the default 700W. The NCP-AII Context: The exam tests your ability to link Physical Layer issues (PSU seating) with Software Monitoring tools (nvidia-smi). A common “trap“ is assuming it is a software driver issue when it is actually a physical power delivery constraint.
Incorrect: A. GPU partitioned into MIG instances Partitioning a GPU into Multi-Instance GPU (MIG) slices divides the compute (SMs) and memory, but it does not lower the physical power limit of the base GPU. The nvidia-smi output would show the MIG profiles, but the aggregate power limit for the physical device would remain at its factory-defined TGP unless manually throttled for other reasons.
C. NVIDIA Container Toolkit not installed The NVIDIA Container Toolkit is a software layer that allows containers to interface with the GPU driver. Its presence (or absence) has no effect on the hardware-level power limits reported by the driver to nvidia-smi. The driver and VBIOS handle power limits entirely at the host/firmware level.
D. CUDA version incompatible with the driver CUDA/Driver incompatibility typically results in functional errors (e.g., “CUDA version too old“ or “Unable to initialize NVML“). It does not manifest as a hardware-level power limit reduction on a single GPU. Power limits are managed by the NVIDIA Kernel Driver and the GPU VBIOS, independent of the user-space CUDA libraries.
Incorrect
Correct: B Incomplete seating of one of the redundant Power Supply Units (PSUs). The Technical Reason: In an HGX system, the Baseboard Management Controller (BMC) monitors the total available power from the PSU grid. If a PSU is not fully seated or is disconnected, the system enters a “degraded power state.“ ? Power Capping: To prevent the entire system from shutting down due to an over-current event on the remaining functional PSUs, the BMC/firmware will automatically enforce a lower power limit on the GPUs. ? Individual GPU Reporting: While power capping often affects the whole system, certain failure modes or specific power phases on the HGX baseboard can result in specific GPUs being throttled more aggressively than others to stay within the reduced power budget. ? Observation via SMI: You would see this in nvidia-smi under the “Enforced Power Limit“ or “Power Cap“ field, where one GPU might show 300W while the others show the default 700W. The NCP-AII Context: The exam tests your ability to link Physical Layer issues (PSU seating) with Software Monitoring tools (nvidia-smi). A common “trap“ is assuming it is a software driver issue when it is actually a physical power delivery constraint.
Incorrect: A. GPU partitioned into MIG instances Partitioning a GPU into Multi-Instance GPU (MIG) slices divides the compute (SMs) and memory, but it does not lower the physical power limit of the base GPU. The nvidia-smi output would show the MIG profiles, but the aggregate power limit for the physical device would remain at its factory-defined TGP unless manually throttled for other reasons.
C. NVIDIA Container Toolkit not installed The NVIDIA Container Toolkit is a software layer that allows containers to interface with the GPU driver. Its presence (or absence) has no effect on the hardware-level power limits reported by the driver to nvidia-smi. The driver and VBIOS handle power limits entirely at the host/firmware level.
D. CUDA version incompatible with the driver CUDA/Driver incompatibility typically results in functional errors (e.g., “CUDA version too old“ or “Unable to initialize NVML“). It does not manifest as a hardware-level power limit reduction on a single GPU. Power limits are managed by the NVIDIA Kernel Driver and the GPU VBIOS, independent of the user-space CUDA libraries.
Unattempted
Correct: B Incomplete seating of one of the redundant Power Supply Units (PSUs). The Technical Reason: In an HGX system, the Baseboard Management Controller (BMC) monitors the total available power from the PSU grid. If a PSU is not fully seated or is disconnected, the system enters a “degraded power state.“ ? Power Capping: To prevent the entire system from shutting down due to an over-current event on the remaining functional PSUs, the BMC/firmware will automatically enforce a lower power limit on the GPUs. ? Individual GPU Reporting: While power capping often affects the whole system, certain failure modes or specific power phases on the HGX baseboard can result in specific GPUs being throttled more aggressively than others to stay within the reduced power budget. ? Observation via SMI: You would see this in nvidia-smi under the “Enforced Power Limit“ or “Power Cap“ field, where one GPU might show 300W while the others show the default 700W. The NCP-AII Context: The exam tests your ability to link Physical Layer issues (PSU seating) with Software Monitoring tools (nvidia-smi). A common “trap“ is assuming it is a software driver issue when it is actually a physical power delivery constraint.
Incorrect: A. GPU partitioned into MIG instances Partitioning a GPU into Multi-Instance GPU (MIG) slices divides the compute (SMs) and memory, but it does not lower the physical power limit of the base GPU. The nvidia-smi output would show the MIG profiles, but the aggregate power limit for the physical device would remain at its factory-defined TGP unless manually throttled for other reasons.
C. NVIDIA Container Toolkit not installed The NVIDIA Container Toolkit is a software layer that allows containers to interface with the GPU driver. Its presence (or absence) has no effect on the hardware-level power limits reported by the driver to nvidia-smi. The driver and VBIOS handle power limits entirely at the host/firmware level.
D. CUDA version incompatible with the driver CUDA/Driver incompatibility typically results in functional errors (e.g., “CUDA version too old“ or “Unable to initialize NVML“). It does not manifest as a hardware-level power limit reduction on a single GPU. Power limits are managed by the NVIDIA Kernel Driver and the GPU VBIOS, independent of the user-space CUDA libraries.
Question 10 of 60
10. Question
A technician is performing a single-node stress test on an NVIDIA HGX system using the High-Performance Linpack (HPL) benchmark. The test completes, but the GFLOPS achieved are significantly lower than the theoretical peak for 8x H100 GPUs. Upon investigating the logs, they see thermal throttling events. What should be the next step in the verification process to isolate whether this is a hardware fault or an environmental issue?
Correct
Correct: D Run a burn-in test using the NeMo Framework or a similar stress tool while monitoring the GPU temperatures and fan speeds via nvidia-smi dmon to see if the cooling system is maintaining the delta-T. The Technical Reason: HPL is a synthetic benchmark. To validate if the thermal issues persist in a “real-world“ scenario, a high-load workload like NVIDIA NeMo (for Large Language Model training) is used. ? Monitoring with dmon: Using nvidia-smi dmon -s tp (Temperature and Power) allows the technician to watch the relationship between GPU heat and fan response in real-time. ? Delta-T (?T): The goal is to verify if the servers cooling system is successfully removing heat. If the fans are at 100% and the GPUs are still hitting 85°C+, it suggests an environmental issue (hot aisle air recirculation). If only one GPU is hot while others are cool, it suggests a hardware fault (bad heat sink seating). The NCP-AII Context: The exam emphasizes using the NVIDIA software stack (NeMo, SMI, DCGM) to validate hardware before moving to expensive physical replacements.
Incorrect: A. Lower the HPL problem size (N) to 1000 Lowering the problem size (N) reduces the computational intensity. While the test will finish faster and generate less heat, it will not give an accurate representation of “peak performance.“ Small N values are dominated by overhead and do not saturate the Tensor Cores, making the benchmark useless for performance validation.
B. Immediately replace all eight GPUs This is the most inefficient and costly response. Thermal throttling is a protective firmware feature, not a sign of “permanent silicon failure.“ Most thermal issues are resolved by checking the server‘s fan health, ensuring the room‘s ambient temperature is within the ASHRAE A1/A2 envelope, or re-seating the HGX tray.
C. Disable the fans in the BIOS This is dangerous and technically counterproductive. Modern HGX systems will emergency shut down (thermal trip) within seconds of starting an HPL run if fans are disabled. Furthermore, the power saved by turning off fans is negligible compared to the 700W drawn by a single H100 GPU; the lack of cooling would immediately trigger the “Thermal Slowdown“ clock-capping.
Incorrect
Correct: D Run a burn-in test using the NeMo Framework or a similar stress tool while monitoring the GPU temperatures and fan speeds via nvidia-smi dmon to see if the cooling system is maintaining the delta-T. The Technical Reason: HPL is a synthetic benchmark. To validate if the thermal issues persist in a “real-world“ scenario, a high-load workload like NVIDIA NeMo (for Large Language Model training) is used. ? Monitoring with dmon: Using nvidia-smi dmon -s tp (Temperature and Power) allows the technician to watch the relationship between GPU heat and fan response in real-time. ? Delta-T (?T): The goal is to verify if the servers cooling system is successfully removing heat. If the fans are at 100% and the GPUs are still hitting 85°C+, it suggests an environmental issue (hot aisle air recirculation). If only one GPU is hot while others are cool, it suggests a hardware fault (bad heat sink seating). The NCP-AII Context: The exam emphasizes using the NVIDIA software stack (NeMo, SMI, DCGM) to validate hardware before moving to expensive physical replacements.
Incorrect: A. Lower the HPL problem size (N) to 1000 Lowering the problem size (N) reduces the computational intensity. While the test will finish faster and generate less heat, it will not give an accurate representation of “peak performance.“ Small N values are dominated by overhead and do not saturate the Tensor Cores, making the benchmark useless for performance validation.
B. Immediately replace all eight GPUs This is the most inefficient and costly response. Thermal throttling is a protective firmware feature, not a sign of “permanent silicon failure.“ Most thermal issues are resolved by checking the server‘s fan health, ensuring the room‘s ambient temperature is within the ASHRAE A1/A2 envelope, or re-seating the HGX tray.
C. Disable the fans in the BIOS This is dangerous and technically counterproductive. Modern HGX systems will emergency shut down (thermal trip) within seconds of starting an HPL run if fans are disabled. Furthermore, the power saved by turning off fans is negligible compared to the 700W drawn by a single H100 GPU; the lack of cooling would immediately trigger the “Thermal Slowdown“ clock-capping.
Unattempted
Correct: D Run a burn-in test using the NeMo Framework or a similar stress tool while monitoring the GPU temperatures and fan speeds via nvidia-smi dmon to see if the cooling system is maintaining the delta-T. The Technical Reason: HPL is a synthetic benchmark. To validate if the thermal issues persist in a “real-world“ scenario, a high-load workload like NVIDIA NeMo (for Large Language Model training) is used. ? Monitoring with dmon: Using nvidia-smi dmon -s tp (Temperature and Power) allows the technician to watch the relationship between GPU heat and fan response in real-time. ? Delta-T (?T): The goal is to verify if the servers cooling system is successfully removing heat. If the fans are at 100% and the GPUs are still hitting 85°C+, it suggests an environmental issue (hot aisle air recirculation). If only one GPU is hot while others are cool, it suggests a hardware fault (bad heat sink seating). The NCP-AII Context: The exam emphasizes using the NVIDIA software stack (NeMo, SMI, DCGM) to validate hardware before moving to expensive physical replacements.
Incorrect: A. Lower the HPL problem size (N) to 1000 Lowering the problem size (N) reduces the computational intensity. While the test will finish faster and generate less heat, it will not give an accurate representation of “peak performance.“ Small N values are dominated by overhead and do not saturate the Tensor Cores, making the benchmark useless for performance validation.
B. Immediately replace all eight GPUs This is the most inefficient and costly response. Thermal throttling is a protective firmware feature, not a sign of “permanent silicon failure.“ Most thermal issues are resolved by checking the server‘s fan health, ensuring the room‘s ambient temperature is within the ASHRAE A1/A2 envelope, or re-seating the HGX tray.
C. Disable the fans in the BIOS This is dangerous and technically counterproductive. Modern HGX systems will emergency shut down (thermal trip) within seconds of starting an HPL run if fans are disabled. Furthermore, the power saved by turning off fans is negligible compared to the 700W drawn by a single H100 GPU; the lack of cooling would immediately trigger the “Thermal Slowdown“ clock-capping.
Question 11 of 60
11. Question
An administrator is optimizing the performance of an AI cluster consisting of both AMD and Intel-based servers. Which optimization technique is most relevant for ensuring that the GPUs in these servers can access system memory with the lowest possible latency and highest bandwidth?
Correct
Correct: A Enabling ‘Resizable BAR‘ (Base Address Register) in the BIOS to allow the CPU to access the entire GPU frame buffer over the PCIe bus. The Technical Reason: Traditionally, the CPU could only access a small 256MB “aperture“ of the GPU‘s memory at a time. This caused significant overhead as the system had to constantly move this “window“ around to access different parts of the GPU VRAM. ? The Optimization: Resizable BAR (also known as Large BAR support) allows the entire GPU frame buffer (e.g., 80GB on an H100) to be mapped into the CPU‘s memory address space. ? The Benefit: This eliminates the 256MB bottleneck, reducing CPU overhead and latency while significantly increasing the bandwidth for data transfers between the system RAM and GPU memory. This is critical for Large Language Model (LLM) training and high-concurrency inference. The NCP-AII Context: The exam validates your ability to configure BIOS/UEFI settings for AI-ready servers. Enabling Above 4G Decoding and Resizable BAR are mandatory steps in the “Bring-up“ checklist for any NVIDIA-Certified System.
Incorrect: B. Decreasing the size of the system swap file While reducing swap usage can prevent disk thrashing, it does not improve the latency or bandwidth of the communication path between the CPU and the GPU. If the system runs out of physical RAM, decreasing the swap file would likely lead to “Out of Memory“ (OOM) errors and application crashes rather than an optimization of GPU data ingestion.
C. Setting GPU fans to 100% speed This is a “brute-force“ approach to cooling. While it prevents thermal throttling, it does not affect the logical or physical communication bandwidth of the PCIe bus. Modern NVIDIA GPUs use sophisticated firmware-managed thermal profiles; locking fans at 100% increases power consumption and mechanical wear without addressing the data-transfer bottleneck described in the prompt.
D. Using a slower rotational disk tier This would be counterproductive. AI workloads require massive data ingestion rates. Moving from high-speed NVMe/SSD storage to rotational disks (HDD) would create a massive I/O bottleneck, starving the GPUs of training data and significantly increasing the time-to-solution. Reducing rack heat at the cost of crippling performance is not an optimization strategy in the NCP-AII curriculum.
Incorrect
Correct: A Enabling ‘Resizable BAR‘ (Base Address Register) in the BIOS to allow the CPU to access the entire GPU frame buffer over the PCIe bus. The Technical Reason: Traditionally, the CPU could only access a small 256MB “aperture“ of the GPU‘s memory at a time. This caused significant overhead as the system had to constantly move this “window“ around to access different parts of the GPU VRAM. ? The Optimization: Resizable BAR (also known as Large BAR support) allows the entire GPU frame buffer (e.g., 80GB on an H100) to be mapped into the CPU‘s memory address space. ? The Benefit: This eliminates the 256MB bottleneck, reducing CPU overhead and latency while significantly increasing the bandwidth for data transfers between the system RAM and GPU memory. This is critical for Large Language Model (LLM) training and high-concurrency inference. The NCP-AII Context: The exam validates your ability to configure BIOS/UEFI settings for AI-ready servers. Enabling Above 4G Decoding and Resizable BAR are mandatory steps in the “Bring-up“ checklist for any NVIDIA-Certified System.
Incorrect: B. Decreasing the size of the system swap file While reducing swap usage can prevent disk thrashing, it does not improve the latency or bandwidth of the communication path between the CPU and the GPU. If the system runs out of physical RAM, decreasing the swap file would likely lead to “Out of Memory“ (OOM) errors and application crashes rather than an optimization of GPU data ingestion.
C. Setting GPU fans to 100% speed This is a “brute-force“ approach to cooling. While it prevents thermal throttling, it does not affect the logical or physical communication bandwidth of the PCIe bus. Modern NVIDIA GPUs use sophisticated firmware-managed thermal profiles; locking fans at 100% increases power consumption and mechanical wear without addressing the data-transfer bottleneck described in the prompt.
D. Using a slower rotational disk tier This would be counterproductive. AI workloads require massive data ingestion rates. Moving from high-speed NVMe/SSD storage to rotational disks (HDD) would create a massive I/O bottleneck, starving the GPUs of training data and significantly increasing the time-to-solution. Reducing rack heat at the cost of crippling performance is not an optimization strategy in the NCP-AII curriculum.
Unattempted
Correct: A Enabling ‘Resizable BAR‘ (Base Address Register) in the BIOS to allow the CPU to access the entire GPU frame buffer over the PCIe bus. The Technical Reason: Traditionally, the CPU could only access a small 256MB “aperture“ of the GPU‘s memory at a time. This caused significant overhead as the system had to constantly move this “window“ around to access different parts of the GPU VRAM. ? The Optimization: Resizable BAR (also known as Large BAR support) allows the entire GPU frame buffer (e.g., 80GB on an H100) to be mapped into the CPU‘s memory address space. ? The Benefit: This eliminates the 256MB bottleneck, reducing CPU overhead and latency while significantly increasing the bandwidth for data transfers between the system RAM and GPU memory. This is critical for Large Language Model (LLM) training and high-concurrency inference. The NCP-AII Context: The exam validates your ability to configure BIOS/UEFI settings for AI-ready servers. Enabling Above 4G Decoding and Resizable BAR are mandatory steps in the “Bring-up“ checklist for any NVIDIA-Certified System.
Incorrect: B. Decreasing the size of the system swap file While reducing swap usage can prevent disk thrashing, it does not improve the latency or bandwidth of the communication path between the CPU and the GPU. If the system runs out of physical RAM, decreasing the swap file would likely lead to “Out of Memory“ (OOM) errors and application crashes rather than an optimization of GPU data ingestion.
C. Setting GPU fans to 100% speed This is a “brute-force“ approach to cooling. While it prevents thermal throttling, it does not affect the logical or physical communication bandwidth of the PCIe bus. Modern NVIDIA GPUs use sophisticated firmware-managed thermal profiles; locking fans at 100% increases power consumption and mechanical wear without addressing the data-transfer bottleneck described in the prompt.
D. Using a slower rotational disk tier This would be counterproductive. AI workloads require massive data ingestion rates. Moving from high-speed NVMe/SSD storage to rotational disks (HDD) would create a massive I/O bottleneck, starving the GPUs of training data and significantly increasing the time-to-solution. Reducing rack heat at the cost of crippling performance is not an optimization strategy in the NCP-AII curriculum.
Question 12 of 60
12. Question
An IT engineer is configuring the initial parameters for third-party storage to be used with an NVIDIA HGX cluster. To achieve maximum throughput and minimum latency for GPUDirect Storage (GDS), which specific storage protocol and configuration setting on the host-side should be prioritized to ensure the most efficient data path to GPU memory?
Correct
Correct: C Enable NVMe-over-Fabrics (NVMe-oF) with RDMA support and ensure the nvidia-fs kernel module is loaded and configured on the compute nodes. The Technical Reason: To achieve the “Direct“ in GPUDirect Storage, the data path must remain in the hardware domain as much as possible. ? NVMe-oF with RDMA: Remote Direct Memory Access (RDMA) is the underlying transport that allows data to move from the storage controller‘s memory directly to the GPU‘s memory over the network (InfiniBand or RoCE) without involving the host CPU‘s cycles or memory bus. ? nvidia-fs Kernel Module: This is the core software component (the NVFS driver) that orchestrates the I/O. It intercepts file system calls and coordinates with the storage driver to ensure the DMA transfer targets the GPU‘s address space. Without this module loaded, the system will fall back to “Compatibility Mode,“ which routes data through the CPU. The NCP-AII Context: The exam validates your ability to “Configure initial parameters for third-party storage.“ You are expected to know that for high-performance AI clusters, a parallel filesystem (like Weka, Lustre, or Vast) using RDMA and the nvidia-fs driver is the standard “Golden Path“ for data.
Incorrect: A. Standard NFSv3 and disable RDMA NFSv3 is a legacy protocol that relies heavily on the CPU for packet processing and uses a “bounce buffer“ in system RAM. Disabling RDMA further ensures that every byte of data must be copied by the CPU before it reaches the GPU. This is the opposite of a GDS optimization and would result in the highest latency possible.
B. Local SATA SSD and disable parallel filesystem While local storage is fast, SATA SSDs are limited by the AHCI protocol, which is significantly slower than NVMe. Furthermore, disabling a parallel filesystem in an HGX cluster makes distributed training impossible, as there would be no shared namespace for the training data across the multiple nodes in the AI Factory.
D. iSCSI over 1GbE management network A 1GbE network provides approximately 125 MB/s of bandwidth. A single NVIDIA H100 GPU can ingest data at speeds exceeding 50 GB/s. Using a management network for storage would result in a massive bottleneck, starving the GPUs and causing the training job to stall indefinitely. Additionally, standard iSCSI does not support the RDMA extensions required for GDS.
Incorrect
Correct: C Enable NVMe-over-Fabrics (NVMe-oF) with RDMA support and ensure the nvidia-fs kernel module is loaded and configured on the compute nodes. The Technical Reason: To achieve the “Direct“ in GPUDirect Storage, the data path must remain in the hardware domain as much as possible. ? NVMe-oF with RDMA: Remote Direct Memory Access (RDMA) is the underlying transport that allows data to move from the storage controller‘s memory directly to the GPU‘s memory over the network (InfiniBand or RoCE) without involving the host CPU‘s cycles or memory bus. ? nvidia-fs Kernel Module: This is the core software component (the NVFS driver) that orchestrates the I/O. It intercepts file system calls and coordinates with the storage driver to ensure the DMA transfer targets the GPU‘s address space. Without this module loaded, the system will fall back to “Compatibility Mode,“ which routes data through the CPU. The NCP-AII Context: The exam validates your ability to “Configure initial parameters for third-party storage.“ You are expected to know that for high-performance AI clusters, a parallel filesystem (like Weka, Lustre, or Vast) using RDMA and the nvidia-fs driver is the standard “Golden Path“ for data.
Incorrect: A. Standard NFSv3 and disable RDMA NFSv3 is a legacy protocol that relies heavily on the CPU for packet processing and uses a “bounce buffer“ in system RAM. Disabling RDMA further ensures that every byte of data must be copied by the CPU before it reaches the GPU. This is the opposite of a GDS optimization and would result in the highest latency possible.
B. Local SATA SSD and disable parallel filesystem While local storage is fast, SATA SSDs are limited by the AHCI protocol, which is significantly slower than NVMe. Furthermore, disabling a parallel filesystem in an HGX cluster makes distributed training impossible, as there would be no shared namespace for the training data across the multiple nodes in the AI Factory.
D. iSCSI over 1GbE management network A 1GbE network provides approximately 125 MB/s of bandwidth. A single NVIDIA H100 GPU can ingest data at speeds exceeding 50 GB/s. Using a management network for storage would result in a massive bottleneck, starving the GPUs and causing the training job to stall indefinitely. Additionally, standard iSCSI does not support the RDMA extensions required for GDS.
Unattempted
Correct: C Enable NVMe-over-Fabrics (NVMe-oF) with RDMA support and ensure the nvidia-fs kernel module is loaded and configured on the compute nodes. The Technical Reason: To achieve the “Direct“ in GPUDirect Storage, the data path must remain in the hardware domain as much as possible. ? NVMe-oF with RDMA: Remote Direct Memory Access (RDMA) is the underlying transport that allows data to move from the storage controller‘s memory directly to the GPU‘s memory over the network (InfiniBand or RoCE) without involving the host CPU‘s cycles or memory bus. ? nvidia-fs Kernel Module: This is the core software component (the NVFS driver) that orchestrates the I/O. It intercepts file system calls and coordinates with the storage driver to ensure the DMA transfer targets the GPU‘s address space. Without this module loaded, the system will fall back to “Compatibility Mode,“ which routes data through the CPU. The NCP-AII Context: The exam validates your ability to “Configure initial parameters for third-party storage.“ You are expected to know that for high-performance AI clusters, a parallel filesystem (like Weka, Lustre, or Vast) using RDMA and the nvidia-fs driver is the standard “Golden Path“ for data.
Incorrect: A. Standard NFSv3 and disable RDMA NFSv3 is a legacy protocol that relies heavily on the CPU for packet processing and uses a “bounce buffer“ in system RAM. Disabling RDMA further ensures that every byte of data must be copied by the CPU before it reaches the GPU. This is the opposite of a GDS optimization and would result in the highest latency possible.
B. Local SATA SSD and disable parallel filesystem While local storage is fast, SATA SSDs are limited by the AHCI protocol, which is significantly slower than NVMe. Furthermore, disabling a parallel filesystem in an HGX cluster makes distributed training impossible, as there would be no shared namespace for the training data across the multiple nodes in the AI Factory.
D. iSCSI over 1GbE management network A 1GbE network provides approximately 125 MB/s of bandwidth. A single NVIDIA H100 GPU can ingest data at speeds exceeding 50 GB/s. Using a management network for storage would result in a massive bottleneck, starving the GPUs and causing the training job to stall indefinitely. Additionally, standard iSCSI does not support the RDMA extensions required for GDS.
Question 13 of 60
13. Question
While configuring third-party storage for an AI factory, the administrator must ensure the storage parameters are optimized for high-throughput data ingestion. Which initial configuration parameter is most critical for a storage system that will serve datasets to NVIDIA DGX nodes via an InfiniBand network using GPUDirect Storage (GDS)?
Correct
Correct: B Enable RDMA (Remote Direct Memory Access) support on the storage controllers and ensure the storage is on a compatible InfiniBand or RoCE subnet.
The Technical Reason: GDS relies fundamentally on RDMA to move data.
Direct Path: RDMA allows the storage controller to write data directly into the GPU‘s memory (HBM) without involving the host CPU‘s memory (RAM) or the OS kernel.
Fabric Compatibility: For this to work, the storage must be “on-fabric“ (InfiniBand or Ethernet with RoCE) so the ConnectX-7 adapters can coordinate the transfer. Without RDMA, the system falls back to traditional “bounce-buffered“ I/O, which consumes significant CPU cycles and limits throughput to roughly 50% of the hardware‘s potential.
The NCP-AII Context: The exam validates your ability to configure “AI-ready“ storage. This includes ensuring the nvidia-fs kernel module is loaded and that the storage protocol (such as NVMe-oF or a GDS-enabled parallel filesystem like Lustre or Weka) is configured to use the RDMA verbs layer.
Incorrect Options: A. TPM-based authentication for every block TPM (Trusted Platform Module) is used for hardware-level security, such as disk encryption keys or secure boot. Requiring a TPM handshake for every block of training data (which involves billions of blocks) would introduce catastrophic latency and effectively kill the performance of an AI cluster. GDS security is typically handled at the network/fabric layer, not per-block via TPM.
C. Disable PCIe Peer-to-Peer (P2P) This is a “trap“ option. P2P communication is actually a requirement for GDS. Disabling P2P in the BIOS prevents the NIC and the GPU from talking directly to each other over the PCIe bus. If P2P is disabled, GDS cannot function, and all data must be “bounced“ through the CPU, which is exactly what an administrator tries to avoid in an AI factory.
D. Legacy NFS v3 with no multi-pathing NFS v3 is a legacy, single-threaded protocol that does not natively support RDMA in a way that benefits GDS. Furthermore, avoiding “multi-pathing“ would create a single point of failure and a massive bandwidth bottleneck. Modern NVIDIA-certified storage solutions use parallel filesystems or NVMe-oF with multi-pathing to saturate the 400Gbps links of a DGX H100.
Incorrect
Correct: B Enable RDMA (Remote Direct Memory Access) support on the storage controllers and ensure the storage is on a compatible InfiniBand or RoCE subnet.
The Technical Reason: GDS relies fundamentally on RDMA to move data.
Direct Path: RDMA allows the storage controller to write data directly into the GPU‘s memory (HBM) without involving the host CPU‘s memory (RAM) or the OS kernel.
Fabric Compatibility: For this to work, the storage must be “on-fabric“ (InfiniBand or Ethernet with RoCE) so the ConnectX-7 adapters can coordinate the transfer. Without RDMA, the system falls back to traditional “bounce-buffered“ I/O, which consumes significant CPU cycles and limits throughput to roughly 50% of the hardware‘s potential.
The NCP-AII Context: The exam validates your ability to configure “AI-ready“ storage. This includes ensuring the nvidia-fs kernel module is loaded and that the storage protocol (such as NVMe-oF or a GDS-enabled parallel filesystem like Lustre or Weka) is configured to use the RDMA verbs layer.
Incorrect Options: A. TPM-based authentication for every block TPM (Trusted Platform Module) is used for hardware-level security, such as disk encryption keys or secure boot. Requiring a TPM handshake for every block of training data (which involves billions of blocks) would introduce catastrophic latency and effectively kill the performance of an AI cluster. GDS security is typically handled at the network/fabric layer, not per-block via TPM.
C. Disable PCIe Peer-to-Peer (P2P) This is a “trap“ option. P2P communication is actually a requirement for GDS. Disabling P2P in the BIOS prevents the NIC and the GPU from talking directly to each other over the PCIe bus. If P2P is disabled, GDS cannot function, and all data must be “bounced“ through the CPU, which is exactly what an administrator tries to avoid in an AI factory.
D. Legacy NFS v3 with no multi-pathing NFS v3 is a legacy, single-threaded protocol that does not natively support RDMA in a way that benefits GDS. Furthermore, avoiding “multi-pathing“ would create a single point of failure and a massive bandwidth bottleneck. Modern NVIDIA-certified storage solutions use parallel filesystems or NVMe-oF with multi-pathing to saturate the 400Gbps links of a DGX H100.
Unattempted
Correct: B Enable RDMA (Remote Direct Memory Access) support on the storage controllers and ensure the storage is on a compatible InfiniBand or RoCE subnet.
The Technical Reason: GDS relies fundamentally on RDMA to move data.
Direct Path: RDMA allows the storage controller to write data directly into the GPU‘s memory (HBM) without involving the host CPU‘s memory (RAM) or the OS kernel.
Fabric Compatibility: For this to work, the storage must be “on-fabric“ (InfiniBand or Ethernet with RoCE) so the ConnectX-7 adapters can coordinate the transfer. Without RDMA, the system falls back to traditional “bounce-buffered“ I/O, which consumes significant CPU cycles and limits throughput to roughly 50% of the hardware‘s potential.
The NCP-AII Context: The exam validates your ability to configure “AI-ready“ storage. This includes ensuring the nvidia-fs kernel module is loaded and that the storage protocol (such as NVMe-oF or a GDS-enabled parallel filesystem like Lustre or Weka) is configured to use the RDMA verbs layer.
Incorrect Options: A. TPM-based authentication for every block TPM (Trusted Platform Module) is used for hardware-level security, such as disk encryption keys or secure boot. Requiring a TPM handshake for every block of training data (which involves billions of blocks) would introduce catastrophic latency and effectively kill the performance of an AI cluster. GDS security is typically handled at the network/fabric layer, not per-block via TPM.
C. Disable PCIe Peer-to-Peer (P2P) This is a “trap“ option. P2P communication is actually a requirement for GDS. Disabling P2P in the BIOS prevents the NIC and the GPU from talking directly to each other over the PCIe bus. If P2P is disabled, GDS cannot function, and all data must be “bounced“ through the CPU, which is exactly what an administrator tries to avoid in an AI factory.
D. Legacy NFS v3 with no multi-pathing NFS v3 is a legacy, single-threaded protocol that does not natively support RDMA in a way that benefits GDS. Furthermore, avoiding “multi-pathing“ would create a single point of failure and a massive bandwidth bottleneck. Modern NVIDIA-certified storage solutions use parallel filesystems or NVMe-oF with multi-pathing to saturate the 400Gbps links of a DGX H100.
Question 14 of 60
14. Question
A Linux administrator is installing the NVIDIA Container Toolkit on a fresh Ubuntu installation to support Docker-based AI training workloads. After successfully installing the package, what is the mandatory next step to ensure the Docker daemon can utilize the NVIDIA GPU runtime correctly for the user applications?
Correct
Correct: C Edit the daemon.json file in the docker directory to set the default-runtime to nvidia and restart the Docker service on the host.
The Technical Reason: Simply installing the nvidia-container-toolkit package does not automatically tell Docker how to use it.
Registration: The toolkit must be registered as a valid runtime within Docker‘s configuration (typically /etc/docker/daemon.json).
Automation: While this can be done manually, NVIDIA provides the nvidia-ctk utility (e.g., sudo nvidia-ctk runtime configure –runtime=docker) to automate the injection of the nvidia runtime definition into the JSON file.
Default Runtime: By setting “default-runtime“: “nvidia“, the administrator ensures that every docker run command automatically utilizes the NVIDIA GPU hooks without requiring the user to explicitly pass the –runtime=nvidia flag every time.
Service Restart: Like any change to the Docker daemon configuration, a sudo systemctl restart docker is required for the changes to take effect.
The NCP-AII Context: The exam expects you to know the post-installation workflow. This includes the configuration of the daemon and the subsequent validation using docker run –rm –gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi.
Incorrect Options: A. Run nvidia-smi –factory-reset This command is used to reset the GPU‘s volatile state (like clearing double-bit ECC errors or resetting clock offsets) to factory defaults. It has no impact on Docker‘s ability to “see“ or “use“ the GPU. It is a troubleshooting step for hardware state, not a configuration step for the container runtime.
B. Install the DOCA SDK and map via virtual PCIe switch While DOCA (Data Center Infrastructure-on-a-Chip Architecture) is critical for BlueField DPUs, it is not a requirement for standard GPU containerization on a host. Mapping GPUs via “virtual PCIe switches“ refers to complex VM-based passthrough (like vGPU or SR-IOV), which is not the standard procedure for a native Linux Docker installation.
D. Recompile the Linux kernel with CUDA-SUPPORT This is a “trick“ answer. There is no such thing as a CUDA-SUPPORT flag in the standard Linux kernel. NVIDIA‘s GPU support is provided by a loadable kernel module (the NVIDIA driver) that is installed alongside the kernel, not compiled into it. Recompiling the kernel is unnecessary and would likely break the existing driver installation.
Incorrect
Correct: C Edit the daemon.json file in the docker directory to set the default-runtime to nvidia and restart the Docker service on the host.
The Technical Reason: Simply installing the nvidia-container-toolkit package does not automatically tell Docker how to use it.
Registration: The toolkit must be registered as a valid runtime within Docker‘s configuration (typically /etc/docker/daemon.json).
Automation: While this can be done manually, NVIDIA provides the nvidia-ctk utility (e.g., sudo nvidia-ctk runtime configure –runtime=docker) to automate the injection of the nvidia runtime definition into the JSON file.
Default Runtime: By setting “default-runtime“: “nvidia“, the administrator ensures that every docker run command automatically utilizes the NVIDIA GPU hooks without requiring the user to explicitly pass the –runtime=nvidia flag every time.
Service Restart: Like any change to the Docker daemon configuration, a sudo systemctl restart docker is required for the changes to take effect.
The NCP-AII Context: The exam expects you to know the post-installation workflow. This includes the configuration of the daemon and the subsequent validation using docker run –rm –gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi.
Incorrect Options: A. Run nvidia-smi –factory-reset This command is used to reset the GPU‘s volatile state (like clearing double-bit ECC errors or resetting clock offsets) to factory defaults. It has no impact on Docker‘s ability to “see“ or “use“ the GPU. It is a troubleshooting step for hardware state, not a configuration step for the container runtime.
B. Install the DOCA SDK and map via virtual PCIe switch While DOCA (Data Center Infrastructure-on-a-Chip Architecture) is critical for BlueField DPUs, it is not a requirement for standard GPU containerization on a host. Mapping GPUs via “virtual PCIe switches“ refers to complex VM-based passthrough (like vGPU or SR-IOV), which is not the standard procedure for a native Linux Docker installation.
D. Recompile the Linux kernel with CUDA-SUPPORT This is a “trick“ answer. There is no such thing as a CUDA-SUPPORT flag in the standard Linux kernel. NVIDIA‘s GPU support is provided by a loadable kernel module (the NVIDIA driver) that is installed alongside the kernel, not compiled into it. Recompiling the kernel is unnecessary and would likely break the existing driver installation.
Unattempted
Correct: C Edit the daemon.json file in the docker directory to set the default-runtime to nvidia and restart the Docker service on the host.
The Technical Reason: Simply installing the nvidia-container-toolkit package does not automatically tell Docker how to use it.
Registration: The toolkit must be registered as a valid runtime within Docker‘s configuration (typically /etc/docker/daemon.json).
Automation: While this can be done manually, NVIDIA provides the nvidia-ctk utility (e.g., sudo nvidia-ctk runtime configure –runtime=docker) to automate the injection of the nvidia runtime definition into the JSON file.
Default Runtime: By setting “default-runtime“: “nvidia“, the administrator ensures that every docker run command automatically utilizes the NVIDIA GPU hooks without requiring the user to explicitly pass the –runtime=nvidia flag every time.
Service Restart: Like any change to the Docker daemon configuration, a sudo systemctl restart docker is required for the changes to take effect.
The NCP-AII Context: The exam expects you to know the post-installation workflow. This includes the configuration of the daemon and the subsequent validation using docker run –rm –gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi.
Incorrect Options: A. Run nvidia-smi –factory-reset This command is used to reset the GPU‘s volatile state (like clearing double-bit ECC errors or resetting clock offsets) to factory defaults. It has no impact on Docker‘s ability to “see“ or “use“ the GPU. It is a troubleshooting step for hardware state, not a configuration step for the container runtime.
B. Install the DOCA SDK and map via virtual PCIe switch While DOCA (Data Center Infrastructure-on-a-Chip Architecture) is critical for BlueField DPUs, it is not a requirement for standard GPU containerization on a host. Mapping GPUs via “virtual PCIe switches“ refers to complex VM-based passthrough (like vGPU or SR-IOV), which is not the standard procedure for a native Linux Docker installation.
D. Recompile the Linux kernel with CUDA-SUPPORT This is a “trick“ answer. There is no such thing as a CUDA-SUPPORT flag in the standard Linux kernel. NVIDIA‘s GPU support is provided by a loadable kernel module (the NVIDIA driver) that is installed alongside the kernel, not compiled into it. Recompiling the kernel is unnecessary and would likely break the existing driver installation.
Question 15 of 60
15. Question
A systems engineer is tasked with the initial physical bring-up of an NVIDIA HGX H100 system within a new AI factory. After ensuring the rack is properly powered and the cooling parameters are within the specified operational range, what is the most critical sequence of events for validating the hardware and performing the initial firmware configuration to ensure all GPUs are correctly recognized by the Baseboard Management Controller?
Correct
Correct: D Configure the OOB management network first, then access the BMC to perform initial firmware synchronization across the HGX baseboard and NVSwitches before deploying the OS.
The Technical Reason:
OOB and BMC Stability: The Baseboard Management Controller (BMC) is the “root of trust“ for the server. It acts as an independent processor that monitors power, thermal sensors, and hardware inventory even when the main CPUs are powered off.
Firmware Synchronization: HGX systems are complex assemblies where the GPUs, NVSwitches, and the Motherboard Tray must have compatible firmware versions to “handshake“ correctly. Performing this synchronization via the BMC ensures that when the system finally boots into an OS, the PCIe enumeration and NVLink training occur without errors.
Hardware Recognition: The BMC is responsible for the “Early Power On“ sequence. If the BMC is not correctly configured or its firmware is out of sync, it may fail to power on the GPU tray or report “Missing“ GPUs to the OS.
The NCP-AII Context: The exam validates your ability to “Perform initial configuration of BMC and OOB.“ It emphasizes that the BMC is the primary tool for initial hardware validation and that firmware updates (using tools like nvfwupd via Redfish) should occur before production software installation.
Incorrect Options: A. Install Container Toolkit on a temporary OS to scan GPUs This is a common “field error.“ You cannot reliably install the NVIDIA Container Toolkit or scan for GPUs if the underlying hardware hasn‘t been properly initialized by the BMC/SBIOS. If the firmware is mismatched, the temporary OS may only see four out of eight GPUs, leading to a false diagnosis of hardware failure when the issue is actually a management-layer configuration error.
B. Boot into OS and run nvidia-smi immediately nvidia-smi is a powerful tool, but it is a host-side utility. In a fresh bring-up, relying on the OS before the BMC is configured is risky. If the BMC has not been set up to manage the specific power policies of the HGX tray, the GPUs may be held in a “Power Brake“ or “Hardware Slowdown“ state, giving misleading performance readings during initial checks.
C. Use storage controller to push firmware via data fabric Storage controllers and the InfiniBand “Data Fabric“ are designed for high-speed data movement, not for low-level system firmware management. Firmware for the GPU baseboard and NVSwitches is delivered through the Management Network (using the BMC/Redfish) or locally via the SBIOS/UEFI. The data fabric is not active or accessible until the OS and drivers are fully functional.
Incorrect
Correct: D Configure the OOB management network first, then access the BMC to perform initial firmware synchronization across the HGX baseboard and NVSwitches before deploying the OS.
The Technical Reason:
OOB and BMC Stability: The Baseboard Management Controller (BMC) is the “root of trust“ for the server. It acts as an independent processor that monitors power, thermal sensors, and hardware inventory even when the main CPUs are powered off.
Firmware Synchronization: HGX systems are complex assemblies where the GPUs, NVSwitches, and the Motherboard Tray must have compatible firmware versions to “handshake“ correctly. Performing this synchronization via the BMC ensures that when the system finally boots into an OS, the PCIe enumeration and NVLink training occur without errors.
Hardware Recognition: The BMC is responsible for the “Early Power On“ sequence. If the BMC is not correctly configured or its firmware is out of sync, it may fail to power on the GPU tray or report “Missing“ GPUs to the OS.
The NCP-AII Context: The exam validates your ability to “Perform initial configuration of BMC and OOB.“ It emphasizes that the BMC is the primary tool for initial hardware validation and that firmware updates (using tools like nvfwupd via Redfish) should occur before production software installation.
Incorrect Options: A. Install Container Toolkit on a temporary OS to scan GPUs This is a common “field error.“ You cannot reliably install the NVIDIA Container Toolkit or scan for GPUs if the underlying hardware hasn‘t been properly initialized by the BMC/SBIOS. If the firmware is mismatched, the temporary OS may only see four out of eight GPUs, leading to a false diagnosis of hardware failure when the issue is actually a management-layer configuration error.
B. Boot into OS and run nvidia-smi immediately nvidia-smi is a powerful tool, but it is a host-side utility. In a fresh bring-up, relying on the OS before the BMC is configured is risky. If the BMC has not been set up to manage the specific power policies of the HGX tray, the GPUs may be held in a “Power Brake“ or “Hardware Slowdown“ state, giving misleading performance readings during initial checks.
C. Use storage controller to push firmware via data fabric Storage controllers and the InfiniBand “Data Fabric“ are designed for high-speed data movement, not for low-level system firmware management. Firmware for the GPU baseboard and NVSwitches is delivered through the Management Network (using the BMC/Redfish) or locally via the SBIOS/UEFI. The data fabric is not active or accessible until the OS and drivers are fully functional.
Unattempted
Correct: D Configure the OOB management network first, then access the BMC to perform initial firmware synchronization across the HGX baseboard and NVSwitches before deploying the OS.
The Technical Reason:
OOB and BMC Stability: The Baseboard Management Controller (BMC) is the “root of trust“ for the server. It acts as an independent processor that monitors power, thermal sensors, and hardware inventory even when the main CPUs are powered off.
Firmware Synchronization: HGX systems are complex assemblies where the GPUs, NVSwitches, and the Motherboard Tray must have compatible firmware versions to “handshake“ correctly. Performing this synchronization via the BMC ensures that when the system finally boots into an OS, the PCIe enumeration and NVLink training occur without errors.
Hardware Recognition: The BMC is responsible for the “Early Power On“ sequence. If the BMC is not correctly configured or its firmware is out of sync, it may fail to power on the GPU tray or report “Missing“ GPUs to the OS.
The NCP-AII Context: The exam validates your ability to “Perform initial configuration of BMC and OOB.“ It emphasizes that the BMC is the primary tool for initial hardware validation and that firmware updates (using tools like nvfwupd via Redfish) should occur before production software installation.
Incorrect Options: A. Install Container Toolkit on a temporary OS to scan GPUs This is a common “field error.“ You cannot reliably install the NVIDIA Container Toolkit or scan for GPUs if the underlying hardware hasn‘t been properly initialized by the BMC/SBIOS. If the firmware is mismatched, the temporary OS may only see four out of eight GPUs, leading to a false diagnosis of hardware failure when the issue is actually a management-layer configuration error.
B. Boot into OS and run nvidia-smi immediately nvidia-smi is a powerful tool, but it is a host-side utility. In a fresh bring-up, relying on the OS before the BMC is configured is risky. If the BMC has not been set up to manage the specific power policies of the HGX tray, the GPUs may be held in a “Power Brake“ or “Hardware Slowdown“ state, giving misleading performance readings during initial checks.
C. Use storage controller to push firmware via data fabric Storage controllers and the InfiniBand “Data Fabric“ are designed for high-speed data movement, not for low-level system firmware management. Firmware for the GPU baseboard and NVSwitches is delivered through the Management Network (using the BMC/Redfish) or locally via the SBIOS/UEFI. The data fabric is not active or accessible until the OS and drivers are fully functional.
Question 16 of 60
16. Question
A technician is performing a fault detection audit on an NVIDIA HGX system. They observe that one of the eight H100 GPUs is not appearing in the SMI output despite being physically present. After checking the power delivery and cooling, they suspect a firmware mismatch. What is the most appropriate tool and method to identify if the HGX firmware is out of sync across the GPU complex?
Correct
Correct: D. Use the nv-fw-updater tool to compare the current VBIOS and firmware versions against the NVIDIA-certified baseline provided in the cluster release notes.
This is correct because the NCP-AII certification blueprint explicitly includes “Perform firmware upgrades (including on HGX) and fault detection“ and “Confirm FW/SW on switches“ as core tasks within the System and Server Bring-up and Cluster Test and Verification domains .
When a GPU is physically present but not appearing in SMI output after power and cooling checks have been verified, firmware mismatch across the GPU complex is a likely cause that requires specialized tools for diagnosis .
The nv-fw-updater tool (or nvfwupd) is specifically designed for HGX systems to manage and verify firmware versions across the GPU complex, ensuring all components run certified firmware versions .
The NVIDIA documentation confirms that firmware updates for HGX systems require matching the correct firmware file to the component, and tools like nvfwupd are used to verify and apply firmware across the GPU tray .
Comparing against the NVIDIA-certified baseline ensures all GPUs run identical, validated firmware versions, which is essential for consistent behavior and detection by the driver .
Incorrect: A. Reinstall the Ubuntu operating system to see if the missing GPU is detected by the generic VGA driver during the initial boot sequence.
This is incorrect because reinstalling the OS is a drastic, time-consuming step that bypasses proper diagnostic procedure. The NCP-AII troubleshooting methodology requires systematic identification of hardware faults before making software changes . GPU detection issues after physical installation are typically related to firmware, power, or seating, not OS configuration. Generic VGA drivers would not properly initialize H100 GPUs.
B. Use the ‘top‘ command to monitor CPU usage; if one core is at one hundred percent, it indicates that the missing GPU is currently undergoing a firmware update.
This is incorrect because top is a process monitoring tool that shows CPU utilization, not a diagnostic tool for GPU firmware status. Firmware updates do not manifest as single-core CPU spikes, and there is no relationship between CPU core usage and GPU firmware update status. This approach has no basis in NVIDIA diagnostic methodology.
C. Run the standard Linux ‘lsmod‘ command to see if the nvidia-peermem module is loaded, as this confirms the firmware version of the GPU hardware.
This is incorrect because lsmod shows loaded kernel modules, not firmware versions. The nvidia-peermem module is related to GPU direct RDMA functionality , not firmware version verification. Firmware versions cannot be determined from module loading status. The correct tool for firmware comparison is nv-fw-updater .
Incorrect
Correct: D. Use the nv-fw-updater tool to compare the current VBIOS and firmware versions against the NVIDIA-certified baseline provided in the cluster release notes.
This is correct because the NCP-AII certification blueprint explicitly includes “Perform firmware upgrades (including on HGX) and fault detection“ and “Confirm FW/SW on switches“ as core tasks within the System and Server Bring-up and Cluster Test and Verification domains .
When a GPU is physically present but not appearing in SMI output after power and cooling checks have been verified, firmware mismatch across the GPU complex is a likely cause that requires specialized tools for diagnosis .
The nv-fw-updater tool (or nvfwupd) is specifically designed for HGX systems to manage and verify firmware versions across the GPU complex, ensuring all components run certified firmware versions .
The NVIDIA documentation confirms that firmware updates for HGX systems require matching the correct firmware file to the component, and tools like nvfwupd are used to verify and apply firmware across the GPU tray .
Comparing against the NVIDIA-certified baseline ensures all GPUs run identical, validated firmware versions, which is essential for consistent behavior and detection by the driver .
Incorrect: A. Reinstall the Ubuntu operating system to see if the missing GPU is detected by the generic VGA driver during the initial boot sequence.
This is incorrect because reinstalling the OS is a drastic, time-consuming step that bypasses proper diagnostic procedure. The NCP-AII troubleshooting methodology requires systematic identification of hardware faults before making software changes . GPU detection issues after physical installation are typically related to firmware, power, or seating, not OS configuration. Generic VGA drivers would not properly initialize H100 GPUs.
B. Use the ‘top‘ command to monitor CPU usage; if one core is at one hundred percent, it indicates that the missing GPU is currently undergoing a firmware update.
This is incorrect because top is a process monitoring tool that shows CPU utilization, not a diagnostic tool for GPU firmware status. Firmware updates do not manifest as single-core CPU spikes, and there is no relationship between CPU core usage and GPU firmware update status. This approach has no basis in NVIDIA diagnostic methodology.
C. Run the standard Linux ‘lsmod‘ command to see if the nvidia-peermem module is loaded, as this confirms the firmware version of the GPU hardware.
This is incorrect because lsmod shows loaded kernel modules, not firmware versions. The nvidia-peermem module is related to GPU direct RDMA functionality , not firmware version verification. Firmware versions cannot be determined from module loading status. The correct tool for firmware comparison is nv-fw-updater .
Unattempted
Correct: D. Use the nv-fw-updater tool to compare the current VBIOS and firmware versions against the NVIDIA-certified baseline provided in the cluster release notes.
This is correct because the NCP-AII certification blueprint explicitly includes “Perform firmware upgrades (including on HGX) and fault detection“ and “Confirm FW/SW on switches“ as core tasks within the System and Server Bring-up and Cluster Test and Verification domains .
When a GPU is physically present but not appearing in SMI output after power and cooling checks have been verified, firmware mismatch across the GPU complex is a likely cause that requires specialized tools for diagnosis .
The nv-fw-updater tool (or nvfwupd) is specifically designed for HGX systems to manage and verify firmware versions across the GPU complex, ensuring all components run certified firmware versions .
The NVIDIA documentation confirms that firmware updates for HGX systems require matching the correct firmware file to the component, and tools like nvfwupd are used to verify and apply firmware across the GPU tray .
Comparing against the NVIDIA-certified baseline ensures all GPUs run identical, validated firmware versions, which is essential for consistent behavior and detection by the driver .
Incorrect: A. Reinstall the Ubuntu operating system to see if the missing GPU is detected by the generic VGA driver during the initial boot sequence.
This is incorrect because reinstalling the OS is a drastic, time-consuming step that bypasses proper diagnostic procedure. The NCP-AII troubleshooting methodology requires systematic identification of hardware faults before making software changes . GPU detection issues after physical installation are typically related to firmware, power, or seating, not OS configuration. Generic VGA drivers would not properly initialize H100 GPUs.
B. Use the ‘top‘ command to monitor CPU usage; if one core is at one hundred percent, it indicates that the missing GPU is currently undergoing a firmware update.
This is incorrect because top is a process monitoring tool that shows CPU utilization, not a diagnostic tool for GPU firmware status. Firmware updates do not manifest as single-core CPU spikes, and there is no relationship between CPU core usage and GPU firmware update status. This approach has no basis in NVIDIA diagnostic methodology.
C. Run the standard Linux ‘lsmod‘ command to see if the nvidia-peermem module is loaded, as this confirms the firmware version of the GPU hardware.
This is incorrect because lsmod shows loaded kernel modules, not firmware versions. The nvidia-peermem module is related to GPU direct RDMA functionality , not firmware version verification. Firmware versions cannot be determined from module loading status. The correct tool for firmware comparison is nv-fw-updater .
Question 17 of 60
17. Question
An infrastructure architect is deploying an NVIDIA BlueField-3 DPU to manage the network control plane for a cluster. To achieve optimal performance and offload networking tasks effectively, the administrator must configure the DPU‘s internal operation mode. Which mode is specifically designed to allow the DPU to run a separate OS and manage network traffic independently?
Correct
Correct: D DPU Mode where internal Arm cores run a Linux-based OS to manage offloads and control plane functions.
The Technical Reason: This is the default and most powerful mode for the BlueField DPU (often referred to as Embedded CPU Function Ownership (ECPF) Mode).
Independent OS: In this mode, the DPUs integrated Arm cores boot their own dedicated operating system (typically Ubuntu or Yocto-based), independent of the hosts x86/AMD64 OS.
Control Plane Isolation: The DPU manages the “eSwitch“ (embedded switch), allowing it to handle networking, security (like firewalls), and storage offloads (via NVIDIA SNAP) without consuming the host‘s CPU cycles.
Connectivity: All traffic to and from the host must pass through the DPUs Arm-managed control plane, providing a secure, isolated management layer.
The NCP-AII Context: The certification requires you to know how to verify the DPU‘s mode and status. You would use the mlxconfig tool from the host to verify the INTERNAL_CPU_OFFLOAD_ENGINE parameter. A value of ENABLED (0) indicates the device is in DPU mode.
Incorrect Options: A. NIC Mode In NIC Mode, the DPU behaves as a standard, “dumb“ network adapter (similar to a ConnectX-7). The internal Arm cores are essentially bypassed or inactive for control plane tasks. While this saves power and reduces complexity, it removes the “programmable“ benefits of the DPU. Note that BlueField SuperNICs ship in this mode by default, but standard DPUs do not.
B. Legacy Mode There is no “Legacy Mode“ in the BlueField-3 architecture that targets PCIe Gen2 compatibility. BlueField-3 is designed for PCIe Gen5 and is backward compatible with Gen4/Gen3 through standard PCIe negotiation. Disabling hardware accelerators would defeat the purpose of using a DPU in an AI cluster.
C. Bridge Mode While the DPU can perform bridging functions at the software layer (using Open vSwitch/OVS), “Bridge Mode“ is not a fundamental hardware operation mode. Furthermore, a DPU is designed specifically to perform local processing (encryption, compression, telemetry) rather than simply acting as a transparent cable to a third-party controller.
Incorrect
Correct: D DPU Mode where internal Arm cores run a Linux-based OS to manage offloads and control plane functions.
The Technical Reason: This is the default and most powerful mode for the BlueField DPU (often referred to as Embedded CPU Function Ownership (ECPF) Mode).
Independent OS: In this mode, the DPUs integrated Arm cores boot their own dedicated operating system (typically Ubuntu or Yocto-based), independent of the hosts x86/AMD64 OS.
Control Plane Isolation: The DPU manages the “eSwitch“ (embedded switch), allowing it to handle networking, security (like firewalls), and storage offloads (via NVIDIA SNAP) without consuming the host‘s CPU cycles.
Connectivity: All traffic to and from the host must pass through the DPUs Arm-managed control plane, providing a secure, isolated management layer.
The NCP-AII Context: The certification requires you to know how to verify the DPU‘s mode and status. You would use the mlxconfig tool from the host to verify the INTERNAL_CPU_OFFLOAD_ENGINE parameter. A value of ENABLED (0) indicates the device is in DPU mode.
Incorrect Options: A. NIC Mode In NIC Mode, the DPU behaves as a standard, “dumb“ network adapter (similar to a ConnectX-7). The internal Arm cores are essentially bypassed or inactive for control plane tasks. While this saves power and reduces complexity, it removes the “programmable“ benefits of the DPU. Note that BlueField SuperNICs ship in this mode by default, but standard DPUs do not.
B. Legacy Mode There is no “Legacy Mode“ in the BlueField-3 architecture that targets PCIe Gen2 compatibility. BlueField-3 is designed for PCIe Gen5 and is backward compatible with Gen4/Gen3 through standard PCIe negotiation. Disabling hardware accelerators would defeat the purpose of using a DPU in an AI cluster.
C. Bridge Mode While the DPU can perform bridging functions at the software layer (using Open vSwitch/OVS), “Bridge Mode“ is not a fundamental hardware operation mode. Furthermore, a DPU is designed specifically to perform local processing (encryption, compression, telemetry) rather than simply acting as a transparent cable to a third-party controller.
Unattempted
Correct: D DPU Mode where internal Arm cores run a Linux-based OS to manage offloads and control plane functions.
The Technical Reason: This is the default and most powerful mode for the BlueField DPU (often referred to as Embedded CPU Function Ownership (ECPF) Mode).
Independent OS: In this mode, the DPUs integrated Arm cores boot their own dedicated operating system (typically Ubuntu or Yocto-based), independent of the hosts x86/AMD64 OS.
Control Plane Isolation: The DPU manages the “eSwitch“ (embedded switch), allowing it to handle networking, security (like firewalls), and storage offloads (via NVIDIA SNAP) without consuming the host‘s CPU cycles.
Connectivity: All traffic to and from the host must pass through the DPUs Arm-managed control plane, providing a secure, isolated management layer.
The NCP-AII Context: The certification requires you to know how to verify the DPU‘s mode and status. You would use the mlxconfig tool from the host to verify the INTERNAL_CPU_OFFLOAD_ENGINE parameter. A value of ENABLED (0) indicates the device is in DPU mode.
Incorrect Options: A. NIC Mode In NIC Mode, the DPU behaves as a standard, “dumb“ network adapter (similar to a ConnectX-7). The internal Arm cores are essentially bypassed or inactive for control plane tasks. While this saves power and reduces complexity, it removes the “programmable“ benefits of the DPU. Note that BlueField SuperNICs ship in this mode by default, but standard DPUs do not.
B. Legacy Mode There is no “Legacy Mode“ in the BlueField-3 architecture that targets PCIe Gen2 compatibility. BlueField-3 is designed for PCIe Gen5 and is backward compatible with Gen4/Gen3 through standard PCIe negotiation. Disabling hardware accelerators would defeat the purpose of using a DPU in an AI cluster.
C. Bridge Mode While the DPU can perform bridging functions at the software layer (using Open vSwitch/OVS), “Bridge Mode“ is not a fundamental hardware operation mode. Furthermore, a DPU is designed specifically to perform local processing (encryption, compression, telemetry) rather than simply acting as a transparent cable to a third-party controller.
Question 18 of 60
18. Question
A cluster administrator is running the NVIDIA Collective Communications Library (NCCL) tests to verify the East/West fabric bandwidth. They observe that the ‘all-reduce‘ performance is significantly lower than expected for an NDR InfiniBand network. Which tool should be used to verify if the NVLink Switch fabric is functioning correctly within the nodes?
Correct
Correct: B The nvidia-smi nvlink –status command to check the health and lane activity of the internal GPU interconnects. The Technical Reason: all-reduce operations rely heavily on high-bandwidth, low-latency communication between all GPUs in the cluster. ? Intra-node Bottleneck: Before data ever hits the InfiniBand network (the “East/West“ fabric), it must move between GPUs within the same server via NVLink. ? NVLink Status: The nvidia-smi nvlink –status (or -s) command provides real-time visibility into whether the NVLink lanes are “Up“ or “Down.“ If some lanes are down or running at degraded speeds, the NCCL collective (like all-reduce) will be forced to fall back to the much slower PCIe bus, drastically reducing performance. ? Lane Activity: It also shows if the lanes are active. In an NDR (400Gbps) InfiniBand environment, a failure in the internal NVLink fabric will prevent the GPUs from saturating the network. The NCP-AII Context: The exam validates your ability to troubleshoot the “NVLink Fabric.“ While InfiniBand handles node-to-node traffic, NVLink handles the incredibly high-speed GPU-to-GPU traffic within the node. Understanding this hierarchy is key to isolating performance drops.
Incorrect: A. The ipmitool sensor list command While high ambient temperatures can cause thermal throttling, ipmitool is a generic hardware monitoring tool for the BMC. It can tell you if a fan is failing, but it cannot give you any specific data about the logical health or lane status of the NVLink interconnects. It is too far removed from the GPU data plane to diagnose NCCL performance issues.
C. The ngc registry image list command This is a management command for the NVIDIA GPU Cloud (NGC). It lists available container images in the cloud registry. While using the latest NCCL version is good practice, checking a list of images in the cloud does nothing to verify the actual physical or logical state of the hardware fabric inside your local cluster nodes.
D. The hpl-burnin script High-Performance Linpack (HPL) is a compute-bound benchmark used to test floating-point performance (R_max). While it stresses the system, it primarily focuses on the GPU-to-Memory and CPU-to-GPU paths. It is not designed to measure the specific GPU-to-GPU NVLink bandwidth that all-reduce (a communication-bound collective) requires.
Incorrect
Correct: B The nvidia-smi nvlink –status command to check the health and lane activity of the internal GPU interconnects. The Technical Reason: all-reduce operations rely heavily on high-bandwidth, low-latency communication between all GPUs in the cluster. ? Intra-node Bottleneck: Before data ever hits the InfiniBand network (the “East/West“ fabric), it must move between GPUs within the same server via NVLink. ? NVLink Status: The nvidia-smi nvlink –status (or -s) command provides real-time visibility into whether the NVLink lanes are “Up“ or “Down.“ If some lanes are down or running at degraded speeds, the NCCL collective (like all-reduce) will be forced to fall back to the much slower PCIe bus, drastically reducing performance. ? Lane Activity: It also shows if the lanes are active. In an NDR (400Gbps) InfiniBand environment, a failure in the internal NVLink fabric will prevent the GPUs from saturating the network. The NCP-AII Context: The exam validates your ability to troubleshoot the “NVLink Fabric.“ While InfiniBand handles node-to-node traffic, NVLink handles the incredibly high-speed GPU-to-GPU traffic within the node. Understanding this hierarchy is key to isolating performance drops.
Incorrect: A. The ipmitool sensor list command While high ambient temperatures can cause thermal throttling, ipmitool is a generic hardware monitoring tool for the BMC. It can tell you if a fan is failing, but it cannot give you any specific data about the logical health or lane status of the NVLink interconnects. It is too far removed from the GPU data plane to diagnose NCCL performance issues.
C. The ngc registry image list command This is a management command for the NVIDIA GPU Cloud (NGC). It lists available container images in the cloud registry. While using the latest NCCL version is good practice, checking a list of images in the cloud does nothing to verify the actual physical or logical state of the hardware fabric inside your local cluster nodes.
D. The hpl-burnin script High-Performance Linpack (HPL) is a compute-bound benchmark used to test floating-point performance (R_max). While it stresses the system, it primarily focuses on the GPU-to-Memory and CPU-to-GPU paths. It is not designed to measure the specific GPU-to-GPU NVLink bandwidth that all-reduce (a communication-bound collective) requires.
Unattempted
Correct: B The nvidia-smi nvlink –status command to check the health and lane activity of the internal GPU interconnects. The Technical Reason: all-reduce operations rely heavily on high-bandwidth, low-latency communication between all GPUs in the cluster. ? Intra-node Bottleneck: Before data ever hits the InfiniBand network (the “East/West“ fabric), it must move between GPUs within the same server via NVLink. ? NVLink Status: The nvidia-smi nvlink –status (or -s) command provides real-time visibility into whether the NVLink lanes are “Up“ or “Down.“ If some lanes are down or running at degraded speeds, the NCCL collective (like all-reduce) will be forced to fall back to the much slower PCIe bus, drastically reducing performance. ? Lane Activity: It also shows if the lanes are active. In an NDR (400Gbps) InfiniBand environment, a failure in the internal NVLink fabric will prevent the GPUs from saturating the network. The NCP-AII Context: The exam validates your ability to troubleshoot the “NVLink Fabric.“ While InfiniBand handles node-to-node traffic, NVLink handles the incredibly high-speed GPU-to-GPU traffic within the node. Understanding this hierarchy is key to isolating performance drops.
Incorrect: A. The ipmitool sensor list command While high ambient temperatures can cause thermal throttling, ipmitool is a generic hardware monitoring tool for the BMC. It can tell you if a fan is failing, but it cannot give you any specific data about the logical health or lane status of the NVLink interconnects. It is too far removed from the GPU data plane to diagnose NCCL performance issues.
C. The ngc registry image list command This is a management command for the NVIDIA GPU Cloud (NGC). It lists available container images in the cloud registry. While using the latest NCCL version is good practice, checking a list of images in the cloud does nothing to verify the actual physical or logical state of the hardware fabric inside your local cluster nodes.
D. The hpl-burnin script High-Performance Linpack (HPL) is a compute-bound benchmark used to test floating-point performance (R_max). While it stresses the system, it primarily focuses on the GPU-to-Memory and CPU-to-GPU paths. It is not designed to measure the specific GPU-to-GPU NVLink bandwidth that all-reduce (a communication-bound collective) requires.
Question 19 of 60
19. Question
To verify the health and performance of the inter-GPU communication within a node, an administrator executes the NVIDIA Collective Communications Library (NCCL) tests. If the ‘all_reduce‘ test shows significantly lower bandwidth than expected on an HGX system, which specific hardware component should be investigated first?
Correct
Correct: B The NVLink Switch and NVLink connections. The Technical Reason: all_reduce is a communication-heavy collective that requires GPUs to sum data across all participants. In an HGX system (such as the H100 or A100), intra-node communication is handled by NVLink, which provides significantly higher bandwidth than PCIe. ? The Bottleneck: If all_reduce bandwidth is low, it typically indicates that the high-speed NVLink fabric is either not engaged, has failed lanes, or the NVIDIA Fabric Manager is not correctly training the links. ? The Diagnostic Path: An administrator should check if the GPUs are falling back to the PCIe bus. This can be verified using nvidia-smi nvlink –status or nvidia-smi topo -m. If the NVLink status is not “Up“ or shows errors, the high-speed switch fabric is the primary culprit. The NCP-AII Context: The exam expects you to understand the hierarchy of AI networking. NVLink is for intra-node (inside the server) and InfiniBand/RoCE is for inter-node (between servers). Since the question specifies an HGX system (a single node), the internal fabric is the first place to look.
Incorrect: A. The TPM 2.0 module The Trusted Platform Module (TPM) is a security chip used for hardware-based authentication, encryption keys, and secure boot. It has no involvement in the data plane or the high-speed communication between GPUs. A faulty TPM would prevent a secure boot but would not cause a performance degradation in NCCL.
C. The 1GbE management switch The Management Network (OOB) is used for BMC access, IPMI, and basic system administration. It operates at 1 Gbps, whereas a single NVLink 4 lane operates at hundreds of gigabytes per second. While the management switch is vital for “bringing up“ the node, it does not carry compute traffic and cannot be the source of all_reduce bandwidth issues.
D. The SATA boot drive The boot drive is used to load the Operating System and drivers. Once the NCCL test is running, the data is moved between GPU memory (HBM) and the fabric. The SATA drive‘s low speed (600 MB/s) is irrelevant to the inter-GPU communication speeds (450 GB/s – 900 GB/s) being tested.
Incorrect
Correct: B The NVLink Switch and NVLink connections. The Technical Reason: all_reduce is a communication-heavy collective that requires GPUs to sum data across all participants. In an HGX system (such as the H100 or A100), intra-node communication is handled by NVLink, which provides significantly higher bandwidth than PCIe. ? The Bottleneck: If all_reduce bandwidth is low, it typically indicates that the high-speed NVLink fabric is either not engaged, has failed lanes, or the NVIDIA Fabric Manager is not correctly training the links. ? The Diagnostic Path: An administrator should check if the GPUs are falling back to the PCIe bus. This can be verified using nvidia-smi nvlink –status or nvidia-smi topo -m. If the NVLink status is not “Up“ or shows errors, the high-speed switch fabric is the primary culprit. The NCP-AII Context: The exam expects you to understand the hierarchy of AI networking. NVLink is for intra-node (inside the server) and InfiniBand/RoCE is for inter-node (between servers). Since the question specifies an HGX system (a single node), the internal fabric is the first place to look.
Incorrect: A. The TPM 2.0 module The Trusted Platform Module (TPM) is a security chip used for hardware-based authentication, encryption keys, and secure boot. It has no involvement in the data plane or the high-speed communication between GPUs. A faulty TPM would prevent a secure boot but would not cause a performance degradation in NCCL.
C. The 1GbE management switch The Management Network (OOB) is used for BMC access, IPMI, and basic system administration. It operates at 1 Gbps, whereas a single NVLink 4 lane operates at hundreds of gigabytes per second. While the management switch is vital for “bringing up“ the node, it does not carry compute traffic and cannot be the source of all_reduce bandwidth issues.
D. The SATA boot drive The boot drive is used to load the Operating System and drivers. Once the NCCL test is running, the data is moved between GPU memory (HBM) and the fabric. The SATA drive‘s low speed (600 MB/s) is irrelevant to the inter-GPU communication speeds (450 GB/s – 900 GB/s) being tested.
Unattempted
Correct: B The NVLink Switch and NVLink connections. The Technical Reason: all_reduce is a communication-heavy collective that requires GPUs to sum data across all participants. In an HGX system (such as the H100 or A100), intra-node communication is handled by NVLink, which provides significantly higher bandwidth than PCIe. ? The Bottleneck: If all_reduce bandwidth is low, it typically indicates that the high-speed NVLink fabric is either not engaged, has failed lanes, or the NVIDIA Fabric Manager is not correctly training the links. ? The Diagnostic Path: An administrator should check if the GPUs are falling back to the PCIe bus. This can be verified using nvidia-smi nvlink –status or nvidia-smi topo -m. If the NVLink status is not “Up“ or shows errors, the high-speed switch fabric is the primary culprit. The NCP-AII Context: The exam expects you to understand the hierarchy of AI networking. NVLink is for intra-node (inside the server) and InfiniBand/RoCE is for inter-node (between servers). Since the question specifies an HGX system (a single node), the internal fabric is the first place to look.
Incorrect: A. The TPM 2.0 module The Trusted Platform Module (TPM) is a security chip used for hardware-based authentication, encryption keys, and secure boot. It has no involvement in the data plane or the high-speed communication between GPUs. A faulty TPM would prevent a secure boot but would not cause a performance degradation in NCCL.
C. The 1GbE management switch The Management Network (OOB) is used for BMC access, IPMI, and basic system administration. It operates at 1 Gbps, whereas a single NVLink 4 lane operates at hundreds of gigabytes per second. While the management switch is vital for “bringing up“ the node, it does not carry compute traffic and cannot be the source of all_reduce bandwidth issues.
D. The SATA boot drive The boot drive is used to load the Operating System and drivers. Once the NCCL test is running, the data is moved between GPU memory (HBM) and the fabric. The SATA drive‘s low speed (600 MB/s) is irrelevant to the inter-GPU communication speeds (450 GB/s – 900 GB/s) being tested.
Question 20 of 60
20. Question
When updating NVIDIA GPU drivers on a production cluster managed by Base Command Manager, what is the recommended procedure to ensure the new drivers are correctly applied to all compute nodes without causing job failures?
Correct
Correct: D Update the software image (category) in BCM, then use the ‘node update‘ command to synchronize the nodes, ensuring they are drained of jobs first.
The Technical Reason:
Image-Based Management: In BCM, compute nodes are grouped into Categories. Each category points to a Software Image (stored in /cm/images/). To update a driver, you first install the new driver into the image on the head node (using chroot or cm-chroot).
Node Draining: GPU drivers cannot be swapped while the kernel modules are in use. If a job is running, the driver update will fail or cause the application to crash. Therefore, nodes must be Drained (set to a state where they accept no new jobs and wait for current ones to finish) via the scheduler (Slurm/Kubernetes).
Synchronization: Once the image is updated and the node is idle, the device update (or node update in older syntax) command synchronizes the local disk/RAM of the compute node with the updated image on the head node.
The NCP-AII Context: The exam validates your knowledge of the BCM Workflow. You are expected to know that BCM does not “push“ installers; it “syncs“ images.
Incorrect Options: A. Directly run the .run installer via SSH This circumvents the entire purpose of BCM. If you manually run an installer on a node, the change is ephemeral. The next time the node reboots or synchronizes with the head node, BCM will overwrite the manual driver installation with the (older) version found in the official software image, leading to “Configuration Drift.“
B. Use NGC CLI to push driver as a container image While the NVIDIA GPU Operator (used in Kubernetes) can deploy drivers via containers, the NVIDIA Container Toolkit and NGC are for applications, not the base host driver in a standard BCM/HPC environment. Drivers are kernel-level components and are part of the base OS image, not a background container task that follows user jobs.
C. Uninstall using ‘apt-get purge‘ under 100% load This is a recipe for a system crash. NVIDIA kernel modules (nvidia.ko, nvidia-uvm.ko) cannot be unloaded if they are being accessed by a process. Attempting to “hot-swap“ drivers under load is not a supported feature and will result in a kernel panic or a “dead“ GPU state until a hard reboot is performed.
Incorrect
Correct: D Update the software image (category) in BCM, then use the ‘node update‘ command to synchronize the nodes, ensuring they are drained of jobs first.
The Technical Reason:
Image-Based Management: In BCM, compute nodes are grouped into Categories. Each category points to a Software Image (stored in /cm/images/). To update a driver, you first install the new driver into the image on the head node (using chroot or cm-chroot).
Node Draining: GPU drivers cannot be swapped while the kernel modules are in use. If a job is running, the driver update will fail or cause the application to crash. Therefore, nodes must be Drained (set to a state where they accept no new jobs and wait for current ones to finish) via the scheduler (Slurm/Kubernetes).
Synchronization: Once the image is updated and the node is idle, the device update (or node update in older syntax) command synchronizes the local disk/RAM of the compute node with the updated image on the head node.
The NCP-AII Context: The exam validates your knowledge of the BCM Workflow. You are expected to know that BCM does not “push“ installers; it “syncs“ images.
Incorrect Options: A. Directly run the .run installer via SSH This circumvents the entire purpose of BCM. If you manually run an installer on a node, the change is ephemeral. The next time the node reboots or synchronizes with the head node, BCM will overwrite the manual driver installation with the (older) version found in the official software image, leading to “Configuration Drift.“
B. Use NGC CLI to push driver as a container image While the NVIDIA GPU Operator (used in Kubernetes) can deploy drivers via containers, the NVIDIA Container Toolkit and NGC are for applications, not the base host driver in a standard BCM/HPC environment. Drivers are kernel-level components and are part of the base OS image, not a background container task that follows user jobs.
C. Uninstall using ‘apt-get purge‘ under 100% load This is a recipe for a system crash. NVIDIA kernel modules (nvidia.ko, nvidia-uvm.ko) cannot be unloaded if they are being accessed by a process. Attempting to “hot-swap“ drivers under load is not a supported feature and will result in a kernel panic or a “dead“ GPU state until a hard reboot is performed.
Unattempted
Correct: D Update the software image (category) in BCM, then use the ‘node update‘ command to synchronize the nodes, ensuring they are drained of jobs first.
The Technical Reason:
Image-Based Management: In BCM, compute nodes are grouped into Categories. Each category points to a Software Image (stored in /cm/images/). To update a driver, you first install the new driver into the image on the head node (using chroot or cm-chroot).
Node Draining: GPU drivers cannot be swapped while the kernel modules are in use. If a job is running, the driver update will fail or cause the application to crash. Therefore, nodes must be Drained (set to a state where they accept no new jobs and wait for current ones to finish) via the scheduler (Slurm/Kubernetes).
Synchronization: Once the image is updated and the node is idle, the device update (or node update in older syntax) command synchronizes the local disk/RAM of the compute node with the updated image on the head node.
The NCP-AII Context: The exam validates your knowledge of the BCM Workflow. You are expected to know that BCM does not “push“ installers; it “syncs“ images.
Incorrect Options: A. Directly run the .run installer via SSH This circumvents the entire purpose of BCM. If you manually run an installer on a node, the change is ephemeral. The next time the node reboots or synchronizes with the head node, BCM will overwrite the manual driver installation with the (older) version found in the official software image, leading to “Configuration Drift.“
B. Use NGC CLI to push driver as a container image While the NVIDIA GPU Operator (used in Kubernetes) can deploy drivers via containers, the NVIDIA Container Toolkit and NGC are for applications, not the base host driver in a standard BCM/HPC environment. Drivers are kernel-level components and are part of the base OS image, not a background container task that follows user jobs.
C. Uninstall using ‘apt-get purge‘ under 100% load This is a recipe for a system crash. NVIDIA kernel modules (nvidia.ko, nvidia-uvm.ko) cannot be unloaded if they are being accessed by a process. Attempting to “hot-swap“ drivers under load is not a supported feature and will result in a kernel panic or a “dead“ GPU state until a hard reboot is performed.
Question 21 of 60
21. Question
When configuring a BlueField network platform to act as a secure infrastructure platform for an AI cluster, the administrator needs to isolate the management plane from the data plane. Which architectural feature of the BlueField platform allows for the offloading of security policies and telemetry without consuming the host CPU cycles of the NVIDIA HGX server?
Correct
Correct: B. The integrated ARM cores and programmable hardware accelerators.
The NCP-AII certification blueprint includes configuring BlueField DPUs as secure infrastructure platforms within the Physical Layer Management domain.
The integrated ARM cores and programmable hardware accelerators are the fundamental architectural components that enable the BlueField DPU to function as an independent infrastructure processor .
These components allow the DPU to “offload, accelerate, and isolate infrastructure workloads“ such as networking, storage, and security policies from the host CPU .
The DPU‘s ARM cores run their own operating system and software stack independently of the host, creating a separate security domain for infrastructure services .
Hardware accelerators (for cryptography, compression, packet processing) enable these tasks to run with higher efficiency than CPU cores could achieve .
This architecture provides “air-gapped“ isolation between the application domain (host) and infrastructure domain (DPU), enhancing security while freeing host CPU cycles for application workloads .
Incorrect: A. The secondary Ethernet port used for legacy BMC management.
This is incorrect because the secondary Ethernet port (typically the OOB management port) provides out-of-band connectivity for BMC/DPU management, but it is not the architectural feature responsible for offloading security policies and telemetry. The offloading capability comes from the ARM cores and programmable accelerators, not management ports.
C. The standard PCIe Gen5 bus connection to the system motherboard.
This is incorrect because the PCIe bus is the physical interconnect that connects the DPU to the host system, but it is not the feature that enables offloading. All PCIe devices have a bus connection; the unique value of the DPU comes from its processing capabilities (ARM cores and accelerators) that operate on that bus.
D. The direct connection to the system local SATA storage drives.
This is incorrect because the BlueField DPU does not typically have direct connections to local SATA drives. Storage connectivity is handled through the PCIe bus to the host‘s storage controllers. While the DPU can accelerate storage protocols (NVMe-oF, etc.) , this is achieved through its processing capabilities, not through direct SATA connections.
Incorrect
Correct: B. The integrated ARM cores and programmable hardware accelerators.
The NCP-AII certification blueprint includes configuring BlueField DPUs as secure infrastructure platforms within the Physical Layer Management domain.
The integrated ARM cores and programmable hardware accelerators are the fundamental architectural components that enable the BlueField DPU to function as an independent infrastructure processor .
These components allow the DPU to “offload, accelerate, and isolate infrastructure workloads“ such as networking, storage, and security policies from the host CPU .
The DPU‘s ARM cores run their own operating system and software stack independently of the host, creating a separate security domain for infrastructure services .
Hardware accelerators (for cryptography, compression, packet processing) enable these tasks to run with higher efficiency than CPU cores could achieve .
This architecture provides “air-gapped“ isolation between the application domain (host) and infrastructure domain (DPU), enhancing security while freeing host CPU cycles for application workloads .
Incorrect: A. The secondary Ethernet port used for legacy BMC management.
This is incorrect because the secondary Ethernet port (typically the OOB management port) provides out-of-band connectivity for BMC/DPU management, but it is not the architectural feature responsible for offloading security policies and telemetry. The offloading capability comes from the ARM cores and programmable accelerators, not management ports.
C. The standard PCIe Gen5 bus connection to the system motherboard.
This is incorrect because the PCIe bus is the physical interconnect that connects the DPU to the host system, but it is not the feature that enables offloading. All PCIe devices have a bus connection; the unique value of the DPU comes from its processing capabilities (ARM cores and accelerators) that operate on that bus.
D. The direct connection to the system local SATA storage drives.
This is incorrect because the BlueField DPU does not typically have direct connections to local SATA drives. Storage connectivity is handled through the PCIe bus to the host‘s storage controllers. While the DPU can accelerate storage protocols (NVMe-oF, etc.) , this is achieved through its processing capabilities, not through direct SATA connections.
Unattempted
Correct: B. The integrated ARM cores and programmable hardware accelerators.
The NCP-AII certification blueprint includes configuring BlueField DPUs as secure infrastructure platforms within the Physical Layer Management domain.
The integrated ARM cores and programmable hardware accelerators are the fundamental architectural components that enable the BlueField DPU to function as an independent infrastructure processor .
These components allow the DPU to “offload, accelerate, and isolate infrastructure workloads“ such as networking, storage, and security policies from the host CPU .
The DPU‘s ARM cores run their own operating system and software stack independently of the host, creating a separate security domain for infrastructure services .
Hardware accelerators (for cryptography, compression, packet processing) enable these tasks to run with higher efficiency than CPU cores could achieve .
This architecture provides “air-gapped“ isolation between the application domain (host) and infrastructure domain (DPU), enhancing security while freeing host CPU cycles for application workloads .
Incorrect: A. The secondary Ethernet port used for legacy BMC management.
This is incorrect because the secondary Ethernet port (typically the OOB management port) provides out-of-band connectivity for BMC/DPU management, but it is not the architectural feature responsible for offloading security policies and telemetry. The offloading capability comes from the ARM cores and programmable accelerators, not management ports.
C. The standard PCIe Gen5 bus connection to the system motherboard.
This is incorrect because the PCIe bus is the physical interconnect that connects the DPU to the host system, but it is not the feature that enables offloading. All PCIe devices have a bus connection; the unique value of the DPU comes from its processing capabilities (ARM cores and accelerators) that operate on that bus.
D. The direct connection to the system local SATA storage drives.
This is incorrect because the BlueField DPU does not typically have direct connections to local SATA drives. Storage connectivity is handled through the PCIe bus to the host‘s storage controllers. While the DPU can accelerate storage protocols (NVMe-oF, etc.) , this is achieved through its processing capabilities, not through direct SATA connections.
Question 22 of 60
22. Question
A system administrator is using NVIDIA Base Command Manager to deploy an OS image across a new cluster of 64 nodes. The administrator needs to ensure that the Slurm scheduler is properly integrated and that the Enroot and Pyxis plugins are installed. What is the specific function of the Pyxis plugin in this AI infrastructure environment?
Correct
Correct: A It enables Slurm to launch containerized workloads using Enroot. The Technical Reason: To run AI workloads efficiently at scale, NVIDIA utilizes a “Container-Native“ approach for HPC schedulers. ? Enroot: This is NVIDIA‘s open-source tool that turns container images (like those from Docker or NGC) into unprivileged sandboxes. It is designed to be faster and more lightweight than Docker for HPC environments. ? Pyxis: This is a Slurm SPANK plugin. Its specific role is to act as the “bridge“ or “glue“ between Slurm and Enroot. It allows a user to submit a job using standard Slurm commands (e.g., srun –container-image=…) without needing to manually pull or manage the container on every node. The NCP-AII Context: The exam validates your ability to configure the NVIDIA Cluster Stack. You are expected to know that Pyxis provides the command-line arguments to Slurm that make container execution seamless for the end-user.
Incorrect: B. It manages the power cycling of the GPU nodes via the BMC This is the function of NVIDIA Base Command Manager (BCM) itself, specifically its integration with IPMI or Redfish. BCM handles the “bare-metal“ provisioning and power management. Pyxis operates at a much higher level (the Job Scheduler level) and has no direct control over the physical power state of the hardware.
C. It acts as a distributed file system Pyxis is a software plugin, not a storage protocol. Distributed file systems used in an NVIDIA-certified environment would be Lustre, Weka, or IBM Storage Scale. While Pyxis might interact with these filesystems to mount container layers, it does not provide the storage infrastructure itself.
D. It provides a graphical user interface for monitoring Monitoring GPU temperatures and health is the role of NVIDIA Data Center GPU Manager (DCGM) and NVIDIA System Management (NVSM). These metrics are typically visualized in the Base Command Manager Web Portal or a Grafana dashboard. Pyxis is a command-line-driven plugin for the scheduler and does not have a GUI component.
Incorrect
Correct: A It enables Slurm to launch containerized workloads using Enroot. The Technical Reason: To run AI workloads efficiently at scale, NVIDIA utilizes a “Container-Native“ approach for HPC schedulers. ? Enroot: This is NVIDIA‘s open-source tool that turns container images (like those from Docker or NGC) into unprivileged sandboxes. It is designed to be faster and more lightweight than Docker for HPC environments. ? Pyxis: This is a Slurm SPANK plugin. Its specific role is to act as the “bridge“ or “glue“ between Slurm and Enroot. It allows a user to submit a job using standard Slurm commands (e.g., srun –container-image=…) without needing to manually pull or manage the container on every node. The NCP-AII Context: The exam validates your ability to configure the NVIDIA Cluster Stack. You are expected to know that Pyxis provides the command-line arguments to Slurm that make container execution seamless for the end-user.
Incorrect: B. It manages the power cycling of the GPU nodes via the BMC This is the function of NVIDIA Base Command Manager (BCM) itself, specifically its integration with IPMI or Redfish. BCM handles the “bare-metal“ provisioning and power management. Pyxis operates at a much higher level (the Job Scheduler level) and has no direct control over the physical power state of the hardware.
C. It acts as a distributed file system Pyxis is a software plugin, not a storage protocol. Distributed file systems used in an NVIDIA-certified environment would be Lustre, Weka, or IBM Storage Scale. While Pyxis might interact with these filesystems to mount container layers, it does not provide the storage infrastructure itself.
D. It provides a graphical user interface for monitoring Monitoring GPU temperatures and health is the role of NVIDIA Data Center GPU Manager (DCGM) and NVIDIA System Management (NVSM). These metrics are typically visualized in the Base Command Manager Web Portal or a Grafana dashboard. Pyxis is a command-line-driven plugin for the scheduler and does not have a GUI component.
Unattempted
Correct: A It enables Slurm to launch containerized workloads using Enroot. The Technical Reason: To run AI workloads efficiently at scale, NVIDIA utilizes a “Container-Native“ approach for HPC schedulers. ? Enroot: This is NVIDIA‘s open-source tool that turns container images (like those from Docker or NGC) into unprivileged sandboxes. It is designed to be faster and more lightweight than Docker for HPC environments. ? Pyxis: This is a Slurm SPANK plugin. Its specific role is to act as the “bridge“ or “glue“ between Slurm and Enroot. It allows a user to submit a job using standard Slurm commands (e.g., srun –container-image=…) without needing to manually pull or manage the container on every node. The NCP-AII Context: The exam validates your ability to configure the NVIDIA Cluster Stack. You are expected to know that Pyxis provides the command-line arguments to Slurm that make container execution seamless for the end-user.
Incorrect: B. It manages the power cycling of the GPU nodes via the BMC This is the function of NVIDIA Base Command Manager (BCM) itself, specifically its integration with IPMI or Redfish. BCM handles the “bare-metal“ provisioning and power management. Pyxis operates at a much higher level (the Job Scheduler level) and has no direct control over the physical power state of the hardware.
C. It acts as a distributed file system Pyxis is a software plugin, not a storage protocol. Distributed file systems used in an NVIDIA-certified environment would be Lustre, Weka, or IBM Storage Scale. While Pyxis might interact with these filesystems to mount container layers, it does not provide the storage infrastructure itself.
D. It provides a graphical user interface for monitoring Monitoring GPU temperatures and health is the role of NVIDIA Data Center GPU Manager (DCGM) and NVIDIA System Management (NVSM). These metrics are typically visualized in the Base Command Manager Web Portal or a Grafana dashboard. Pyxis is a command-line-driven plugin for the scheduler and does not have a GUI component.
Question 23 of 60
23. Question
An administrator is configuring an NVIDIA BlueField DPU to operate in a secure AI infrastructure. The goal is to offload networking and security tasks from the host CPU. Which mode should the BlueField DPU be configured in to allow it to run its own internal operating system and manage the hardware eSwitch independently of the host?
Correct
Correct: D. DPU Mode (also known as Embedded Function Mode), where the ARM cores on the BlueField device boot an OS like Ubuntu to manage the control plane.
This is correct because DPU Mode, also known as embedded CPU function ownership (ECPF) mode, is the default mode for BlueField DPU SKUs where the embedded Arm system controls the NIC resources and data path independently of the host x86 CPU .
In DPU Mode, “the NIC resources and functionality are owned and controlled by the embedded Arm subsystem,“ and “all network communication to the host flows through a virtual switch control plane hosted on the Arm cores that manages all networking traffic coming and going from the host“ .
The ARM cores run their own operating system (such as Ubuntu) independently of the host, creating a separate security domain for infrastructure services .
The embedded Arm system runs services that manage the NIC resources and data path, including the hardware eSwitch (embedded switch) which controls all networking traffic .
This architecture enables the offloading of networking and security tasks from the host CPU, as the DPU manages infrastructure services independently while freeing host CPU cycles for application workloads .
Incorrect: A. NIC Mode (also known as Non-Embedded Mode), where the BlueField behaves as a standard ConnectX adapter and the ARM cores are disabled.
This is incorrect because in NIC Mode, “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This mode explicitly disables the ARM cores and does not allow the DPU to run its own operating system or manage the eSwitch independently. While NIC Mode reduces power consumption and improves network performance, it defeats the purpose of offloading networking and security tasks from the host CPU.
B. Separated Host Mode, which allows the host to manage the DPU through a virtualized serial console while the DPU remains in a low-power sleep state.
This is incorrect because “Separated Host Mode“ is not a recognized operational mode in NVIDIA BlueField documentation. The documented modes are NIC Mode, DPU Mode, and Zero Trust (Restricted) Mode . The description of a low-power sleep state does not correspond to any valid DPU configuration for active infrastructure offloading.
C. Pass-through Mode, which ignores the internal ARM cores and directly maps the physical network ports to the host PCIe bus for maximum performance.
This is incorrect because “Pass-through Mode“ is not a documented operational mode for BlueField DPUs. This description essentially describes NIC Mode behavior but uses non-standard terminology. The goal of offloading networking and security tasks requires the ARM cores to be active and managing the control plane, not ignored in favor of direct host access.
Incorrect
Correct: D. DPU Mode (also known as Embedded Function Mode), where the ARM cores on the BlueField device boot an OS like Ubuntu to manage the control plane.
This is correct because DPU Mode, also known as embedded CPU function ownership (ECPF) mode, is the default mode for BlueField DPU SKUs where the embedded Arm system controls the NIC resources and data path independently of the host x86 CPU .
In DPU Mode, “the NIC resources and functionality are owned and controlled by the embedded Arm subsystem,“ and “all network communication to the host flows through a virtual switch control plane hosted on the Arm cores that manages all networking traffic coming and going from the host“ .
The ARM cores run their own operating system (such as Ubuntu) independently of the host, creating a separate security domain for infrastructure services .
The embedded Arm system runs services that manage the NIC resources and data path, including the hardware eSwitch (embedded switch) which controls all networking traffic .
This architecture enables the offloading of networking and security tasks from the host CPU, as the DPU manages infrastructure services independently while freeing host CPU cycles for application workloads .
Incorrect: A. NIC Mode (also known as Non-Embedded Mode), where the BlueField behaves as a standard ConnectX adapter and the ARM cores are disabled.
This is incorrect because in NIC Mode, “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This mode explicitly disables the ARM cores and does not allow the DPU to run its own operating system or manage the eSwitch independently. While NIC Mode reduces power consumption and improves network performance, it defeats the purpose of offloading networking and security tasks from the host CPU.
B. Separated Host Mode, which allows the host to manage the DPU through a virtualized serial console while the DPU remains in a low-power sleep state.
This is incorrect because “Separated Host Mode“ is not a recognized operational mode in NVIDIA BlueField documentation. The documented modes are NIC Mode, DPU Mode, and Zero Trust (Restricted) Mode . The description of a low-power sleep state does not correspond to any valid DPU configuration for active infrastructure offloading.
C. Pass-through Mode, which ignores the internal ARM cores and directly maps the physical network ports to the host PCIe bus for maximum performance.
This is incorrect because “Pass-through Mode“ is not a documented operational mode for BlueField DPUs. This description essentially describes NIC Mode behavior but uses non-standard terminology. The goal of offloading networking and security tasks requires the ARM cores to be active and managing the control plane, not ignored in favor of direct host access.
Unattempted
Correct: D. DPU Mode (also known as Embedded Function Mode), where the ARM cores on the BlueField device boot an OS like Ubuntu to manage the control plane.
This is correct because DPU Mode, also known as embedded CPU function ownership (ECPF) mode, is the default mode for BlueField DPU SKUs where the embedded Arm system controls the NIC resources and data path independently of the host x86 CPU .
In DPU Mode, “the NIC resources and functionality are owned and controlled by the embedded Arm subsystem,“ and “all network communication to the host flows through a virtual switch control plane hosted on the Arm cores that manages all networking traffic coming and going from the host“ .
The ARM cores run their own operating system (such as Ubuntu) independently of the host, creating a separate security domain for infrastructure services .
The embedded Arm system runs services that manage the NIC resources and data path, including the hardware eSwitch (embedded switch) which controls all networking traffic .
This architecture enables the offloading of networking and security tasks from the host CPU, as the DPU manages infrastructure services independently while freeing host CPU cycles for application workloads .
Incorrect: A. NIC Mode (also known as Non-Embedded Mode), where the BlueField behaves as a standard ConnectX adapter and the ARM cores are disabled.
This is incorrect because in NIC Mode, “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This mode explicitly disables the ARM cores and does not allow the DPU to run its own operating system or manage the eSwitch independently. While NIC Mode reduces power consumption and improves network performance, it defeats the purpose of offloading networking and security tasks from the host CPU.
B. Separated Host Mode, which allows the host to manage the DPU through a virtualized serial console while the DPU remains in a low-power sleep state.
This is incorrect because “Separated Host Mode“ is not a recognized operational mode in NVIDIA BlueField documentation. The documented modes are NIC Mode, DPU Mode, and Zero Trust (Restricted) Mode . The description of a low-power sleep state does not correspond to any valid DPU configuration for active infrastructure offloading.
C. Pass-through Mode, which ignores the internal ARM cores and directly maps the physical network ports to the host PCIe bus for maximum performance.
This is incorrect because “Pass-through Mode“ is not a documented operational mode for BlueField DPUs. This description essentially describes NIC Mode behavior but uses non-standard terminology. The goal of offloading networking and security tasks requires the ARM cores to be active and managing the control plane, not ignored in favor of direct host access.
Question 24 of 60
24. Question
During the cluster verification phase, the administrator runs the NVIDIA Collective Communications Library (NCCL) tests across multiple nodes. The test reveals that the East-West fabric bandwidth is significantly lower than expected. What is the most likely cause of this performance bottleneck in the context of cable and switch validation?
Correct
Correct: C One or more InfiniBand cables are improperly seated or have poor signal quality, leading to link-layer errors and reduced effective throughput. The Technical Reason: AI fabrics are extremely sensitive to physical integrity. ? Link-Layer Errors: A partially seated QSFP/OSFP connector or a damaged fiber-optic strand can cause symbol errors. While the link may stay “Up,“ the InfiniBand adaptive routing and error-correction mechanisms will trigger retransmissions, significantly lowering the effective bandwidth measured by NCCL. ? Bit Error Rate (BER): High-speed cables must maintain a very low BER. Physical stress or dust on the transceivers can degrade signal quality, causing the link to down-negotiate its width (e.g., from x4 to x1) or speed. The NCP-AII Context: The exam validates your ability to use InfiniBand Diagnostic Tools. You would typically use ibqueryerrors or ibdiagnet to identify ports with high error counters (like SymbolErrorCounter or LinkErrorRecoveryCounter).
Incorrect:
A. Different versions of the Python interpreter NCCL is a C++/CUDA library. While Python-based frameworks (like PyTorch or TensorFlow) call NCCL, the performance of the fabric itself is handled by the NCCL kernel and the InfiniBand verbs layer. A mismatch in Python versions might cause code to crash, but it would not result in a consistent, clean reduction in raw network bandwidth.
B. Forgot to install the NGC CLI The NGC CLI is a tool used to download container images and datasets from the NVIDIA GPU Cloud. It is a management-plane utility and is not involved in the data-plane communication between network switches. Switches communicate using low-level protocols (like Subnet Management packets in InfiniBand), which do not require host-side CLI tools to function.
D. Wallpaper on the head node‘s desktop This is a “humorous“ distracter. The Head Node typically manages the cluster but does not participate in the high-speed compute traffic of a training job. Furthermore, professional AI infrastructure is almost exclusively run in Headless Mode (no GUI/desktop environment). Even if a GUI existed, 2D wallpaper would have zero impact on the 400 Gbps InfiniBand data fabric.
Incorrect
Correct: C One or more InfiniBand cables are improperly seated or have poor signal quality, leading to link-layer errors and reduced effective throughput. The Technical Reason: AI fabrics are extremely sensitive to physical integrity. ? Link-Layer Errors: A partially seated QSFP/OSFP connector or a damaged fiber-optic strand can cause symbol errors. While the link may stay “Up,“ the InfiniBand adaptive routing and error-correction mechanisms will trigger retransmissions, significantly lowering the effective bandwidth measured by NCCL. ? Bit Error Rate (BER): High-speed cables must maintain a very low BER. Physical stress or dust on the transceivers can degrade signal quality, causing the link to down-negotiate its width (e.g., from x4 to x1) or speed. The NCP-AII Context: The exam validates your ability to use InfiniBand Diagnostic Tools. You would typically use ibqueryerrors or ibdiagnet to identify ports with high error counters (like SymbolErrorCounter or LinkErrorRecoveryCounter).
Incorrect:
A. Different versions of the Python interpreter NCCL is a C++/CUDA library. While Python-based frameworks (like PyTorch or TensorFlow) call NCCL, the performance of the fabric itself is handled by the NCCL kernel and the InfiniBand verbs layer. A mismatch in Python versions might cause code to crash, but it would not result in a consistent, clean reduction in raw network bandwidth.
B. Forgot to install the NGC CLI The NGC CLI is a tool used to download container images and datasets from the NVIDIA GPU Cloud. It is a management-plane utility and is not involved in the data-plane communication between network switches. Switches communicate using low-level protocols (like Subnet Management packets in InfiniBand), which do not require host-side CLI tools to function.
D. Wallpaper on the head node‘s desktop This is a “humorous“ distracter. The Head Node typically manages the cluster but does not participate in the high-speed compute traffic of a training job. Furthermore, professional AI infrastructure is almost exclusively run in Headless Mode (no GUI/desktop environment). Even if a GUI existed, 2D wallpaper would have zero impact on the 400 Gbps InfiniBand data fabric.
Unattempted
Correct: C One or more InfiniBand cables are improperly seated or have poor signal quality, leading to link-layer errors and reduced effective throughput. The Technical Reason: AI fabrics are extremely sensitive to physical integrity. ? Link-Layer Errors: A partially seated QSFP/OSFP connector or a damaged fiber-optic strand can cause symbol errors. While the link may stay “Up,“ the InfiniBand adaptive routing and error-correction mechanisms will trigger retransmissions, significantly lowering the effective bandwidth measured by NCCL. ? Bit Error Rate (BER): High-speed cables must maintain a very low BER. Physical stress or dust on the transceivers can degrade signal quality, causing the link to down-negotiate its width (e.g., from x4 to x1) or speed. The NCP-AII Context: The exam validates your ability to use InfiniBand Diagnostic Tools. You would typically use ibqueryerrors or ibdiagnet to identify ports with high error counters (like SymbolErrorCounter or LinkErrorRecoveryCounter).
Incorrect:
A. Different versions of the Python interpreter NCCL is a C++/CUDA library. While Python-based frameworks (like PyTorch or TensorFlow) call NCCL, the performance of the fabric itself is handled by the NCCL kernel and the InfiniBand verbs layer. A mismatch in Python versions might cause code to crash, but it would not result in a consistent, clean reduction in raw network bandwidth.
B. Forgot to install the NGC CLI The NGC CLI is a tool used to download container images and datasets from the NVIDIA GPU Cloud. It is a management-plane utility and is not involved in the data-plane communication between network switches. Switches communicate using low-level protocols (like Subnet Management packets in InfiniBand), which do not require host-side CLI tools to function.
D. Wallpaper on the head node‘s desktop This is a “humorous“ distracter. The Head Node typically manages the cluster but does not participate in the high-speed compute traffic of a training job. Furthermore, professional AI infrastructure is almost exclusively run in Headless Mode (no GUI/desktop environment). Even if a GUI existed, 2D wallpaper would have zero impact on the 400 Gbps InfiniBand data fabric.
Question 25 of 60
25. Question
A storage bottleneck is suspected in an AI training pipeline. The GPUs are frequently idle while waiting for data. Which optimization should be performed to improve the throughput of the storage system for AI workloads?
Correct
Correct: D. Implement GPUDirect Storage (GDS) to enable a direct data path between the storage and GPU memory, bypassing the CPU buffer.
This is correct because the NCP-AII certification blueprint explicitly includes storage optimization as a core component of AI infrastructure, with GPUDirect Storage (GDS) being the key technology for eliminating data movement bottlenecks .
GDS provides a direct path to local or remote storage (like NVMe or NVMe-oF) and GPU memory, enabling direct communication between storage and GPUs .
This is especially important when dataset sizes no longer fit into system memory, and data I/O to the GPUs becomes the runtime bottleneck .
By bypassing the CPU bounce buffers, GDS:
Provides increased bandwidth and lower latency between storage and GPUs
Frees up CPU resources that would otherwise be consumed by data movement
Enables more efficient data flow through the entire AI pipeline from storage to GPU
Enabling a direct data path alleviates the storage bottleneck for scale-out AI and data science workloads .
The VAST Data NCP reference architecture specifically lists NVIDIA Magnum IO GPUDirect Storage (GDS) as a key technology for achieving high-performance storage access .
Incorrect: A. Move the training datasets to a tape-based archive system to take advantage of the high sequential read speeds of magnetic tape.
This is incorrect because tape-based storage has extremely high latency compared to modern storage systems and is unsuitable for active training data. Tape is designed for long-term archival, not for feeding data to GPUs in real-time during training. Magnetic tape systems cannot provide the throughput required to keep modern GPUs utilized.
B. Decrease the MTU on the storage network to 1500 to ensure compatibility with older office switches in the building.
This is incorrect because decreasing MTU (Maximum Transmission Unit) to 1500 would increase packet processing overhead and reduce network efficiency. High-performance storage networks typically use jumbo frames (9000+ MTU) to maximize throughput and reduce CPU overhead. This change would worsen the bottleneck, not improve it.
C. Switch from an RDMA-based storage protocol to a standard NFS over TCP protocol to simplify the network stack.
This is incorrect because RDMA-based protocols are specifically designed to provide low-latency, high-throughput data transfers with minimal CPU involvement. Switching to standard NFS over TCP would increase CPU overhead and latency, making the storage bottleneck worse. The goal is to use RDMA-enabled protocols like NVMe-oF or NFS over RDMA to maximize performance .
Incorrect
Correct: D. Implement GPUDirect Storage (GDS) to enable a direct data path between the storage and GPU memory, bypassing the CPU buffer.
This is correct because the NCP-AII certification blueprint explicitly includes storage optimization as a core component of AI infrastructure, with GPUDirect Storage (GDS) being the key technology for eliminating data movement bottlenecks .
GDS provides a direct path to local or remote storage (like NVMe or NVMe-oF) and GPU memory, enabling direct communication between storage and GPUs .
This is especially important when dataset sizes no longer fit into system memory, and data I/O to the GPUs becomes the runtime bottleneck .
By bypassing the CPU bounce buffers, GDS:
Provides increased bandwidth and lower latency between storage and GPUs
Frees up CPU resources that would otherwise be consumed by data movement
Enables more efficient data flow through the entire AI pipeline from storage to GPU
Enabling a direct data path alleviates the storage bottleneck for scale-out AI and data science workloads .
The VAST Data NCP reference architecture specifically lists NVIDIA Magnum IO GPUDirect Storage (GDS) as a key technology for achieving high-performance storage access .
Incorrect: A. Move the training datasets to a tape-based archive system to take advantage of the high sequential read speeds of magnetic tape.
This is incorrect because tape-based storage has extremely high latency compared to modern storage systems and is unsuitable for active training data. Tape is designed for long-term archival, not for feeding data to GPUs in real-time during training. Magnetic tape systems cannot provide the throughput required to keep modern GPUs utilized.
B. Decrease the MTU on the storage network to 1500 to ensure compatibility with older office switches in the building.
This is incorrect because decreasing MTU (Maximum Transmission Unit) to 1500 would increase packet processing overhead and reduce network efficiency. High-performance storage networks typically use jumbo frames (9000+ MTU) to maximize throughput and reduce CPU overhead. This change would worsen the bottleneck, not improve it.
C. Switch from an RDMA-based storage protocol to a standard NFS over TCP protocol to simplify the network stack.
This is incorrect because RDMA-based protocols are specifically designed to provide low-latency, high-throughput data transfers with minimal CPU involvement. Switching to standard NFS over TCP would increase CPU overhead and latency, making the storage bottleneck worse. The goal is to use RDMA-enabled protocols like NVMe-oF or NFS over RDMA to maximize performance .
Unattempted
Correct: D. Implement GPUDirect Storage (GDS) to enable a direct data path between the storage and GPU memory, bypassing the CPU buffer.
This is correct because the NCP-AII certification blueprint explicitly includes storage optimization as a core component of AI infrastructure, with GPUDirect Storage (GDS) being the key technology for eliminating data movement bottlenecks .
GDS provides a direct path to local or remote storage (like NVMe or NVMe-oF) and GPU memory, enabling direct communication between storage and GPUs .
This is especially important when dataset sizes no longer fit into system memory, and data I/O to the GPUs becomes the runtime bottleneck .
By bypassing the CPU bounce buffers, GDS:
Provides increased bandwidth and lower latency between storage and GPUs
Frees up CPU resources that would otherwise be consumed by data movement
Enables more efficient data flow through the entire AI pipeline from storage to GPU
Enabling a direct data path alleviates the storage bottleneck for scale-out AI and data science workloads .
The VAST Data NCP reference architecture specifically lists NVIDIA Magnum IO GPUDirect Storage (GDS) as a key technology for achieving high-performance storage access .
Incorrect: A. Move the training datasets to a tape-based archive system to take advantage of the high sequential read speeds of magnetic tape.
This is incorrect because tape-based storage has extremely high latency compared to modern storage systems and is unsuitable for active training data. Tape is designed for long-term archival, not for feeding data to GPUs in real-time during training. Magnetic tape systems cannot provide the throughput required to keep modern GPUs utilized.
B. Decrease the MTU on the storage network to 1500 to ensure compatibility with older office switches in the building.
This is incorrect because decreasing MTU (Maximum Transmission Unit) to 1500 would increase packet processing overhead and reduce network efficiency. High-performance storage networks typically use jumbo frames (9000+ MTU) to maximize throughput and reduce CPU overhead. This change would worsen the bottleneck, not improve it.
C. Switch from an RDMA-based storage protocol to a standard NFS over TCP protocol to simplify the network stack.
This is incorrect because RDMA-based protocols are specifically designed to provide low-latency, high-throughput data transfers with minimal CPU involvement. Switching to standard NFS over TCP would increase CPU overhead and latency, making the storage bottleneck worse. The goal is to use RDMA-enabled protocols like NVMe-oF or NFS over RDMA to maximize performance .
Question 26 of 60
26. Question
An administrator identifies a faulty BlueField-3 DPU that is causing intermittent network drops. When replacing the card, which of the following is a critical post-replacement step to ensure the new DPU is correctly integrated into the AI cluster‘s automated management framework?
Correct
Correct: C Updating the DPU firmware and re-provisioning the DOCA runtime image.
The Technical Reason: A replacement BlueField-3 DPU typically ships with a generic manufacturing firmware. To integrate it into a production cluster:
Firmware Alignment: The DPU firmware must match the specific version validated for the rest of the cluster to ensure compatibility with the host‘s NVIDIA drivers and the InfiniBand/Ethernet switch OS.
DOCA Runtime (BFB): In “DPU Mode,“ the card runs its own OS on internal Arm cores. The BlueField Bootstream (BFB) file contains the DOCA (Data Center Infrastructure-on-a-Chip Architecture) runtime, drivers, and libraries. Re-provisioning this image ensures the DPU has the correct identity, security certificates, and offload capabilities (like NVMe-over-Fabrics or Firewall offloads) required by the management framework (e.g., NVIDIA Base Command Manager).
The NCP-AII Context: The exam tests your understanding of the “DOCA Stack.“ Replacing a DPU is not a “plug-and-play“ operation; it requires a “bring-up“ sequence involving firmware synchronization and OS deployment to the DPU itself.
Incorrect Options: A. Manually assigning a public IPv4 address In an AI cluster, DPUs are typically managed via a private Out-of-Band (OOB) management network or an internal RShim interface. Assigning a public IP address to an internal port is a security risk and is not part of the standard integration workflow. Automation frameworks like Base Command Manager use DHCP or private static pools for DPU management.
B. Painting the card‘s bracket This is a purely aesthetic action with no technical impact on the hardware‘s function or its integration into the management software. Modification of hardware components (like painting) can also void the manufacturer‘s warranty.
D. Disabling the NVSwitch fabric Disabling the NVSwitch fabric would cripple the cluster‘s performance by preventing GPUs from communicating at high speeds. The DPU and the NVSwitch fabric are independent components of the data plane; the DPU manages networking and storage, while NVSwitch manages GPU-to-GPU memory copies. There is no technical reason to hide the GPUs from the DPU.
Incorrect
Correct: C Updating the DPU firmware and re-provisioning the DOCA runtime image.
The Technical Reason: A replacement BlueField-3 DPU typically ships with a generic manufacturing firmware. To integrate it into a production cluster:
Firmware Alignment: The DPU firmware must match the specific version validated for the rest of the cluster to ensure compatibility with the host‘s NVIDIA drivers and the InfiniBand/Ethernet switch OS.
DOCA Runtime (BFB): In “DPU Mode,“ the card runs its own OS on internal Arm cores. The BlueField Bootstream (BFB) file contains the DOCA (Data Center Infrastructure-on-a-Chip Architecture) runtime, drivers, and libraries. Re-provisioning this image ensures the DPU has the correct identity, security certificates, and offload capabilities (like NVMe-over-Fabrics or Firewall offloads) required by the management framework (e.g., NVIDIA Base Command Manager).
The NCP-AII Context: The exam tests your understanding of the “DOCA Stack.“ Replacing a DPU is not a “plug-and-play“ operation; it requires a “bring-up“ sequence involving firmware synchronization and OS deployment to the DPU itself.
Incorrect Options: A. Manually assigning a public IPv4 address In an AI cluster, DPUs are typically managed via a private Out-of-Band (OOB) management network or an internal RShim interface. Assigning a public IP address to an internal port is a security risk and is not part of the standard integration workflow. Automation frameworks like Base Command Manager use DHCP or private static pools for DPU management.
B. Painting the card‘s bracket This is a purely aesthetic action with no technical impact on the hardware‘s function or its integration into the management software. Modification of hardware components (like painting) can also void the manufacturer‘s warranty.
D. Disabling the NVSwitch fabric Disabling the NVSwitch fabric would cripple the cluster‘s performance by preventing GPUs from communicating at high speeds. The DPU and the NVSwitch fabric are independent components of the data plane; the DPU manages networking and storage, while NVSwitch manages GPU-to-GPU memory copies. There is no technical reason to hide the GPUs from the DPU.
Unattempted
Correct: C Updating the DPU firmware and re-provisioning the DOCA runtime image.
The Technical Reason: A replacement BlueField-3 DPU typically ships with a generic manufacturing firmware. To integrate it into a production cluster:
Firmware Alignment: The DPU firmware must match the specific version validated for the rest of the cluster to ensure compatibility with the host‘s NVIDIA drivers and the InfiniBand/Ethernet switch OS.
DOCA Runtime (BFB): In “DPU Mode,“ the card runs its own OS on internal Arm cores. The BlueField Bootstream (BFB) file contains the DOCA (Data Center Infrastructure-on-a-Chip Architecture) runtime, drivers, and libraries. Re-provisioning this image ensures the DPU has the correct identity, security certificates, and offload capabilities (like NVMe-over-Fabrics or Firewall offloads) required by the management framework (e.g., NVIDIA Base Command Manager).
The NCP-AII Context: The exam tests your understanding of the “DOCA Stack.“ Replacing a DPU is not a “plug-and-play“ operation; it requires a “bring-up“ sequence involving firmware synchronization and OS deployment to the DPU itself.
Incorrect Options: A. Manually assigning a public IPv4 address In an AI cluster, DPUs are typically managed via a private Out-of-Band (OOB) management network or an internal RShim interface. Assigning a public IP address to an internal port is a security risk and is not part of the standard integration workflow. Automation frameworks like Base Command Manager use DHCP or private static pools for DPU management.
B. Painting the card‘s bracket This is a purely aesthetic action with no technical impact on the hardware‘s function or its integration into the management software. Modification of hardware components (like painting) can also void the manufacturer‘s warranty.
D. Disabling the NVSwitch fabric Disabling the NVSwitch fabric would cripple the cluster‘s performance by preventing GPUs from communicating at high speeds. The DPU and the NVSwitch fabric are independent components of the data plane; the DPU manages networking and storage, while NVSwitch manages GPU-to-GPU memory copies. There is no technical reason to hide the GPUs from the DPU.
Question 27 of 60
27. Question
An administrator is performing a High-Performance Linpack (HPL) test on a newly installed 64-node cluster. The test completes, but the GFLOPS achieved are significantly lower than the theoretical peak for the H100 GPUs. What is the most likely cause related to the physical layer or configuration?
Correct
Correct: D. The NVLink Switch is disabled, forcing all GPU-to-GPU communication over the slower PCIe bus or the North-South network fabric.
The NCP-AII certification blueprint explicitly includes “Perform single-node NCCL (including verifying NVLink Switch)“ and “Run NCCL to verify E/W fabric bandwidth“ as core tasks within the Cluster Test and Verification domain . This indicates that NVLink Switch functionality is a critical component that must be verified during cluster validation.
NVLink technology provides substantially greater bandwidth than PCIefourth-generation NVLink (used in H100 systems) is capable of 100 Gbps per lane, more than tripling the 32 Gbps bandwidth of PCIe Gen5, and provides 900 GB/s of bidirectional bandwidth per GPU .
The NVLink Switch enables high-bandwidth, any-to-any connectivity between all GPUs in a server . In an 8-GPU H100 system, the NVLink Switch provides 3.6 TB/s of bisection bandwidth and 450 GB/s of bandwidth for reduction operations .
If the NVLink Switch is disabled, GPU-to-GPU communication would fall back to the PCIe bus, which offers significantly lower bandwidth (approximately 5-7x less per GPU) . This would create a severe bottleneck for collective operations like all-reduce that are essential for HPL performance.
The NVIDIA technical blog confirms that the performance needs of AI and HPC workloads “require high-bandwidth communication between every GPU,“ and NVLink is specifically designed to enable this . Disabling this interconnect would directly cause the significant performance drop observed in the HPL test.
Incorrect: A. The administrator forgot to install the NGC CLI on the compute nodes before starting the HPL job.
This is incorrect because NGC CLI is a tool for downloading containers and managing NGC resources, not a component required for HPL execution. The NCP-AII blueprint lists “Install NGC CLI on hosts“ under Control Plane Installation and Configuration , but missing this tool would not cause a 50% reduction in HPL GFLOPSit would simply prevent container downloads.
B. The storage array is using SATA drives instead of NVMe, causing a bottleneck in the initial loading of the HPL binary.
This is incorrect because HPL (High-Performance Linpack) is a compute-intensive benchmark that primarily stresses GPUs, CPUs, and memory, not storage I/O. Once the HPL binary is loaded into memory, storage performance has negligible impact on the actual floating-point computation results. Storage testing is a separate verification task in the exam blueprint .
C. The Slurm scheduler is configured with a default wall-clock limit that is too short for the HPL execution.
This is incorrect because if the wall-clock limit were too short, the job would be terminated before completion, not complete with significantly reduced performance. Slurm is a workload manager for job scheduling , and wall-clock limits affect job duration, not the achieved GFLOPS during execution. The HPL test completed successfully, just at lower performance, ruling out scheduling-related premature termination.
Incorrect
Correct: D. The NVLink Switch is disabled, forcing all GPU-to-GPU communication over the slower PCIe bus or the North-South network fabric.
The NCP-AII certification blueprint explicitly includes “Perform single-node NCCL (including verifying NVLink Switch)“ and “Run NCCL to verify E/W fabric bandwidth“ as core tasks within the Cluster Test and Verification domain . This indicates that NVLink Switch functionality is a critical component that must be verified during cluster validation.
NVLink technology provides substantially greater bandwidth than PCIefourth-generation NVLink (used in H100 systems) is capable of 100 Gbps per lane, more than tripling the 32 Gbps bandwidth of PCIe Gen5, and provides 900 GB/s of bidirectional bandwidth per GPU .
The NVLink Switch enables high-bandwidth, any-to-any connectivity between all GPUs in a server . In an 8-GPU H100 system, the NVLink Switch provides 3.6 TB/s of bisection bandwidth and 450 GB/s of bandwidth for reduction operations .
If the NVLink Switch is disabled, GPU-to-GPU communication would fall back to the PCIe bus, which offers significantly lower bandwidth (approximately 5-7x less per GPU) . This would create a severe bottleneck for collective operations like all-reduce that are essential for HPL performance.
The NVIDIA technical blog confirms that the performance needs of AI and HPC workloads “require high-bandwidth communication between every GPU,“ and NVLink is specifically designed to enable this . Disabling this interconnect would directly cause the significant performance drop observed in the HPL test.
Incorrect: A. The administrator forgot to install the NGC CLI on the compute nodes before starting the HPL job.
This is incorrect because NGC CLI is a tool for downloading containers and managing NGC resources, not a component required for HPL execution. The NCP-AII blueprint lists “Install NGC CLI on hosts“ under Control Plane Installation and Configuration , but missing this tool would not cause a 50% reduction in HPL GFLOPSit would simply prevent container downloads.
B. The storage array is using SATA drives instead of NVMe, causing a bottleneck in the initial loading of the HPL binary.
This is incorrect because HPL (High-Performance Linpack) is a compute-intensive benchmark that primarily stresses GPUs, CPUs, and memory, not storage I/O. Once the HPL binary is loaded into memory, storage performance has negligible impact on the actual floating-point computation results. Storage testing is a separate verification task in the exam blueprint .
C. The Slurm scheduler is configured with a default wall-clock limit that is too short for the HPL execution.
This is incorrect because if the wall-clock limit were too short, the job would be terminated before completion, not complete with significantly reduced performance. Slurm is a workload manager for job scheduling , and wall-clock limits affect job duration, not the achieved GFLOPS during execution. The HPL test completed successfully, just at lower performance, ruling out scheduling-related premature termination.
Unattempted
Correct: D. The NVLink Switch is disabled, forcing all GPU-to-GPU communication over the slower PCIe bus or the North-South network fabric.
The NCP-AII certification blueprint explicitly includes “Perform single-node NCCL (including verifying NVLink Switch)“ and “Run NCCL to verify E/W fabric bandwidth“ as core tasks within the Cluster Test and Verification domain . This indicates that NVLink Switch functionality is a critical component that must be verified during cluster validation.
NVLink technology provides substantially greater bandwidth than PCIefourth-generation NVLink (used in H100 systems) is capable of 100 Gbps per lane, more than tripling the 32 Gbps bandwidth of PCIe Gen5, and provides 900 GB/s of bidirectional bandwidth per GPU .
The NVLink Switch enables high-bandwidth, any-to-any connectivity between all GPUs in a server . In an 8-GPU H100 system, the NVLink Switch provides 3.6 TB/s of bisection bandwidth and 450 GB/s of bandwidth for reduction operations .
If the NVLink Switch is disabled, GPU-to-GPU communication would fall back to the PCIe bus, which offers significantly lower bandwidth (approximately 5-7x less per GPU) . This would create a severe bottleneck for collective operations like all-reduce that are essential for HPL performance.
The NVIDIA technical blog confirms that the performance needs of AI and HPC workloads “require high-bandwidth communication between every GPU,“ and NVLink is specifically designed to enable this . Disabling this interconnect would directly cause the significant performance drop observed in the HPL test.
Incorrect: A. The administrator forgot to install the NGC CLI on the compute nodes before starting the HPL job.
This is incorrect because NGC CLI is a tool for downloading containers and managing NGC resources, not a component required for HPL execution. The NCP-AII blueprint lists “Install NGC CLI on hosts“ under Control Plane Installation and Configuration , but missing this tool would not cause a 50% reduction in HPL GFLOPSit would simply prevent container downloads.
B. The storage array is using SATA drives instead of NVMe, causing a bottleneck in the initial loading of the HPL binary.
This is incorrect because HPL (High-Performance Linpack) is a compute-intensive benchmark that primarily stresses GPUs, CPUs, and memory, not storage I/O. Once the HPL binary is loaded into memory, storage performance has negligible impact on the actual floating-point computation results. Storage testing is a separate verification task in the exam blueprint .
C. The Slurm scheduler is configured with a default wall-clock limit that is too short for the HPL execution.
This is incorrect because if the wall-clock limit were too short, the job would be terminated before completion, not complete with significantly reduced performance. Slurm is a workload manager for job scheduling , and wall-clock limits affect job duration, not the achieved GFLOPS during execution. The HPL test completed successfully, just at lower performance, ruling out scheduling-related premature termination.
Question 28 of 60
28. Question
During a performance audit of an AI factory, it is discovered that the InfiniBand fabric is experiencing high levels of ‘congestion discard‘ packets. Which optimization strategy should the network administrator apply at the switch level to resolve this and improve collective communication performance?
Correct
Correct: C Enable Adaptive Routing and configure Congestion Control (CC) parameters on the InfiniBand switches and DPUs.
The Technical Reason: “Congestion discard“ occurs when a switch‘s internal buffers overflow because a specific egress port is overwhelmed (often called a “hot spot“).
Adaptive Routing (AR): Instead of pinning a data flow to a single static path, AR allows the switch to dynamically select the least congested path for each packet. This balances the load across all available links in the fat-tree topology.
Congestion Control (CC): This mechanism allows the fabric to signal the source (the HCA/DPU) to “throttle back“ its injection rate when it detects downstream congestion. This prevents buffers from filling up and dropping packets, which is critical for lossless fabrics like InfiniBand.
The NCP-AII Context: The exam validates your ability to configure NVIDIA Quantum-2 switches. You are expected to know how to enable these features via the Subnet Manager (SM) configuration or the switch‘s command-line interface (MLNX-OS/NVIDIA Air).
Incorrect Options: A. Disable the Subnet Manager The Subnet Manager (SM) is the “brain“ of the InfiniBand fabric. Disabling it would cause the entire fabric to stop functioning, as no routes would be calculated and no nodes would be discovered. The SM must be active and ideally redundant (Master/Standby) to maintain the network.
B. Physically disconnect half of the compute nodes While this would technically reduce traffic, it is not an “optimization strategy“it is a reduction in computational capacity. An AI factory is designed to run at maximum scale; the goal of an administrator is to tune the fabric to handle the designed load, not to disable the hardware.
D. Reduce the MTU size to 1500 bytes In InfiniBand AI fabrics, a large MTU (typically 4096 bytes) is preferred to maximize throughput and reduce the CPU overhead of processing headers. Reducing the MTU to 1500 (the standard for Ethernet) would increase the packet-per-second load on the switch silicon and the DPUs, likely increasing congestion and overhead rather than solving it.
Incorrect
Correct: C Enable Adaptive Routing and configure Congestion Control (CC) parameters on the InfiniBand switches and DPUs.
The Technical Reason: “Congestion discard“ occurs when a switch‘s internal buffers overflow because a specific egress port is overwhelmed (often called a “hot spot“).
Adaptive Routing (AR): Instead of pinning a data flow to a single static path, AR allows the switch to dynamically select the least congested path for each packet. This balances the load across all available links in the fat-tree topology.
Congestion Control (CC): This mechanism allows the fabric to signal the source (the HCA/DPU) to “throttle back“ its injection rate when it detects downstream congestion. This prevents buffers from filling up and dropping packets, which is critical for lossless fabrics like InfiniBand.
The NCP-AII Context: The exam validates your ability to configure NVIDIA Quantum-2 switches. You are expected to know how to enable these features via the Subnet Manager (SM) configuration or the switch‘s command-line interface (MLNX-OS/NVIDIA Air).
Incorrect Options: A. Disable the Subnet Manager The Subnet Manager (SM) is the “brain“ of the InfiniBand fabric. Disabling it would cause the entire fabric to stop functioning, as no routes would be calculated and no nodes would be discovered. The SM must be active and ideally redundant (Master/Standby) to maintain the network.
B. Physically disconnect half of the compute nodes While this would technically reduce traffic, it is not an “optimization strategy“it is a reduction in computational capacity. An AI factory is designed to run at maximum scale; the goal of an administrator is to tune the fabric to handle the designed load, not to disable the hardware.
D. Reduce the MTU size to 1500 bytes In InfiniBand AI fabrics, a large MTU (typically 4096 bytes) is preferred to maximize throughput and reduce the CPU overhead of processing headers. Reducing the MTU to 1500 (the standard for Ethernet) would increase the packet-per-second load on the switch silicon and the DPUs, likely increasing congestion and overhead rather than solving it.
Unattempted
Correct: C Enable Adaptive Routing and configure Congestion Control (CC) parameters on the InfiniBand switches and DPUs.
The Technical Reason: “Congestion discard“ occurs when a switch‘s internal buffers overflow because a specific egress port is overwhelmed (often called a “hot spot“).
Adaptive Routing (AR): Instead of pinning a data flow to a single static path, AR allows the switch to dynamically select the least congested path for each packet. This balances the load across all available links in the fat-tree topology.
Congestion Control (CC): This mechanism allows the fabric to signal the source (the HCA/DPU) to “throttle back“ its injection rate when it detects downstream congestion. This prevents buffers from filling up and dropping packets, which is critical for lossless fabrics like InfiniBand.
The NCP-AII Context: The exam validates your ability to configure NVIDIA Quantum-2 switches. You are expected to know how to enable these features via the Subnet Manager (SM) configuration or the switch‘s command-line interface (MLNX-OS/NVIDIA Air).
Incorrect Options: A. Disable the Subnet Manager The Subnet Manager (SM) is the “brain“ of the InfiniBand fabric. Disabling it would cause the entire fabric to stop functioning, as no routes would be calculated and no nodes would be discovered. The SM must be active and ideally redundant (Master/Standby) to maintain the network.
B. Physically disconnect half of the compute nodes While this would technically reduce traffic, it is not an “optimization strategy“it is a reduction in computational capacity. An AI factory is designed to run at maximum scale; the goal of an administrator is to tune the fabric to handle the designed load, not to disable the hardware.
D. Reduce the MTU size to 1500 bytes In InfiniBand AI fabrics, a large MTU (typically 4096 bytes) is preferred to maximize throughput and reduce the CPU overhead of processing headers. Reducing the MTU to 1500 (the standard for Ethernet) would increase the packet-per-second load on the switch silicon and the DPUs, likely increasing congestion and overhead rather than solving it.
Question 29 of 60
29. Question
During the installation of the NVIDIA GPU drivers on a cluster of nodes with BlueField-3 DPUs, the administrator must also install the DOCA drivers. What is the primary purpose of the DOCA software stack in the context of the AI infrastructure control plane?
Correct
Correct: D. To provide the necessary drivers and libraries to program the DPU‘s hardware accelerators for networking, storage, and security offloading.
The NCP-AII certification blueprint includes configuring BlueField DPUs with DOCA as part of the Physical Layer Management domain.
The primary purpose of the DOCA (Data Center-on-a-Chip Architecture) software stack is to provide a comprehensive framework for programming NVIDIA BlueField DPUs .
DOCA is explicitly defined as “the software infrastructure for BlueField‘s main hardware entities,“ containing “a runtime and development environment, including libraries and drivers for device management and programmability, for the host and as part of a BlueField Platform Software“ .
The framework “enables rapidly creating and managing applications and services on top of the BlueField networking platform“ and allows developers to “deliver breakthrough networking, security, and storage performance by harnessing the power of NVIDIA‘s BlueField data-processing units (DPUs)“ .
DOCA includes specialized libraries for each acceleration domain:
Storage acceleration libraries (DOCA Compress, DOCA SNAP) for storage offloading
The DOCA-Host package, installed alongside NVIDIA drivers, provides “all the required libraries and drivers for hosts that include NVIDIA Networking platforms (i.e., BlueField and ConnectX)“ .
By providing these drivers and libraries, DOCA enables the DPU to offload infrastructure tasks from the host CPU, freeing valuable compute resources for AI workloads .
Incorrect: A. To act as a secondary compiler for CUDA kernels, allowing them to run more efficiently on the ARM cores of the BlueField-3 DPU.
This is incorrect because DOCA is not a CUDA compiler. CUDA kernels are compiled by the NVIDIA CUDA compiler (NVCC) for execution on GPUs, not for ARM cores. The ARM cores in BlueField DPUs run infrastructure services, not CUDA compute workloads. DOCA provides libraries for infrastructure acceleration, not GPU compute compilation.
B. To manage the thermal cooling policies of the HGX baseboard by adjusting the fan speeds based on the InfiniBand traffic volume.
This is incorrect because thermal management and fan speed control are handled by the Baseboard Management Controller (BMC) and system thermal firmware, not by DOCA. DOCA focuses on networking, storage, and security offloading for DPUs, not hardware thermal management. The BMC manages cooling independently of the DOCA software stack.
C. To provide a web-based GUI for the BMC that allows administrators to manually assign MIG profiles to individual Docker containers.
This is incorrect because DOCA does not provide a web-based GUI for BMC management or MIG profile assignment. MIG (Multi-Instance GPU) configuration is performed through NVIDIA drivers and tools like nvidia-smi, not through DOCA. BMC management has its own dedicated interfaces (IPMI/Redfish) separate from the DOCA framework.
Incorrect
Correct: D. To provide the necessary drivers and libraries to program the DPU‘s hardware accelerators for networking, storage, and security offloading.
The NCP-AII certification blueprint includes configuring BlueField DPUs with DOCA as part of the Physical Layer Management domain.
The primary purpose of the DOCA (Data Center-on-a-Chip Architecture) software stack is to provide a comprehensive framework for programming NVIDIA BlueField DPUs .
DOCA is explicitly defined as “the software infrastructure for BlueField‘s main hardware entities,“ containing “a runtime and development environment, including libraries and drivers for device management and programmability, for the host and as part of a BlueField Platform Software“ .
The framework “enables rapidly creating and managing applications and services on top of the BlueField networking platform“ and allows developers to “deliver breakthrough networking, security, and storage performance by harnessing the power of NVIDIA‘s BlueField data-processing units (DPUs)“ .
DOCA includes specialized libraries for each acceleration domain:
Storage acceleration libraries (DOCA Compress, DOCA SNAP) for storage offloading
The DOCA-Host package, installed alongside NVIDIA drivers, provides “all the required libraries and drivers for hosts that include NVIDIA Networking platforms (i.e., BlueField and ConnectX)“ .
By providing these drivers and libraries, DOCA enables the DPU to offload infrastructure tasks from the host CPU, freeing valuable compute resources for AI workloads .
Incorrect: A. To act as a secondary compiler for CUDA kernels, allowing them to run more efficiently on the ARM cores of the BlueField-3 DPU.
This is incorrect because DOCA is not a CUDA compiler. CUDA kernels are compiled by the NVIDIA CUDA compiler (NVCC) for execution on GPUs, not for ARM cores. The ARM cores in BlueField DPUs run infrastructure services, not CUDA compute workloads. DOCA provides libraries for infrastructure acceleration, not GPU compute compilation.
B. To manage the thermal cooling policies of the HGX baseboard by adjusting the fan speeds based on the InfiniBand traffic volume.
This is incorrect because thermal management and fan speed control are handled by the Baseboard Management Controller (BMC) and system thermal firmware, not by DOCA. DOCA focuses on networking, storage, and security offloading for DPUs, not hardware thermal management. The BMC manages cooling independently of the DOCA software stack.
C. To provide a web-based GUI for the BMC that allows administrators to manually assign MIG profiles to individual Docker containers.
This is incorrect because DOCA does not provide a web-based GUI for BMC management or MIG profile assignment. MIG (Multi-Instance GPU) configuration is performed through NVIDIA drivers and tools like nvidia-smi, not through DOCA. BMC management has its own dedicated interfaces (IPMI/Redfish) separate from the DOCA framework.
Unattempted
Correct: D. To provide the necessary drivers and libraries to program the DPU‘s hardware accelerators for networking, storage, and security offloading.
The NCP-AII certification blueprint includes configuring BlueField DPUs with DOCA as part of the Physical Layer Management domain.
The primary purpose of the DOCA (Data Center-on-a-Chip Architecture) software stack is to provide a comprehensive framework for programming NVIDIA BlueField DPUs .
DOCA is explicitly defined as “the software infrastructure for BlueField‘s main hardware entities,“ containing “a runtime and development environment, including libraries and drivers for device management and programmability, for the host and as part of a BlueField Platform Software“ .
The framework “enables rapidly creating and managing applications and services on top of the BlueField networking platform“ and allows developers to “deliver breakthrough networking, security, and storage performance by harnessing the power of NVIDIA‘s BlueField data-processing units (DPUs)“ .
DOCA includes specialized libraries for each acceleration domain:
Storage acceleration libraries (DOCA Compress, DOCA SNAP) for storage offloading
The DOCA-Host package, installed alongside NVIDIA drivers, provides “all the required libraries and drivers for hosts that include NVIDIA Networking platforms (i.e., BlueField and ConnectX)“ .
By providing these drivers and libraries, DOCA enables the DPU to offload infrastructure tasks from the host CPU, freeing valuable compute resources for AI workloads .
Incorrect: A. To act as a secondary compiler for CUDA kernels, allowing them to run more efficiently on the ARM cores of the BlueField-3 DPU.
This is incorrect because DOCA is not a CUDA compiler. CUDA kernels are compiled by the NVIDIA CUDA compiler (NVCC) for execution on GPUs, not for ARM cores. The ARM cores in BlueField DPUs run infrastructure services, not CUDA compute workloads. DOCA provides libraries for infrastructure acceleration, not GPU compute compilation.
B. To manage the thermal cooling policies of the HGX baseboard by adjusting the fan speeds based on the InfiniBand traffic volume.
This is incorrect because thermal management and fan speed control are handled by the Baseboard Management Controller (BMC) and system thermal firmware, not by DOCA. DOCA focuses on networking, storage, and security offloading for DPUs, not hardware thermal management. The BMC manages cooling independently of the DOCA software stack.
C. To provide a web-based GUI for the BMC that allows administrators to manually assign MIG profiles to individual Docker containers.
This is incorrect because DOCA does not provide a web-based GUI for BMC management or MIG profile assignment. MIG (Multi-Instance GPU) configuration is performed through NVIDIA drivers and tools like nvidia-smi, not through DOCA. BMC management has its own dedicated interfaces (IPMI/Redfish) separate from the DOCA framework.
Question 30 of 60
30. Question
Following the physical installation of an 8-node HGX H100 cluster, the team must run the High-Performance Linpack (HPL) benchmark. What is the primary purpose of executing HPL during the cluster verification phase of an AI infrastructure deployment?
Correct
Correct: D. To verify the maximum floating-point performance and thermal stability.
The NCP-AII certification blueprint explicitly lists “Execute HPL (High-Performance Linpack)“ and “Perform HPL burn-in“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
HPL is the industry-standard benchmark for measuring the floating-point compute performance of supercomputers and is the basis for the TOP500 list .
During cluster verification, HPL serves two primary purposes:
First, it verifies the maximum floating-point performance (measured in GFLOPS or TFLOPS) that the system can achieve under ideal conditions .
Second, running HPL as a burn-in test stresses the entire systemGPUs, CPUs, memory, and interconnectunder sustained maximum load, which validates thermal stability and ensures all components operate reliably without throttling or failure .
Incorrect: A. To ensure that the NVIDIA NGC CLI is correctly authenticated.
This is incorrect because NGC CLI authentication is a separate task within the Control Plane Installation and Configuration domain . HPL is a compute performance benchmark, not a tool for validating command-line interface authentication.
B. To check the read/write speeds of the local SATA boot drives.
This is incorrect because storage performance testing is a separate verification task explicitly listed in the exam blueprint under “Test storage“ . HPL focuses on floating-point computation and system thermal stability, not storage I/O performance. Local SATA boot drive speeds are irrelevant to HPL‘s purpose and would be validated through storage-specific benchmarks.
C. To test the latency of the management network‘s DHCP server.
This is incorrect because the management network and DHCP services are part of the out-of-band (OOB) infrastructure . HPL tests compute performance across the high-speed fabric, not management network services. Network latency testing would be performed using different tools.
Incorrect
Correct: D. To verify the maximum floating-point performance and thermal stability.
The NCP-AII certification blueprint explicitly lists “Execute HPL (High-Performance Linpack)“ and “Perform HPL burn-in“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
HPL is the industry-standard benchmark for measuring the floating-point compute performance of supercomputers and is the basis for the TOP500 list .
During cluster verification, HPL serves two primary purposes:
First, it verifies the maximum floating-point performance (measured in GFLOPS or TFLOPS) that the system can achieve under ideal conditions .
Second, running HPL as a burn-in test stresses the entire systemGPUs, CPUs, memory, and interconnectunder sustained maximum load, which validates thermal stability and ensures all components operate reliably without throttling or failure .
Incorrect: A. To ensure that the NVIDIA NGC CLI is correctly authenticated.
This is incorrect because NGC CLI authentication is a separate task within the Control Plane Installation and Configuration domain . HPL is a compute performance benchmark, not a tool for validating command-line interface authentication.
B. To check the read/write speeds of the local SATA boot drives.
This is incorrect because storage performance testing is a separate verification task explicitly listed in the exam blueprint under “Test storage“ . HPL focuses on floating-point computation and system thermal stability, not storage I/O performance. Local SATA boot drive speeds are irrelevant to HPL‘s purpose and would be validated through storage-specific benchmarks.
C. To test the latency of the management network‘s DHCP server.
This is incorrect because the management network and DHCP services are part of the out-of-band (OOB) infrastructure . HPL tests compute performance across the high-speed fabric, not management network services. Network latency testing would be performed using different tools.
Unattempted
Correct: D. To verify the maximum floating-point performance and thermal stability.
The NCP-AII certification blueprint explicitly lists “Execute HPL (High-Performance Linpack)“ and “Perform HPL burn-in“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
HPL is the industry-standard benchmark for measuring the floating-point compute performance of supercomputers and is the basis for the TOP500 list .
During cluster verification, HPL serves two primary purposes:
First, it verifies the maximum floating-point performance (measured in GFLOPS or TFLOPS) that the system can achieve under ideal conditions .
Second, running HPL as a burn-in test stresses the entire systemGPUs, CPUs, memory, and interconnectunder sustained maximum load, which validates thermal stability and ensures all components operate reliably without throttling or failure .
Incorrect: A. To ensure that the NVIDIA NGC CLI is correctly authenticated.
This is incorrect because NGC CLI authentication is a separate task within the Control Plane Installation and Configuration domain . HPL is a compute performance benchmark, not a tool for validating command-line interface authentication.
B. To check the read/write speeds of the local SATA boot drives.
This is incorrect because storage performance testing is a separate verification task explicitly listed in the exam blueprint under “Test storage“ . HPL focuses on floating-point computation and system thermal stability, not storage I/O performance. Local SATA boot drive speeds are irrelevant to HPL‘s purpose and would be validated through storage-specific benchmarks.
C. To test the latency of the management network‘s DHCP server.
This is incorrect because the management network and DHCP services are part of the out-of-band (OOB) infrastructure . HPL tests compute performance across the high-speed fabric, not management network services. Network latency testing would be performed using different tools.
Question 31 of 60
31. Question
An administrator is optimizing the performance of an AI cluster with a mix of Intel and AMD servers. They find that the storage throughput is lower than expected on the Intel nodes. Which optimization technique specifically targets the reduction of latency and CPU overhead for NVMe-based storage?
Correct
Correct: CImplementing NVIDIA GPUDirect Storage to enable a direct DMA path between the storage and the GPU memory, bypassing the CPU.The Technical Reason: NVIDIA GPUDirect Storage (GDS) is the specific technology designed to solve the “IO bottleneck“ in AI infrastructure.Direct Data Path: Without GDS, data must be copied from NVMe storage into the CPUs system memory (bounce buffer) before being copied again to the GPU memory. GDS enables a Direct Memory Access (DMA) engine to transfer data directly between the NVMe drive (or a remote storage target over NVMe-oF) and the GPU‘s High Bandwidth Memory (HBM).Reduced CPU Overhead: By bypassing the CPU, GDS significantly reduces CPU utilization and context switches, which is particularly beneficial on nodes where the CPU is already heavily taxed by data preprocessing tasks.Latency reduction: Eliminating the extra “hop“ through system RAM lowers end-to-end latency for training datasets.The NCP-AII Context: The exam validates your understanding of the NVIDIA Magnum IO suite, of which GDS is a core component. You are expected to know that GDS requires the nvidia-fs kernel module and a compatible filesystem (e.g., Lustre, Weka, or local XFS/ext4 with specific configurations).Incorrect Options:A. Disabling PCIe Gen5 and using USB 2.0This would be a massive performance regression. PCIe Gen5 provides the high-bandwidth lanes ($64\text{ GB/s}$ for a x16 slot) required for NDR networking and H100 GPUs. USB 2.0 is limited to $480\text{ Mbps}$, which is several orders of magnitude slower than even the oldest NVMe drives. Using an external hub would make AI training virtually impossible due to the severe data starvation.B. Replacing NVMe with Tape DrivesWhile tape drives are excellent for long-term archival storage due to their high capacity and low cost, they have extremely high latency (seeking time) and are not suitable for the random-access patterns or high-throughput requirements of active model training. NVMe (Non-Volatile Memory Express) is the industry standard for high-performance AI data lakes.D. Increasing the Linux swap file sizeThe Linux swap file is used when the system runs out of physical RAM; it is much slower than actual memory. Increasing swap does not create “virtual GPU memory.“ While Unified Memory (UM) allows for oversubscription of GPU memory, it relies on the GPU‘s page migration engine, not the Linux system swap file. Using swap on an NVMe drive for GPU tasks would drastically increase latency and decrease throughput.
Incorrect
Correct: CImplementing NVIDIA GPUDirect Storage to enable a direct DMA path between the storage and the GPU memory, bypassing the CPU.The Technical Reason: NVIDIA GPUDirect Storage (GDS) is the specific technology designed to solve the “IO bottleneck“ in AI infrastructure.Direct Data Path: Without GDS, data must be copied from NVMe storage into the CPUs system memory (bounce buffer) before being copied again to the GPU memory. GDS enables a Direct Memory Access (DMA) engine to transfer data directly between the NVMe drive (or a remote storage target over NVMe-oF) and the GPU‘s High Bandwidth Memory (HBM).Reduced CPU Overhead: By bypassing the CPU, GDS significantly reduces CPU utilization and context switches, which is particularly beneficial on nodes where the CPU is already heavily taxed by data preprocessing tasks.Latency reduction: Eliminating the extra “hop“ through system RAM lowers end-to-end latency for training datasets.The NCP-AII Context: The exam validates your understanding of the NVIDIA Magnum IO suite, of which GDS is a core component. You are expected to know that GDS requires the nvidia-fs kernel module and a compatible filesystem (e.g., Lustre, Weka, or local XFS/ext4 with specific configurations).Incorrect Options:A. Disabling PCIe Gen5 and using USB 2.0This would be a massive performance regression. PCIe Gen5 provides the high-bandwidth lanes ($64\text{ GB/s}$ for a x16 slot) required for NDR networking and H100 GPUs. USB 2.0 is limited to $480\text{ Mbps}$, which is several orders of magnitude slower than even the oldest NVMe drives. Using an external hub would make AI training virtually impossible due to the severe data starvation.B. Replacing NVMe with Tape DrivesWhile tape drives are excellent for long-term archival storage due to their high capacity and low cost, they have extremely high latency (seeking time) and are not suitable for the random-access patterns or high-throughput requirements of active model training. NVMe (Non-Volatile Memory Express) is the industry standard for high-performance AI data lakes.D. Increasing the Linux swap file sizeThe Linux swap file is used when the system runs out of physical RAM; it is much slower than actual memory. Increasing swap does not create “virtual GPU memory.“ While Unified Memory (UM) allows for oversubscription of GPU memory, it relies on the GPU‘s page migration engine, not the Linux system swap file. Using swap on an NVMe drive for GPU tasks would drastically increase latency and decrease throughput.
Unattempted
Correct: CImplementing NVIDIA GPUDirect Storage to enable a direct DMA path between the storage and the GPU memory, bypassing the CPU.The Technical Reason: NVIDIA GPUDirect Storage (GDS) is the specific technology designed to solve the “IO bottleneck“ in AI infrastructure.Direct Data Path: Without GDS, data must be copied from NVMe storage into the CPUs system memory (bounce buffer) before being copied again to the GPU memory. GDS enables a Direct Memory Access (DMA) engine to transfer data directly between the NVMe drive (or a remote storage target over NVMe-oF) and the GPU‘s High Bandwidth Memory (HBM).Reduced CPU Overhead: By bypassing the CPU, GDS significantly reduces CPU utilization and context switches, which is particularly beneficial on nodes where the CPU is already heavily taxed by data preprocessing tasks.Latency reduction: Eliminating the extra “hop“ through system RAM lowers end-to-end latency for training datasets.The NCP-AII Context: The exam validates your understanding of the NVIDIA Magnum IO suite, of which GDS is a core component. You are expected to know that GDS requires the nvidia-fs kernel module and a compatible filesystem (e.g., Lustre, Weka, or local XFS/ext4 with specific configurations).Incorrect Options:A. Disabling PCIe Gen5 and using USB 2.0This would be a massive performance regression. PCIe Gen5 provides the high-bandwidth lanes ($64\text{ GB/s}$ for a x16 slot) required for NDR networking and H100 GPUs. USB 2.0 is limited to $480\text{ Mbps}$, which is several orders of magnitude slower than even the oldest NVMe drives. Using an external hub would make AI training virtually impossible due to the severe data starvation.B. Replacing NVMe with Tape DrivesWhile tape drives are excellent for long-term archival storage due to their high capacity and low cost, they have extremely high latency (seeking time) and are not suitable for the random-access patterns or high-throughput requirements of active model training. NVMe (Non-Volatile Memory Express) is the industry standard for high-performance AI data lakes.D. Increasing the Linux swap file sizeThe Linux swap file is used when the system runs out of physical RAM; it is much slower than actual memory. Increasing swap does not create “virtual GPU memory.“ While Unified Memory (UM) allows for oversubscription of GPU memory, it relies on the GPU‘s page migration engine, not the Linux system swap file. Using swap on an NVMe drive for GPU tasks would drastically increase latency and decrease throughput.
Question 32 of 60
32. Question
The ClusterKit tool is being used for a multifaceted node assessment. One of the tests fails because the ‘NVLink Switch‘ cannot be verified. What does this failure imply for the AI workloads intended for that node, and which physical component should be inspected first?
Correct
Correct: D. The GPUs cannot communicate with each other at full speed; the physical HGX/DGX baseboard or NVSwitch modules should be inspected.
The NCP-AII certification blueprint includes verifying NVLink Switch functionality as part of the Cluster Test and Verification domain, which comprises 33% of the examination .
When ClusterKit‘s multifaceted node assessment reports that the ‘NVLink Switch‘ cannot be verified, this specifically indicates a failure in the high-speed GPU-to-GPU interconnect .
NVLink and NVSwitch are the technologies that enable GPUs within a node to communicate with each other at very high speeds (up to 900 GB/s on H100 systems) .
A failure to verify the NVLink Switch directly implies that the GPUs cannot communicate with each other at full speed, which would severely impact distributed training workloads that rely on fast peer-to-peer GPU communication .
The physical components to inspect first are the HGX baseboard (which houses the NVSwitch chips) or the NVSwitch modules themselves, as these are the hardware elements responsible for the NVLink fabric within the node .
This aligns with standard troubleshooting methodology where physical hardware inspection precedes software-level diagnosis .
Incorrect: A. The node cannot connect to the internet; the RJ45 management cable should be inspected.
This is incorrect because NVLink Switch verification tests the high-speed GPU interconnect fabric, not internet connectivity. The RJ45 management cable is part of the out-of-band (OOB) management network and has no relation to NVLink functionality.
B. The CPU cannot access the system memory; the DDR5 DIMM slots should be inspected for bent pins.
This is incorrect because NVLink Switch verification is specific to GPU-to-GPU communication, not CPU memory access. System memory issues would manifest as different failures in other tests, not an NVLink Switch verification failure.
C. The storage system is too slow; the NVMe drive‘s firmware should be updated immediately.
This is incorrect because NVLink Switch verification is unrelated to storage performance. Storage testing is a separate verification task explicitly listed in the exam blueprint under “Test storage“ . NVLink specifically handles GPU peer-to-peer communication, not storage I/O.
Incorrect
Correct: D. The GPUs cannot communicate with each other at full speed; the physical HGX/DGX baseboard or NVSwitch modules should be inspected.
The NCP-AII certification blueprint includes verifying NVLink Switch functionality as part of the Cluster Test and Verification domain, which comprises 33% of the examination .
When ClusterKit‘s multifaceted node assessment reports that the ‘NVLink Switch‘ cannot be verified, this specifically indicates a failure in the high-speed GPU-to-GPU interconnect .
NVLink and NVSwitch are the technologies that enable GPUs within a node to communicate with each other at very high speeds (up to 900 GB/s on H100 systems) .
A failure to verify the NVLink Switch directly implies that the GPUs cannot communicate with each other at full speed, which would severely impact distributed training workloads that rely on fast peer-to-peer GPU communication .
The physical components to inspect first are the HGX baseboard (which houses the NVSwitch chips) or the NVSwitch modules themselves, as these are the hardware elements responsible for the NVLink fabric within the node .
This aligns with standard troubleshooting methodology where physical hardware inspection precedes software-level diagnosis .
Incorrect: A. The node cannot connect to the internet; the RJ45 management cable should be inspected.
This is incorrect because NVLink Switch verification tests the high-speed GPU interconnect fabric, not internet connectivity. The RJ45 management cable is part of the out-of-band (OOB) management network and has no relation to NVLink functionality.
B. The CPU cannot access the system memory; the DDR5 DIMM slots should be inspected for bent pins.
This is incorrect because NVLink Switch verification is specific to GPU-to-GPU communication, not CPU memory access. System memory issues would manifest as different failures in other tests, not an NVLink Switch verification failure.
C. The storage system is too slow; the NVMe drive‘s firmware should be updated immediately.
This is incorrect because NVLink Switch verification is unrelated to storage performance. Storage testing is a separate verification task explicitly listed in the exam blueprint under “Test storage“ . NVLink specifically handles GPU peer-to-peer communication, not storage I/O.
Unattempted
Correct: D. The GPUs cannot communicate with each other at full speed; the physical HGX/DGX baseboard or NVSwitch modules should be inspected.
The NCP-AII certification blueprint includes verifying NVLink Switch functionality as part of the Cluster Test and Verification domain, which comprises 33% of the examination .
When ClusterKit‘s multifaceted node assessment reports that the ‘NVLink Switch‘ cannot be verified, this specifically indicates a failure in the high-speed GPU-to-GPU interconnect .
NVLink and NVSwitch are the technologies that enable GPUs within a node to communicate with each other at very high speeds (up to 900 GB/s on H100 systems) .
A failure to verify the NVLink Switch directly implies that the GPUs cannot communicate with each other at full speed, which would severely impact distributed training workloads that rely on fast peer-to-peer GPU communication .
The physical components to inspect first are the HGX baseboard (which houses the NVSwitch chips) or the NVSwitch modules themselves, as these are the hardware elements responsible for the NVLink fabric within the node .
This aligns with standard troubleshooting methodology where physical hardware inspection precedes software-level diagnosis .
Incorrect: A. The node cannot connect to the internet; the RJ45 management cable should be inspected.
This is incorrect because NVLink Switch verification tests the high-speed GPU interconnect fabric, not internet connectivity. The RJ45 management cable is part of the out-of-band (OOB) management network and has no relation to NVLink functionality.
B. The CPU cannot access the system memory; the DDR5 DIMM slots should be inspected for bent pins.
This is incorrect because NVLink Switch verification is specific to GPU-to-GPU communication, not CPU memory access. System memory issues would manifest as different failures in other tests, not an NVLink Switch verification failure.
C. The storage system is too slow; the NVMe drive‘s firmware should be updated immediately.
This is incorrect because NVLink Switch verification is unrelated to storage performance. Storage testing is a separate verification task explicitly listed in the exam blueprint under “Test storage“ . NVLink specifically handles GPU peer-to-peer communication, not storage I/O.
Question 33 of 60
33. Question
In a cluster utilizing BlueField-3 Data Processing Units (DPUs), the network team wants to offload the OVS (Open vSwitch) data path to the DPU hardware. Which NVIDIA platform must be configured to manage these DPU resources, and what is the primary benefit for AI workloads?
Correct
Correct: D The DOCA (Data-Center-on-a-Chip Architecture) framework; it offloads network and storage tasks from the host CPU to the DPU.
The Technical Reason: NVIDIA DOCA is the software development kit (SDK) and runtime environment that unlocks the potential of the BlueField DPU.
Hardware Offloading: DOCA provides the necessary drivers and APIs to offload the OVS (Open vSwitch) data path. Instead of the host CPU processing every network packet header (which consumes significant “expensive“ CPU cycles), the DPU‘s dedicated hardware accelerators handle the switching, routing, and security policies.
Primary Benefit: For AI workloads, this ensures that the host CPU remains focused on job scheduling, data preprocessing, and managing the GPU kernels. It significantly reduces “system noise“ and latency, which is critical for synchronous operations like NCCL collectives.
The NCP-AII Context: The exam validates your ability to “Confirm FW/SW on BlueField-3“ and understand how the DPU interacts with the rest of the stack. DOCA is the primary management and acceleration layer for all DPU-based services.
Incorrect Options: A. The NVIDIA Container Toolkit The NVIDIA Container Toolkit (including nvidia-docker2) is used to expose GPUs to containerized applications. It does not manage DPU network offloads or OVS data paths. Furthermore, DPU memory is not typically mapped to GPU memory via the NVLink Switch; NVLink is an internal GPU-to-GPU fabric, whereas the DPU communicates with the GPU over the PCIe bus using GPUDirect RDMA.
B. Base Command Manager (BCM) While Base Command Manager (formerly Bright Cluster Manager) can provision and monitor DPU-equipped nodes, it is a cluster management platform, not the framework used to offload network data paths. Additionally, you do not “replace the DPU firmware with a standard Linux kernel“ to run CUDA kernels on a DPU. The DPU runs its own optimized OS (often Ubuntu-based) and is intended for infrastructure acceleration, not for running CUDA-based training.
C. The NGC CLI The NGC CLI is a tool for interacting with the NVIDIA GPU Cloud (NGC) to pull containers, models, and datasets. It is an administrative utility for software distribution and has no capability to virtualize DPU hardware or manage low-level network offloads like OVS.
Incorrect
Correct: D The DOCA (Data-Center-on-a-Chip Architecture) framework; it offloads network and storage tasks from the host CPU to the DPU.
The Technical Reason: NVIDIA DOCA is the software development kit (SDK) and runtime environment that unlocks the potential of the BlueField DPU.
Hardware Offloading: DOCA provides the necessary drivers and APIs to offload the OVS (Open vSwitch) data path. Instead of the host CPU processing every network packet header (which consumes significant “expensive“ CPU cycles), the DPU‘s dedicated hardware accelerators handle the switching, routing, and security policies.
Primary Benefit: For AI workloads, this ensures that the host CPU remains focused on job scheduling, data preprocessing, and managing the GPU kernels. It significantly reduces “system noise“ and latency, which is critical for synchronous operations like NCCL collectives.
The NCP-AII Context: The exam validates your ability to “Confirm FW/SW on BlueField-3“ and understand how the DPU interacts with the rest of the stack. DOCA is the primary management and acceleration layer for all DPU-based services.
Incorrect Options: A. The NVIDIA Container Toolkit The NVIDIA Container Toolkit (including nvidia-docker2) is used to expose GPUs to containerized applications. It does not manage DPU network offloads or OVS data paths. Furthermore, DPU memory is not typically mapped to GPU memory via the NVLink Switch; NVLink is an internal GPU-to-GPU fabric, whereas the DPU communicates with the GPU over the PCIe bus using GPUDirect RDMA.
B. Base Command Manager (BCM) While Base Command Manager (formerly Bright Cluster Manager) can provision and monitor DPU-equipped nodes, it is a cluster management platform, not the framework used to offload network data paths. Additionally, you do not “replace the DPU firmware with a standard Linux kernel“ to run CUDA kernels on a DPU. The DPU runs its own optimized OS (often Ubuntu-based) and is intended for infrastructure acceleration, not for running CUDA-based training.
C. The NGC CLI The NGC CLI is a tool for interacting with the NVIDIA GPU Cloud (NGC) to pull containers, models, and datasets. It is an administrative utility for software distribution and has no capability to virtualize DPU hardware or manage low-level network offloads like OVS.
Unattempted
Correct: D The DOCA (Data-Center-on-a-Chip Architecture) framework; it offloads network and storage tasks from the host CPU to the DPU.
The Technical Reason: NVIDIA DOCA is the software development kit (SDK) and runtime environment that unlocks the potential of the BlueField DPU.
Hardware Offloading: DOCA provides the necessary drivers and APIs to offload the OVS (Open vSwitch) data path. Instead of the host CPU processing every network packet header (which consumes significant “expensive“ CPU cycles), the DPU‘s dedicated hardware accelerators handle the switching, routing, and security policies.
Primary Benefit: For AI workloads, this ensures that the host CPU remains focused on job scheduling, data preprocessing, and managing the GPU kernels. It significantly reduces “system noise“ and latency, which is critical for synchronous operations like NCCL collectives.
The NCP-AII Context: The exam validates your ability to “Confirm FW/SW on BlueField-3“ and understand how the DPU interacts with the rest of the stack. DOCA is the primary management and acceleration layer for all DPU-based services.
Incorrect Options: A. The NVIDIA Container Toolkit The NVIDIA Container Toolkit (including nvidia-docker2) is used to expose GPUs to containerized applications. It does not manage DPU network offloads or OVS data paths. Furthermore, DPU memory is not typically mapped to GPU memory via the NVLink Switch; NVLink is an internal GPU-to-GPU fabric, whereas the DPU communicates with the GPU over the PCIe bus using GPUDirect RDMA.
B. Base Command Manager (BCM) While Base Command Manager (formerly Bright Cluster Manager) can provision and monitor DPU-equipped nodes, it is a cluster management platform, not the framework used to offload network data paths. Additionally, you do not “replace the DPU firmware with a standard Linux kernel“ to run CUDA kernels on a DPU. The DPU runs its own optimized OS (often Ubuntu-based) and is intended for infrastructure acceleration, not for running CUDA-based training.
C. The NGC CLI The NGC CLI is a tool for interacting with the NVIDIA GPU Cloud (NGC) to pull containers, models, and datasets. It is an administrative utility for software distribution and has no capability to virtualize DPU hardware or manage low-level network offloads like OVS.
Question 34 of 60
34. Question
To enable GPU-accelerated containers, an administrator must install the NVIDIA Container Toolkit. Which component of the toolkit is responsible for modifying the container runtime‘s configuration so that it can automatically discover and mount the NVIDIA GPU device nodes and libraries into the container at startup?
Correct
Correct: D. The nvidia-container-runtime-hook (or the integrated CDI implementation), which acts as a pre-start hook for the container engine.
This is correct because the NVIDIA Container Runtime Hook (included in the nvidia-container-toolkit package) is specifically designed to implement the interface required by an OCI prestart hook .
The hook is invoked by the low-level runtime (like runC) after a container has been created but before it has been started .
When invoked, the hook is given access to the config.json associated with the container and uses this information to invoke the nvidia-container-cli CLI with an appropriate set of flags, most importantly determining which specific GPU devices should be injected into the container .
In the context of Docker, when a user specifies the –gpus flag, the NVIDIA Container Runtime (nvidia-container-runtime) injects this hook as a prestart hook into the OCI spec, which then performs the automatic discovery and mounting of GPU device nodes and libraries .
More recent versions of the toolkit also include an integrated Container Device Interface (CDI) implementation, which serves a similar purpose of abstracting device access and can be used as an alternative to the traditional hook mechanism .
Incorrect: A. The NVIDIA GPU Firmware, which contains the logic to inject files into the Linux kernel‘s namespace.
This is incorrect because GPU firmware operates at the hardware level and is not responsible for container runtime configuration. The NVIDIA Container Toolkit is a userspace software component that works with the container runtime, not firmware that interacts with kernel namespaces.
B. The Slurm database, which tracks the location of all GPU libraries and copies them into the container‘s root filesystem.
This is incorrect because Slurm is a workload manager for job scheduling, not a component of the NVIDIA Container Toolkit. Slurm does not handle runtime injection of GPU devices or libraries into containers.
C. The TPM 2.0 module, which provides the cryptographic keys needed to unlock the GPU for use by Docker or Podman.
This is incorrect because the Trusted Platform Module (TPM) is a security chip used for cryptographic operations and platform integrity, not for unlocking GPUs for container access. GPUs do not require cryptographic unlocking via TPM for container use.
Incorrect
Correct: D. The nvidia-container-runtime-hook (or the integrated CDI implementation), which acts as a pre-start hook for the container engine.
This is correct because the NVIDIA Container Runtime Hook (included in the nvidia-container-toolkit package) is specifically designed to implement the interface required by an OCI prestart hook .
The hook is invoked by the low-level runtime (like runC) after a container has been created but before it has been started .
When invoked, the hook is given access to the config.json associated with the container and uses this information to invoke the nvidia-container-cli CLI with an appropriate set of flags, most importantly determining which specific GPU devices should be injected into the container .
In the context of Docker, when a user specifies the –gpus flag, the NVIDIA Container Runtime (nvidia-container-runtime) injects this hook as a prestart hook into the OCI spec, which then performs the automatic discovery and mounting of GPU device nodes and libraries .
More recent versions of the toolkit also include an integrated Container Device Interface (CDI) implementation, which serves a similar purpose of abstracting device access and can be used as an alternative to the traditional hook mechanism .
Incorrect: A. The NVIDIA GPU Firmware, which contains the logic to inject files into the Linux kernel‘s namespace.
This is incorrect because GPU firmware operates at the hardware level and is not responsible for container runtime configuration. The NVIDIA Container Toolkit is a userspace software component that works with the container runtime, not firmware that interacts with kernel namespaces.
B. The Slurm database, which tracks the location of all GPU libraries and copies them into the container‘s root filesystem.
This is incorrect because Slurm is a workload manager for job scheduling, not a component of the NVIDIA Container Toolkit. Slurm does not handle runtime injection of GPU devices or libraries into containers.
C. The TPM 2.0 module, which provides the cryptographic keys needed to unlock the GPU for use by Docker or Podman.
This is incorrect because the Trusted Platform Module (TPM) is a security chip used for cryptographic operations and platform integrity, not for unlocking GPUs for container access. GPUs do not require cryptographic unlocking via TPM for container use.
Unattempted
Correct: D. The nvidia-container-runtime-hook (or the integrated CDI implementation), which acts as a pre-start hook for the container engine.
This is correct because the NVIDIA Container Runtime Hook (included in the nvidia-container-toolkit package) is specifically designed to implement the interface required by an OCI prestart hook .
The hook is invoked by the low-level runtime (like runC) after a container has been created but before it has been started .
When invoked, the hook is given access to the config.json associated with the container and uses this information to invoke the nvidia-container-cli CLI with an appropriate set of flags, most importantly determining which specific GPU devices should be injected into the container .
In the context of Docker, when a user specifies the –gpus flag, the NVIDIA Container Runtime (nvidia-container-runtime) injects this hook as a prestart hook into the OCI spec, which then performs the automatic discovery and mounting of GPU device nodes and libraries .
More recent versions of the toolkit also include an integrated Container Device Interface (CDI) implementation, which serves a similar purpose of abstracting device access and can be used as an alternative to the traditional hook mechanism .
Incorrect: A. The NVIDIA GPU Firmware, which contains the logic to inject files into the Linux kernel‘s namespace.
This is incorrect because GPU firmware operates at the hardware level and is not responsible for container runtime configuration. The NVIDIA Container Toolkit is a userspace software component that works with the container runtime, not firmware that interacts with kernel namespaces.
B. The Slurm database, which tracks the location of all GPU libraries and copies them into the container‘s root filesystem.
This is incorrect because Slurm is a workload manager for job scheduling, not a component of the NVIDIA Container Toolkit. Slurm does not handle runtime injection of GPU devices or libraries into containers.
C. The TPM 2.0 module, which provides the cryptographic keys needed to unlock the GPU for use by Docker or Podman.
This is incorrect because the Trusted Platform Module (TPM) is a security chip used for cryptographic operations and platform integrity, not for unlocking GPUs for container access. GPUs do not require cryptographic unlocking via TPM for container use.
Question 35 of 60
35. Question
A lead engineer is overseeing the initial deployment of an NVIDIA HGX H100 cluster within a high-density data center. During the bring-up phase, the BMC reports that the power supply units are functioning but the GPUs are not reaching their maximum clock speeds during validation. Which specific sequence of actions is most appropriate to validate that the physical infrastructure supports the peak power requirements of the AI factory?
Correct
Correct: D Check the BMC power capping settings to ensure they are not restricted and verify that the PDU branch circuits provide the required 200V-240V input to the PSUs. The Technical Reason: If GPUs are not reaching maximum clock speeds (throttling), it is often due to power limitations rather than thermal issues. ? Power Capping: The Baseboard Management Controller (BMC) includes settings to cap total system power. If the “Power Limit“ is set below the maximum TDP of the HGX baseboard (often to protect a data center with limited cooling/power), the GPUs will be prevented from entering their “Boost“ clock states. ? Input Voltage: NVIDIA HGX/DGX power supply units (PSUs) require high-voltage AC input (200V- 240V) to provide their full rated output (e.g., 3.3 kW per PSU). If the rack PDUs are mistakenly providing lower voltage (110V – 120V), the PSUs will either fail to start or operate in a severely derated capacity, causing the system to throttle the GPUs to prevent a crash. The NCP-AII Context: The exam validates your ability to ensure “power, thermals, and airflow are within spec.“ This involves verifying the physical electrical chainfrom the PDU branch circuit to the BMC firmware policy.
Incorrect: A. Replace OSFP transceivers with active optical cables While Active Optical Cables (AOC) are used for long-distance networking, they have no impact on the power distribution within a server rack. Electromagnetic interference (EMI) from network cables is not a factor that causes GPU power throttling; throttling is a logical or electrical capacity issue.
B. Reinstall the NVIDIA Container Toolkit The NVIDIA Container Toolkit allows containers to access GPU hardware. While nvidia-smi (part of the driver) communicates with the GPU, the actual power-delivery logic is handled by the hardware midplane and BMC firmware, independent of the container runtime or toolkit. Reinstalling it would not resolve a low-level electrical or BMC-imposed power cap.
C. Downgrade HGX baseboard firmware Downgrading firmware is rarely a valid troubleshooting step in the NCP-AII curriculum. Legacy firmware may lack support for newer H100 stepping or power management features. Furthermore, “bypassing power-sensing logic“ is extremely dangerous and could lead to hardware damage, fire, or “tripping“ the data center breakers by exceeding the physical circuit capacity.
Incorrect
Correct: D Check the BMC power capping settings to ensure they are not restricted and verify that the PDU branch circuits provide the required 200V-240V input to the PSUs. The Technical Reason: If GPUs are not reaching maximum clock speeds (throttling), it is often due to power limitations rather than thermal issues. ? Power Capping: The Baseboard Management Controller (BMC) includes settings to cap total system power. If the “Power Limit“ is set below the maximum TDP of the HGX baseboard (often to protect a data center with limited cooling/power), the GPUs will be prevented from entering their “Boost“ clock states. ? Input Voltage: NVIDIA HGX/DGX power supply units (PSUs) require high-voltage AC input (200V- 240V) to provide their full rated output (e.g., 3.3 kW per PSU). If the rack PDUs are mistakenly providing lower voltage (110V – 120V), the PSUs will either fail to start or operate in a severely derated capacity, causing the system to throttle the GPUs to prevent a crash. The NCP-AII Context: The exam validates your ability to ensure “power, thermals, and airflow are within spec.“ This involves verifying the physical electrical chainfrom the PDU branch circuit to the BMC firmware policy.
Incorrect: A. Replace OSFP transceivers with active optical cables While Active Optical Cables (AOC) are used for long-distance networking, they have no impact on the power distribution within a server rack. Electromagnetic interference (EMI) from network cables is not a factor that causes GPU power throttling; throttling is a logical or electrical capacity issue.
B. Reinstall the NVIDIA Container Toolkit The NVIDIA Container Toolkit allows containers to access GPU hardware. While nvidia-smi (part of the driver) communicates with the GPU, the actual power-delivery logic is handled by the hardware midplane and BMC firmware, independent of the container runtime or toolkit. Reinstalling it would not resolve a low-level electrical or BMC-imposed power cap.
C. Downgrade HGX baseboard firmware Downgrading firmware is rarely a valid troubleshooting step in the NCP-AII curriculum. Legacy firmware may lack support for newer H100 stepping or power management features. Furthermore, “bypassing power-sensing logic“ is extremely dangerous and could lead to hardware damage, fire, or “tripping“ the data center breakers by exceeding the physical circuit capacity.
Unattempted
Correct: D Check the BMC power capping settings to ensure they are not restricted and verify that the PDU branch circuits provide the required 200V-240V input to the PSUs. The Technical Reason: If GPUs are not reaching maximum clock speeds (throttling), it is often due to power limitations rather than thermal issues. ? Power Capping: The Baseboard Management Controller (BMC) includes settings to cap total system power. If the “Power Limit“ is set below the maximum TDP of the HGX baseboard (often to protect a data center with limited cooling/power), the GPUs will be prevented from entering their “Boost“ clock states. ? Input Voltage: NVIDIA HGX/DGX power supply units (PSUs) require high-voltage AC input (200V- 240V) to provide their full rated output (e.g., 3.3 kW per PSU). If the rack PDUs are mistakenly providing lower voltage (110V – 120V), the PSUs will either fail to start or operate in a severely derated capacity, causing the system to throttle the GPUs to prevent a crash. The NCP-AII Context: The exam validates your ability to ensure “power, thermals, and airflow are within spec.“ This involves verifying the physical electrical chainfrom the PDU branch circuit to the BMC firmware policy.
Incorrect: A. Replace OSFP transceivers with active optical cables While Active Optical Cables (AOC) are used for long-distance networking, they have no impact on the power distribution within a server rack. Electromagnetic interference (EMI) from network cables is not a factor that causes GPU power throttling; throttling is a logical or electrical capacity issue.
B. Reinstall the NVIDIA Container Toolkit The NVIDIA Container Toolkit allows containers to access GPU hardware. While nvidia-smi (part of the driver) communicates with the GPU, the actual power-delivery logic is handled by the hardware midplane and BMC firmware, independent of the container runtime or toolkit. Reinstalling it would not resolve a low-level electrical or BMC-imposed power cap.
C. Downgrade HGX baseboard firmware Downgrading firmware is rarely a valid troubleshooting step in the NCP-AII curriculum. Legacy firmware may lack support for newer H100 stepping or power management features. Furthermore, “bypassing power-sensing logic“ is extremely dangerous and could lead to hardware damage, fire, or “tripping“ the data center breakers by exceeding the physical circuit capacity.
Question 36 of 60
36. Question
A network card (NIC) in an AI server is identified as faulty after showing intermittent link drops. The administrator needs to replace the card. Which step is critical to ensure that the new card is recognized and functions with the same performance characteristics as the rest of the cluster?
Correct
Correct: B The administrator must verify and update the firmware of the new NIC to match the specific version used across the cluster as defined in the BCM category.
The Technical Reason: In an NVIDIA-certified environment managed by NVIDIA Base Command Manager (BCM), nodes are organized into Categories.
The Golden Recipe: A category defines the exact “Golden Stack“ (OS, driver, and firmware versions) that all nodes in that group must run.
Firmware Synchronization: When a NIC is replaced, it often ships with the “factory latest“ firmware, which may be newer or older than the cluster‘s validated baseline. If the firmware version does not match, the NIC might behave differently under load (e.g., different congestion control or NCCL performance), leading to distributed training stalls.
Verification: The administrator should use tools like mlxfwmanager (for ConnectX/BlueField) or the nvfwupd tool to ensure the new card is aligned with the cluster-wide standard.
The NCP-AII Context: The exam validates your ability to “Replace faulty cards“ and “Configure categories“ in BCM. Correct lifecycle management involves ensuring that any new hardware is instantly “normalized“ to the cluster‘s standard.
Incorrect Options: A. Swap the card while the server is running Replacing a PCIe card (NIC or GPU) while a training job is running is not supported in standard AI infrastructure and can cause a system crash or electrical damage. Distributed AI jobs are highly fragile; if a single node‘s network link vanishes, the entire multi-node job will fail immediately with a NCCL Timeout or Connection Refused error. Hardware replacement should always be performed during a scheduled maintenance window after the node has been drained of jobs.
C. Change the MAC address with a Sharpie A MAC address is a unique hardware identifier burned into the NIC‘s EEPROM. Physically writing on the PCB with a marker has no effect on the digital identity of the card. While MAC addresses can sometimes be “spoofed“ in software, in an AI cluster, the Base Command Manager handles new hardware by detecting the new MAC and updating the cluster‘s internal DHCP/PXE database accordingly.
D. No steps are needed This is a common but dangerous misconception. Even cards with the same OPN (Ordering Part Number) can have different hardware revisions or ship with different firmware “PSIDs“ (Parameter-Set Identifications). In a high-performance fabric (NDR 400G), even a minor firmware revision difference can lead to inconsistent packet handling or different thermal profiles, which disrupts the synchronous nature of AI workloads.
Incorrect
Correct: B The administrator must verify and update the firmware of the new NIC to match the specific version used across the cluster as defined in the BCM category.
The Technical Reason: In an NVIDIA-certified environment managed by NVIDIA Base Command Manager (BCM), nodes are organized into Categories.
The Golden Recipe: A category defines the exact “Golden Stack“ (OS, driver, and firmware versions) that all nodes in that group must run.
Firmware Synchronization: When a NIC is replaced, it often ships with the “factory latest“ firmware, which may be newer or older than the cluster‘s validated baseline. If the firmware version does not match, the NIC might behave differently under load (e.g., different congestion control or NCCL performance), leading to distributed training stalls.
Verification: The administrator should use tools like mlxfwmanager (for ConnectX/BlueField) or the nvfwupd tool to ensure the new card is aligned with the cluster-wide standard.
The NCP-AII Context: The exam validates your ability to “Replace faulty cards“ and “Configure categories“ in BCM. Correct lifecycle management involves ensuring that any new hardware is instantly “normalized“ to the cluster‘s standard.
Incorrect Options: A. Swap the card while the server is running Replacing a PCIe card (NIC or GPU) while a training job is running is not supported in standard AI infrastructure and can cause a system crash or electrical damage. Distributed AI jobs are highly fragile; if a single node‘s network link vanishes, the entire multi-node job will fail immediately with a NCCL Timeout or Connection Refused error. Hardware replacement should always be performed during a scheduled maintenance window after the node has been drained of jobs.
C. Change the MAC address with a Sharpie A MAC address is a unique hardware identifier burned into the NIC‘s EEPROM. Physically writing on the PCB with a marker has no effect on the digital identity of the card. While MAC addresses can sometimes be “spoofed“ in software, in an AI cluster, the Base Command Manager handles new hardware by detecting the new MAC and updating the cluster‘s internal DHCP/PXE database accordingly.
D. No steps are needed This is a common but dangerous misconception. Even cards with the same OPN (Ordering Part Number) can have different hardware revisions or ship with different firmware “PSIDs“ (Parameter-Set Identifications). In a high-performance fabric (NDR 400G), even a minor firmware revision difference can lead to inconsistent packet handling or different thermal profiles, which disrupts the synchronous nature of AI workloads.
Unattempted
Correct: B The administrator must verify and update the firmware of the new NIC to match the specific version used across the cluster as defined in the BCM category.
The Technical Reason: In an NVIDIA-certified environment managed by NVIDIA Base Command Manager (BCM), nodes are organized into Categories.
The Golden Recipe: A category defines the exact “Golden Stack“ (OS, driver, and firmware versions) that all nodes in that group must run.
Firmware Synchronization: When a NIC is replaced, it often ships with the “factory latest“ firmware, which may be newer or older than the cluster‘s validated baseline. If the firmware version does not match, the NIC might behave differently under load (e.g., different congestion control or NCCL performance), leading to distributed training stalls.
Verification: The administrator should use tools like mlxfwmanager (for ConnectX/BlueField) or the nvfwupd tool to ensure the new card is aligned with the cluster-wide standard.
The NCP-AII Context: The exam validates your ability to “Replace faulty cards“ and “Configure categories“ in BCM. Correct lifecycle management involves ensuring that any new hardware is instantly “normalized“ to the cluster‘s standard.
Incorrect Options: A. Swap the card while the server is running Replacing a PCIe card (NIC or GPU) while a training job is running is not supported in standard AI infrastructure and can cause a system crash or electrical damage. Distributed AI jobs are highly fragile; if a single node‘s network link vanishes, the entire multi-node job will fail immediately with a NCCL Timeout or Connection Refused error. Hardware replacement should always be performed during a scheduled maintenance window after the node has been drained of jobs.
C. Change the MAC address with a Sharpie A MAC address is a unique hardware identifier burned into the NIC‘s EEPROM. Physically writing on the PCB with a marker has no effect on the digital identity of the card. While MAC addresses can sometimes be “spoofed“ in software, in an AI cluster, the Base Command Manager handles new hardware by detecting the new MAC and updating the cluster‘s internal DHCP/PXE database accordingly.
D. No steps are needed This is a common but dangerous misconception. Even cards with the same OPN (Ordering Part Number) can have different hardware revisions or ship with different firmware “PSIDs“ (Parameter-Set Identifications). In a high-performance fabric (NDR 400G), even a minor firmware revision difference can lead to inconsistent packet handling or different thermal profiles, which disrupts the synchronous nature of AI workloads.
Question 37 of 60
37. Question
To facilitate the use of various AI models and tools, an administrator needs to install the NGC CLI on the cluster hosts. What is the main benefit of using the NGC CLI in a professional AI infrastructure, and how does it integrate with the control plane workflow?
Correct
Correct: C It allows users to download and manage optimized AI containers, pre-trained models, and scripts directly from the NVIDIA GPU Cloud repository.
The Technical Reason: The NGC CLI is the command-line interface for the NVIDIA NGC Catalog, which serves as the central hub for GPU-accelerated software.
Optimized Stacks: NGC provides “Golden Stack“ containers (e.g., PyTorch, TensorFlow, NeMo) that are pre-configured with the correct versions of CUDA, cuDNN, and NCCL for peak performance on HGX/DGX systems.
Pre-trained Models: It provides access to a vast library of pre-trained models (like BERT, ResNet, or Llama variants) and “Resources“ (deployment scripts, Helm charts, and Jupyter notebooks).
Automation: For professional infrastructure, the CLI is essential for automation. It allows administrators to write scripts that authenticate once (ngc config set) and then programmatically pull latest images or models across hundreds of nodes.
The NCP-AII Context: The exam validates your ability to “Configure the NVIDIA Container Toolkit and authenticate with the NGC CLI.“ In a professional workflow, you use the NGC CLI to fetch the artifacts that the control plane (like Slurm + Enroot/Pyxis) will then distribute across the cluster.
Incorrect Options: A. Replaces the standard Linux shell The NGC CLI is a standalone application (binary) that runs inside your standard Linux shell (Bash, Zsh). It does not replace the shell or limit the terminal to Python-only commands. You still use standard Linux commands for system management, and the NGC CLI simply adds the ngc command prefix for interacting with NVIDIA‘s cloud services.
B. Physically format hard drives Hard drive formatting and low-level disk layout are tasks handled by the Base Command Manager (BCM) or the OS installer (e.g., via Kickstart or Preseed) during the node provisioning phase. The NGC CLI operates at the application and content layer, long after the OS and file systems have been established.
D. Graphical user interface for GPU temperatures The NGC CLI is strictly a command-line tool (CLI), not a graphical user interface (GUI). Real-time monitoring of GPU temperatures across a cluster is performed using nvidia-smi (for single-node CLI), NVIDIA DCGM Exporter (with Prometheus/Grafana for cluster-wide GUI), or the monitoring dashboard within Base Command Manager.
Incorrect
Correct: C It allows users to download and manage optimized AI containers, pre-trained models, and scripts directly from the NVIDIA GPU Cloud repository.
The Technical Reason: The NGC CLI is the command-line interface for the NVIDIA NGC Catalog, which serves as the central hub for GPU-accelerated software.
Optimized Stacks: NGC provides “Golden Stack“ containers (e.g., PyTorch, TensorFlow, NeMo) that are pre-configured with the correct versions of CUDA, cuDNN, and NCCL for peak performance on HGX/DGX systems.
Pre-trained Models: It provides access to a vast library of pre-trained models (like BERT, ResNet, or Llama variants) and “Resources“ (deployment scripts, Helm charts, and Jupyter notebooks).
Automation: For professional infrastructure, the CLI is essential for automation. It allows administrators to write scripts that authenticate once (ngc config set) and then programmatically pull latest images or models across hundreds of nodes.
The NCP-AII Context: The exam validates your ability to “Configure the NVIDIA Container Toolkit and authenticate with the NGC CLI.“ In a professional workflow, you use the NGC CLI to fetch the artifacts that the control plane (like Slurm + Enroot/Pyxis) will then distribute across the cluster.
Incorrect Options: A. Replaces the standard Linux shell The NGC CLI is a standalone application (binary) that runs inside your standard Linux shell (Bash, Zsh). It does not replace the shell or limit the terminal to Python-only commands. You still use standard Linux commands for system management, and the NGC CLI simply adds the ngc command prefix for interacting with NVIDIA‘s cloud services.
B. Physically format hard drives Hard drive formatting and low-level disk layout are tasks handled by the Base Command Manager (BCM) or the OS installer (e.g., via Kickstart or Preseed) during the node provisioning phase. The NGC CLI operates at the application and content layer, long after the OS and file systems have been established.
D. Graphical user interface for GPU temperatures The NGC CLI is strictly a command-line tool (CLI), not a graphical user interface (GUI). Real-time monitoring of GPU temperatures across a cluster is performed using nvidia-smi (for single-node CLI), NVIDIA DCGM Exporter (with Prometheus/Grafana for cluster-wide GUI), or the monitoring dashboard within Base Command Manager.
Unattempted
Correct: C It allows users to download and manage optimized AI containers, pre-trained models, and scripts directly from the NVIDIA GPU Cloud repository.
The Technical Reason: The NGC CLI is the command-line interface for the NVIDIA NGC Catalog, which serves as the central hub for GPU-accelerated software.
Optimized Stacks: NGC provides “Golden Stack“ containers (e.g., PyTorch, TensorFlow, NeMo) that are pre-configured with the correct versions of CUDA, cuDNN, and NCCL for peak performance on HGX/DGX systems.
Pre-trained Models: It provides access to a vast library of pre-trained models (like BERT, ResNet, or Llama variants) and “Resources“ (deployment scripts, Helm charts, and Jupyter notebooks).
Automation: For professional infrastructure, the CLI is essential for automation. It allows administrators to write scripts that authenticate once (ngc config set) and then programmatically pull latest images or models across hundreds of nodes.
The NCP-AII Context: The exam validates your ability to “Configure the NVIDIA Container Toolkit and authenticate with the NGC CLI.“ In a professional workflow, you use the NGC CLI to fetch the artifacts that the control plane (like Slurm + Enroot/Pyxis) will then distribute across the cluster.
Incorrect Options: A. Replaces the standard Linux shell The NGC CLI is a standalone application (binary) that runs inside your standard Linux shell (Bash, Zsh). It does not replace the shell or limit the terminal to Python-only commands. You still use standard Linux commands for system management, and the NGC CLI simply adds the ngc command prefix for interacting with NVIDIA‘s cloud services.
B. Physically format hard drives Hard drive formatting and low-level disk layout are tasks handled by the Base Command Manager (BCM) or the OS installer (e.g., via Kickstart or Preseed) during the node provisioning phase. The NGC CLI operates at the application and content layer, long after the OS and file systems have been established.
D. Graphical user interface for GPU temperatures The NGC CLI is strictly a command-line tool (CLI), not a graphical user interface (GUI). Real-time monitoring of GPU temperatures across a cluster is performed using nvidia-smi (for single-node CLI), NVIDIA DCGM Exporter (with Prometheus/Grafana for cluster-wide GUI), or the monitoring dashboard within Base Command Manager.
Question 38 of 60
38. Question
After the physical installation and software configuration, an engineer must perform a multifaceted assessment of the cluster using ClusterKit. Which combination of tests within a standard validation workflow would best verify the end-to-end performance and stability of the GPU-to-GPU communication across multiple nodes?
Correct
Correct: A. Running NCCL tests to measure Inter-node and Intra-node bandwidth, followed by an HPL burn-in to verify the thermal stability and compute consistency of the entire fabric.
The NCP-AII certification blueprint explicitly lists “Run NCCL to verify E/W fabric bandwidth“ and “Perform HPL burn-in“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
NCCL tests are specifically designed to measure GPU-to-GPU communication performance both within a single node (intra-node) and across multiple nodes (inter-node), which directly validates the high-speed fabric (InfiniBand or RoCE) that connects the cluster .
Running NCCL tests first verifies that the East/West fabric bandwidth meets specifications and that GPUs can communicate efficiently for collective operations like all-reduce during distributed training .
Following NCCL tests with an HPL (High-Performance Linpack) burn-in validates the system under sustained maximum computational load, confirming thermal stability and consistent floating-point performance across the entire cluster .
This combination provides comprehensive validation of both communication performance (NCCL) and compute/thermal stability (HPL), covering the critical aspects required for production AI workloads .
The certification documentation confirms that ClusterKit is used to perform a “multifaceted node assessment,“ and the combination of NCCL and HPL tests directly addresses the verification of both fabric bandwidth and system stability .
Incorrect: B. Validating the storage throughput using a standard dd command to the local boot drive while the GPUs are in a MIG-enabled ‘sleep‘ state.
This is incorrect because the dd command to a local boot drive is not a valid test for storage performance in an AI cluster context. The exam blueprint separately lists “Test storage“ as a distinct verification task using appropriate storage benchmarking tools, not simple dd commands to boot drives . Additionally, MIG-enabled ‘sleep‘ state is not a valid concept for testing GPU-to-GPU communication across multiple nodes.
C. Performing a NeMo burn-in test on the BlueField-3 DPU‘s ARM cores while simultaneously upgrading the firmware on the HGX baseboard via the SMI tool.
This is incorrect because NeMo burn-in is an AI framework-specific test designed to run on GPUs, not on BlueField-3 DPU ARM cores . Additionally, performing firmware upgrades simultaneously with burn-in tests would violate standard maintenance procedures and could cause system instability. The SMI tool (System Management Interface) is for GPU management, not HGX baseboard firmware upgrades.
D. Executing a single-node stress test using the NGC CLI and then verifying the signal quality of the BMC‘s management network using a Fluke cable tester.
This is incorrect because NGC CLI is a command-line interface for downloading containers and managing NGC resources, not a tool for executing stress tests . Single-node stress tests are performed using tools like HPL or NCCL, not NGC CLI. Additionally, verifying BMC management network signal quality with a cable tester does not validate GPU-to-GPU communication across multiple nodes, which requires NCCL tests on the high-speed fabric.
Incorrect
Correct: A. Running NCCL tests to measure Inter-node and Intra-node bandwidth, followed by an HPL burn-in to verify the thermal stability and compute consistency of the entire fabric.
The NCP-AII certification blueprint explicitly lists “Run NCCL to verify E/W fabric bandwidth“ and “Perform HPL burn-in“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
NCCL tests are specifically designed to measure GPU-to-GPU communication performance both within a single node (intra-node) and across multiple nodes (inter-node), which directly validates the high-speed fabric (InfiniBand or RoCE) that connects the cluster .
Running NCCL tests first verifies that the East/West fabric bandwidth meets specifications and that GPUs can communicate efficiently for collective operations like all-reduce during distributed training .
Following NCCL tests with an HPL (High-Performance Linpack) burn-in validates the system under sustained maximum computational load, confirming thermal stability and consistent floating-point performance across the entire cluster .
This combination provides comprehensive validation of both communication performance (NCCL) and compute/thermal stability (HPL), covering the critical aspects required for production AI workloads .
The certification documentation confirms that ClusterKit is used to perform a “multifaceted node assessment,“ and the combination of NCCL and HPL tests directly addresses the verification of both fabric bandwidth and system stability .
Incorrect: B. Validating the storage throughput using a standard dd command to the local boot drive while the GPUs are in a MIG-enabled ‘sleep‘ state.
This is incorrect because the dd command to a local boot drive is not a valid test for storage performance in an AI cluster context. The exam blueprint separately lists “Test storage“ as a distinct verification task using appropriate storage benchmarking tools, not simple dd commands to boot drives . Additionally, MIG-enabled ‘sleep‘ state is not a valid concept for testing GPU-to-GPU communication across multiple nodes.
C. Performing a NeMo burn-in test on the BlueField-3 DPU‘s ARM cores while simultaneously upgrading the firmware on the HGX baseboard via the SMI tool.
This is incorrect because NeMo burn-in is an AI framework-specific test designed to run on GPUs, not on BlueField-3 DPU ARM cores . Additionally, performing firmware upgrades simultaneously with burn-in tests would violate standard maintenance procedures and could cause system instability. The SMI tool (System Management Interface) is for GPU management, not HGX baseboard firmware upgrades.
D. Executing a single-node stress test using the NGC CLI and then verifying the signal quality of the BMC‘s management network using a Fluke cable tester.
This is incorrect because NGC CLI is a command-line interface for downloading containers and managing NGC resources, not a tool for executing stress tests . Single-node stress tests are performed using tools like HPL or NCCL, not NGC CLI. Additionally, verifying BMC management network signal quality with a cable tester does not validate GPU-to-GPU communication across multiple nodes, which requires NCCL tests on the high-speed fabric.
Unattempted
Correct: A. Running NCCL tests to measure Inter-node and Intra-node bandwidth, followed by an HPL burn-in to verify the thermal stability and compute consistency of the entire fabric.
The NCP-AII certification blueprint explicitly lists “Run NCCL to verify E/W fabric bandwidth“ and “Perform HPL burn-in“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
NCCL tests are specifically designed to measure GPU-to-GPU communication performance both within a single node (intra-node) and across multiple nodes (inter-node), which directly validates the high-speed fabric (InfiniBand or RoCE) that connects the cluster .
Running NCCL tests first verifies that the East/West fabric bandwidth meets specifications and that GPUs can communicate efficiently for collective operations like all-reduce during distributed training .
Following NCCL tests with an HPL (High-Performance Linpack) burn-in validates the system under sustained maximum computational load, confirming thermal stability and consistent floating-point performance across the entire cluster .
This combination provides comprehensive validation of both communication performance (NCCL) and compute/thermal stability (HPL), covering the critical aspects required for production AI workloads .
The certification documentation confirms that ClusterKit is used to perform a “multifaceted node assessment,“ and the combination of NCCL and HPL tests directly addresses the verification of both fabric bandwidth and system stability .
Incorrect: B. Validating the storage throughput using a standard dd command to the local boot drive while the GPUs are in a MIG-enabled ‘sleep‘ state.
This is incorrect because the dd command to a local boot drive is not a valid test for storage performance in an AI cluster context. The exam blueprint separately lists “Test storage“ as a distinct verification task using appropriate storage benchmarking tools, not simple dd commands to boot drives . Additionally, MIG-enabled ‘sleep‘ state is not a valid concept for testing GPU-to-GPU communication across multiple nodes.
C. Performing a NeMo burn-in test on the BlueField-3 DPU‘s ARM cores while simultaneously upgrading the firmware on the HGX baseboard via the SMI tool.
This is incorrect because NeMo burn-in is an AI framework-specific test designed to run on GPUs, not on BlueField-3 DPU ARM cores . Additionally, performing firmware upgrades simultaneously with burn-in tests would violate standard maintenance procedures and could cause system instability. The SMI tool (System Management Interface) is for GPU management, not HGX baseboard firmware upgrades.
D. Executing a single-node stress test using the NGC CLI and then verifying the signal quality of the BMC‘s management network using a Fluke cable tester.
This is incorrect because NGC CLI is a command-line interface for downloading containers and managing NGC resources, not a tool for executing stress tests . Single-node stress tests are performed using tools like HPL or NCCL, not NGC CLI. Additionally, verifying BMC management network signal quality with a cable tester does not validate GPU-to-GPU communication across multiple nodes, which requires NCCL tests on the high-speed fabric.
Question 39 of 60
39. Question
An IT professional is setting up the control plane for a new NVIDIA-certified cluster using Base Command Manager (BCM). During the installation, the administrator needs to configure High Availability (HA) for the head nodes. Which of the following is a requirement for a successful BCM HA configuration to ensure the cluster remains operational if the primary head node fails?
Correct
Correct: B A dedicated heartbeat network between the head nodes and a shared storage mechanism or synchronized database for the cluster configuration metadata.
The Technical Reason: To maintain a “single source of truth“ while providing redundancy, BCM requires:
Heartbeat Network: A reliable communication link (often a dedicated RJ45 connection or a specific VLAN) used by the cmdaemon on both nodes to monitor each other‘s status. If the secondary node detects a loss of heartbeats from the primary, it initiates a failover.
Synchronized State: The BCM configuration (MySQL database, LDAP, and workload manager state like Slurm) must be replicated. During the cmha-setup process, the secondary node‘s database is “recloned“ from the primary to ensure metadata consistency.
Shared/Replicated Storage: A shared /home or /cm/shared directory is typically configured (often via high-availability NFS or local synchronization) so that users and jobs see the same environment regardless of which head node is active.
The NCP-AII Context: The exam expects familiarity with the cmha-setup wizard, which specifically prompts for the selection of a “failover interface“ (internal or dedicated) and a Virtual IP (VIP) that migrates between nodes.
Incorrect Options: A. Using the NGC CLI to replicate to the cloud The NGC CLI is used for downloading AI containers and pre-trained models. It is not a system backup or real-time replication tool for BCM head nodes. High Availability requires local, low-latency failover; a cloud-based recovery would result in significant downtime and is not a part of the BCM HA hardware specification.
C. BlueField-3 DPUs as primary head nodes While BlueField-3 DPUs can offload networking and security, they do not act as the “Primary Master“ for the BCM control plane. BCM head nodes are typically x86 or Arm-based servers that manage the DPUs. In an HA setup, both nodes are full servers, not DPUs.
D. NVIDIA Container Toolkit on the BMC The BMC (Baseboard Management Controller) is a small management processor for hardware-level monitoring (fans, power). It cannot run the NVIDIA Container Toolkit or the Slurm daemon. Failover logic is handled by the BCM software running on the host OS, not through the Out-of-Band (OOB) firmware.
Incorrect
Correct: B A dedicated heartbeat network between the head nodes and a shared storage mechanism or synchronized database for the cluster configuration metadata.
The Technical Reason: To maintain a “single source of truth“ while providing redundancy, BCM requires:
Heartbeat Network: A reliable communication link (often a dedicated RJ45 connection or a specific VLAN) used by the cmdaemon on both nodes to monitor each other‘s status. If the secondary node detects a loss of heartbeats from the primary, it initiates a failover.
Synchronized State: The BCM configuration (MySQL database, LDAP, and workload manager state like Slurm) must be replicated. During the cmha-setup process, the secondary node‘s database is “recloned“ from the primary to ensure metadata consistency.
Shared/Replicated Storage: A shared /home or /cm/shared directory is typically configured (often via high-availability NFS or local synchronization) so that users and jobs see the same environment regardless of which head node is active.
The NCP-AII Context: The exam expects familiarity with the cmha-setup wizard, which specifically prompts for the selection of a “failover interface“ (internal or dedicated) and a Virtual IP (VIP) that migrates between nodes.
Incorrect Options: A. Using the NGC CLI to replicate to the cloud The NGC CLI is used for downloading AI containers and pre-trained models. It is not a system backup or real-time replication tool for BCM head nodes. High Availability requires local, low-latency failover; a cloud-based recovery would result in significant downtime and is not a part of the BCM HA hardware specification.
C. BlueField-3 DPUs as primary head nodes While BlueField-3 DPUs can offload networking and security, they do not act as the “Primary Master“ for the BCM control plane. BCM head nodes are typically x86 or Arm-based servers that manage the DPUs. In an HA setup, both nodes are full servers, not DPUs.
D. NVIDIA Container Toolkit on the BMC The BMC (Baseboard Management Controller) is a small management processor for hardware-level monitoring (fans, power). It cannot run the NVIDIA Container Toolkit or the Slurm daemon. Failover logic is handled by the BCM software running on the host OS, not through the Out-of-Band (OOB) firmware.
Unattempted
Correct: B A dedicated heartbeat network between the head nodes and a shared storage mechanism or synchronized database for the cluster configuration metadata.
The Technical Reason: To maintain a “single source of truth“ while providing redundancy, BCM requires:
Heartbeat Network: A reliable communication link (often a dedicated RJ45 connection or a specific VLAN) used by the cmdaemon on both nodes to monitor each other‘s status. If the secondary node detects a loss of heartbeats from the primary, it initiates a failover.
Synchronized State: The BCM configuration (MySQL database, LDAP, and workload manager state like Slurm) must be replicated. During the cmha-setup process, the secondary node‘s database is “recloned“ from the primary to ensure metadata consistency.
Shared/Replicated Storage: A shared /home or /cm/shared directory is typically configured (often via high-availability NFS or local synchronization) so that users and jobs see the same environment regardless of which head node is active.
The NCP-AII Context: The exam expects familiarity with the cmha-setup wizard, which specifically prompts for the selection of a “failover interface“ (internal or dedicated) and a Virtual IP (VIP) that migrates between nodes.
Incorrect Options: A. Using the NGC CLI to replicate to the cloud The NGC CLI is used for downloading AI containers and pre-trained models. It is not a system backup or real-time replication tool for BCM head nodes. High Availability requires local, low-latency failover; a cloud-based recovery would result in significant downtime and is not a part of the BCM HA hardware specification.
C. BlueField-3 DPUs as primary head nodes While BlueField-3 DPUs can offload networking and security, they do not act as the “Primary Master“ for the BCM control plane. BCM head nodes are typically x86 or Arm-based servers that manage the DPUs. In an HA setup, both nodes are full servers, not DPUs.
D. NVIDIA Container Toolkit on the BMC The BMC (Baseboard Management Controller) is a small management processor for hardware-level monitoring (fans, power). It cannot run the NVIDIA Container Toolkit or the Slurm daemon. Failover logic is handled by the BCM software running on the host OS, not through the Out-of-Band (OOB) firmware.
Question 40 of 60
40. Question
An administrator is installing NVIDIA Base Command Manager (BCM) to manage a new AI cluster. After the initial installation, they need to configure High Availability (HA) for the head node. What is the correct procedure to ensure the cluster remains operational if the primary head node fails?
Correct
Correct: D. Install BCM on a second head node, configure a shared heartbeat network, and synchronize the cluster database and configuration files.
This is correct because the NCP-AII certification blueprint explicitly includes “Install the Base Command Manager (BCM), configure and verify the HA“ as a core task within the Control Plane Installation and Configuration domain, which comprises 19% of the examination .
The official BCM documentation confirms that High Availability (HA) configuration is a major step in the cluster bring-up process, listed alongside head node installation and category creation .
During BCM installation, when requesting the license, the system specifically prompts for HA configuration with the parameter “HA | N, if you set ha for Y“, indicating that HA setup requires designating a secondary head node during the licensing phase .
The license request process also requires specifying “the MAC address of the first NIC of the secondary head node so that it can also serve the BCM licenses in the event of a failover“ , confirming that a second head node must be prepared with shared access to licensing.
The heartbeat network is implied through the requirement that both head nodes have connectivity to serve licenses during failover, and the cluster database and configuration synchronization is managed through BCM‘s internal mechanisms when HA is properly configured .
Incorrect: A. Deploy a load balancer in front of the GPUs to redistribute CUDA kernels if the BCM service becomes unresponsive.
This is incorrect because load balancers distribute network traffic, not CUDA kernels or GPU compute workloads. GPU kernel execution is managed by the workload scheduler (like Slurm), not by load balancers in front of GPUs. Additionally, this approach does not address head node redundancy, which is the core requirement for HA configuration.
B. Manually copy the /etc/shadow file to all compute nodes every hour to synchronize user credentials.
This is incorrect because BCM already manages user credentials centrally through its integrated LDAP service that runs on the head nodes . Manual file copying is not a valid HA solution and does not address head node failover. User and group management in BCM is handled through the single system model, automatically propagating changes across the cluster .
C. Enable ‘HA-Mode‘ in the BIOS of all compute nodes so they can elect a new leader using a Paxos-based consensus algorithm.
This is incorrect because High Availability for the BCM head node is configured at the software level, not in compute node BIOS. Compute nodes do not elect head node leaders using consensus algorithmsthey are provisioned and managed by the head node. BIOS settings have no role in BCM HA configuration.
Incorrect
Correct: D. Install BCM on a second head node, configure a shared heartbeat network, and synchronize the cluster database and configuration files.
This is correct because the NCP-AII certification blueprint explicitly includes “Install the Base Command Manager (BCM), configure and verify the HA“ as a core task within the Control Plane Installation and Configuration domain, which comprises 19% of the examination .
The official BCM documentation confirms that High Availability (HA) configuration is a major step in the cluster bring-up process, listed alongside head node installation and category creation .
During BCM installation, when requesting the license, the system specifically prompts for HA configuration with the parameter “HA | N, if you set ha for Y“, indicating that HA setup requires designating a secondary head node during the licensing phase .
The license request process also requires specifying “the MAC address of the first NIC of the secondary head node so that it can also serve the BCM licenses in the event of a failover“ , confirming that a second head node must be prepared with shared access to licensing.
The heartbeat network is implied through the requirement that both head nodes have connectivity to serve licenses during failover, and the cluster database and configuration synchronization is managed through BCM‘s internal mechanisms when HA is properly configured .
Incorrect: A. Deploy a load balancer in front of the GPUs to redistribute CUDA kernels if the BCM service becomes unresponsive.
This is incorrect because load balancers distribute network traffic, not CUDA kernels or GPU compute workloads. GPU kernel execution is managed by the workload scheduler (like Slurm), not by load balancers in front of GPUs. Additionally, this approach does not address head node redundancy, which is the core requirement for HA configuration.
B. Manually copy the /etc/shadow file to all compute nodes every hour to synchronize user credentials.
This is incorrect because BCM already manages user credentials centrally through its integrated LDAP service that runs on the head nodes . Manual file copying is not a valid HA solution and does not address head node failover. User and group management in BCM is handled through the single system model, automatically propagating changes across the cluster .
C. Enable ‘HA-Mode‘ in the BIOS of all compute nodes so they can elect a new leader using a Paxos-based consensus algorithm.
This is incorrect because High Availability for the BCM head node is configured at the software level, not in compute node BIOS. Compute nodes do not elect head node leaders using consensus algorithmsthey are provisioned and managed by the head node. BIOS settings have no role in BCM HA configuration.
Unattempted
Correct: D. Install BCM on a second head node, configure a shared heartbeat network, and synchronize the cluster database and configuration files.
This is correct because the NCP-AII certification blueprint explicitly includes “Install the Base Command Manager (BCM), configure and verify the HA“ as a core task within the Control Plane Installation and Configuration domain, which comprises 19% of the examination .
The official BCM documentation confirms that High Availability (HA) configuration is a major step in the cluster bring-up process, listed alongside head node installation and category creation .
During BCM installation, when requesting the license, the system specifically prompts for HA configuration with the parameter “HA | N, if you set ha for Y“, indicating that HA setup requires designating a secondary head node during the licensing phase .
The license request process also requires specifying “the MAC address of the first NIC of the secondary head node so that it can also serve the BCM licenses in the event of a failover“ , confirming that a second head node must be prepared with shared access to licensing.
The heartbeat network is implied through the requirement that both head nodes have connectivity to serve licenses during failover, and the cluster database and configuration synchronization is managed through BCM‘s internal mechanisms when HA is properly configured .
Incorrect: A. Deploy a load balancer in front of the GPUs to redistribute CUDA kernels if the BCM service becomes unresponsive.
This is incorrect because load balancers distribute network traffic, not CUDA kernels or GPU compute workloads. GPU kernel execution is managed by the workload scheduler (like Slurm), not by load balancers in front of GPUs. Additionally, this approach does not address head node redundancy, which is the core requirement for HA configuration.
B. Manually copy the /etc/shadow file to all compute nodes every hour to synchronize user credentials.
This is incorrect because BCM already manages user credentials centrally through its integrated LDAP service that runs on the head nodes . Manual file copying is not a valid HA solution and does not address head node failover. User and group management in BCM is handled through the single system model, automatically propagating changes across the cluster .
C. Enable ‘HA-Mode‘ in the BIOS of all compute nodes so they can elect a new leader using a Paxos-based consensus algorithm.
This is incorrect because High Availability for the BCM head node is configured at the software level, not in compute node BIOS. Compute nodes do not elect head node leaders using consensus algorithmsthey are provisioned and managed by the head node. BIOS settings have no role in BCM HA configuration.
Question 41 of 60
41. Question
During the initial physical bring-up of an NVIDIA HGX H100 system, an administrator observes that the Baseboard Management Controller (BMC) reports a power-capping event shortly after the system begins a validation stress test. Although the rack PDUs are within their specified limits, the system performance is being throttled. Which specific action should be prioritized to ensure the server meets the high-power demands of AI factory workloads?
Correct
Correct: B. Verify the Power Supply Unit (PSU) redundancy policy in the BMC and ensure all power cables are seated and connected to independent circuits.
This is the correct priority action because a BMC-reported power capping event despite rack PDUs operating within limits strongly indicates a power delivery configuration issue at the server level, not a cluster-wide power shortage .
The NVIDIA documentation for HGX systems specifies that power capping features must be correctly enabled for N+N redundant configurations to ensure safe and high-performance operation .
To enable PSU redundancy support, the power budget limit must be set appropriately (e.g., 12 kW) using specific ipmitool commands: ipmitool raw x3c 0x81 0x05 0xE0 0x2E .
If power cables are not fully seated or are connected to the same circuit rather than independent circuits, the system‘s power sensing logic will detect a fault and invoke a power cap to protect the hardware, causing performance throttling .
By default, a system may boot with three power supplies, but to achieve safe operation of an N+N configuration, you need to enable the power capping feature to limit the power consumed by the system .
Addressing the physical power delivery and redundancy configuration directly resolves the root cause, ensuring the server can draw the full power required to meet high-performance AI workload demands without hardware-induced throttling.
Incorrect: A. Reinstall the NVIDIA Container Toolkit to recalibrate the power sensing logic of the underlying operating system.
This is incorrect because the NVIDIA Container Toolkit is a software component for enabling GPU access within containers, not for power sensing or power capping functionality. Reinstalling it would not affect BMC-reported power events.
C. Update the TPM firmware to version 2.0 to allow for higher power draw authorization from the motherboard components.
This is incorrect because the Trusted Platform Module (TPM) is a security chip used for cryptographic operations and platform integrity (secure boot, encryption), not for managing or authorizing power draw.
D. Decrease the GPU clock frequency via nvidia-smi to manually stay under the current power threshold.
This is incorrect because manually throttling GPU clocks via nvidia-smi would reduce performance, which is contrary to ensuring the server meets high-performance demands. This is a temporary workaround that accepts the power cap, rather than resolving the underlying configuration or hardware issue causing it.
Incorrect
Correct: B. Verify the Power Supply Unit (PSU) redundancy policy in the BMC and ensure all power cables are seated and connected to independent circuits.
This is the correct priority action because a BMC-reported power capping event despite rack PDUs operating within limits strongly indicates a power delivery configuration issue at the server level, not a cluster-wide power shortage .
The NVIDIA documentation for HGX systems specifies that power capping features must be correctly enabled for N+N redundant configurations to ensure safe and high-performance operation .
To enable PSU redundancy support, the power budget limit must be set appropriately (e.g., 12 kW) using specific ipmitool commands: ipmitool raw x3c 0x81 0x05 0xE0 0x2E .
If power cables are not fully seated or are connected to the same circuit rather than independent circuits, the system‘s power sensing logic will detect a fault and invoke a power cap to protect the hardware, causing performance throttling .
By default, a system may boot with three power supplies, but to achieve safe operation of an N+N configuration, you need to enable the power capping feature to limit the power consumed by the system .
Addressing the physical power delivery and redundancy configuration directly resolves the root cause, ensuring the server can draw the full power required to meet high-performance AI workload demands without hardware-induced throttling.
Incorrect: A. Reinstall the NVIDIA Container Toolkit to recalibrate the power sensing logic of the underlying operating system.
This is incorrect because the NVIDIA Container Toolkit is a software component for enabling GPU access within containers, not for power sensing or power capping functionality. Reinstalling it would not affect BMC-reported power events.
C. Update the TPM firmware to version 2.0 to allow for higher power draw authorization from the motherboard components.
This is incorrect because the Trusted Platform Module (TPM) is a security chip used for cryptographic operations and platform integrity (secure boot, encryption), not for managing or authorizing power draw.
D. Decrease the GPU clock frequency via nvidia-smi to manually stay under the current power threshold.
This is incorrect because manually throttling GPU clocks via nvidia-smi would reduce performance, which is contrary to ensuring the server meets high-performance demands. This is a temporary workaround that accepts the power cap, rather than resolving the underlying configuration or hardware issue causing it.
Unattempted
Correct: B. Verify the Power Supply Unit (PSU) redundancy policy in the BMC and ensure all power cables are seated and connected to independent circuits.
This is the correct priority action because a BMC-reported power capping event despite rack PDUs operating within limits strongly indicates a power delivery configuration issue at the server level, not a cluster-wide power shortage .
The NVIDIA documentation for HGX systems specifies that power capping features must be correctly enabled for N+N redundant configurations to ensure safe and high-performance operation .
To enable PSU redundancy support, the power budget limit must be set appropriately (e.g., 12 kW) using specific ipmitool commands: ipmitool raw x3c 0x81 0x05 0xE0 0x2E .
If power cables are not fully seated or are connected to the same circuit rather than independent circuits, the system‘s power sensing logic will detect a fault and invoke a power cap to protect the hardware, causing performance throttling .
By default, a system may boot with three power supplies, but to achieve safe operation of an N+N configuration, you need to enable the power capping feature to limit the power consumed by the system .
Addressing the physical power delivery and redundancy configuration directly resolves the root cause, ensuring the server can draw the full power required to meet high-performance AI workload demands without hardware-induced throttling.
Incorrect: A. Reinstall the NVIDIA Container Toolkit to recalibrate the power sensing logic of the underlying operating system.
This is incorrect because the NVIDIA Container Toolkit is a software component for enabling GPU access within containers, not for power sensing or power capping functionality. Reinstalling it would not affect BMC-reported power events.
C. Update the TPM firmware to version 2.0 to allow for higher power draw authorization from the motherboard components.
This is incorrect because the Trusted Platform Module (TPM) is a security chip used for cryptographic operations and platform integrity (secure boot, encryption), not for managing or authorizing power draw.
D. Decrease the GPU clock frequency via nvidia-smi to manually stay under the current power threshold.
This is incorrect because manually throttling GPU clocks via nvidia-smi would reduce performance, which is contrary to ensuring the server meets high-performance demands. This is a temporary workaround that accepts the power cap, rather than resolving the underlying configuration or hardware issue causing it.
Question 42 of 60
42. Question
When optimizing an AI cluster that uses both NVIDIA GPUs and AMD EPYC CPUs, an engineer notices that the NUMA affinity settings are misconfigured, leading to increased latency in data transfers between the NIC and the GPU. Which tool or configuration should be adjusted to ensure that the GPU and the network card are communicating over the same NUMA node?
Correct
Correct: C. Use lscpu to identify the affinity and then adjust the Slurm configuration or use numactl to bind the processes to the specific CPU cores physically closest to the PCIe root complex of the GPU and NIC.
The NCP-AII certification blueprint explicitly includes “Execute performance optimization for AMD and Intel servers“ as a core task within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
NUMA (Non-Uniform Memory Access) optimization is critical for AMD EPYC-based systems because these processors are composed of multiple dies (chiplets) connected by a cache-coherent interconnect, with each chiplet having its own PCIe root complex and memory controller .
For optimal performance between a GPU and NIC, the benchmark processes should run on the chiplet with the PCIe root complex to which both devices are directly connected, and use the DRAM controller on that same chiplet .
The lscpu command can be used to identify the NUMA topology and device affinities. numactl is then the appropriate tool to bind processes to specific CPU cores and memory nodes, ensuring that communication stays within the same NUMA domain and avoids the overhead of cross-chiplet hops .
Slurm, as the workload manager, can also be configured to enforce GPU-CPU binding using options like –gres-flags=enforce_binding or by configuring Cores= in gres.conf to ensure jobs are scheduled with proper NUMA affinity .
Incorrect: A. Disable NUMA entirely in the BIOS so that the CPU behaves as a single large core, which automatically eliminates all latency issues associated with memory bank access.
This is incorrect because disabling NUMA does not eliminate the physical architecture of multiple memory controllers and PCIe root complexes. Modern AMD EPYC processors are fundamentally NUMA-based with multiple chiplets, and disabling NUMA in BIOS would not change the underlying hardware topology or eliminate latency penalties for cross-chiplet communication .
B. Use the nvidia-smi -ac command to manually set the GPU clock speeds to match the CPU clock speeds, which synchronizes the data transfer timing across the PCIe bus.
This is incorrect because nvidia-smi -ac is used to set GPU application clocks and memory transfer rates, not to synchronize data transfer timing with CPU clocks. GPU clock settings have no impact on NUMA affinity or the physical path data takes across the PCIe bus relative to CPU cores.
D. Update the DOCA drivers on the BlueField DPU, as the DPU is responsible for virtually remapping the CPU NUMA domains to match the GPU memory layout.
This is incorrect because the BlueField DPU does not remap CPU NUMA domains. NUMA topology is determined by the physical CPU architecture and is reported by the system BIOS. The DPU operates as a separate device on the PCIe bus and cannot alter the host CPU‘s memory hierarchy or NUMA node configuration.
Incorrect
Correct: C. Use lscpu to identify the affinity and then adjust the Slurm configuration or use numactl to bind the processes to the specific CPU cores physically closest to the PCIe root complex of the GPU and NIC.
The NCP-AII certification blueprint explicitly includes “Execute performance optimization for AMD and Intel servers“ as a core task within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
NUMA (Non-Uniform Memory Access) optimization is critical for AMD EPYC-based systems because these processors are composed of multiple dies (chiplets) connected by a cache-coherent interconnect, with each chiplet having its own PCIe root complex and memory controller .
For optimal performance between a GPU and NIC, the benchmark processes should run on the chiplet with the PCIe root complex to which both devices are directly connected, and use the DRAM controller on that same chiplet .
The lscpu command can be used to identify the NUMA topology and device affinities. numactl is then the appropriate tool to bind processes to specific CPU cores and memory nodes, ensuring that communication stays within the same NUMA domain and avoids the overhead of cross-chiplet hops .
Slurm, as the workload manager, can also be configured to enforce GPU-CPU binding using options like –gres-flags=enforce_binding or by configuring Cores= in gres.conf to ensure jobs are scheduled with proper NUMA affinity .
Incorrect: A. Disable NUMA entirely in the BIOS so that the CPU behaves as a single large core, which automatically eliminates all latency issues associated with memory bank access.
This is incorrect because disabling NUMA does not eliminate the physical architecture of multiple memory controllers and PCIe root complexes. Modern AMD EPYC processors are fundamentally NUMA-based with multiple chiplets, and disabling NUMA in BIOS would not change the underlying hardware topology or eliminate latency penalties for cross-chiplet communication .
B. Use the nvidia-smi -ac command to manually set the GPU clock speeds to match the CPU clock speeds, which synchronizes the data transfer timing across the PCIe bus.
This is incorrect because nvidia-smi -ac is used to set GPU application clocks and memory transfer rates, not to synchronize data transfer timing with CPU clocks. GPU clock settings have no impact on NUMA affinity or the physical path data takes across the PCIe bus relative to CPU cores.
D. Update the DOCA drivers on the BlueField DPU, as the DPU is responsible for virtually remapping the CPU NUMA domains to match the GPU memory layout.
This is incorrect because the BlueField DPU does not remap CPU NUMA domains. NUMA topology is determined by the physical CPU architecture and is reported by the system BIOS. The DPU operates as a separate device on the PCIe bus and cannot alter the host CPU‘s memory hierarchy or NUMA node configuration.
Unattempted
Correct: C. Use lscpu to identify the affinity and then adjust the Slurm configuration or use numactl to bind the processes to the specific CPU cores physically closest to the PCIe root complex of the GPU and NIC.
The NCP-AII certification blueprint explicitly includes “Execute performance optimization for AMD and Intel servers“ as a core task within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
NUMA (Non-Uniform Memory Access) optimization is critical for AMD EPYC-based systems because these processors are composed of multiple dies (chiplets) connected by a cache-coherent interconnect, with each chiplet having its own PCIe root complex and memory controller .
For optimal performance between a GPU and NIC, the benchmark processes should run on the chiplet with the PCIe root complex to which both devices are directly connected, and use the DRAM controller on that same chiplet .
The lscpu command can be used to identify the NUMA topology and device affinities. numactl is then the appropriate tool to bind processes to specific CPU cores and memory nodes, ensuring that communication stays within the same NUMA domain and avoids the overhead of cross-chiplet hops .
Slurm, as the workload manager, can also be configured to enforce GPU-CPU binding using options like –gres-flags=enforce_binding or by configuring Cores= in gres.conf to ensure jobs are scheduled with proper NUMA affinity .
Incorrect: A. Disable NUMA entirely in the BIOS so that the CPU behaves as a single large core, which automatically eliminates all latency issues associated with memory bank access.
This is incorrect because disabling NUMA does not eliminate the physical architecture of multiple memory controllers and PCIe root complexes. Modern AMD EPYC processors are fundamentally NUMA-based with multiple chiplets, and disabling NUMA in BIOS would not change the underlying hardware topology or eliminate latency penalties for cross-chiplet communication .
B. Use the nvidia-smi -ac command to manually set the GPU clock speeds to match the CPU clock speeds, which synchronizes the data transfer timing across the PCIe bus.
This is incorrect because nvidia-smi -ac is used to set GPU application clocks and memory transfer rates, not to synchronize data transfer timing with CPU clocks. GPU clock settings have no impact on NUMA affinity or the physical path data takes across the PCIe bus relative to CPU cores.
D. Update the DOCA drivers on the BlueField DPU, as the DPU is responsible for virtually remapping the CPU NUMA domains to match the GPU memory layout.
This is incorrect because the BlueField DPU does not remap CPU NUMA domains. NUMA topology is determined by the physical CPU architecture and is reported by the system BIOS. The DPU operates as a separate device on the PCIe bus and cannot alter the host CPU‘s memory hierarchy or NUMA node configuration.
Question 43 of 60
43. Question
An administrator notices that some nodes in the cluster are performing significantly slower during HPL tests. Upon investigation, they find that the nvidia-smi -q command shows the Clocks Throttle Reason as SW Thermal Slowdown. What is the most appropriate action?
Correct
Correct: D. Inspect the physical airflow path for obstructions, verify that the fans are operating at the correct RPM, and check for failed cooling components in the BMC.
This is correct because the “SW Thermal Slowdown“ throttle reason indicates that the GPU‘s clocks are being reduced due to thermal constraints. According to NVIDIA‘s documentation, SW Thermal Slowdown means “the current clocks have been optimized to ensure…current GPU temperature does not exceed GPU Max Operating Temperature“.
The NCP-AII certification blueprint explicitly includes “Identify faulty cards, GPUs, and power supplies“ and “Validate power and cooling parameters“ as core tasks within the System and Server Bring-up and Troubleshoot domains.
Thermal throttling in H100 systems is most commonly caused by physical cooling failures including:
Fan failures or incorrect RPM operation
Airflow obstruction due to dust accumulation or improper cabling
Failed cooling components that can be identified through BMC monitoring
The systematic approach of inspecting physical airflow, verifying fan operation, and checking BMC logs aligns with proper hardware troubleshooting methodology before considering software or configuration changes.
Incorrect: A. Replace the BlueField-3 DPU, as it is likely failing to route the cooling commands from the Base Command Manager to the GPU‘s firmware controller.
This is incorrect because the BlueField-3 DPU is not involved in GPU thermal management. The DPU handles networking, storage offload, and security functions. Cooling commands and fan control are managed by the Baseboard Management Controller (BMC) and system thermal firmware, not the DPU.
B. Use the NGC CLI to update the PyTorch container to a version that uses less GPU memory, thereby reducing the heat generated by the H100 cores during the test.
This is incorrect because changing container versions does not address the underlying thermal issue. NGC CLI is used for downloading containers and managing NGC resources. While lower memory usage might slightly reduce heat, it is a workaround that accepts throttling rather than fixing the root cause, and would not resolve a “SW Thermal Slowdown“ condition caused by actual cooling failures.
C. Modify the Slurm configuration to automatically reboot any node that reports a thermal slowdown to allow the GPU to cool down during the POST process.
This is incorrect because Slurm is a workload manager for job scheduling and has no role in GPU thermal management. Rebooting nodes experiencing thermal throttling is a temporary, reactive measure that does not address the physical cooling failure. Proper thermal management requires identifying and fixing the root cause of inadequate cooling, not automated reboots.
Incorrect
Correct: D. Inspect the physical airflow path for obstructions, verify that the fans are operating at the correct RPM, and check for failed cooling components in the BMC.
This is correct because the “SW Thermal Slowdown“ throttle reason indicates that the GPU‘s clocks are being reduced due to thermal constraints. According to NVIDIA‘s documentation, SW Thermal Slowdown means “the current clocks have been optimized to ensure…current GPU temperature does not exceed GPU Max Operating Temperature“.
The NCP-AII certification blueprint explicitly includes “Identify faulty cards, GPUs, and power supplies“ and “Validate power and cooling parameters“ as core tasks within the System and Server Bring-up and Troubleshoot domains.
Thermal throttling in H100 systems is most commonly caused by physical cooling failures including:
Fan failures or incorrect RPM operation
Airflow obstruction due to dust accumulation or improper cabling
Failed cooling components that can be identified through BMC monitoring
The systematic approach of inspecting physical airflow, verifying fan operation, and checking BMC logs aligns with proper hardware troubleshooting methodology before considering software or configuration changes.
Incorrect: A. Replace the BlueField-3 DPU, as it is likely failing to route the cooling commands from the Base Command Manager to the GPU‘s firmware controller.
This is incorrect because the BlueField-3 DPU is not involved in GPU thermal management. The DPU handles networking, storage offload, and security functions. Cooling commands and fan control are managed by the Baseboard Management Controller (BMC) and system thermal firmware, not the DPU.
B. Use the NGC CLI to update the PyTorch container to a version that uses less GPU memory, thereby reducing the heat generated by the H100 cores during the test.
This is incorrect because changing container versions does not address the underlying thermal issue. NGC CLI is used for downloading containers and managing NGC resources. While lower memory usage might slightly reduce heat, it is a workaround that accepts throttling rather than fixing the root cause, and would not resolve a “SW Thermal Slowdown“ condition caused by actual cooling failures.
C. Modify the Slurm configuration to automatically reboot any node that reports a thermal slowdown to allow the GPU to cool down during the POST process.
This is incorrect because Slurm is a workload manager for job scheduling and has no role in GPU thermal management. Rebooting nodes experiencing thermal throttling is a temporary, reactive measure that does not address the physical cooling failure. Proper thermal management requires identifying and fixing the root cause of inadequate cooling, not automated reboots.
Unattempted
Correct: D. Inspect the physical airflow path for obstructions, verify that the fans are operating at the correct RPM, and check for failed cooling components in the BMC.
This is correct because the “SW Thermal Slowdown“ throttle reason indicates that the GPU‘s clocks are being reduced due to thermal constraints. According to NVIDIA‘s documentation, SW Thermal Slowdown means “the current clocks have been optimized to ensure…current GPU temperature does not exceed GPU Max Operating Temperature“.
The NCP-AII certification blueprint explicitly includes “Identify faulty cards, GPUs, and power supplies“ and “Validate power and cooling parameters“ as core tasks within the System and Server Bring-up and Troubleshoot domains.
Thermal throttling in H100 systems is most commonly caused by physical cooling failures including:
Fan failures or incorrect RPM operation
Airflow obstruction due to dust accumulation or improper cabling
Failed cooling components that can be identified through BMC monitoring
The systematic approach of inspecting physical airflow, verifying fan operation, and checking BMC logs aligns with proper hardware troubleshooting methodology before considering software or configuration changes.
Incorrect: A. Replace the BlueField-3 DPU, as it is likely failing to route the cooling commands from the Base Command Manager to the GPU‘s firmware controller.
This is incorrect because the BlueField-3 DPU is not involved in GPU thermal management. The DPU handles networking, storage offload, and security functions. Cooling commands and fan control are managed by the Baseboard Management Controller (BMC) and system thermal firmware, not the DPU.
B. Use the NGC CLI to update the PyTorch container to a version that uses less GPU memory, thereby reducing the heat generated by the H100 cores during the test.
This is incorrect because changing container versions does not address the underlying thermal issue. NGC CLI is used for downloading containers and managing NGC resources. While lower memory usage might slightly reduce heat, it is a workaround that accepts throttling rather than fixing the root cause, and would not resolve a “SW Thermal Slowdown“ condition caused by actual cooling failures.
C. Modify the Slurm configuration to automatically reboot any node that reports a thermal slowdown to allow the GPU to cool down during the POST process.
This is incorrect because Slurm is a workload manager for job scheduling and has no role in GPU thermal management. Rebooting nodes experiencing thermal throttling is a temporary, reactive measure that does not address the physical cooling failure. Proper thermal management requires identifying and fixing the root cause of inadequate cooling, not automated reboots.
Question 44 of 60
44. Question
A cluster node is reporting a hardware fault where one of the six fans is spinning at 0 RPM, and the corresponding GPU is reporting Thermal Violation in nvidia-smi. What is the correct troubleshooting and remediation path for an NVIDIA-Certified system in a production environment?
Correct
Correct: B Identify the faulty fan module, verify the fault in the BMC logs, and replace the fan unit following the server‘s hot-swap or FRU procedures. The Technical Reason: NVIDIA-Certified systems are designed with high availability in mind. ? Diagnostic Tools: The primary method to confirm a hardware failure is through the Baseboard Management Controller (BMC) or using the NVIDIA System Management (NVSM) tool (e.g., sudo nvsm show fans). These tools identify exactly which fan ID has failed. ? FRU (Field Replaceable Unit): Most modern NVIDIA systems use hot-swappable fan modules. Replacing a failed fan typically involves pulling the handle of the identified module and inserting a new one. ? Thermal Protection: Once the fan is replaced and airflow is restored, the GPU‘s onboard sensors will detect the drop in temperature, and the “Thermal Violation“ flag will clear, allowing the GPU to return to its peak performance state. The NCP-AII Context: The exam validates your ability to follow official Service Manual procedures. A critical detail often tested is the “30-second rule“on systems like the DGX H100, a fan module must be replaced within 30-60 seconds to prevent the other components from overheating due to the loss of static pressure in the chassis.
Incorrect: A. Remove the thermal paste Thermal paste (TIM) is essential for conducting heat from the GPU die to the heatsink. Removing it would cause the GPU to reach its critical thermal shutdown temperature (100°C+) almost instantly, potentially causing permanent hardware damage. It does not help the GPU “breathe“ better; it effectively suffocates its heat dissipation capability.
C. Overclock the other five fans to 150 percent While enterprise fans can sometimes be set to a “Full Speed“ or “Max Performance“ profile in the BIOS/BMC, you cannot “overclock“ them to 150% of their rated physical speed. Furthermore, AI server cooling is carefully balanced for specific static pressure. A dead fan creates a “path of least resistance“ where air might bypass the GPUs entirely; simply spinning other fans faster cannot reliably fix the localized hot spot created by a dead module.
D. Switch to a different Linux distribution Software changes do not fix physical hardware failures. While a “lighter“ OS might reduce idle CPU load, the thermal requirements of a training job on a 700W H100 GPU remain identical regardless of the Linux distro. An NVIDIA-Certified system requires a validated software stack (like DGX OS or NVIDIA AI Enterprise) to ensure proper driver and hardware management.
Incorrect
Correct: B Identify the faulty fan module, verify the fault in the BMC logs, and replace the fan unit following the server‘s hot-swap or FRU procedures. The Technical Reason: NVIDIA-Certified systems are designed with high availability in mind. ? Diagnostic Tools: The primary method to confirm a hardware failure is through the Baseboard Management Controller (BMC) or using the NVIDIA System Management (NVSM) tool (e.g., sudo nvsm show fans). These tools identify exactly which fan ID has failed. ? FRU (Field Replaceable Unit): Most modern NVIDIA systems use hot-swappable fan modules. Replacing a failed fan typically involves pulling the handle of the identified module and inserting a new one. ? Thermal Protection: Once the fan is replaced and airflow is restored, the GPU‘s onboard sensors will detect the drop in temperature, and the “Thermal Violation“ flag will clear, allowing the GPU to return to its peak performance state. The NCP-AII Context: The exam validates your ability to follow official Service Manual procedures. A critical detail often tested is the “30-second rule“on systems like the DGX H100, a fan module must be replaced within 30-60 seconds to prevent the other components from overheating due to the loss of static pressure in the chassis.
Incorrect: A. Remove the thermal paste Thermal paste (TIM) is essential for conducting heat from the GPU die to the heatsink. Removing it would cause the GPU to reach its critical thermal shutdown temperature (100°C+) almost instantly, potentially causing permanent hardware damage. It does not help the GPU “breathe“ better; it effectively suffocates its heat dissipation capability.
C. Overclock the other five fans to 150 percent While enterprise fans can sometimes be set to a “Full Speed“ or “Max Performance“ profile in the BIOS/BMC, you cannot “overclock“ them to 150% of their rated physical speed. Furthermore, AI server cooling is carefully balanced for specific static pressure. A dead fan creates a “path of least resistance“ where air might bypass the GPUs entirely; simply spinning other fans faster cannot reliably fix the localized hot spot created by a dead module.
D. Switch to a different Linux distribution Software changes do not fix physical hardware failures. While a “lighter“ OS might reduce idle CPU load, the thermal requirements of a training job on a 700W H100 GPU remain identical regardless of the Linux distro. An NVIDIA-Certified system requires a validated software stack (like DGX OS or NVIDIA AI Enterprise) to ensure proper driver and hardware management.
Unattempted
Correct: B Identify the faulty fan module, verify the fault in the BMC logs, and replace the fan unit following the server‘s hot-swap or FRU procedures. The Technical Reason: NVIDIA-Certified systems are designed with high availability in mind. ? Diagnostic Tools: The primary method to confirm a hardware failure is through the Baseboard Management Controller (BMC) or using the NVIDIA System Management (NVSM) tool (e.g., sudo nvsm show fans). These tools identify exactly which fan ID has failed. ? FRU (Field Replaceable Unit): Most modern NVIDIA systems use hot-swappable fan modules. Replacing a failed fan typically involves pulling the handle of the identified module and inserting a new one. ? Thermal Protection: Once the fan is replaced and airflow is restored, the GPU‘s onboard sensors will detect the drop in temperature, and the “Thermal Violation“ flag will clear, allowing the GPU to return to its peak performance state. The NCP-AII Context: The exam validates your ability to follow official Service Manual procedures. A critical detail often tested is the “30-second rule“on systems like the DGX H100, a fan module must be replaced within 30-60 seconds to prevent the other components from overheating due to the loss of static pressure in the chassis.
Incorrect: A. Remove the thermal paste Thermal paste (TIM) is essential for conducting heat from the GPU die to the heatsink. Removing it would cause the GPU to reach its critical thermal shutdown temperature (100°C+) almost instantly, potentially causing permanent hardware damage. It does not help the GPU “breathe“ better; it effectively suffocates its heat dissipation capability.
C. Overclock the other five fans to 150 percent While enterprise fans can sometimes be set to a “Full Speed“ or “Max Performance“ profile in the BIOS/BMC, you cannot “overclock“ them to 150% of their rated physical speed. Furthermore, AI server cooling is carefully balanced for specific static pressure. A dead fan creates a “path of least resistance“ where air might bypass the GPUs entirely; simply spinning other fans faster cannot reliably fix the localized hot spot created by a dead module.
D. Switch to a different Linux distribution Software changes do not fix physical hardware failures. While a “lighter“ OS might reduce idle CPU load, the thermal requirements of a training job on a 700W H100 GPU remain identical regardless of the Linux distro. An NVIDIA-Certified system requires a validated software stack (like DGX OS or NVIDIA AI Enterprise) to ensure proper driver and hardware management.
Question 45 of 60
45. Question
A system administrator identifies a faulty H100 GPU in an HGX baseboard that is causing consistent Bus Errors. After confirming the hardware failure, what is the correct high-level procedure for replacing the GPU while maintaining the integrity of the remaining system components?
Correct
Correct: D. Power down the system, follow anti-static procedures, remove the HGX heat sink assembly, replace the specific GPU module applying correct torque to the fasteners, and then re-validate with a burn-in test.
This is correct because the NCP-AII certification blueprint explicitly includes “Replace faulty cards, GPUs, and power supplies“ as a core task within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
The official NVIDIA DGX H100/H200 Service Manual provides detailed procedures for component replacement that align with this sequence:
Power down the system: The service manual specifies shutting down the system before performing any hardware replacement procedures .
Follow anti-static procedures: The documentation includes “Electrostatic Discharge“ warnings and emphasizes that “Static Sensitive Devices“ require best practices for ESD protection, including wrist straps connected to chassis ground .
Remove the HGX heat sink assembly: While the service manual doesn‘t show GPU-specific steps, the DIMM replacement procedure demonstrates the methodology of removing components to access the target module .
Replace the specific GPU module applying correct torque: Hardware replacement procedures require proper torque specifications for fasteners to ensure thermal performance and mechanical integrity .
Re-validate with a burn-in test: The service manual explicitly recommends “Running the Pre-flight Test“ after servicing, stating “NVIDIA recommends running the pre-flight stress test before putting a system into a production environment or after servicing“ .
Incorrect: A. The H100 GPUs are hot-swappable; the administrator should use a specialized extraction tool to pull the GPU while the system is running and insert a new one immediately.
This is incorrect because H100 GPUs in HGX baseboards are not hot-swappable components. The service documentation does not list GPUs as customer-replaceable units that can be replaced while the system is running . PCIe devices in server environments require system power-down for safe replacement.
B. Use the nvidia-smi -r command to logically reset the GPU, which physically ejects the faulty silicon from the socket so it can be picked up from the bottom of the chassis.
This is incorrect because there is no nvidia-smi -r command that physically ejects GPUs. The nvidia-smi tool is for GPU management and monitoring, not for hardware ejection. This description is fabricated and has no basis in actual NVIDIA tooling.
C. Replace the entire motherboard because individual GPUs on an HGX baseboard are permanently soldered and cannot be replaced without specialized factory wave-soldering equipment.
This is incorrect because while GPU replacement requires technical expertise, the certification blueprint specifically includes “Replace faulty…GPUs“ as a required skill . Individual GPU modules on HGX baseboards are designed to be replaceable, though the procedure involves proper disassembly of the cooling solution and careful handling.
Incorrect
Correct: D. Power down the system, follow anti-static procedures, remove the HGX heat sink assembly, replace the specific GPU module applying correct torque to the fasteners, and then re-validate with a burn-in test.
This is correct because the NCP-AII certification blueprint explicitly includes “Replace faulty cards, GPUs, and power supplies“ as a core task within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
The official NVIDIA DGX H100/H200 Service Manual provides detailed procedures for component replacement that align with this sequence:
Power down the system: The service manual specifies shutting down the system before performing any hardware replacement procedures .
Follow anti-static procedures: The documentation includes “Electrostatic Discharge“ warnings and emphasizes that “Static Sensitive Devices“ require best practices for ESD protection, including wrist straps connected to chassis ground .
Remove the HGX heat sink assembly: While the service manual doesn‘t show GPU-specific steps, the DIMM replacement procedure demonstrates the methodology of removing components to access the target module .
Replace the specific GPU module applying correct torque: Hardware replacement procedures require proper torque specifications for fasteners to ensure thermal performance and mechanical integrity .
Re-validate with a burn-in test: The service manual explicitly recommends “Running the Pre-flight Test“ after servicing, stating “NVIDIA recommends running the pre-flight stress test before putting a system into a production environment or after servicing“ .
Incorrect: A. The H100 GPUs are hot-swappable; the administrator should use a specialized extraction tool to pull the GPU while the system is running and insert a new one immediately.
This is incorrect because H100 GPUs in HGX baseboards are not hot-swappable components. The service documentation does not list GPUs as customer-replaceable units that can be replaced while the system is running . PCIe devices in server environments require system power-down for safe replacement.
B. Use the nvidia-smi -r command to logically reset the GPU, which physically ejects the faulty silicon from the socket so it can be picked up from the bottom of the chassis.
This is incorrect because there is no nvidia-smi -r command that physically ejects GPUs. The nvidia-smi tool is for GPU management and monitoring, not for hardware ejection. This description is fabricated and has no basis in actual NVIDIA tooling.
C. Replace the entire motherboard because individual GPUs on an HGX baseboard are permanently soldered and cannot be replaced without specialized factory wave-soldering equipment.
This is incorrect because while GPU replacement requires technical expertise, the certification blueprint specifically includes “Replace faulty…GPUs“ as a required skill . Individual GPU modules on HGX baseboards are designed to be replaceable, though the procedure involves proper disassembly of the cooling solution and careful handling.
Unattempted
Correct: D. Power down the system, follow anti-static procedures, remove the HGX heat sink assembly, replace the specific GPU module applying correct torque to the fasteners, and then re-validate with a burn-in test.
This is correct because the NCP-AII certification blueprint explicitly includes “Replace faulty cards, GPUs, and power supplies“ as a core task within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
The official NVIDIA DGX H100/H200 Service Manual provides detailed procedures for component replacement that align with this sequence:
Power down the system: The service manual specifies shutting down the system before performing any hardware replacement procedures .
Follow anti-static procedures: The documentation includes “Electrostatic Discharge“ warnings and emphasizes that “Static Sensitive Devices“ require best practices for ESD protection, including wrist straps connected to chassis ground .
Remove the HGX heat sink assembly: While the service manual doesn‘t show GPU-specific steps, the DIMM replacement procedure demonstrates the methodology of removing components to access the target module .
Replace the specific GPU module applying correct torque: Hardware replacement procedures require proper torque specifications for fasteners to ensure thermal performance and mechanical integrity .
Re-validate with a burn-in test: The service manual explicitly recommends “Running the Pre-flight Test“ after servicing, stating “NVIDIA recommends running the pre-flight stress test before putting a system into a production environment or after servicing“ .
Incorrect: A. The H100 GPUs are hot-swappable; the administrator should use a specialized extraction tool to pull the GPU while the system is running and insert a new one immediately.
This is incorrect because H100 GPUs in HGX baseboards are not hot-swappable components. The service documentation does not list GPUs as customer-replaceable units that can be replaced while the system is running . PCIe devices in server environments require system power-down for safe replacement.
B. Use the nvidia-smi -r command to logically reset the GPU, which physically ejects the faulty silicon from the socket so it can be picked up from the bottom of the chassis.
This is incorrect because there is no nvidia-smi -r command that physically ejects GPUs. The nvidia-smi tool is for GPU management and monitoring, not for hardware ejection. This description is fabricated and has no basis in actual NVIDIA tooling.
C. Replace the entire motherboard because individual GPUs on an HGX baseboard are permanently soldered and cannot be replaced without specialized factory wave-soldering equipment.
This is incorrect because while GPU replacement requires technical expertise, the certification blueprint specifically includes “Replace faulty…GPUs“ as a required skill . Individual GPU modules on HGX baseboards are designed to be replaceable, though the procedure involves proper disassembly of the cooling solution and careful handling.
Question 46 of 60
46. Question
A validation engineer is performing a cluster-wide High-Performance Linpack (HPL) test to verify the computational performance and thermal stability of a new AI factory. During the test, one node consistently reports significantly lower GFLOPS than the others. Which diagnostic step should be taken first to identify if the issue is related to the physical layer or the software configuration?
Correct
Correct: A Run a single-node NCCL test on the slow node to isolate whether the performance drop is due to internal GPU communication or external fabric issues.
The Technical Reason: HPL performance is heavily dependent on the efficiency of data movement.
Isolation Strategy: By running a single-node NCCL test (e.g., all_reduce_perf with -b 8G -e 8G -f 2 -g 8), the engineer can see if the eight GPUs inside that specific node can communicate with each other at the expected NVLink speeds (typically 900GB/s aggregate for H100).
Data Plane vs. Compute: If the single-node test passes with high bandwidth, the issue likely resides in the External Fabric (InfiniBand/DPU) or the HPL Software Configuration (e.g., incorrect MPI mapping). If the single-node test fails or shows low bandwidth, the issue is internal to that node (e.g., a faulty NVSwitch or a GPU that has “fallen off the bus“).
The NCP-AII Context: The exam validates your ability to use NVIDIA Collective Communications Library (NCCL) tests as a diagnostic tool. It is the fastest way to “divide and conquer“ a performance problem between the internal node topology and the external cluster fabric.
Incorrect Options: B. Update the BIOS on all other nodes to match the slow node In a production AI cluster, you never “downgrade“ or “equalize“ the entire cluster to match a failing or slow node. This would reduce the overall ROI of the infrastructure. The goal is to bring the slow node up to the “Validated Golden Recipe“ standard, not to propagate a potential configuration error across the entire fabric.
C. Check the NVLink Switch status and signal quality While this is a valid troubleshooting step for internal communication, it is too specific to be the first step. An HPL slowdown could be caused by many things: CPU throttling, slow system RAM, a bad InfiniBand cable, or a software environment variable. Running a NCCL test first (Option A) tells you if you need to check the NVLink switches (Option C).
D. Immediately replace the motherboard A lower HPL score is a performance metric, not a hardware death certificate. It could be caused by something as simple as the power profile being set to “Quiet“ instead of “Max Performance“ in the BIOS, or a single DIMM being unseated. Replacing a motherboard is a “Tier 3“ destructive repair and should only be done after all software and simple physical (cabling/seating) checks have been exhausted.
Incorrect
Correct: A Run a single-node NCCL test on the slow node to isolate whether the performance drop is due to internal GPU communication or external fabric issues.
The Technical Reason: HPL performance is heavily dependent on the efficiency of data movement.
Isolation Strategy: By running a single-node NCCL test (e.g., all_reduce_perf with -b 8G -e 8G -f 2 -g 8), the engineer can see if the eight GPUs inside that specific node can communicate with each other at the expected NVLink speeds (typically 900GB/s aggregate for H100).
Data Plane vs. Compute: If the single-node test passes with high bandwidth, the issue likely resides in the External Fabric (InfiniBand/DPU) or the HPL Software Configuration (e.g., incorrect MPI mapping). If the single-node test fails or shows low bandwidth, the issue is internal to that node (e.g., a faulty NVSwitch or a GPU that has “fallen off the bus“).
The NCP-AII Context: The exam validates your ability to use NVIDIA Collective Communications Library (NCCL) tests as a diagnostic tool. It is the fastest way to “divide and conquer“ a performance problem between the internal node topology and the external cluster fabric.
Incorrect Options: B. Update the BIOS on all other nodes to match the slow node In a production AI cluster, you never “downgrade“ or “equalize“ the entire cluster to match a failing or slow node. This would reduce the overall ROI of the infrastructure. The goal is to bring the slow node up to the “Validated Golden Recipe“ standard, not to propagate a potential configuration error across the entire fabric.
C. Check the NVLink Switch status and signal quality While this is a valid troubleshooting step for internal communication, it is too specific to be the first step. An HPL slowdown could be caused by many things: CPU throttling, slow system RAM, a bad InfiniBand cable, or a software environment variable. Running a NCCL test first (Option A) tells you if you need to check the NVLink switches (Option C).
D. Immediately replace the motherboard A lower HPL score is a performance metric, not a hardware death certificate. It could be caused by something as simple as the power profile being set to “Quiet“ instead of “Max Performance“ in the BIOS, or a single DIMM being unseated. Replacing a motherboard is a “Tier 3“ destructive repair and should only be done after all software and simple physical (cabling/seating) checks have been exhausted.
Unattempted
Correct: A Run a single-node NCCL test on the slow node to isolate whether the performance drop is due to internal GPU communication or external fabric issues.
The Technical Reason: HPL performance is heavily dependent on the efficiency of data movement.
Isolation Strategy: By running a single-node NCCL test (e.g., all_reduce_perf with -b 8G -e 8G -f 2 -g 8), the engineer can see if the eight GPUs inside that specific node can communicate with each other at the expected NVLink speeds (typically 900GB/s aggregate for H100).
Data Plane vs. Compute: If the single-node test passes with high bandwidth, the issue likely resides in the External Fabric (InfiniBand/DPU) or the HPL Software Configuration (e.g., incorrect MPI mapping). If the single-node test fails or shows low bandwidth, the issue is internal to that node (e.g., a faulty NVSwitch or a GPU that has “fallen off the bus“).
The NCP-AII Context: The exam validates your ability to use NVIDIA Collective Communications Library (NCCL) tests as a diagnostic tool. It is the fastest way to “divide and conquer“ a performance problem between the internal node topology and the external cluster fabric.
Incorrect Options: B. Update the BIOS on all other nodes to match the slow node In a production AI cluster, you never “downgrade“ or “equalize“ the entire cluster to match a failing or slow node. This would reduce the overall ROI of the infrastructure. The goal is to bring the slow node up to the “Validated Golden Recipe“ standard, not to propagate a potential configuration error across the entire fabric.
C. Check the NVLink Switch status and signal quality While this is a valid troubleshooting step for internal communication, it is too specific to be the first step. An HPL slowdown could be caused by many things: CPU throttling, slow system RAM, a bad InfiniBand cable, or a software environment variable. Running a NCCL test first (Option A) tells you if you need to check the NVLink switches (Option C).
D. Immediately replace the motherboard A lower HPL score is a performance metric, not a hardware death certificate. It could be caused by something as simple as the power profile being set to “Quiet“ instead of “Max Performance“ in the BIOS, or a single DIMM being unseated. Replacing a motherboard is a “Tier 3“ destructive repair and should only be done after all software and simple physical (cabling/seating) checks have been exhausted.
Question 47 of 60
47. Question
During the initial configuration of a new AI server node, the administrator needs to ensure the platform is secure and the Out-of-Band (OOB) management is isolated. Which set of tasks correctly describes the initial setup of the BMC, OOB, and TPM for a production-ready NVIDIA-certified system?
Correct
Correct: C Enable the TPM in the BIOS/UEFI, configure a dedicated management subnet for the OOB interface, and update the BMC password from the factory default.
The Technical Reason:
TPM (Trusted Platform Module): Enabling the TPM 2.0 is a requirement for “Measured Boot“ and hardware-based security. It ensures that the firmware and OS bootloaders have not been tampered with, which is critical for the integrity of an AI Factory.
OOB Isolation: Best practices for AI infrastructure require the Out-of-Band (OOB) management interface (the BMC port) to be on a physically or logically isolated subnet. This prevents high-bandwidth data traffic from interfering with management and protects the BMC from unauthorized access via the data network.
Credential Security: NVIDIA-certified systems ship with a default factory password. Changing this immediately is the most basic but essential step in securing the control plane.
The NCP-AII Context: The exam validates your ability to perform initial hardware provisioning. This includes the “Standard Operating Procedure“ for server hardening before any AI frameworks are installed.
Incorrect Options: A. Use nvidia-smi to reset BMC and map TPM to Slurm nvidia-smi is a tool for managing GPUs and the NVIDIA driver; it does not have the permissions or the architecture to reset BMC hardware credentials. Furthermore, TPM keys are used for system-level attestation and disk encryption, not for “mapping to the Slurm scheduler“ to encrypt job execution. Slurm job security is handled through MUNGE and user-space authentication.
B. Flash TPM with DOCA and bridge the BMC The TPM is a secure cryptoprocessor, not a storage device; you cannot “flash it with a DOCA image.“ DOCA runs on the BlueField DPU or the host OS. Additionally, “bridging the BMC with the primary Ethernet port“ is a security risk (shared-NIC mode), as it exposes the management controller to the production data network. Disabling OOB management entirely would make it impossible to perform remote power cycles or monitor hardware health.
D. Disable TPM and place OOB on the data fabric VLAN Disabling the TPM reduces the security posture of the node and is against NVIDIA-certified deployment guidelines. Placing the OOB port on the same VLAN as the data fabric is a major configuration error; the data fabric (InfiniBand or 400G Ethernet) is for high-performance tensor movement and should never be cluttered or compromised by management traffic.
Incorrect
Correct: C Enable the TPM in the BIOS/UEFI, configure a dedicated management subnet for the OOB interface, and update the BMC password from the factory default.
The Technical Reason:
TPM (Trusted Platform Module): Enabling the TPM 2.0 is a requirement for “Measured Boot“ and hardware-based security. It ensures that the firmware and OS bootloaders have not been tampered with, which is critical for the integrity of an AI Factory.
OOB Isolation: Best practices for AI infrastructure require the Out-of-Band (OOB) management interface (the BMC port) to be on a physically or logically isolated subnet. This prevents high-bandwidth data traffic from interfering with management and protects the BMC from unauthorized access via the data network.
Credential Security: NVIDIA-certified systems ship with a default factory password. Changing this immediately is the most basic but essential step in securing the control plane.
The NCP-AII Context: The exam validates your ability to perform initial hardware provisioning. This includes the “Standard Operating Procedure“ for server hardening before any AI frameworks are installed.
Incorrect Options: A. Use nvidia-smi to reset BMC and map TPM to Slurm nvidia-smi is a tool for managing GPUs and the NVIDIA driver; it does not have the permissions or the architecture to reset BMC hardware credentials. Furthermore, TPM keys are used for system-level attestation and disk encryption, not for “mapping to the Slurm scheduler“ to encrypt job execution. Slurm job security is handled through MUNGE and user-space authentication.
B. Flash TPM with DOCA and bridge the BMC The TPM is a secure cryptoprocessor, not a storage device; you cannot “flash it with a DOCA image.“ DOCA runs on the BlueField DPU or the host OS. Additionally, “bridging the BMC with the primary Ethernet port“ is a security risk (shared-NIC mode), as it exposes the management controller to the production data network. Disabling OOB management entirely would make it impossible to perform remote power cycles or monitor hardware health.
D. Disable TPM and place OOB on the data fabric VLAN Disabling the TPM reduces the security posture of the node and is against NVIDIA-certified deployment guidelines. Placing the OOB port on the same VLAN as the data fabric is a major configuration error; the data fabric (InfiniBand or 400G Ethernet) is for high-performance tensor movement and should never be cluttered or compromised by management traffic.
Unattempted
Correct: C Enable the TPM in the BIOS/UEFI, configure a dedicated management subnet for the OOB interface, and update the BMC password from the factory default.
The Technical Reason:
TPM (Trusted Platform Module): Enabling the TPM 2.0 is a requirement for “Measured Boot“ and hardware-based security. It ensures that the firmware and OS bootloaders have not been tampered with, which is critical for the integrity of an AI Factory.
OOB Isolation: Best practices for AI infrastructure require the Out-of-Band (OOB) management interface (the BMC port) to be on a physically or logically isolated subnet. This prevents high-bandwidth data traffic from interfering with management and protects the BMC from unauthorized access via the data network.
Credential Security: NVIDIA-certified systems ship with a default factory password. Changing this immediately is the most basic but essential step in securing the control plane.
The NCP-AII Context: The exam validates your ability to perform initial hardware provisioning. This includes the “Standard Operating Procedure“ for server hardening before any AI frameworks are installed.
Incorrect Options: A. Use nvidia-smi to reset BMC and map TPM to Slurm nvidia-smi is a tool for managing GPUs and the NVIDIA driver; it does not have the permissions or the architecture to reset BMC hardware credentials. Furthermore, TPM keys are used for system-level attestation and disk encryption, not for “mapping to the Slurm scheduler“ to encrypt job execution. Slurm job security is handled through MUNGE and user-space authentication.
B. Flash TPM with DOCA and bridge the BMC The TPM is a secure cryptoprocessor, not a storage device; you cannot “flash it with a DOCA image.“ DOCA runs on the BlueField DPU or the host OS. Additionally, “bridging the BMC with the primary Ethernet port“ is a security risk (shared-NIC mode), as it exposes the management controller to the production data network. Disabling OOB management entirely would make it impossible to perform remote power cycles or monitor hardware health.
D. Disable TPM and place OOB on the data fabric VLAN Disabling the TPM reduces the security posture of the node and is against NVIDIA-certified deployment guidelines. Placing the OOB port on the same VLAN as the data fabric is a major configuration error; the data fabric (InfiniBand or 400G Ethernet) is for high-performance tensor movement and should never be cluttered or compromised by management traffic.
Question 48 of 60
48. Question
During a performance optimization phase for an AI cluster using AMD EPYC servers, an administrator notices that the GPU-to-GPU communication across different CPU sockets is slower than expected. Which optimization technique should be applied at the BIOS or OS level to improve the throughput of the PCIe subsystem for these nodes?
Correct
Correct: B Ensure ‘NPS‘ (Nodes Per Socket) is configured to optimize NUMA topology for GPU affinity.
The Technical Reason: AMD EPYC processors use a Multi-Chip Module (MCM) design where the processor is divided into multiple quadrants.
NPS (Nodes Per Socket): This BIOS setting determines how many NUMA nodes the system reports per physical CPU socket. For AI workloads, setting NPS4 (on 8-channel memory systems) or NPS1 (depending on the specific topology) ensures that the PCIe lanes connected to a specific GPU are logically tied to the CPU cores and memory channels physically closest to them.
The “Hop“ Penalty: If NPS is misconfigured, data traveling from a GPU connected to one quadrant might have to traverse the internal AMD Infinity Fabric to reach memory or another GPU managed by a different quadrant. This cross-quadrant traffic introduces latency and reduces throughput.
GPU Affinity: By aligning the NUMA topology with the physical hardware layout, the OS and NVIDIA drivers can ensure that “East-West“ traffic stays on the local PCIe root complex as much as possible.
The NCP-AII Context: The exam validates your ability to “Execute performance optimization for AMD and Intel servers.“ For AMD systems, managing NUMA boundaries via NPS settings is the standard “Best Practice“ for NVIDIA-certified systems to achieve peak NCCL (NVIDIA Collective Communications Library) performance.
Incorrect Options: A. Set the storage array to use RAID 0 While RAID 0 can increase raw storage throughput by striping data across multiple disks, it has no impact on the PCIe subsystem efficiency or the GPU-to-GPU communication speeds inside the server. This is a storage-layer optimization, not a CPU-socket or interconnect optimization.
C. Enable ‘IOMMU‘ and ‘AER‘ for better error reporting IOMMU (Input-Output Memory Management Unit) and AER (Advanced Error Reporting) are important for virtualization and stability. However, enabling IOMMU can actually decrease performance slightly due to the overhead of address translation. It is a security and diagnostic feature, not a throughput optimization technique for multi-socket GPU communication.
D. Disable Hyper-Threading (SMT) While disabling Simultaneous Multi-Threading (SMT) can sometimes reduce “jitter“ in certain high-frequency trading or HPC workloads, it does not address the fundamental issue of cross-socket PCIe latency. Modern AI frameworks are highly parallelized and typically benefit from having more logical cores available for data preprocessing and management.
Incorrect
Correct: B Ensure ‘NPS‘ (Nodes Per Socket) is configured to optimize NUMA topology for GPU affinity.
The Technical Reason: AMD EPYC processors use a Multi-Chip Module (MCM) design where the processor is divided into multiple quadrants.
NPS (Nodes Per Socket): This BIOS setting determines how many NUMA nodes the system reports per physical CPU socket. For AI workloads, setting NPS4 (on 8-channel memory systems) or NPS1 (depending on the specific topology) ensures that the PCIe lanes connected to a specific GPU are logically tied to the CPU cores and memory channels physically closest to them.
The “Hop“ Penalty: If NPS is misconfigured, data traveling from a GPU connected to one quadrant might have to traverse the internal AMD Infinity Fabric to reach memory or another GPU managed by a different quadrant. This cross-quadrant traffic introduces latency and reduces throughput.
GPU Affinity: By aligning the NUMA topology with the physical hardware layout, the OS and NVIDIA drivers can ensure that “East-West“ traffic stays on the local PCIe root complex as much as possible.
The NCP-AII Context: The exam validates your ability to “Execute performance optimization for AMD and Intel servers.“ For AMD systems, managing NUMA boundaries via NPS settings is the standard “Best Practice“ for NVIDIA-certified systems to achieve peak NCCL (NVIDIA Collective Communications Library) performance.
Incorrect Options: A. Set the storage array to use RAID 0 While RAID 0 can increase raw storage throughput by striping data across multiple disks, it has no impact on the PCIe subsystem efficiency or the GPU-to-GPU communication speeds inside the server. This is a storage-layer optimization, not a CPU-socket or interconnect optimization.
C. Enable ‘IOMMU‘ and ‘AER‘ for better error reporting IOMMU (Input-Output Memory Management Unit) and AER (Advanced Error Reporting) are important for virtualization and stability. However, enabling IOMMU can actually decrease performance slightly due to the overhead of address translation. It is a security and diagnostic feature, not a throughput optimization technique for multi-socket GPU communication.
D. Disable Hyper-Threading (SMT) While disabling Simultaneous Multi-Threading (SMT) can sometimes reduce “jitter“ in certain high-frequency trading or HPC workloads, it does not address the fundamental issue of cross-socket PCIe latency. Modern AI frameworks are highly parallelized and typically benefit from having more logical cores available for data preprocessing and management.
Unattempted
Correct: B Ensure ‘NPS‘ (Nodes Per Socket) is configured to optimize NUMA topology for GPU affinity.
The Technical Reason: AMD EPYC processors use a Multi-Chip Module (MCM) design where the processor is divided into multiple quadrants.
NPS (Nodes Per Socket): This BIOS setting determines how many NUMA nodes the system reports per physical CPU socket. For AI workloads, setting NPS4 (on 8-channel memory systems) or NPS1 (depending on the specific topology) ensures that the PCIe lanes connected to a specific GPU are logically tied to the CPU cores and memory channels physically closest to them.
The “Hop“ Penalty: If NPS is misconfigured, data traveling from a GPU connected to one quadrant might have to traverse the internal AMD Infinity Fabric to reach memory or another GPU managed by a different quadrant. This cross-quadrant traffic introduces latency and reduces throughput.
GPU Affinity: By aligning the NUMA topology with the physical hardware layout, the OS and NVIDIA drivers can ensure that “East-West“ traffic stays on the local PCIe root complex as much as possible.
The NCP-AII Context: The exam validates your ability to “Execute performance optimization for AMD and Intel servers.“ For AMD systems, managing NUMA boundaries via NPS settings is the standard “Best Practice“ for NVIDIA-certified systems to achieve peak NCCL (NVIDIA Collective Communications Library) performance.
Incorrect Options: A. Set the storage array to use RAID 0 While RAID 0 can increase raw storage throughput by striping data across multiple disks, it has no impact on the PCIe subsystem efficiency or the GPU-to-GPU communication speeds inside the server. This is a storage-layer optimization, not a CPU-socket or interconnect optimization.
C. Enable ‘IOMMU‘ and ‘AER‘ for better error reporting IOMMU (Input-Output Memory Management Unit) and AER (Advanced Error Reporting) are important for virtualization and stability. However, enabling IOMMU can actually decrease performance slightly due to the overhead of address translation. It is a security and diagnostic feature, not a throughput optimization technique for multi-socket GPU communication.
D. Disable Hyper-Threading (SMT) While disabling Simultaneous Multi-Threading (SMT) can sometimes reduce “jitter“ in certain high-frequency trading or HPC workloads, it does not address the fundamental issue of cross-socket PCIe latency. Modern AI frameworks are highly parallelized and typically benefit from having more logical cores available for data preprocessing and management.
Question 49 of 60
49. Question
An AI cluster is utilizing a high-performance parallel filesystem for training data. However, the throughput observed at the GPUs is only 50 percent of the storage rated capacity. The network is verified to be healthy. Which optimization technique should be explored to reduce the CPU bottleneck and increase the data ingestion rate directly into the GPU memory?
Correct
Correct: D Enable and configure GPUDirect Storage (GDS) to allow the NIC to DMA data directly from the storage into the GPU memory, bypassing the host bounce buffers and CPU intervention.
The Technical Reason: In a standard I/O path, data must be copied from the storage/network into the CPU‘s system memory (RAM) before being copied a second time to the GPU‘s memory.
The Bottleneck: This “double-buffering“ consumes CPU cycles and memory bandwidth, which often becomes the limiting factor in high-speed AI training (the “CPU Bottleneck“).
The GDS Solution: NVIDIA GPUDirect Storage (GDS) creates a direct Direct Memory Access (DMA) path between the storage (local NVMe or remote NVMe-oF/Parallel FS) and the GPU‘s High Bandwidth Memory (HBM).
The Result: By bypassing the CPU and system RAM, GDS significantly reduces latency, lowers CPU utilization by up to 3x, and increases total aggregate throughput, allowing the GPUs to ingest data at the storage‘s full rated capacity.
The NCP-AII Context: The exam validates your ability to deploy the NVIDIA Magnum IO stack. GDS is a pillar of this stack, particularly for “AI Factories“ where training data is fed from high-performance storage like Lustre, Weka, or IBM Storage Scale.
Incorrect Options: A. Switch from RDMA to standard TCP/IP This would be a massive performance regression. RDMA (Remote Direct Memory Access) is the foundation of high-performance AI networking because it allows data transfer without involving the OS kernel or CPU. Switching to TCP/IP would increase CPU overhead and latency significantly, further starving the GPUs of data.
B. Enable Transparent Huge Pages (THP) While Transparent Huge Pages can help optimize memory management for some database workloads, it does not address the fundamental I/O bottleneck of moving data from the network/disk to the GPU. In fact, for many HPC and AI workloads, NVIDIA actually recommends disabling THP to prevent unpredictable latency spikes during memory allocation.
C. Install more system RAM While more RAM allows for a larger Linux page cache, the GPU still has to pull that data through the PCIe bus via the CPU. If the CPU is already the bottleneck (as stated in the prompt), adding more RAM does not solve the underlying processing overhead required to move data from RAM to GPU. GDS (Option D) is specifically designed to eliminate this middle-man entirely.
Incorrect
Correct: D Enable and configure GPUDirect Storage (GDS) to allow the NIC to DMA data directly from the storage into the GPU memory, bypassing the host bounce buffers and CPU intervention.
The Technical Reason: In a standard I/O path, data must be copied from the storage/network into the CPU‘s system memory (RAM) before being copied a second time to the GPU‘s memory.
The Bottleneck: This “double-buffering“ consumes CPU cycles and memory bandwidth, which often becomes the limiting factor in high-speed AI training (the “CPU Bottleneck“).
The GDS Solution: NVIDIA GPUDirect Storage (GDS) creates a direct Direct Memory Access (DMA) path between the storage (local NVMe or remote NVMe-oF/Parallel FS) and the GPU‘s High Bandwidth Memory (HBM).
The Result: By bypassing the CPU and system RAM, GDS significantly reduces latency, lowers CPU utilization by up to 3x, and increases total aggregate throughput, allowing the GPUs to ingest data at the storage‘s full rated capacity.
The NCP-AII Context: The exam validates your ability to deploy the NVIDIA Magnum IO stack. GDS is a pillar of this stack, particularly for “AI Factories“ where training data is fed from high-performance storage like Lustre, Weka, or IBM Storage Scale.
Incorrect Options: A. Switch from RDMA to standard TCP/IP This would be a massive performance regression. RDMA (Remote Direct Memory Access) is the foundation of high-performance AI networking because it allows data transfer without involving the OS kernel or CPU. Switching to TCP/IP would increase CPU overhead and latency significantly, further starving the GPUs of data.
B. Enable Transparent Huge Pages (THP) While Transparent Huge Pages can help optimize memory management for some database workloads, it does not address the fundamental I/O bottleneck of moving data from the network/disk to the GPU. In fact, for many HPC and AI workloads, NVIDIA actually recommends disabling THP to prevent unpredictable latency spikes during memory allocation.
C. Install more system RAM While more RAM allows for a larger Linux page cache, the GPU still has to pull that data through the PCIe bus via the CPU. If the CPU is already the bottleneck (as stated in the prompt), adding more RAM does not solve the underlying processing overhead required to move data from RAM to GPU. GDS (Option D) is specifically designed to eliminate this middle-man entirely.
Unattempted
Correct: D Enable and configure GPUDirect Storage (GDS) to allow the NIC to DMA data directly from the storage into the GPU memory, bypassing the host bounce buffers and CPU intervention.
The Technical Reason: In a standard I/O path, data must be copied from the storage/network into the CPU‘s system memory (RAM) before being copied a second time to the GPU‘s memory.
The Bottleneck: This “double-buffering“ consumes CPU cycles and memory bandwidth, which often becomes the limiting factor in high-speed AI training (the “CPU Bottleneck“).
The GDS Solution: NVIDIA GPUDirect Storage (GDS) creates a direct Direct Memory Access (DMA) path between the storage (local NVMe or remote NVMe-oF/Parallel FS) and the GPU‘s High Bandwidth Memory (HBM).
The Result: By bypassing the CPU and system RAM, GDS significantly reduces latency, lowers CPU utilization by up to 3x, and increases total aggregate throughput, allowing the GPUs to ingest data at the storage‘s full rated capacity.
The NCP-AII Context: The exam validates your ability to deploy the NVIDIA Magnum IO stack. GDS is a pillar of this stack, particularly for “AI Factories“ where training data is fed from high-performance storage like Lustre, Weka, or IBM Storage Scale.
Incorrect Options: A. Switch from RDMA to standard TCP/IP This would be a massive performance regression. RDMA (Remote Direct Memory Access) is the foundation of high-performance AI networking because it allows data transfer without involving the OS kernel or CPU. Switching to TCP/IP would increase CPU overhead and latency significantly, further starving the GPUs of data.
B. Enable Transparent Huge Pages (THP) While Transparent Huge Pages can help optimize memory management for some database workloads, it does not address the fundamental I/O bottleneck of moving data from the network/disk to the GPU. In fact, for many HPC and AI workloads, NVIDIA actually recommends disabling THP to prevent unpredictable latency spikes during memory allocation.
C. Install more system RAM While more RAM allows for a larger Linux page cache, the GPU still has to pull that data through the PCIe bus via the CPU. If the CPU is already the bottleneck (as stated in the prompt), adding more RAM does not solve the underlying processing overhead required to move data from RAM to GPU. GDS (Option D) is specifically designed to eliminate this middle-man entirely.
Question 50 of 60
50. Question
A data center technician is connecting multiple NVIDIA DGX nodes to a leaf-and-spine InfiniBand fabric. To ensure optimal signal quality and avoid link errors, the technician must validate the cables and transceivers. What is the most effective way to identify a physical layer fault when a 400Gbps link fails to negotiate correctly during the bring-up phase?
Correct
Correct: D Inspect the optical fiber end-faces for contamination using a digital scope and check the transceiver RX/TX power levels in the BMC or OS metrics. The Technical Reason: In a 400Gbps (NDR) environment using OSFP or QSFP-DD form factors, the signal margins are extremely tight. ? Contamination: A single speck of dust on the fiber end-face can cause signal attenuation or back-reflection, leading to high Bit Error Rates (BER) or a complete failure to link. Using a digital inspection scope is the industry-standard first step. ? Power Levels: By checking the DOM (Digital Optical Monitoring) dataeither through the switch CLI (show interfaces transceiver), the server‘s BMC, or the OS via mlxlinka technician can see if the light levels (RX/TX power) are within the operational window. A low RX power level typically confirms a physical path issue (bad cable, dirty connector, or failing transceiver). The NCP-AII Context: The exam validates your ability to “Describe and validate cable types and transceivers.“ For professional-level certification, you are expected to know that software-level resets (Option C) are ineffective if the physical “layer 0“ is compromised.
Incorrect: A. Swap with SFP28 equivalents SFP28 is a 25Gbps form factor. It is physically incompatible with the OSFP/QSFP-DD ports used for 400Gbps NDR InfiniBand. Furthermore, high-speed signaling is limited by the physical physics of the link and the transceiver‘s capability, not by the “HGX firmware power limits.“
B. Increase MTU size on the BMC port The BMC management port is an Out-of-Band (OOB) interface, typically 1GbE. Changing its MTU (Maximum Transmission Unit) has zero effect on the high-speed data fabric. MTU is a Layer 2/3 configuration and does not influence the Layer 1 physical clock speed negotiation.
C. Reinstall the NVIDIA GPU drivers While drivers are necessary for the OS to use the network, the initial link negotiation (turning the “link light“ green) happens at the hardware/firmware level between the NIC/DPU and the Switch. If a physical link won‘t establish, reinstalling the GPU driver is a “red herring“ that does not address the underlying hardware connectivity issue.
Incorrect
Correct: D Inspect the optical fiber end-faces for contamination using a digital scope and check the transceiver RX/TX power levels in the BMC or OS metrics. The Technical Reason: In a 400Gbps (NDR) environment using OSFP or QSFP-DD form factors, the signal margins are extremely tight. ? Contamination: A single speck of dust on the fiber end-face can cause signal attenuation or back-reflection, leading to high Bit Error Rates (BER) or a complete failure to link. Using a digital inspection scope is the industry-standard first step. ? Power Levels: By checking the DOM (Digital Optical Monitoring) dataeither through the switch CLI (show interfaces transceiver), the server‘s BMC, or the OS via mlxlinka technician can see if the light levels (RX/TX power) are within the operational window. A low RX power level typically confirms a physical path issue (bad cable, dirty connector, or failing transceiver). The NCP-AII Context: The exam validates your ability to “Describe and validate cable types and transceivers.“ For professional-level certification, you are expected to know that software-level resets (Option C) are ineffective if the physical “layer 0“ is compromised.
Incorrect: A. Swap with SFP28 equivalents SFP28 is a 25Gbps form factor. It is physically incompatible with the OSFP/QSFP-DD ports used for 400Gbps NDR InfiniBand. Furthermore, high-speed signaling is limited by the physical physics of the link and the transceiver‘s capability, not by the “HGX firmware power limits.“
B. Increase MTU size on the BMC port The BMC management port is an Out-of-Band (OOB) interface, typically 1GbE. Changing its MTU (Maximum Transmission Unit) has zero effect on the high-speed data fabric. MTU is a Layer 2/3 configuration and does not influence the Layer 1 physical clock speed negotiation.
C. Reinstall the NVIDIA GPU drivers While drivers are necessary for the OS to use the network, the initial link negotiation (turning the “link light“ green) happens at the hardware/firmware level between the NIC/DPU and the Switch. If a physical link won‘t establish, reinstalling the GPU driver is a “red herring“ that does not address the underlying hardware connectivity issue.
Unattempted
Correct: D Inspect the optical fiber end-faces for contamination using a digital scope and check the transceiver RX/TX power levels in the BMC or OS metrics. The Technical Reason: In a 400Gbps (NDR) environment using OSFP or QSFP-DD form factors, the signal margins are extremely tight. ? Contamination: A single speck of dust on the fiber end-face can cause signal attenuation or back-reflection, leading to high Bit Error Rates (BER) or a complete failure to link. Using a digital inspection scope is the industry-standard first step. ? Power Levels: By checking the DOM (Digital Optical Monitoring) dataeither through the switch CLI (show interfaces transceiver), the server‘s BMC, or the OS via mlxlinka technician can see if the light levels (RX/TX power) are within the operational window. A low RX power level typically confirms a physical path issue (bad cable, dirty connector, or failing transceiver). The NCP-AII Context: The exam validates your ability to “Describe and validate cable types and transceivers.“ For professional-level certification, you are expected to know that software-level resets (Option C) are ineffective if the physical “layer 0“ is compromised.
Incorrect: A. Swap with SFP28 equivalents SFP28 is a 25Gbps form factor. It is physically incompatible with the OSFP/QSFP-DD ports used for 400Gbps NDR InfiniBand. Furthermore, high-speed signaling is limited by the physical physics of the link and the transceiver‘s capability, not by the “HGX firmware power limits.“
B. Increase MTU size on the BMC port The BMC management port is an Out-of-Band (OOB) interface, typically 1GbE. Changing its MTU (Maximum Transmission Unit) has zero effect on the high-speed data fabric. MTU is a Layer 2/3 configuration and does not influence the Layer 1 physical clock speed negotiation.
C. Reinstall the NVIDIA GPU drivers While drivers are necessary for the OS to use the network, the initial link negotiation (turning the “link light“ green) happens at the hardware/firmware level between the NIC/DPU and the Switch. If a physical link won‘t establish, reinstalling the GPU driver is a “red herring“ that does not address the underlying hardware connectivity issue.
Question 51 of 60
51. Question
A system administrator needs to update the NVIDIA GPU drivers on a production cluster managed by Base Command Manager. What is the most efficient and recommended method to ensure that all compute nodes receive the update consistently while minimizing the risk of version mismatch across the cluster?
Correct
Correct: C Update the software image (category) on the BCM head node and then use the BCM imaging tools to push the update to the compute nodes.
The Technical Reason: NVIDIA Base Command Manager (BCM) manages compute nodes using Software Images (blueprints of the OS and drivers).
Centralized Management: Instead of touching nodes individually, an administrator modifies the image stored on the head node (often using chroot or apt-get within the image directory, e.g., /cm/images/default-image).
Consistency: Once the image is updated, BCM provides tools like imageupdate or a scheduled reboot to synchronize these changes across all nodes assigned to that Category. This ensures every node has the exact same driver version, down to the build number.
Efficiency: This method allows a single administrator to update hundreds of nodes simultaneously, drastically reducing the maintenance window and eliminating human error.
The NCP-AII Context: The exam validates your proficiency in the BCM lifecycle. You are expected to know that BCM uses Categories to group nodes with similar hardware and that the “Category-Image“ relationship is the foundation of cluster-wide consistency.
Incorrect Options: A. Wait for users to complain In a production AI Factory, proactive maintenance is mandatory. Waiting for failures in a distributed training job (which could cost thousands of dollars in wasted compute time) is an “unprofessional“ approach. Performance degradation or kernel panics caused by mismatched drivers must be avoided through scheduled, cluster-wide updates.
B. Use the NGC CLI via the OOB network The NGC CLI is designed to pull containers, models, and datasets from the NVIDIA GPU Cloud. It is not a configuration management or provisioning tool for the host operating system or kernel drivers. Furthermore, the OOB (Out-of-Band) network is typically reserved for low-level management (BMC/IPMI) and does not have the bandwidth or architecture for pushing system-wide driver binaries.
D. Log into each of the 100 nodes individually This is the most inefficient method and is highly discouraged in the NCP-AII curriculum. Manual installation on a node-by-node basis (known as “snowflake nodes“) leads to version drift, where slight differences in environment variables or library paths across the 100 nodes cause inconsistent performance and difficulty in troubleshooting.
Incorrect
Correct: C Update the software image (category) on the BCM head node and then use the BCM imaging tools to push the update to the compute nodes.
The Technical Reason: NVIDIA Base Command Manager (BCM) manages compute nodes using Software Images (blueprints of the OS and drivers).
Centralized Management: Instead of touching nodes individually, an administrator modifies the image stored on the head node (often using chroot or apt-get within the image directory, e.g., /cm/images/default-image).
Consistency: Once the image is updated, BCM provides tools like imageupdate or a scheduled reboot to synchronize these changes across all nodes assigned to that Category. This ensures every node has the exact same driver version, down to the build number.
Efficiency: This method allows a single administrator to update hundreds of nodes simultaneously, drastically reducing the maintenance window and eliminating human error.
The NCP-AII Context: The exam validates your proficiency in the BCM lifecycle. You are expected to know that BCM uses Categories to group nodes with similar hardware and that the “Category-Image“ relationship is the foundation of cluster-wide consistency.
Incorrect Options: A. Wait for users to complain In a production AI Factory, proactive maintenance is mandatory. Waiting for failures in a distributed training job (which could cost thousands of dollars in wasted compute time) is an “unprofessional“ approach. Performance degradation or kernel panics caused by mismatched drivers must be avoided through scheduled, cluster-wide updates.
B. Use the NGC CLI via the OOB network The NGC CLI is designed to pull containers, models, and datasets from the NVIDIA GPU Cloud. It is not a configuration management or provisioning tool for the host operating system or kernel drivers. Furthermore, the OOB (Out-of-Band) network is typically reserved for low-level management (BMC/IPMI) and does not have the bandwidth or architecture for pushing system-wide driver binaries.
D. Log into each of the 100 nodes individually This is the most inefficient method and is highly discouraged in the NCP-AII curriculum. Manual installation on a node-by-node basis (known as “snowflake nodes“) leads to version drift, where slight differences in environment variables or library paths across the 100 nodes cause inconsistent performance and difficulty in troubleshooting.
Unattempted
Correct: C Update the software image (category) on the BCM head node and then use the BCM imaging tools to push the update to the compute nodes.
The Technical Reason: NVIDIA Base Command Manager (BCM) manages compute nodes using Software Images (blueprints of the OS and drivers).
Centralized Management: Instead of touching nodes individually, an administrator modifies the image stored on the head node (often using chroot or apt-get within the image directory, e.g., /cm/images/default-image).
Consistency: Once the image is updated, BCM provides tools like imageupdate or a scheduled reboot to synchronize these changes across all nodes assigned to that Category. This ensures every node has the exact same driver version, down to the build number.
Efficiency: This method allows a single administrator to update hundreds of nodes simultaneously, drastically reducing the maintenance window and eliminating human error.
The NCP-AII Context: The exam validates your proficiency in the BCM lifecycle. You are expected to know that BCM uses Categories to group nodes with similar hardware and that the “Category-Image“ relationship is the foundation of cluster-wide consistency.
Incorrect Options: A. Wait for users to complain In a production AI Factory, proactive maintenance is mandatory. Waiting for failures in a distributed training job (which could cost thousands of dollars in wasted compute time) is an “unprofessional“ approach. Performance degradation or kernel panics caused by mismatched drivers must be avoided through scheduled, cluster-wide updates.
B. Use the NGC CLI via the OOB network The NGC CLI is designed to pull containers, models, and datasets from the NVIDIA GPU Cloud. It is not a configuration management or provisioning tool for the host operating system or kernel drivers. Furthermore, the OOB (Out-of-Band) network is typically reserved for low-level management (BMC/IPMI) and does not have the bandwidth or architecture for pushing system-wide driver binaries.
D. Log into each of the 100 nodes individually This is the most inefficient method and is highly discouraged in the NCP-AII curriculum. Manual installation on a node-by-node basis (known as “snowflake nodes“) leads to version drift, where slight differences in environment variables or library paths across the 100 nodes cause inconsistent performance and difficulty in troubleshooting.
Question 52 of 60
52. Question
An administrator notices that some nodes in the cluster are performing significantly slower during HPL tests. Upon investigation, they find that the ‘nvidia-smi -q‘ command shows the ‘Clocks Throttle Reason‘ as ‘SW Thermal Slowdown‘. What is the most appropriate troubleshooting action?
Correct
Correct: B. Inspect the physical airflow path for obstructions, verify that the server‘s fans are operating at the correct RPM, and check for any failed cooling components in the BMC logs.
This is the correct troubleshooting action because the “SW Thermal Slowdown“ throttle reason indicates the GPU is reducing its clock speed due to thermal constraints . According to hardware diagnostic documentation, this is typically caused by physical cooling failures including:
Fan failures or incorrect RPM operation
Airflow obstruction from dust accumulation or improper cabling
Failed cooling components that can be identified through BMC monitoring
The systematic approach of inspecting physical airflow, verifying fan operation, and checking BMC logs aligns with proper hardware troubleshooting methodology in the NCP-AII certification‘s Troubleshoot and Optimize domain .
Incorrect: A. Modify the Slurm configuration to automatically reboot any node that reports a thermal slowdown to allow the GPU to cool down during the POST process.
This is incorrect because Slurm is a workload manager for job scheduling and has no role in GPU thermal management . Rebooting nodes experiencing thermal throttling is a temporary, reactive measure that does not address the physical cooling failure. Proper thermal management requires identifying and fixing the root cause of inadequate cooling, not automated reboots.
C. Replace the BlueField-3 DPU, as it is likely failing to route the cooling commands from the Base Command Manager to the GPU‘s firmware.
This is incorrect because the BlueField-3 DPU is not involved in GPU thermal management. The DPU handles networking, storage offload, and security functions . Cooling commands and fan control are managed by the Baseboard Management Controller (BMC) and system thermal firmware, not the DPU.
D. Use the NGC CLI to update the PyTorch container to a version that uses less GPU memory, thereby reducing the heat generated by the H100 cores.
This is incorrect because changing container versions does not address the underlying thermal issue. NGC CLI is used for downloading containers and managing NGC resources . While lower memory usage might slightly reduce heat, it is a workaround that accepts throttling rather than fixing the root cause, and would not resolve an “SW Thermal Slowdown“ condition caused by actual cooling failures.
Incorrect
Correct: B. Inspect the physical airflow path for obstructions, verify that the server‘s fans are operating at the correct RPM, and check for any failed cooling components in the BMC logs.
This is the correct troubleshooting action because the “SW Thermal Slowdown“ throttle reason indicates the GPU is reducing its clock speed due to thermal constraints . According to hardware diagnostic documentation, this is typically caused by physical cooling failures including:
Fan failures or incorrect RPM operation
Airflow obstruction from dust accumulation or improper cabling
Failed cooling components that can be identified through BMC monitoring
The systematic approach of inspecting physical airflow, verifying fan operation, and checking BMC logs aligns with proper hardware troubleshooting methodology in the NCP-AII certification‘s Troubleshoot and Optimize domain .
Incorrect: A. Modify the Slurm configuration to automatically reboot any node that reports a thermal slowdown to allow the GPU to cool down during the POST process.
This is incorrect because Slurm is a workload manager for job scheduling and has no role in GPU thermal management . Rebooting nodes experiencing thermal throttling is a temporary, reactive measure that does not address the physical cooling failure. Proper thermal management requires identifying and fixing the root cause of inadequate cooling, not automated reboots.
C. Replace the BlueField-3 DPU, as it is likely failing to route the cooling commands from the Base Command Manager to the GPU‘s firmware.
This is incorrect because the BlueField-3 DPU is not involved in GPU thermal management. The DPU handles networking, storage offload, and security functions . Cooling commands and fan control are managed by the Baseboard Management Controller (BMC) and system thermal firmware, not the DPU.
D. Use the NGC CLI to update the PyTorch container to a version that uses less GPU memory, thereby reducing the heat generated by the H100 cores.
This is incorrect because changing container versions does not address the underlying thermal issue. NGC CLI is used for downloading containers and managing NGC resources . While lower memory usage might slightly reduce heat, it is a workaround that accepts throttling rather than fixing the root cause, and would not resolve an “SW Thermal Slowdown“ condition caused by actual cooling failures.
Unattempted
Correct: B. Inspect the physical airflow path for obstructions, verify that the server‘s fans are operating at the correct RPM, and check for any failed cooling components in the BMC logs.
This is the correct troubleshooting action because the “SW Thermal Slowdown“ throttle reason indicates the GPU is reducing its clock speed due to thermal constraints . According to hardware diagnostic documentation, this is typically caused by physical cooling failures including:
Fan failures or incorrect RPM operation
Airflow obstruction from dust accumulation or improper cabling
Failed cooling components that can be identified through BMC monitoring
The systematic approach of inspecting physical airflow, verifying fan operation, and checking BMC logs aligns with proper hardware troubleshooting methodology in the NCP-AII certification‘s Troubleshoot and Optimize domain .
Incorrect: A. Modify the Slurm configuration to automatically reboot any node that reports a thermal slowdown to allow the GPU to cool down during the POST process.
This is incorrect because Slurm is a workload manager for job scheduling and has no role in GPU thermal management . Rebooting nodes experiencing thermal throttling is a temporary, reactive measure that does not address the physical cooling failure. Proper thermal management requires identifying and fixing the root cause of inadequate cooling, not automated reboots.
C. Replace the BlueField-3 DPU, as it is likely failing to route the cooling commands from the Base Command Manager to the GPU‘s firmware.
This is incorrect because the BlueField-3 DPU is not involved in GPU thermal management. The DPU handles networking, storage offload, and security functions . Cooling commands and fan control are managed by the Baseboard Management Controller (BMC) and system thermal firmware, not the DPU.
D. Use the NGC CLI to update the PyTorch container to a version that uses less GPU memory, thereby reducing the heat generated by the H100 cores.
This is incorrect because changing container versions does not address the underlying thermal issue. NGC CLI is used for downloading containers and managing NGC resources . While lower memory usage might slightly reduce heat, it is a workaround that accepts throttling rather than fixing the root cause, and would not resolve an “SW Thermal Slowdown“ condition caused by actual cooling failures.
Question 53 of 60
53. Question
A cluster administrator is configuring a Slurm-based environment on an NVIDIA AI factory. To enable users to launch GPU-accelerated containers seamlessly, the administrator decides to use the Enroot and Pyxis plugins. Which statement correctly describes the role of these tools in the control plane configuration?
Correct
Correct: B Enroot acts as the container runtime that turns Docker images into unprivileged sandboxes, while Pyxis integrates Enroot with the Slurm scheduler.
The Technical Reason:
Enroot: Unlike Docker, which requires a root-level daemon, Enroot is a “chroot-based“ runtime. It takes standard Docker/OCI container images (usually from NVIDIA NGC) and converts them into a simple directory structure or a squashfs file. This allows users to run containers as unprivileged users, which is a critical security requirement in multi-tenant AI clusters.
Pyxis: This is a Slurm SPANK (Slurm Plug-in Architecture for Node and Job Konfig) plugin. It allows users to use standard Slurm commands (e.g., srun –container-image=…) to launch those Enroot containers seamlessly across the cluster without needing to manually pull or manage images on every individual compute node.
The NCP-AII Context: The exam validates your ability to configure a scalable software stack. The Enroot + Pyxis workflow is preferred because it handles high-performance InfiniBand (RDMA) and GPU device mapping natively and securely, whereas standard Docker often struggles with the permissions required for “bare-metal“ hardware access in an HPC environment.
Incorrect Options: A. Managing NVLink and MIG profiles This is a misidentification of the tools. NVLink Switch topology is managed by the NVIDIA Fabric Manager, and MIG (Multi-Instance GPU) profiles are configured using nvidia-smi or the NVIDIA Device Plugin for Kubernetes. Neither Enroot nor Pyxis has the capability to modify hardware-level GPU partitions or switch topologies.
C. Encrypting NGC CLI communication While security is a benefit of Enroot, these tools are not used to “encrypt communication“ between the NGC CLI and a local Docker daemon. The NGC CLI uses standard HTTPS/TLS for secure downloads. Enroot‘s role begins after an image is downloaded, by preparing it for execution on the compute nodes.
D. Flashing GPU firmware and DPU power management Pyxis is a scheduler plugin, not a firmware utility. GPU firmware is updated via NVSM or mlxfwmanager. Power distribution to BlueField-3 DPUs is managed by the server‘s BMC (Baseboard Management Controller) and the physical power supply units. Neither Enroot nor Pyxis interacts with the physical power rails or the low-level firmware flashing process.
Incorrect
Correct: B Enroot acts as the container runtime that turns Docker images into unprivileged sandboxes, while Pyxis integrates Enroot with the Slurm scheduler.
The Technical Reason:
Enroot: Unlike Docker, which requires a root-level daemon, Enroot is a “chroot-based“ runtime. It takes standard Docker/OCI container images (usually from NVIDIA NGC) and converts them into a simple directory structure or a squashfs file. This allows users to run containers as unprivileged users, which is a critical security requirement in multi-tenant AI clusters.
Pyxis: This is a Slurm SPANK (Slurm Plug-in Architecture for Node and Job Konfig) plugin. It allows users to use standard Slurm commands (e.g., srun –container-image=…) to launch those Enroot containers seamlessly across the cluster without needing to manually pull or manage images on every individual compute node.
The NCP-AII Context: The exam validates your ability to configure a scalable software stack. The Enroot + Pyxis workflow is preferred because it handles high-performance InfiniBand (RDMA) and GPU device mapping natively and securely, whereas standard Docker often struggles with the permissions required for “bare-metal“ hardware access in an HPC environment.
Incorrect Options: A. Managing NVLink and MIG profiles This is a misidentification of the tools. NVLink Switch topology is managed by the NVIDIA Fabric Manager, and MIG (Multi-Instance GPU) profiles are configured using nvidia-smi or the NVIDIA Device Plugin for Kubernetes. Neither Enroot nor Pyxis has the capability to modify hardware-level GPU partitions or switch topologies.
C. Encrypting NGC CLI communication While security is a benefit of Enroot, these tools are not used to “encrypt communication“ between the NGC CLI and a local Docker daemon. The NGC CLI uses standard HTTPS/TLS for secure downloads. Enroot‘s role begins after an image is downloaded, by preparing it for execution on the compute nodes.
D. Flashing GPU firmware and DPU power management Pyxis is a scheduler plugin, not a firmware utility. GPU firmware is updated via NVSM or mlxfwmanager. Power distribution to BlueField-3 DPUs is managed by the server‘s BMC (Baseboard Management Controller) and the physical power supply units. Neither Enroot nor Pyxis interacts with the physical power rails or the low-level firmware flashing process.
Unattempted
Correct: B Enroot acts as the container runtime that turns Docker images into unprivileged sandboxes, while Pyxis integrates Enroot with the Slurm scheduler.
The Technical Reason:
Enroot: Unlike Docker, which requires a root-level daemon, Enroot is a “chroot-based“ runtime. It takes standard Docker/OCI container images (usually from NVIDIA NGC) and converts them into a simple directory structure or a squashfs file. This allows users to run containers as unprivileged users, which is a critical security requirement in multi-tenant AI clusters.
Pyxis: This is a Slurm SPANK (Slurm Plug-in Architecture for Node and Job Konfig) plugin. It allows users to use standard Slurm commands (e.g., srun –container-image=…) to launch those Enroot containers seamlessly across the cluster without needing to manually pull or manage images on every individual compute node.
The NCP-AII Context: The exam validates your ability to configure a scalable software stack. The Enroot + Pyxis workflow is preferred because it handles high-performance InfiniBand (RDMA) and GPU device mapping natively and securely, whereas standard Docker often struggles with the permissions required for “bare-metal“ hardware access in an HPC environment.
Incorrect Options: A. Managing NVLink and MIG profiles This is a misidentification of the tools. NVLink Switch topology is managed by the NVIDIA Fabric Manager, and MIG (Multi-Instance GPU) profiles are configured using nvidia-smi or the NVIDIA Device Plugin for Kubernetes. Neither Enroot nor Pyxis has the capability to modify hardware-level GPU partitions or switch topologies.
C. Encrypting NGC CLI communication While security is a benefit of Enroot, these tools are not used to “encrypt communication“ between the NGC CLI and a local Docker daemon. The NGC CLI uses standard HTTPS/TLS for secure downloads. Enroot‘s role begins after an image is downloaded, by preparing it for execution on the compute nodes.
D. Flashing GPU firmware and DPU power management Pyxis is a scheduler plugin, not a firmware utility. GPU firmware is updated via NVSM or mlxfwmanager. Power distribution to BlueField-3 DPUs is managed by the server‘s BMC (Baseboard Management Controller) and the physical power supply units. Neither Enroot nor Pyxis interacts with the physical power rails or the low-level firmware flashing process.
Question 54 of 60
54. Question
A cluster administrator is using NVIDIA Base Command Manager (BCM) to provision a large AI factory. The requirement is to ensure high availability (HA) for the management control plane. Which configuration step is essential for establishing a reliable HA head node pair within the BCM environment?
Correct
Correct: A Configure a dedicated heartbeat network between the primary and secondary head nodes and ensure that the cluster database is synchronized in real-time.
The Technical Reason: BCM HA relies on a secondary head node that acts as a “hot standby.“
Heartbeat Network: This is a dedicated physical or logical link used by the cmdaemon services to monitor the health of the peer node. If the primary node stops sending heartbeats, the secondary node initiates a failover.
Database Synchronization: The “brain“ of the cluster is the SQL database containing node configurations, job history, and monitoring data. BCM uses real-time replication to ensure the secondary node‘s database is an exact mirror of the primary‘s.
Virtual IP (VIP): A shared IP address is configured that “floats“ between the nodes, ensuring compute nodes and administrators always connect to the currently active master.
The NCP-AII Context: The exam validates your ability to run the cmha-setup wizard. This tool automates the partitioning of the heartbeat network and the initial “recloning“ of the database to the secondary node.
Incorrect Options: B. Use the NGC CLI to download an HA-License The NGC CLI is used for fetching AI containers, models, and datasets from the NVIDIA GPU Cloud. It is not used for hardware licensing or cluster failover logic. Furthermore, compute nodes in a BCM environment do not “elect a leader“; leadership is managed strictly between the two configured head nodes via the cmdaemon logic.
C. Install a different operating system This is a major configuration error. In an NVIDIA-certified cluster, both head nodes must be identical in their software stackincluding the OS version, kernel, and BCM versionto ensure that failover is seamless. Using different operating systems would lead to library mismatches, path errors, and likely a failure of the database replication service.
D. Disable the OOB management interface The Out-of-Band (OOB) management (the BMC/IPMI port) is critical for HA. In a failover event, the active head node may need to use the OOB interface of the failing node to perform a “STONITH“ (Shoot The Other Node In The Head) or power-cycle to prevent a “split-brain“ scenario. Additionally, each BMC must have its own unique IP address; they should never conflict if configured correctly.
Incorrect
Correct: A Configure a dedicated heartbeat network between the primary and secondary head nodes and ensure that the cluster database is synchronized in real-time.
The Technical Reason: BCM HA relies on a secondary head node that acts as a “hot standby.“
Heartbeat Network: This is a dedicated physical or logical link used by the cmdaemon services to monitor the health of the peer node. If the primary node stops sending heartbeats, the secondary node initiates a failover.
Database Synchronization: The “brain“ of the cluster is the SQL database containing node configurations, job history, and monitoring data. BCM uses real-time replication to ensure the secondary node‘s database is an exact mirror of the primary‘s.
Virtual IP (VIP): A shared IP address is configured that “floats“ between the nodes, ensuring compute nodes and administrators always connect to the currently active master.
The NCP-AII Context: The exam validates your ability to run the cmha-setup wizard. This tool automates the partitioning of the heartbeat network and the initial “recloning“ of the database to the secondary node.
Incorrect Options: B. Use the NGC CLI to download an HA-License The NGC CLI is used for fetching AI containers, models, and datasets from the NVIDIA GPU Cloud. It is not used for hardware licensing or cluster failover logic. Furthermore, compute nodes in a BCM environment do not “elect a leader“; leadership is managed strictly between the two configured head nodes via the cmdaemon logic.
C. Install a different operating system This is a major configuration error. In an NVIDIA-certified cluster, both head nodes must be identical in their software stackincluding the OS version, kernel, and BCM versionto ensure that failover is seamless. Using different operating systems would lead to library mismatches, path errors, and likely a failure of the database replication service.
D. Disable the OOB management interface The Out-of-Band (OOB) management (the BMC/IPMI port) is critical for HA. In a failover event, the active head node may need to use the OOB interface of the failing node to perform a “STONITH“ (Shoot The Other Node In The Head) or power-cycle to prevent a “split-brain“ scenario. Additionally, each BMC must have its own unique IP address; they should never conflict if configured correctly.
Unattempted
Correct: A Configure a dedicated heartbeat network between the primary and secondary head nodes and ensure that the cluster database is synchronized in real-time.
The Technical Reason: BCM HA relies on a secondary head node that acts as a “hot standby.“
Heartbeat Network: This is a dedicated physical or logical link used by the cmdaemon services to monitor the health of the peer node. If the primary node stops sending heartbeats, the secondary node initiates a failover.
Database Synchronization: The “brain“ of the cluster is the SQL database containing node configurations, job history, and monitoring data. BCM uses real-time replication to ensure the secondary node‘s database is an exact mirror of the primary‘s.
Virtual IP (VIP): A shared IP address is configured that “floats“ between the nodes, ensuring compute nodes and administrators always connect to the currently active master.
The NCP-AII Context: The exam validates your ability to run the cmha-setup wizard. This tool automates the partitioning of the heartbeat network and the initial “recloning“ of the database to the secondary node.
Incorrect Options: B. Use the NGC CLI to download an HA-License The NGC CLI is used for fetching AI containers, models, and datasets from the NVIDIA GPU Cloud. It is not used for hardware licensing or cluster failover logic. Furthermore, compute nodes in a BCM environment do not “elect a leader“; leadership is managed strictly between the two configured head nodes via the cmdaemon logic.
C. Install a different operating system This is a major configuration error. In an NVIDIA-certified cluster, both head nodes must be identical in their software stackincluding the OS version, kernel, and BCM versionto ensure that failover is seamless. Using different operating systems would lead to library mismatches, path errors, and likely a failure of the database replication service.
D. Disable the OOB management interface The Out-of-Band (OOB) management (the BMC/IPMI port) is critical for HA. In a failover event, the active head node may need to use the OOB interface of the failing node to perform a “STONITH“ (Shoot The Other Node In The Head) or power-cycle to prevent a “split-brain“ scenario. Additionally, each BMC must have its own unique IP address; they should never conflict if configured correctly.
Question 55 of 60
55. Question
An engineer is conducting a single-node stress test and an HPL (High-Performance Linpack) benchmark on a new AI server. During the test, the HPL performance (GFLOPS) starts high but gradually drops by 30% over a 15-minute period. Which hardware-related issue is most likely causing this performance degradation, and how can it be verified using the cluster tools?
Correct
Correct: D. Thermal throttling of the GPUs; it can be verified by checking the ‘Clocks Throttle Reasons‘ in the output of ‘nvidia-smi -q‘.
This is correct because a gradual performance drop of 30% over time during sustained HPL testing is a classic symptom of thermal throttling. As the GPUs heat up under sustained load, they reduce clock speeds to stay within temperature limits, causing performance degradation .
The NCP-AII certification blueprint explicitly includes “Execute HPL (High-Performance Linpack)“ and “Perform HPL burn-in“ as core tasks within the Cluster Test and Verification domain, with the purpose of validating thermal stability under sustained maximum load .
Thermal throttling can be directly verified using nvidia-smi -q to examine the “Clocks Throttle Reasons“ section, which will show “SW Thermal Slowdown“ as “Active“ when thermal constraints are causing clock reduction .
The Dell support documentation specifically identifies this scenario: when users experience poor GPU performance, nvidia-smi -q output can indicate that thermal or power brake slowdowns are active, and updating firmware resolves the issue .
A practice test question on thermal throttling confirms that when this issue appears during validation, the immediate action should be to “Increase airflow & verify cooling configuration“ .
Incorrect: A. The SSD storage is becoming fragmented; it can be verified by running a ‘defrag‘ command on the Linux filesystem.
This is incorrect because HPL (High-Performance Linpack) is a compute-intensive benchmark that primarily stresses GPUs, CPUs, and memory, not storage I/O. Storage fragmentation would not cause a 30% gradual performance drop in HPL over 15 minutes. Linux filesystems (especially those using ext4/XFS) do not require defragmentation like traditional Windows filesystems, and there is no standard “defrag“ command for Linux in this context.
B. The BIOS is in ‘Power Save‘ mode; it can be verified by checking the serial number of the motherboard in the BMC log.
This is incorrect because BIOS power settings cannot be verified by checking motherboard serial numbers in BMC logs. Serial numbers are for hardware identification, not power configuration validation. While incorrect BIOS power settings could impact performance, the symptom of gradual degradation over time points to thermal throttling, not static BIOS configuration. The verification method described is invalid.
C. The Slurm scheduler is losing its connection to the head node; it can be verified by checking the ping latency between the node and the switch.
This is incorrect because Slurm is a workload manager for job scheduling, not a runtime component that affects HPL performance once a job is running. If Slurm lost connection to the head node, the job would fail or be terminated, not gradually degrade in performance by 30%. Network latency between node and switch would affect distributed jobs across nodes, but this is a single-node HPL test.
Incorrect
Correct: D. Thermal throttling of the GPUs; it can be verified by checking the ‘Clocks Throttle Reasons‘ in the output of ‘nvidia-smi -q‘.
This is correct because a gradual performance drop of 30% over time during sustained HPL testing is a classic symptom of thermal throttling. As the GPUs heat up under sustained load, they reduce clock speeds to stay within temperature limits, causing performance degradation .
The NCP-AII certification blueprint explicitly includes “Execute HPL (High-Performance Linpack)“ and “Perform HPL burn-in“ as core tasks within the Cluster Test and Verification domain, with the purpose of validating thermal stability under sustained maximum load .
Thermal throttling can be directly verified using nvidia-smi -q to examine the “Clocks Throttle Reasons“ section, which will show “SW Thermal Slowdown“ as “Active“ when thermal constraints are causing clock reduction .
The Dell support documentation specifically identifies this scenario: when users experience poor GPU performance, nvidia-smi -q output can indicate that thermal or power brake slowdowns are active, and updating firmware resolves the issue .
A practice test question on thermal throttling confirms that when this issue appears during validation, the immediate action should be to “Increase airflow & verify cooling configuration“ .
Incorrect: A. The SSD storage is becoming fragmented; it can be verified by running a ‘defrag‘ command on the Linux filesystem.
This is incorrect because HPL (High-Performance Linpack) is a compute-intensive benchmark that primarily stresses GPUs, CPUs, and memory, not storage I/O. Storage fragmentation would not cause a 30% gradual performance drop in HPL over 15 minutes. Linux filesystems (especially those using ext4/XFS) do not require defragmentation like traditional Windows filesystems, and there is no standard “defrag“ command for Linux in this context.
B. The BIOS is in ‘Power Save‘ mode; it can be verified by checking the serial number of the motherboard in the BMC log.
This is incorrect because BIOS power settings cannot be verified by checking motherboard serial numbers in BMC logs. Serial numbers are for hardware identification, not power configuration validation. While incorrect BIOS power settings could impact performance, the symptom of gradual degradation over time points to thermal throttling, not static BIOS configuration. The verification method described is invalid.
C. The Slurm scheduler is losing its connection to the head node; it can be verified by checking the ping latency between the node and the switch.
This is incorrect because Slurm is a workload manager for job scheduling, not a runtime component that affects HPL performance once a job is running. If Slurm lost connection to the head node, the job would fail or be terminated, not gradually degrade in performance by 30%. Network latency between node and switch would affect distributed jobs across nodes, but this is a single-node HPL test.
Unattempted
Correct: D. Thermal throttling of the GPUs; it can be verified by checking the ‘Clocks Throttle Reasons‘ in the output of ‘nvidia-smi -q‘.
This is correct because a gradual performance drop of 30% over time during sustained HPL testing is a classic symptom of thermal throttling. As the GPUs heat up under sustained load, they reduce clock speeds to stay within temperature limits, causing performance degradation .
The NCP-AII certification blueprint explicitly includes “Execute HPL (High-Performance Linpack)“ and “Perform HPL burn-in“ as core tasks within the Cluster Test and Verification domain, with the purpose of validating thermal stability under sustained maximum load .
Thermal throttling can be directly verified using nvidia-smi -q to examine the “Clocks Throttle Reasons“ section, which will show “SW Thermal Slowdown“ as “Active“ when thermal constraints are causing clock reduction .
The Dell support documentation specifically identifies this scenario: when users experience poor GPU performance, nvidia-smi -q output can indicate that thermal or power brake slowdowns are active, and updating firmware resolves the issue .
A practice test question on thermal throttling confirms that when this issue appears during validation, the immediate action should be to “Increase airflow & verify cooling configuration“ .
Incorrect: A. The SSD storage is becoming fragmented; it can be verified by running a ‘defrag‘ command on the Linux filesystem.
This is incorrect because HPL (High-Performance Linpack) is a compute-intensive benchmark that primarily stresses GPUs, CPUs, and memory, not storage I/O. Storage fragmentation would not cause a 30% gradual performance drop in HPL over 15 minutes. Linux filesystems (especially those using ext4/XFS) do not require defragmentation like traditional Windows filesystems, and there is no standard “defrag“ command for Linux in this context.
B. The BIOS is in ‘Power Save‘ mode; it can be verified by checking the serial number of the motherboard in the BMC log.
This is incorrect because BIOS power settings cannot be verified by checking motherboard serial numbers in BMC logs. Serial numbers are for hardware identification, not power configuration validation. While incorrect BIOS power settings could impact performance, the symptom of gradual degradation over time points to thermal throttling, not static BIOS configuration. The verification method described is invalid.
C. The Slurm scheduler is losing its connection to the head node; it can be verified by checking the ping latency between the node and the switch.
This is incorrect because Slurm is a workload manager for job scheduling, not a runtime component that affects HPL performance once a job is running. If Slurm lost connection to the head node, the job would fail or be terminated, not gradually degrade in performance by 30%. Network latency between node and switch would affect distributed jobs across nodes, but this is a single-node HPL test.
Question 56 of 60
56. Question
A researcher needs to partition an NVIDIA A100 GPU using Multi-Instance GPU (MIG) technology to support seven distinct users, each requiring isolated compute and memory resources. Which configuration step and characteristic are essential for ensuring that these users do not interfere with each other‘s performance while running concurrent AI inference tasks?
Correct
Correct: B. The administrator must enable MIG mode using ‘nvidia-smi -i 0 -mig 1‘ and then create GPU instances based on the 1g.5gb profile to provide hardware-level isolation.
This is correct because the NCP-AII certification blueprint explicitly includes “MIG (Multi-Instance GPU) enablement and management“ as a core topic within the Physical Layer Management domain .
To support seven distinct users with isolated compute and memory resources on an A100 GPU, the 1g.5gb profile must be used, as it creates seven independent instances, each with dedicated memory and Streaming Multiprocessors (SMs) .
The profile name “1g.5gb“ indicates that each GPU instance has 1g SM (computing resource) and 5GB of memory, partitioning the GPU into seven hardware-isolated instances .
The command sequence in option B follows the required steps:
First, enable MIG mode on the GPU using sudo nvidia-smi -i 0 -mig 1 .
Then, create GPU instances based on the 1g.5gb profile to achieve the seven-instance configuration .
These instances provide hardware-level isolation with dedicated on-chip protections, ensuring users do not interfere with each other‘s performance .
MIG enables concurrent execution of up to seven different workloads on a single A100 GPU, each with guaranteed quality of service (QoS) .
Incorrect: A. MIG instances must be configured to allow ‘memory oversubscription‘ so that any single user can use the full 80GB of memory if other users are idle.
This is incorrect because MIG instances are hardware-partitioned with dedicated memory allocations that cannot be oversubscribed. Each instance receives a fixed portion of GPU memory (e.g., 5GB for 1g.5gb) that is physically isolated from other instances. Memory oversubscription would violate the hardware isolation principle and is not supported in MIG.
C. The users must share the same CUDA context and use software-level time-slicing to rotate their workloads on the GPU every 10 milliseconds.
This is incorrect because MIG provides hardware-level isolation, which is fundamentally different from software time-slicing. Time-slicing shares GPU resources sequentially and does not provide the guaranteed QoS or fault isolation that MIG offers. MIG enables true parallel execution with dedicated resources for each instance .
D. The administrator should disable the NVIDIA drivers and use the Linux ‘cgroups‘ utility to manually limit the amount of VRAM visible to each user process.
This is incorrect because cgroups cannot partition GPU memory or SMs with hardware-level isolation. MIG requires NVIDIA drivers to be active and is configured through NVIDIA tools like nvidia-smi. Disabling drivers would make the GPU inaccessible entirely. MIG provides the necessary hardware partitioning that software-only solutions like cgroups cannot achieve.
Incorrect
Correct: B. The administrator must enable MIG mode using ‘nvidia-smi -i 0 -mig 1‘ and then create GPU instances based on the 1g.5gb profile to provide hardware-level isolation.
This is correct because the NCP-AII certification blueprint explicitly includes “MIG (Multi-Instance GPU) enablement and management“ as a core topic within the Physical Layer Management domain .
To support seven distinct users with isolated compute and memory resources on an A100 GPU, the 1g.5gb profile must be used, as it creates seven independent instances, each with dedicated memory and Streaming Multiprocessors (SMs) .
The profile name “1g.5gb“ indicates that each GPU instance has 1g SM (computing resource) and 5GB of memory, partitioning the GPU into seven hardware-isolated instances .
The command sequence in option B follows the required steps:
First, enable MIG mode on the GPU using sudo nvidia-smi -i 0 -mig 1 .
Then, create GPU instances based on the 1g.5gb profile to achieve the seven-instance configuration .
These instances provide hardware-level isolation with dedicated on-chip protections, ensuring users do not interfere with each other‘s performance .
MIG enables concurrent execution of up to seven different workloads on a single A100 GPU, each with guaranteed quality of service (QoS) .
Incorrect: A. MIG instances must be configured to allow ‘memory oversubscription‘ so that any single user can use the full 80GB of memory if other users are idle.
This is incorrect because MIG instances are hardware-partitioned with dedicated memory allocations that cannot be oversubscribed. Each instance receives a fixed portion of GPU memory (e.g., 5GB for 1g.5gb) that is physically isolated from other instances. Memory oversubscription would violate the hardware isolation principle and is not supported in MIG.
C. The users must share the same CUDA context and use software-level time-slicing to rotate their workloads on the GPU every 10 milliseconds.
This is incorrect because MIG provides hardware-level isolation, which is fundamentally different from software time-slicing. Time-slicing shares GPU resources sequentially and does not provide the guaranteed QoS or fault isolation that MIG offers. MIG enables true parallel execution with dedicated resources for each instance .
D. The administrator should disable the NVIDIA drivers and use the Linux ‘cgroups‘ utility to manually limit the amount of VRAM visible to each user process.
This is incorrect because cgroups cannot partition GPU memory or SMs with hardware-level isolation. MIG requires NVIDIA drivers to be active and is configured through NVIDIA tools like nvidia-smi. Disabling drivers would make the GPU inaccessible entirely. MIG provides the necessary hardware partitioning that software-only solutions like cgroups cannot achieve.
Unattempted
Correct: B. The administrator must enable MIG mode using ‘nvidia-smi -i 0 -mig 1‘ and then create GPU instances based on the 1g.5gb profile to provide hardware-level isolation.
This is correct because the NCP-AII certification blueprint explicitly includes “MIG (Multi-Instance GPU) enablement and management“ as a core topic within the Physical Layer Management domain .
To support seven distinct users with isolated compute and memory resources on an A100 GPU, the 1g.5gb profile must be used, as it creates seven independent instances, each with dedicated memory and Streaming Multiprocessors (SMs) .
The profile name “1g.5gb“ indicates that each GPU instance has 1g SM (computing resource) and 5GB of memory, partitioning the GPU into seven hardware-isolated instances .
The command sequence in option B follows the required steps:
First, enable MIG mode on the GPU using sudo nvidia-smi -i 0 -mig 1 .
Then, create GPU instances based on the 1g.5gb profile to achieve the seven-instance configuration .
These instances provide hardware-level isolation with dedicated on-chip protections, ensuring users do not interfere with each other‘s performance .
MIG enables concurrent execution of up to seven different workloads on a single A100 GPU, each with guaranteed quality of service (QoS) .
Incorrect: A. MIG instances must be configured to allow ‘memory oversubscription‘ so that any single user can use the full 80GB of memory if other users are idle.
This is incorrect because MIG instances are hardware-partitioned with dedicated memory allocations that cannot be oversubscribed. Each instance receives a fixed portion of GPU memory (e.g., 5GB for 1g.5gb) that is physically isolated from other instances. Memory oversubscription would violate the hardware isolation principle and is not supported in MIG.
C. The users must share the same CUDA context and use software-level time-slicing to rotate their workloads on the GPU every 10 milliseconds.
This is incorrect because MIG provides hardware-level isolation, which is fundamentally different from software time-slicing. Time-slicing shares GPU resources sequentially and does not provide the guaranteed QoS or fault isolation that MIG offers. MIG enables true parallel execution with dedicated resources for each instance .
D. The administrator should disable the NVIDIA drivers and use the Linux ‘cgroups‘ utility to manually limit the amount of VRAM visible to each user process.
This is incorrect because cgroups cannot partition GPU memory or SMs with hardware-level isolation. MIG requires NVIDIA drivers to be active and is configured through NVIDIA tools like nvidia-smi. Disabling drivers would make the GPU inaccessible entirely. MIG provides the necessary hardware partitioning that software-only solutions like cgroups cannot achieve.
Question 57 of 60
57. Question
A system administrator is deploying NVIDIA Base Command Manager (BCM) to manage a new AI cluster. After the initial installation, the administrator needs to configure High Availability (HA) for the head node to prevent a single point of failure. Which component is primarily responsible for maintaining the synchronized state of the cluster configuration across the primary and secondary head nodes?
Correct
Correct: A The BCM shared data store and synchronization service
The Technical Reason: NVIDIA Base Command Manager (BCM) (formerly Bright Cluster Manager) achieves HA by using a secondary head node that acts as a “hot standby.“ The critical synchronization of the cluster‘s “brain“ is handled as follows:
Shared Data Store: BCM relies on a synchronized SQL database (MariaDB/MySQL) that stores all cluster metadata, node configurations, monitoring data, and category settings.
Synchronization Service (cmdaemon): The BCM management daemon (cmdaemon) on the primary head node continuously replicates its state and database to the secondary head node. If the primary fails, the secondary has an identical copy of the cluster state and can take over management immediately.
Data Directories: Key directories like /cm/shared (containing software images and scripts) and /home are typically hosted on shared NFS storage or synchronized via BCM‘s internal tools to ensure both head nodes see the same filesystem.
The NCP-AII Context: The exam validates your ability to run the cmha-setup utility. This wizard automates the setup of the heartbeat network, the cloning of the head node image, and the initial database replication.
Incorrect Options: B. The Slurm workload manager daemon While Slurm is the standard scheduler in an NVIDIA AI infrastructure, its primary role is managing job queues and resource allocation across compute nodes. While Slurm itself can be configured for HA (using a SlurmctldHost and BackupController), it is a consumer of the infrastructure provided by BCM. It does not manage the underlying BCM cluster configuration or head node synchronization.
C. The NVIDIA Container Toolkit The NVIDIA Container Toolkit allows users to build and run GPU-accelerated containers. It is installed on the compute nodes (and head nodes) to enable Docker, Enroot, or Podman to access GPU hardware. It has no role in managing the High Availability state or metadata of the BCM management control plane.
D. The DOCA Telemetry Service NVIDIA DOCA is the software framework for BlueField DPUs. The Telemetry Service is used to collect and stream performance data (like network traffic or power usage) from the DPU to monitoring tools. It is an observability component, not a configuration management or HA synchronization service for the head nodes.
Incorrect
Correct: A The BCM shared data store and synchronization service
The Technical Reason: NVIDIA Base Command Manager (BCM) (formerly Bright Cluster Manager) achieves HA by using a secondary head node that acts as a “hot standby.“ The critical synchronization of the cluster‘s “brain“ is handled as follows:
Shared Data Store: BCM relies on a synchronized SQL database (MariaDB/MySQL) that stores all cluster metadata, node configurations, monitoring data, and category settings.
Synchronization Service (cmdaemon): The BCM management daemon (cmdaemon) on the primary head node continuously replicates its state and database to the secondary head node. If the primary fails, the secondary has an identical copy of the cluster state and can take over management immediately.
Data Directories: Key directories like /cm/shared (containing software images and scripts) and /home are typically hosted on shared NFS storage or synchronized via BCM‘s internal tools to ensure both head nodes see the same filesystem.
The NCP-AII Context: The exam validates your ability to run the cmha-setup utility. This wizard automates the setup of the heartbeat network, the cloning of the head node image, and the initial database replication.
Incorrect Options: B. The Slurm workload manager daemon While Slurm is the standard scheduler in an NVIDIA AI infrastructure, its primary role is managing job queues and resource allocation across compute nodes. While Slurm itself can be configured for HA (using a SlurmctldHost and BackupController), it is a consumer of the infrastructure provided by BCM. It does not manage the underlying BCM cluster configuration or head node synchronization.
C. The NVIDIA Container Toolkit The NVIDIA Container Toolkit allows users to build and run GPU-accelerated containers. It is installed on the compute nodes (and head nodes) to enable Docker, Enroot, or Podman to access GPU hardware. It has no role in managing the High Availability state or metadata of the BCM management control plane.
D. The DOCA Telemetry Service NVIDIA DOCA is the software framework for BlueField DPUs. The Telemetry Service is used to collect and stream performance data (like network traffic or power usage) from the DPU to monitoring tools. It is an observability component, not a configuration management or HA synchronization service for the head nodes.
Unattempted
Correct: A The BCM shared data store and synchronization service
The Technical Reason: NVIDIA Base Command Manager (BCM) (formerly Bright Cluster Manager) achieves HA by using a secondary head node that acts as a “hot standby.“ The critical synchronization of the cluster‘s “brain“ is handled as follows:
Shared Data Store: BCM relies on a synchronized SQL database (MariaDB/MySQL) that stores all cluster metadata, node configurations, monitoring data, and category settings.
Synchronization Service (cmdaemon): The BCM management daemon (cmdaemon) on the primary head node continuously replicates its state and database to the secondary head node. If the primary fails, the secondary has an identical copy of the cluster state and can take over management immediately.
Data Directories: Key directories like /cm/shared (containing software images and scripts) and /home are typically hosted on shared NFS storage or synchronized via BCM‘s internal tools to ensure both head nodes see the same filesystem.
The NCP-AII Context: The exam validates your ability to run the cmha-setup utility. This wizard automates the setup of the heartbeat network, the cloning of the head node image, and the initial database replication.
Incorrect Options: B. The Slurm workload manager daemon While Slurm is the standard scheduler in an NVIDIA AI infrastructure, its primary role is managing job queues and resource allocation across compute nodes. While Slurm itself can be configured for HA (using a SlurmctldHost and BackupController), it is a consumer of the infrastructure provided by BCM. It does not manage the underlying BCM cluster configuration or head node synchronization.
C. The NVIDIA Container Toolkit The NVIDIA Container Toolkit allows users to build and run GPU-accelerated containers. It is installed on the compute nodes (and head nodes) to enable Docker, Enroot, or Podman to access GPU hardware. It has no role in managing the High Availability state or metadata of the BCM management control plane.
D. The DOCA Telemetry Service NVIDIA DOCA is the software framework for BlueField DPUs. The Telemetry Service is used to collect and stream performance data (like network traffic or power usage) from the DPU to monitoring tools. It is an observability component, not a configuration management or HA synchronization service for the head nodes.
Question 58 of 60
58. Question
An engineer is using the High-Performance Linpack (HPL) benchmark to validate a single node‘s compute performance. If the HPL results are inconsistent across multiple runs, what is the first hardware-related parameter that should be monitored via the NVIDIA SMI tool during the test?
Correct
Correct: B GPU temperature and power draw to check for thermal throttling or power limit capping that could be causing performance fluctuations. The Technical Reason: HPL is an extremely compute-intensive workload that maximizes the utilization of Tensor Cores and FP64 units. ? Thermal Throttling: If the server‘s cooling system (fans or airflow) is inadequate, the GPU will hit its Thermal Slowdown threshold (typically around 85°C to 90°C for data center GPUs like the H100). The firmware will then lower the clock speeds to reduce heat, leading to inconsistent HPL scores. ? Power Capping: Similarly, if the GPU hits its Power Limit (e.g., 700W for an H100 SXM5), the “Power Brake“ or “Power Capping“ mechanism will engage, fluctuating the core clocks to stay within the power envelope. ? NVIDIA-SMI Monitoring: You can observe these states in real-time using nvidia-smi -q -d PERFORMANCE, which lists the “Clocks Throttle Reasons“ such as Thermal, Power Brake, or SW Power Cap. The NCP-AII Context: The exam expects you to use NVIDIA-SMI (System Management Interface) as the first line of defense for performance validation. Identifying “Clocks Throttle Reasons“ is a core competency for an infrastructure professional.
Incorrect: A. Firmware of the local SATA boot drive While drive firmware is part of general system maintenance, the SATA boot drive has no impact on the execution speed of HPL once the benchmark is loaded into GPU memory. There is no dependency between the NVIDIA Container Toolkit and the specific firmware version of a storage drive that would cause GFLOPS fluctuations.
C. Light levels of OOB management transceivers The Out-of-Band (OOB) management network is used for remote access (BMC/IPMI) and health monitoring. While a loss of OOB connectivity is a management issue, it does not physically interfere with the GPU‘s internal compute performance or the execution of a local HPL benchmark.
D. SSH sessions to the BlueField-3 DPU The BlueField-3 DPU (Data Processing Unit) manages network and storage offloads. While excessive traffic to the DPU‘s ARM cores could affect network latency, the question specifies a single-node compute performance test (HPL). HPL performance is bound by GPU-to-Memory bandwidth and core clocks, not by the management traffic on the DPU‘s ARM subsystem.
Incorrect
Correct: B GPU temperature and power draw to check for thermal throttling or power limit capping that could be causing performance fluctuations. The Technical Reason: HPL is an extremely compute-intensive workload that maximizes the utilization of Tensor Cores and FP64 units. ? Thermal Throttling: If the server‘s cooling system (fans or airflow) is inadequate, the GPU will hit its Thermal Slowdown threshold (typically around 85°C to 90°C for data center GPUs like the H100). The firmware will then lower the clock speeds to reduce heat, leading to inconsistent HPL scores. ? Power Capping: Similarly, if the GPU hits its Power Limit (e.g., 700W for an H100 SXM5), the “Power Brake“ or “Power Capping“ mechanism will engage, fluctuating the core clocks to stay within the power envelope. ? NVIDIA-SMI Monitoring: You can observe these states in real-time using nvidia-smi -q -d PERFORMANCE, which lists the “Clocks Throttle Reasons“ such as Thermal, Power Brake, or SW Power Cap. The NCP-AII Context: The exam expects you to use NVIDIA-SMI (System Management Interface) as the first line of defense for performance validation. Identifying “Clocks Throttle Reasons“ is a core competency for an infrastructure professional.
Incorrect: A. Firmware of the local SATA boot drive While drive firmware is part of general system maintenance, the SATA boot drive has no impact on the execution speed of HPL once the benchmark is loaded into GPU memory. There is no dependency between the NVIDIA Container Toolkit and the specific firmware version of a storage drive that would cause GFLOPS fluctuations.
C. Light levels of OOB management transceivers The Out-of-Band (OOB) management network is used for remote access (BMC/IPMI) and health monitoring. While a loss of OOB connectivity is a management issue, it does not physically interfere with the GPU‘s internal compute performance or the execution of a local HPL benchmark.
D. SSH sessions to the BlueField-3 DPU The BlueField-3 DPU (Data Processing Unit) manages network and storage offloads. While excessive traffic to the DPU‘s ARM cores could affect network latency, the question specifies a single-node compute performance test (HPL). HPL performance is bound by GPU-to-Memory bandwidth and core clocks, not by the management traffic on the DPU‘s ARM subsystem.
Unattempted
Correct: B GPU temperature and power draw to check for thermal throttling or power limit capping that could be causing performance fluctuations. The Technical Reason: HPL is an extremely compute-intensive workload that maximizes the utilization of Tensor Cores and FP64 units. ? Thermal Throttling: If the server‘s cooling system (fans or airflow) is inadequate, the GPU will hit its Thermal Slowdown threshold (typically around 85°C to 90°C for data center GPUs like the H100). The firmware will then lower the clock speeds to reduce heat, leading to inconsistent HPL scores. ? Power Capping: Similarly, if the GPU hits its Power Limit (e.g., 700W for an H100 SXM5), the “Power Brake“ or “Power Capping“ mechanism will engage, fluctuating the core clocks to stay within the power envelope. ? NVIDIA-SMI Monitoring: You can observe these states in real-time using nvidia-smi -q -d PERFORMANCE, which lists the “Clocks Throttle Reasons“ such as Thermal, Power Brake, or SW Power Cap. The NCP-AII Context: The exam expects you to use NVIDIA-SMI (System Management Interface) as the first line of defense for performance validation. Identifying “Clocks Throttle Reasons“ is a core competency for an infrastructure professional.
Incorrect: A. Firmware of the local SATA boot drive While drive firmware is part of general system maintenance, the SATA boot drive has no impact on the execution speed of HPL once the benchmark is loaded into GPU memory. There is no dependency between the NVIDIA Container Toolkit and the specific firmware version of a storage drive that would cause GFLOPS fluctuations.
C. Light levels of OOB management transceivers The Out-of-Band (OOB) management network is used for remote access (BMC/IPMI) and health monitoring. While a loss of OOB connectivity is a management issue, it does not physically interfere with the GPU‘s internal compute performance or the execution of a local HPL benchmark.
D. SSH sessions to the BlueField-3 DPU The BlueField-3 DPU (Data Processing Unit) manages network and storage offloads. While excessive traffic to the DPU‘s ARM cores could affect network latency, the question specifies a single-node compute performance test (HPL). HPL performance is bound by GPU-to-Memory bandwidth and core clocks, not by the management traffic on the DPU‘s ARM subsystem.
Question 59 of 60
59. Question
When configuring MIG profiles on an NVIDIA A100 GPU to support a high-concurrency inference application, a developer notices that some profiles allow for more Compute Instances (CI) than others for the same amount of memory. To maximize the number of independent clients served while ensuring each has at least 10GB of memory, which MIG configuration strategy is most appropriate?
Correct
Correct: A Use the ‘1g.10gb‘ profile to create up to seven instances, as this provides the smallest possible compute slice while meeting the minimum memory requirement per client. The Technical Reason: The naming convention for MIG profiles follows the pattern Gg.GBgb, where G represents the number of GPU Compute Slices (SMs) and GB represents the dedicated memory capacity. ? Profile Breakdown: On an NVIDIA A100 (80GB model), the 1g.10gb profile allocates 1/7th of the GPU‘s compute power and 10GB of its memory. ? Maximizing Concurrency: Since the A100 is physically limited to a maximum of 7 MIG instances, using the smallest possible compute slice (1g) that still meets the user‘s specific memory threshold (10GB) is the mathematically optimal “bin-packing“ strategy. ? Isolation: Each of these seven clients receives a hardware-isolated path for compute, cache, and memory, ensuring that one client‘s heavy inference request cannot cause latency spikes for the other six (Quality of Service). The NCP-AII Context: The exam tests your ability to choose the correct profile for specific SLAs. You are expected to know that while the A100 has 8 memory controllers, one is reserved for management, leaving 7 available for user instances.
Incorrect: B. Use CUDA streams within a single instance CUDA Streams are a software-level concurrency mechanism. While they allow for overlapping execution, they do not provide hardware isolation. In a high-concurrency production environment, a single “rogue“ stream could saturate the memory bandwidth or cause a kernel error that crashes the entire application for all clients. MIG is preferred here because it provides fault isolation at the silicon level.
C. Disable MIG and use Multi-Process Service (MPS) MPS allows multiple processes to share a single GPU, but it does not provide the same level of memory and fault isolation as MIG. Furthermore, the question specifically asks for a “MIG configuration strategy.“ Disabling MIG to use MPS ignores the hardware-partitioning benefits required for guaranteed per-client resources in a multi-tenant environment.
D. Select the ‘2g.10gb‘ profile The 2g.10gb profile (available on some A100 configurations) uses two compute slices but still only 10GB of memory. Because an A100 only has 7 total compute slices available for MIG, choosing a “2g“ profile would limit you to a maximum of 3 instances (2+2+2 = 6, with 1 slice remaining unused). This fails the requirement to “maximize the number of independent clients.“
Incorrect
Correct: A Use the ‘1g.10gb‘ profile to create up to seven instances, as this provides the smallest possible compute slice while meeting the minimum memory requirement per client. The Technical Reason: The naming convention for MIG profiles follows the pattern Gg.GBgb, where G represents the number of GPU Compute Slices (SMs) and GB represents the dedicated memory capacity. ? Profile Breakdown: On an NVIDIA A100 (80GB model), the 1g.10gb profile allocates 1/7th of the GPU‘s compute power and 10GB of its memory. ? Maximizing Concurrency: Since the A100 is physically limited to a maximum of 7 MIG instances, using the smallest possible compute slice (1g) that still meets the user‘s specific memory threshold (10GB) is the mathematically optimal “bin-packing“ strategy. ? Isolation: Each of these seven clients receives a hardware-isolated path for compute, cache, and memory, ensuring that one client‘s heavy inference request cannot cause latency spikes for the other six (Quality of Service). The NCP-AII Context: The exam tests your ability to choose the correct profile for specific SLAs. You are expected to know that while the A100 has 8 memory controllers, one is reserved for management, leaving 7 available for user instances.
Incorrect: B. Use CUDA streams within a single instance CUDA Streams are a software-level concurrency mechanism. While they allow for overlapping execution, they do not provide hardware isolation. In a high-concurrency production environment, a single “rogue“ stream could saturate the memory bandwidth or cause a kernel error that crashes the entire application for all clients. MIG is preferred here because it provides fault isolation at the silicon level.
C. Disable MIG and use Multi-Process Service (MPS) MPS allows multiple processes to share a single GPU, but it does not provide the same level of memory and fault isolation as MIG. Furthermore, the question specifically asks for a “MIG configuration strategy.“ Disabling MIG to use MPS ignores the hardware-partitioning benefits required for guaranteed per-client resources in a multi-tenant environment.
D. Select the ‘2g.10gb‘ profile The 2g.10gb profile (available on some A100 configurations) uses two compute slices but still only 10GB of memory. Because an A100 only has 7 total compute slices available for MIG, choosing a “2g“ profile would limit you to a maximum of 3 instances (2+2+2 = 6, with 1 slice remaining unused). This fails the requirement to “maximize the number of independent clients.“
Unattempted
Correct: A Use the ‘1g.10gb‘ profile to create up to seven instances, as this provides the smallest possible compute slice while meeting the minimum memory requirement per client. The Technical Reason: The naming convention for MIG profiles follows the pattern Gg.GBgb, where G represents the number of GPU Compute Slices (SMs) and GB represents the dedicated memory capacity. ? Profile Breakdown: On an NVIDIA A100 (80GB model), the 1g.10gb profile allocates 1/7th of the GPU‘s compute power and 10GB of its memory. ? Maximizing Concurrency: Since the A100 is physically limited to a maximum of 7 MIG instances, using the smallest possible compute slice (1g) that still meets the user‘s specific memory threshold (10GB) is the mathematically optimal “bin-packing“ strategy. ? Isolation: Each of these seven clients receives a hardware-isolated path for compute, cache, and memory, ensuring that one client‘s heavy inference request cannot cause latency spikes for the other six (Quality of Service). The NCP-AII Context: The exam tests your ability to choose the correct profile for specific SLAs. You are expected to know that while the A100 has 8 memory controllers, one is reserved for management, leaving 7 available for user instances.
Incorrect: B. Use CUDA streams within a single instance CUDA Streams are a software-level concurrency mechanism. While they allow for overlapping execution, they do not provide hardware isolation. In a high-concurrency production environment, a single “rogue“ stream could saturate the memory bandwidth or cause a kernel error that crashes the entire application for all clients. MIG is preferred here because it provides fault isolation at the silicon level.
C. Disable MIG and use Multi-Process Service (MPS) MPS allows multiple processes to share a single GPU, but it does not provide the same level of memory and fault isolation as MIG. Furthermore, the question specifically asks for a “MIG configuration strategy.“ Disabling MIG to use MPS ignores the hardware-partitioning benefits required for guaranteed per-client resources in a multi-tenant environment.
D. Select the ‘2g.10gb‘ profile The 2g.10gb profile (available on some A100 configurations) uses two compute slices but still only 10GB of memory. Because an A100 only has 7 total compute slices available for MIG, choosing a “2g“ profile would limit you to a maximum of 3 instances (2+2+2 = 6, with 1 slice remaining unused). This fails the requirement to “maximize the number of independent clients.“
Question 60 of 60
60. Question
A network engineer is configuring a BlueField-3 Data Processing Unit (DPU) to act as a secure offload engine for a multi-tenant AI cluster. The requirement is to isolate the management traffic from the data traffic while ensuring the DPU can perform hardware-accelerated encryption. Which action is necessary to correctly manage the DPU‘s physical and logical interfaces for this deployment?
Correct
Correct: C Configure the DPU in ‘Separated‘ mode where the ARM cores manage the OOB interface and the network ports are assigned to the host as virtual functions.
The Technical Reason:
Separated Mode (or “Separated Host Mode“): In this configuration, the DPU‘s internal ARM subsystem is logically isolated from the host‘s data path. The ARM cores run their own OS (typically Ubuntu) and manage the Out-of-Band (OOB) 1GbE management port.
Virtual Functions (SR-IOV): The high-speed network ports (400Gbps) are presented to the host OS as Virtual Functions (VFs) or Scalable Functions (SFs). This allows the host to send/receive data at line rate while the DPU‘s ARM cores handle the management, telemetry, and control plane tasks independently.
Hardware-Accelerated Encryption: Even in Separated mode, the DPUs dedicated hardware engines (for IPsec, TLS, or MACsec) can be utilized via the NVIDIA DOCA framework to encrypt data-in-motion without taxing the hosts CPUs.
The NCP-AII Context: The exam validates your ability to “Configure and manage a BlueField network platform.“ Understanding how to toggle between DPU Mode (where the DPU owns the switch) and Separated/NIC Mode (where it acts as an accelerator for the host) is a critical objective.
Incorrect Options: A. Manually bridge management with InfiniBand Bridging the management interface (low-speed, insecure) with the high-speed data fabric (InfiniBand/400GbE) is a major security violation. One of the primary purposes of a DPU is to maintain a “Physical Air Gap“ or logical isolation between the management network and the production data network to prevent lateral movement by attackers.
B. Enable ‘MIG‘ on the BlueField-3 DPU Multi-Instance GPU (MIG) is a technology exclusive to NVIDIA GPUs (like the H100 or A100) for partitioning SMs and memory. It does not apply to DPUs. To partition network bandwidth on a DPU, you would use Quality of Service (QoS), Rate Limiting, or SR-IOV/Scalable Functions, but not MIG.
D. Flash with ConnectX-7 firmware to disable ARM cores While a BlueField-3 DPU physically contains ConnectX-7 silicon, “downgrading“ it to a standard NIC by disabling the ARM cores defeats the purpose of the DPU deployment. Disabling the ARM cores (effectively entering NIC Mode) removes the ability to run DOCA-based security agents, firewalls, or isolated management, which were the specific requirements of the engineer in the prompt.
Incorrect
Correct: C Configure the DPU in ‘Separated‘ mode where the ARM cores manage the OOB interface and the network ports are assigned to the host as virtual functions.
The Technical Reason:
Separated Mode (or “Separated Host Mode“): In this configuration, the DPU‘s internal ARM subsystem is logically isolated from the host‘s data path. The ARM cores run their own OS (typically Ubuntu) and manage the Out-of-Band (OOB) 1GbE management port.
Virtual Functions (SR-IOV): The high-speed network ports (400Gbps) are presented to the host OS as Virtual Functions (VFs) or Scalable Functions (SFs). This allows the host to send/receive data at line rate while the DPU‘s ARM cores handle the management, telemetry, and control plane tasks independently.
Hardware-Accelerated Encryption: Even in Separated mode, the DPUs dedicated hardware engines (for IPsec, TLS, or MACsec) can be utilized via the NVIDIA DOCA framework to encrypt data-in-motion without taxing the hosts CPUs.
The NCP-AII Context: The exam validates your ability to “Configure and manage a BlueField network platform.“ Understanding how to toggle between DPU Mode (where the DPU owns the switch) and Separated/NIC Mode (where it acts as an accelerator for the host) is a critical objective.
Incorrect Options: A. Manually bridge management with InfiniBand Bridging the management interface (low-speed, insecure) with the high-speed data fabric (InfiniBand/400GbE) is a major security violation. One of the primary purposes of a DPU is to maintain a “Physical Air Gap“ or logical isolation between the management network and the production data network to prevent lateral movement by attackers.
B. Enable ‘MIG‘ on the BlueField-3 DPU Multi-Instance GPU (MIG) is a technology exclusive to NVIDIA GPUs (like the H100 or A100) for partitioning SMs and memory. It does not apply to DPUs. To partition network bandwidth on a DPU, you would use Quality of Service (QoS), Rate Limiting, or SR-IOV/Scalable Functions, but not MIG.
D. Flash with ConnectX-7 firmware to disable ARM cores While a BlueField-3 DPU physically contains ConnectX-7 silicon, “downgrading“ it to a standard NIC by disabling the ARM cores defeats the purpose of the DPU deployment. Disabling the ARM cores (effectively entering NIC Mode) removes the ability to run DOCA-based security agents, firewalls, or isolated management, which were the specific requirements of the engineer in the prompt.
Unattempted
Correct: C Configure the DPU in ‘Separated‘ mode where the ARM cores manage the OOB interface and the network ports are assigned to the host as virtual functions.
The Technical Reason:
Separated Mode (or “Separated Host Mode“): In this configuration, the DPU‘s internal ARM subsystem is logically isolated from the host‘s data path. The ARM cores run their own OS (typically Ubuntu) and manage the Out-of-Band (OOB) 1GbE management port.
Virtual Functions (SR-IOV): The high-speed network ports (400Gbps) are presented to the host OS as Virtual Functions (VFs) or Scalable Functions (SFs). This allows the host to send/receive data at line rate while the DPU‘s ARM cores handle the management, telemetry, and control plane tasks independently.
Hardware-Accelerated Encryption: Even in Separated mode, the DPUs dedicated hardware engines (for IPsec, TLS, or MACsec) can be utilized via the NVIDIA DOCA framework to encrypt data-in-motion without taxing the hosts CPUs.
The NCP-AII Context: The exam validates your ability to “Configure and manage a BlueField network platform.“ Understanding how to toggle between DPU Mode (where the DPU owns the switch) and Separated/NIC Mode (where it acts as an accelerator for the host) is a critical objective.
Incorrect Options: A. Manually bridge management with InfiniBand Bridging the management interface (low-speed, insecure) with the high-speed data fabric (InfiniBand/400GbE) is a major security violation. One of the primary purposes of a DPU is to maintain a “Physical Air Gap“ or logical isolation between the management network and the production data network to prevent lateral movement by attackers.
B. Enable ‘MIG‘ on the BlueField-3 DPU Multi-Instance GPU (MIG) is a technology exclusive to NVIDIA GPUs (like the H100 or A100) for partitioning SMs and memory. It does not apply to DPUs. To partition network bandwidth on a DPU, you would use Quality of Service (QoS), Rate Limiting, or SR-IOV/Scalable Functions, but not MIG.
D. Flash with ConnectX-7 firmware to disable ARM cores While a BlueField-3 DPU physically contains ConnectX-7 silicon, “downgrading“ it to a standard NIC by disabling the ARM cores defeats the purpose of the DPU deployment. Disabling the ARM cores (effectively entering NIC Mode) removes the ability to run DOCA-based security agents, firewalls, or isolated management, which were the specific requirements of the engineer in the prompt.
X
Use Page numbers below to navigate to other practice tests