You have already completed the Test before. Hence you can not start it again.
Test is loading...
You must sign in or sign up to start the Test.
You have to finish following quiz, to start this Test:
Your results are here!! for" NVIDIA NCP-AII Practice Test 5 "
0 of 60 questions answered correctly
Your time:
Time has elapsed
Your Final Score is : 0
You have attempted : 0
Number of Correct Questions : 0 and scored 0
Number of Incorrect Questions : 0 and Negative marks 0
Average score
Your score
NVIDIA NCP-AII
You have attempted: 0
Number of Correct Questions: 0 and scored 0
Number of Incorrect Questions: 0 and Negative marks 0
You can review your answers by clicking on “View Answers” option. Important Note : Open Reference Documentation Links in New Tab (Right Click and Open in New Tab).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Answered
Review
Question 1 of 60
1. Question
A cluster administrator is performing a single-node stress test as part of the verification process. The goal is to ensure the node can handle peak loads without thermal throttling or power failure. Which tool is commonly used to execute a High-Performance Linpack (HPL) test, and what does a successful HPL result indicate about the node health?
Correct
A. The ping utility is used to send large packets…
Incorrect: ping is a basic network diagnostic tool used to test reachability and latency via ICMP. It cannot exercise the GPU or CPU compute cores and has no relevance to the mathematical intensity or thermal stress required by an HPL test.
B. The standard Linux stress command is used…
Incorrect: The stress (or stress-ng) utility is designed to put a load on the CPU, memory, and I/O of a Linux system. However, it is not optimized for NVIDIA GPUs and does not perform the dense linear algebra operations characteristic of HPL. Furthermore, filesystem integrity is verified with tools like fsck, not compute stress tests.
C. The NVIDIA Container Toolkit is used… to verify the web server.
Incorrect: While the NVIDIA Container Toolkit (specifically the container runtime) is indeed used to run the HPL container, the purpose of HPL is not to verify web server configurations (like NGINX or Apache). Web server health is irrelevant to the hardware-level floating-point performance that HPL measures.
D. An HPL-optimized container is used to stress the GPUs and CPUs…
Correct: According to the NCP-AII curriculum, the standard method to run HPL is via an NGC-optimized container (like the hpl or hpc-benchmarks containers from the NVIDIA NGC catalog).
The Test: It solves a massive system of linear equations, forcing the GPUs (and CPUs) to operate at their maximum theoretical floating-point capability.
The Result: A stable result (measured in TFLOPS) without crashes or thermal throttling confirms that the system power delivery (PSUs) and cooling infrastructure (fans and airflow) are robust enough to handle the extreme heat and power draw of a real-world AI training job
Incorrect
A. The ping utility is used to send large packets…
Incorrect: ping is a basic network diagnostic tool used to test reachability and latency via ICMP. It cannot exercise the GPU or CPU compute cores and has no relevance to the mathematical intensity or thermal stress required by an HPL test.
B. The standard Linux stress command is used…
Incorrect: The stress (or stress-ng) utility is designed to put a load on the CPU, memory, and I/O of a Linux system. However, it is not optimized for NVIDIA GPUs and does not perform the dense linear algebra operations characteristic of HPL. Furthermore, filesystem integrity is verified with tools like fsck, not compute stress tests.
C. The NVIDIA Container Toolkit is used… to verify the web server.
Incorrect: While the NVIDIA Container Toolkit (specifically the container runtime) is indeed used to run the HPL container, the purpose of HPL is not to verify web server configurations (like NGINX or Apache). Web server health is irrelevant to the hardware-level floating-point performance that HPL measures.
D. An HPL-optimized container is used to stress the GPUs and CPUs…
Correct: According to the NCP-AII curriculum, the standard method to run HPL is via an NGC-optimized container (like the hpl or hpc-benchmarks containers from the NVIDIA NGC catalog).
The Test: It solves a massive system of linear equations, forcing the GPUs (and CPUs) to operate at their maximum theoretical floating-point capability.
The Result: A stable result (measured in TFLOPS) without crashes or thermal throttling confirms that the system power delivery (PSUs) and cooling infrastructure (fans and airflow) are robust enough to handle the extreme heat and power draw of a real-world AI training job
Unattempted
A. The ping utility is used to send large packets…
Incorrect: ping is a basic network diagnostic tool used to test reachability and latency via ICMP. It cannot exercise the GPU or CPU compute cores and has no relevance to the mathematical intensity or thermal stress required by an HPL test.
B. The standard Linux stress command is used…
Incorrect: The stress (or stress-ng) utility is designed to put a load on the CPU, memory, and I/O of a Linux system. However, it is not optimized for NVIDIA GPUs and does not perform the dense linear algebra operations characteristic of HPL. Furthermore, filesystem integrity is verified with tools like fsck, not compute stress tests.
C. The NVIDIA Container Toolkit is used… to verify the web server.
Incorrect: While the NVIDIA Container Toolkit (specifically the container runtime) is indeed used to run the HPL container, the purpose of HPL is not to verify web server configurations (like NGINX or Apache). Web server health is irrelevant to the hardware-level floating-point performance that HPL measures.
D. An HPL-optimized container is used to stress the GPUs and CPUs…
Correct: According to the NCP-AII curriculum, the standard method to run HPL is via an NGC-optimized container (like the hpl or hpc-benchmarks containers from the NVIDIA NGC catalog).
The Test: It solves a massive system of linear equations, forcing the GPUs (and CPUs) to operate at their maximum theoretical floating-point capability.
The Result: A stable result (measured in TFLOPS) without crashes or thermal throttling confirms that the system power delivery (PSUs) and cooling infrastructure (fans and airflow) are robust enough to handle the extreme heat and power draw of a real-world AI training job
Question 2 of 60
2. Question
To facilitate seamless workload orchestration in an AI factory, an administrator is configuring a Slurm cluster with Enroot and Pyxis. What is the specific purpose of the Pyxis plugin in this NVIDIA-based AI infrastructure software stack?
Correct
Option A: Incorrect Pyxis is not a storage driver or a data transfer tool. While the NGC (NVIDIA GPU Cloud) registry is a common source for AI containers, Pyxis does not handle the “automatic downloading“ of models. Its role is strictly related to how Slurm interfaces with the container runtime on the compute nodes.
Option B: Correct This is the standard definition within the NVIDIA AI Enterprise and Base Command stacks. Pyxis is a Slurm SPANK (Slurm Plug-in Architecture for Node and Job Control) plugin. Its specific job is to allow Slurm to natively “speak“ to Enroot. By using the –container-image flag in a Slurm script, Pyxis automates the process of pulling, creating, and starting an unprivileged Enroot container for the user, removing the need for complex manual container commands.
Option C: Incorrect Power management for NVIDIA GPUs is handled by the NVIDIA Management Library (NVML) and the NVIDIA GPU Driver, often monitored by tools like DCGM (Data Center GPU Manager). Pyxis operates at the job-scheduling and containerization layer, not the electrical or power-sequencing layer of the hardware.
Option D: Incorrect Hardware-level encryption for InfiniBand (often referred to as “Shield“ or secure fabric) is a feature of the NVIDIA Quantum-2 switches and ConnectX network adapters. Pyxis is a software plugin for the Slurm scheduler and does not interact with the InfiniBand kernel modules or fabric encryption keys.
Incorrect
Option A: Incorrect Pyxis is not a storage driver or a data transfer tool. While the NGC (NVIDIA GPU Cloud) registry is a common source for AI containers, Pyxis does not handle the “automatic downloading“ of models. Its role is strictly related to how Slurm interfaces with the container runtime on the compute nodes.
Option B: Correct This is the standard definition within the NVIDIA AI Enterprise and Base Command stacks. Pyxis is a Slurm SPANK (Slurm Plug-in Architecture for Node and Job Control) plugin. Its specific job is to allow Slurm to natively “speak“ to Enroot. By using the –container-image flag in a Slurm script, Pyxis automates the process of pulling, creating, and starting an unprivileged Enroot container for the user, removing the need for complex manual container commands.
Option C: Incorrect Power management for NVIDIA GPUs is handled by the NVIDIA Management Library (NVML) and the NVIDIA GPU Driver, often monitored by tools like DCGM (Data Center GPU Manager). Pyxis operates at the job-scheduling and containerization layer, not the electrical or power-sequencing layer of the hardware.
Option D: Incorrect Hardware-level encryption for InfiniBand (often referred to as “Shield“ or secure fabric) is a feature of the NVIDIA Quantum-2 switches and ConnectX network adapters. Pyxis is a software plugin for the Slurm scheduler and does not interact with the InfiniBand kernel modules or fabric encryption keys.
Unattempted
Option A: Incorrect Pyxis is not a storage driver or a data transfer tool. While the NGC (NVIDIA GPU Cloud) registry is a common source for AI containers, Pyxis does not handle the “automatic downloading“ of models. Its role is strictly related to how Slurm interfaces with the container runtime on the compute nodes.
Option B: Correct This is the standard definition within the NVIDIA AI Enterprise and Base Command stacks. Pyxis is a Slurm SPANK (Slurm Plug-in Architecture for Node and Job Control) plugin. Its specific job is to allow Slurm to natively “speak“ to Enroot. By using the –container-image flag in a Slurm script, Pyxis automates the process of pulling, creating, and starting an unprivileged Enroot container for the user, removing the need for complex manual container commands.
Option C: Incorrect Power management for NVIDIA GPUs is handled by the NVIDIA Management Library (NVML) and the NVIDIA GPU Driver, often monitored by tools like DCGM (Data Center GPU Manager). Pyxis operates at the job-scheduling and containerization layer, not the electrical or power-sequencing layer of the hardware.
Option D: Incorrect Hardware-level encryption for InfiniBand (often referred to as “Shield“ or secure fabric) is a feature of the NVIDIA Quantum-2 switches and ConnectX network adapters. Pyxis is a software plugin for the Slurm scheduler and does not interact with the InfiniBand kernel modules or fabric encryption keys.
Question 3 of 60
3. Question
During the configuration of a High Availability control plane in Base Command Manager, the administrator must verify that the failover mechanism is working correctly. Which component is responsible for maintaining the cluster database and ensuring that the standby head node can take over if the primary head node fails?
Correct
A. The LDAP service used for user authentication and authorization
Incorrect: While LDAP (or Active Directory) is necessary to ensure that users have the same credentials across the cluster, it is a service that consumes or runs on the infrastructure. It does not possess the logic to monitor head node health or coordinate a failover transition.
B. The NVIDIA Container Toolkit running on the compute nodes
Incorrect: This toolkit is responsible for mapping GPU devices into containers (like Docker or Enroot) on individual compute nodes. It operates at the “worker“ level and has no visibility or control over the “manager“ (head node) status or the cluster-wide database.
C. The Slurm database daemon slurmdbd running on the login node
Incorrect: Slurmdbd is used for job accounting and logging workload history. While it is a critical part of a functional AI cluster, it is a workload management component. It does not manage the underlying BCM infrastructure failover or the synchronization of the head node configuration files.
D. A shared heartbeat mechanism and synchronized database across head nodes
Correct: In a Base Command Manager HA configuration, the primary and standby head nodes communicate via a heartbeat. This heartbeat allows the standby node to detect if the primary has failed. Simultaneously, the cluster database (which contains the configuration for all nodes and images) is kept in a synchronized state (often using technologies like MariaDB Galera or DRBD). This ensures that if the standby takes over, it has the exact same data as the primary.
Incorrect
A. The LDAP service used for user authentication and authorization
Incorrect: While LDAP (or Active Directory) is necessary to ensure that users have the same credentials across the cluster, it is a service that consumes or runs on the infrastructure. It does not possess the logic to monitor head node health or coordinate a failover transition.
B. The NVIDIA Container Toolkit running on the compute nodes
Incorrect: This toolkit is responsible for mapping GPU devices into containers (like Docker or Enroot) on individual compute nodes. It operates at the “worker“ level and has no visibility or control over the “manager“ (head node) status or the cluster-wide database.
C. The Slurm database daemon slurmdbd running on the login node
Incorrect: Slurmdbd is used for job accounting and logging workload history. While it is a critical part of a functional AI cluster, it is a workload management component. It does not manage the underlying BCM infrastructure failover or the synchronization of the head node configuration files.
D. A shared heartbeat mechanism and synchronized database across head nodes
Correct: In a Base Command Manager HA configuration, the primary and standby head nodes communicate via a heartbeat. This heartbeat allows the standby node to detect if the primary has failed. Simultaneously, the cluster database (which contains the configuration for all nodes and images) is kept in a synchronized state (often using technologies like MariaDB Galera or DRBD). This ensures that if the standby takes over, it has the exact same data as the primary.
Unattempted
A. The LDAP service used for user authentication and authorization
Incorrect: While LDAP (or Active Directory) is necessary to ensure that users have the same credentials across the cluster, it is a service that consumes or runs on the infrastructure. It does not possess the logic to monitor head node health or coordinate a failover transition.
B. The NVIDIA Container Toolkit running on the compute nodes
Incorrect: This toolkit is responsible for mapping GPU devices into containers (like Docker or Enroot) on individual compute nodes. It operates at the “worker“ level and has no visibility or control over the “manager“ (head node) status or the cluster-wide database.
C. The Slurm database daemon slurmdbd running on the login node
Incorrect: Slurmdbd is used for job accounting and logging workload history. While it is a critical part of a functional AI cluster, it is a workload management component. It does not manage the underlying BCM infrastructure failover or the synchronization of the head node configuration files.
D. A shared heartbeat mechanism and synchronized database across head nodes
Correct: In a Base Command Manager HA configuration, the primary and standby head nodes communicate via a heartbeat. This heartbeat allows the standby node to detect if the primary has failed. Simultaneously, the cluster database (which contains the configuration for all nodes and images) is kept in a synchronized state (often using technologies like MariaDB Galera or DRBD). This ensures that if the standby takes over, it has the exact same data as the primary.
Question 4 of 60
4. Question
When managing a BlueField-3 network platform, an engineer needs to update the DOCA software framework and the DPU firmware. Which tool is the most appropriate for performing this update directly from the host server, and what type of file is typically used for a full DPU image update?
Correct
A. The DPU updates itself automatically by connecting to the NVIDIA NGC cloud service
Incorrect: While NVIDIA NGC is a repository for containers and models, BlueField DPUs do not perform “silent“ or automatic OS/firmware upgrades upon reboot. Updates must be orchestrated by an administrator to ensure compatibility with the host drivers and the network fabric.
B. Download a standard Ubuntu .iso file and use a virtual CD-ROM drive in the BMC
Incorrect: While the DPU runs a Linux-based OS (often Ubuntu-based), a standard generic ISO does not contain the necessary drivers, firmware, and DOCA (Data Center GPU Orchestrated Architecture) components required for BlueField hardware. Furthermore, the installation process typically leverages the host-to-DPU internal path rather than a traditional BMC virtual media mount meant for the server‘s main CPU.
C. Use the BFB (BlueField Binary) image file and the bfb-install utility
Correct: The BFB (BlueField Binary) is the standard bundled image provided by NVIDIA. It contains the DPU‘s operating system (the “runtime“ OS), the firmware, and the DOCA drivers. The bfb-install utility (or cat commands to the RShim device) is the standard method used from the host server to push this binary image to the DPU‘s ARM cores for a fresh installation or full upgrade.
D. Use the nvidia-smi tool with the –update-firmware flag
Incorrect: nvidia-smi is the primary tool for managing GPUs. While it can provide some telemetry for DPUs in certain converged configurations, it is not the tool used for flashing DPU operating systems or DOCA frameworks. DPU firmware is typically managed via mstflint (from the MFT suite) or bundled within the BFB image.
Incorrect
A. The DPU updates itself automatically by connecting to the NVIDIA NGC cloud service
Incorrect: While NVIDIA NGC is a repository for containers and models, BlueField DPUs do not perform “silent“ or automatic OS/firmware upgrades upon reboot. Updates must be orchestrated by an administrator to ensure compatibility with the host drivers and the network fabric.
B. Download a standard Ubuntu .iso file and use a virtual CD-ROM drive in the BMC
Incorrect: While the DPU runs a Linux-based OS (often Ubuntu-based), a standard generic ISO does not contain the necessary drivers, firmware, and DOCA (Data Center GPU Orchestrated Architecture) components required for BlueField hardware. Furthermore, the installation process typically leverages the host-to-DPU internal path rather than a traditional BMC virtual media mount meant for the server‘s main CPU.
C. Use the BFB (BlueField Binary) image file and the bfb-install utility
Correct: The BFB (BlueField Binary) is the standard bundled image provided by NVIDIA. It contains the DPU‘s operating system (the “runtime“ OS), the firmware, and the DOCA drivers. The bfb-install utility (or cat commands to the RShim device) is the standard method used from the host server to push this binary image to the DPU‘s ARM cores for a fresh installation or full upgrade.
D. Use the nvidia-smi tool with the –update-firmware flag
Incorrect: nvidia-smi is the primary tool for managing GPUs. While it can provide some telemetry for DPUs in certain converged configurations, it is not the tool used for flashing DPU operating systems or DOCA frameworks. DPU firmware is typically managed via mstflint (from the MFT suite) or bundled within the BFB image.
Unattempted
A. The DPU updates itself automatically by connecting to the NVIDIA NGC cloud service
Incorrect: While NVIDIA NGC is a repository for containers and models, BlueField DPUs do not perform “silent“ or automatic OS/firmware upgrades upon reboot. Updates must be orchestrated by an administrator to ensure compatibility with the host drivers and the network fabric.
B. Download a standard Ubuntu .iso file and use a virtual CD-ROM drive in the BMC
Incorrect: While the DPU runs a Linux-based OS (often Ubuntu-based), a standard generic ISO does not contain the necessary drivers, firmware, and DOCA (Data Center GPU Orchestrated Architecture) components required for BlueField hardware. Furthermore, the installation process typically leverages the host-to-DPU internal path rather than a traditional BMC virtual media mount meant for the server‘s main CPU.
C. Use the BFB (BlueField Binary) image file and the bfb-install utility
Correct: The BFB (BlueField Binary) is the standard bundled image provided by NVIDIA. It contains the DPU‘s operating system (the “runtime“ OS), the firmware, and the DOCA drivers. The bfb-install utility (or cat commands to the RShim device) is the standard method used from the host server to push this binary image to the DPU‘s ARM cores for a fresh installation or full upgrade.
D. Use the nvidia-smi tool with the –update-firmware flag
Incorrect: nvidia-smi is the primary tool for managing GPUs. While it can provide some telemetry for DPUs in certain converged configurations, it is not the tool used for flashing DPU operating systems or DOCA frameworks. DPU firmware is typically managed via mstflint (from the MFT suite) or bundled within the BFB image.
Question 5 of 60
5. Question
An engineer is running ClusterKit to perform a multifaceted node assessment on a newly deployed cluster. One of the tests involves verifying the East-West (E/W) fabric bandwidth using NVIDIA Collective Communications Library (NCCL) tests. What is the primary purpose of this specific NCCL test in the context of cluster verification?
Correct
A. To test the internet gateway‘s ability to download large datasets
Incorrect: NCCL tests focus on internal cluster traffic, often referred to as East-West (E/W) traffic. Downloading from the NGC catalog is North-South traffic. While internet connectivity is important for setup, NCCL is not used to measure gateway throughput or external internet speeds.
B. To confirm that the fans in the network switches are spinning at the correct RPM
Incorrect: This is a hardware environmental check. While high-speed data transfers cause heat that requires fans to spin faster, NCCL is a software library that measures data throughput and latency. Monitoring fan RPM would be handled by the switchÂ’s OS (e.g., NVIDIA Cumulus Linux) or a BMC/SNMP monitoring tool, not a collective communications test.
C. To verify that the local hard drives on each node can reach their maximum speeds
Incorrect: Testing local drive speed is an I/O benchmark (typically using tools like fio). NCCL bypasses local disk storage to move data directly between GPU memories across the network fabric. It is a test of the network and memory bus, not the storage subsystem.
D. To ensure that the InfiniBand or Ethernet fabric can support the high-speed data transfers between GPUs on different nodes
Correct: Distributed AI training (like training Large Language Models) requires GPUs on different nodes to constantly synchronize gradients. This requires massive bandwidth and low latency. The NCCL test (such as nccl-tests or all_reduce_perf) verifies that the InfiniBand (IB) or RoCE (RDMA over Converged Ethernet) fabric is correctly configured to handle these collective operations at near-peak theoretical speeds.
Incorrect
A. To test the internet gateway‘s ability to download large datasets
Incorrect: NCCL tests focus on internal cluster traffic, often referred to as East-West (E/W) traffic. Downloading from the NGC catalog is North-South traffic. While internet connectivity is important for setup, NCCL is not used to measure gateway throughput or external internet speeds.
B. To confirm that the fans in the network switches are spinning at the correct RPM
Incorrect: This is a hardware environmental check. While high-speed data transfers cause heat that requires fans to spin faster, NCCL is a software library that measures data throughput and latency. Monitoring fan RPM would be handled by the switchÂ’s OS (e.g., NVIDIA Cumulus Linux) or a BMC/SNMP monitoring tool, not a collective communications test.
C. To verify that the local hard drives on each node can reach their maximum speeds
Incorrect: Testing local drive speed is an I/O benchmark (typically using tools like fio). NCCL bypasses local disk storage to move data directly between GPU memories across the network fabric. It is a test of the network and memory bus, not the storage subsystem.
D. To ensure that the InfiniBand or Ethernet fabric can support the high-speed data transfers between GPUs on different nodes
Correct: Distributed AI training (like training Large Language Models) requires GPUs on different nodes to constantly synchronize gradients. This requires massive bandwidth and low latency. The NCCL test (such as nccl-tests or all_reduce_perf) verifies that the InfiniBand (IB) or RoCE (RDMA over Converged Ethernet) fabric is correctly configured to handle these collective operations at near-peak theoretical speeds.
Unattempted
A. To test the internet gateway‘s ability to download large datasets
Incorrect: NCCL tests focus on internal cluster traffic, often referred to as East-West (E/W) traffic. Downloading from the NGC catalog is North-South traffic. While internet connectivity is important for setup, NCCL is not used to measure gateway throughput or external internet speeds.
B. To confirm that the fans in the network switches are spinning at the correct RPM
Incorrect: This is a hardware environmental check. While high-speed data transfers cause heat that requires fans to spin faster, NCCL is a software library that measures data throughput and latency. Monitoring fan RPM would be handled by the switchÂ’s OS (e.g., NVIDIA Cumulus Linux) or a BMC/SNMP monitoring tool, not a collective communications test.
C. To verify that the local hard drives on each node can reach their maximum speeds
Incorrect: Testing local drive speed is an I/O benchmark (typically using tools like fio). NCCL bypasses local disk storage to move data directly between GPU memories across the network fabric. It is a test of the network and memory bus, not the storage subsystem.
D. To ensure that the InfiniBand or Ethernet fabric can support the high-speed data transfers between GPUs on different nodes
Correct: Distributed AI training (like training Large Language Models) requires GPUs on different nodes to constantly synchronize gradients. This requires massive bandwidth and low latency. The NCCL test (such as nccl-tests or all_reduce_perf) verifies that the InfiniBand (IB) or RoCE (RDMA over Converged Ethernet) fabric is correctly configured to handle these collective operations at near-peak theoretical speeds.
Question 6 of 60
6. Question
A 128-node AI cluster is experiencing intermittent performance degradation during large-scale training runs. The administrator notices that some nodes show XID Error 61 in the dmesg logs. What does this specific error typically indicate in an NVIDIA GPU environment, and what is the appropriate troubleshooting step?
Correct
A. XID 61 indicates a GPU memory clock toggle failure; check PSUs or replace the GPU.
Correct: According to the official NVIDIA XID catalog, XID 61 is an internal microcontroller error (often related to a breakpoint or memory clock toggle failure). In an enterprise AI environment, this is typically a hardware-level event. It can be triggered by insufficient power (voltage sags from a failing PSU) or physical instability of the GPU itself. The recommended troubleshooting involves checking the physical power delivery and, if the error persists, RMA-ing the faulty hardware.
B. XID 61 is a warning that the GPU is too cold.
Incorrect: NVIDIA GPUs do not have an XID error to indicate that they are “too cold.“ Modern GPUs operate efficiently at lower temperatures. While thermal issues exist, they almost always involve overheating (which leads to thermal throttling or XID 43/45), not under-heating.
C. XID 61 indicates that the NVIDIA driver is unlicensed.
Incorrect: While NVIDIA vGPU software requires licensing (managed through the NVIDIA License System), a license failure does not produce an XID 61. Licensing issues typically result in restricted performance (clocks capped at a lower speed) or specific error messages in the license client logs, but they do not manifest as microcontroller hardware breakpoints.
D. XID 61 indicates that the GPU has run out of memory (OOM).
Incorrect: Out of Memory errors are typically handled at the application level (e.g., a “CUDA out of memory“ error in PyTorch). While some memory-related faults can trigger XIDs, such as XID 13 (Graphics Engine Exception/illegal address) or XID 31 (MMU fault), XID 61 is specifically reserved for internal microcontroller and clocking failures.
Incorrect
A. XID 61 indicates a GPU memory clock toggle failure; check PSUs or replace the GPU.
Correct: According to the official NVIDIA XID catalog, XID 61 is an internal microcontroller error (often related to a breakpoint or memory clock toggle failure). In an enterprise AI environment, this is typically a hardware-level event. It can be triggered by insufficient power (voltage sags from a failing PSU) or physical instability of the GPU itself. The recommended troubleshooting involves checking the physical power delivery and, if the error persists, RMA-ing the faulty hardware.
B. XID 61 is a warning that the GPU is too cold.
Incorrect: NVIDIA GPUs do not have an XID error to indicate that they are “too cold.“ Modern GPUs operate efficiently at lower temperatures. While thermal issues exist, they almost always involve overheating (which leads to thermal throttling or XID 43/45), not under-heating.
C. XID 61 indicates that the NVIDIA driver is unlicensed.
Incorrect: While NVIDIA vGPU software requires licensing (managed through the NVIDIA License System), a license failure does not produce an XID 61. Licensing issues typically result in restricted performance (clocks capped at a lower speed) or specific error messages in the license client logs, but they do not manifest as microcontroller hardware breakpoints.
D. XID 61 indicates that the GPU has run out of memory (OOM).
Incorrect: Out of Memory errors are typically handled at the application level (e.g., a “CUDA out of memory“ error in PyTorch). While some memory-related faults can trigger XIDs, such as XID 13 (Graphics Engine Exception/illegal address) or XID 31 (MMU fault), XID 61 is specifically reserved for internal microcontroller and clocking failures.
Unattempted
A. XID 61 indicates a GPU memory clock toggle failure; check PSUs or replace the GPU.
Correct: According to the official NVIDIA XID catalog, XID 61 is an internal microcontroller error (often related to a breakpoint or memory clock toggle failure). In an enterprise AI environment, this is typically a hardware-level event. It can be triggered by insufficient power (voltage sags from a failing PSU) or physical instability of the GPU itself. The recommended troubleshooting involves checking the physical power delivery and, if the error persists, RMA-ing the faulty hardware.
B. XID 61 is a warning that the GPU is too cold.
Incorrect: NVIDIA GPUs do not have an XID error to indicate that they are “too cold.“ Modern GPUs operate efficiently at lower temperatures. While thermal issues exist, they almost always involve overheating (which leads to thermal throttling or XID 43/45), not under-heating.
C. XID 61 indicates that the NVIDIA driver is unlicensed.
Incorrect: While NVIDIA vGPU software requires licensing (managed through the NVIDIA License System), a license failure does not produce an XID 61. Licensing issues typically result in restricted performance (clocks capped at a lower speed) or specific error messages in the license client logs, but they do not manifest as microcontroller hardware breakpoints.
D. XID 61 indicates that the GPU has run out of memory (OOM).
Incorrect: Out of Memory errors are typically handled at the application level (e.g., a “CUDA out of memory“ error in PyTorch). While some memory-related faults can trigger XIDs, such as XID 13 (Graphics Engine Exception/illegal address) or XID 31 (MMU fault), XID 61 is specifically reserved for internal microcontroller and clocking failures.
Question 7 of 60
7. Question
In a physical layer management scenario, an engineer is troubleshooting a BlueField-3 DPU that is not achieving the expected line rate. They suspect a configuration issue in the physical layer settings. Which command-line utility should be used to verify the link speed, lane width, and error rates specifically for the NVIDIA network interface on the DPU?
Correct
A. The ‘fdisk‘ utility
Incorrect: fdisk is a partition table manipulator for Linux. It is used to manage disk drives and storage partitions. It has no capability to interface with network transceivers, monitor signal quality, or verify the lane width of a high-speed InfiniBand or Ethernet link.
B. The ‘ibstat‘ or ‘ibv_devinfo‘ tools for InfiniBand, or ‘ethtool‘ for Ethernet
Correct: These are the standard utilities taught in the NCP-AII curriculum for physical layer verification.
ibstat / ibv_devinfo: Provide critical InfiniBand telemetry, including the link state (Active/Down), physical state (LinkUp), and the negotiated speed/width (e.g., 4x NDR).
ethtool: The go-to tool for Ethernet mode, allowing engineers to see if a port has auto-negotiated to the correct speed and to check for physical layer errors (CRC errors, frame slips) that indicate a faulty cable or transceiver.
C. The ‘ping‘ command
Incorrect: While ping verifies basic Layer 3 connectivity, it is a poor tool for troubleshooting “line rate“ issues. High latency in a ping could be caused by software interrupts or CPU load; it does not provide telemetry on lane width (e.g., if a 4-lane cable is only operating on 2 lanes) or signal-to-noise ratios.
D. The ‘ifconfig‘ command
Incorrect: ifconfig (and its modern replacement ip addr) is used for configuring IP addresses and viewing basic packet counts. It lacks the deep “under-the-hood“ physical layer telemetry needed to diagnose lane-specific hardware issues or InfiniBand-specific state transitions.
Incorrect
A. The ‘fdisk‘ utility
Incorrect: fdisk is a partition table manipulator for Linux. It is used to manage disk drives and storage partitions. It has no capability to interface with network transceivers, monitor signal quality, or verify the lane width of a high-speed InfiniBand or Ethernet link.
B. The ‘ibstat‘ or ‘ibv_devinfo‘ tools for InfiniBand, or ‘ethtool‘ for Ethernet
Correct: These are the standard utilities taught in the NCP-AII curriculum for physical layer verification.
ibstat / ibv_devinfo: Provide critical InfiniBand telemetry, including the link state (Active/Down), physical state (LinkUp), and the negotiated speed/width (e.g., 4x NDR).
ethtool: The go-to tool for Ethernet mode, allowing engineers to see if a port has auto-negotiated to the correct speed and to check for physical layer errors (CRC errors, frame slips) that indicate a faulty cable or transceiver.
C. The ‘ping‘ command
Incorrect: While ping verifies basic Layer 3 connectivity, it is a poor tool for troubleshooting “line rate“ issues. High latency in a ping could be caused by software interrupts or CPU load; it does not provide telemetry on lane width (e.g., if a 4-lane cable is only operating on 2 lanes) or signal-to-noise ratios.
D. The ‘ifconfig‘ command
Incorrect: ifconfig (and its modern replacement ip addr) is used for configuring IP addresses and viewing basic packet counts. It lacks the deep “under-the-hood“ physical layer telemetry needed to diagnose lane-specific hardware issues or InfiniBand-specific state transitions.
Unattempted
A. The ‘fdisk‘ utility
Incorrect: fdisk is a partition table manipulator for Linux. It is used to manage disk drives and storage partitions. It has no capability to interface with network transceivers, monitor signal quality, or verify the lane width of a high-speed InfiniBand or Ethernet link.
B. The ‘ibstat‘ or ‘ibv_devinfo‘ tools for InfiniBand, or ‘ethtool‘ for Ethernet
Correct: These are the standard utilities taught in the NCP-AII curriculum for physical layer verification.
ibstat / ibv_devinfo: Provide critical InfiniBand telemetry, including the link state (Active/Down), physical state (LinkUp), and the negotiated speed/width (e.g., 4x NDR).
ethtool: The go-to tool for Ethernet mode, allowing engineers to see if a port has auto-negotiated to the correct speed and to check for physical layer errors (CRC errors, frame slips) that indicate a faulty cable or transceiver.
C. The ‘ping‘ command
Incorrect: While ping verifies basic Layer 3 connectivity, it is a poor tool for troubleshooting “line rate“ issues. High latency in a ping could be caused by software interrupts or CPU load; it does not provide telemetry on lane width (e.g., if a 4-lane cable is only operating on 2 lanes) or signal-to-noise ratios.
D. The ‘ifconfig‘ command
Incorrect: ifconfig (and its modern replacement ip addr) is used for configuring IP addresses and viewing basic packet counts. It lacks the deep “under-the-hood“ physical layer telemetry needed to diagnose lane-specific hardware issues or InfiniBand-specific state transitions.
Question 8 of 60
8. Question
A deployment team needs to validate the storage subsystem performance for a large NeMo-based Large Language Model (LLM) training job. Which test should be prioritized to ensure that the storage can handle the massive checkpointing requirements of the training process according to NVIDIA standards?
Correct
A. A sequential write throughput test using Large Block sizes
Correct: Checkpointing is the process of periodically saving the entire state of a model (weights, optimizer states, and gradients) to durable storage to allow for recovery after a failure. For LLMs, these checkpoints can reach terabytes in size. Because this data is written in large, continuous streams across the cluster (often using parallel file systems like Lustre, Weka, or PixStor), the primary performance bottleneck is sequential write throughput. NVIDIA standard validation (such as using the fio tool) focuses on large block sizes (e.g., 1MB or larger) to ensure the storage can ingest these massive files without stalling the GPUs.
B. A random read IOPS test using small 4KB blocks
Incorrect: Small, random 4KB I/O patterns are typical of Inference workloads or traditional database operations, where thousands of tiny, non-contiguous requests are made. While data ingestion (reading the training dataset) involves reads, checkpointing—the specific focus of this question—is almost entirely a write-heavy operation. Validating with 4KB random reads would not accurately simulate the pressure an LLM checkpoint puts on the storage fabric.
C. A network latency test using traceroute
Incorrect: traceroute is a basic networking utility used to identify the path and hop count to a destination. While low latency is important, the number of “hops“ is less critical than the available bandwidth and RDMA (Remote Direct Memory Access) capabilities of the storage network. A traceroute does not measure throughput or the storage subsystem‘s ability to handle concurrent writes.
D. A GPU-to-GPU P2P test
Incorrect: GPU-to-GPU Peer-to-Peer (P2P) tests (like those in NCCL) verify the speed of data transfers between GPUs over NVLink or InfiniBand. While GPUDirect Storage (GDS) is a related technology that allows storage to bypass the CPU and go straight to GPU memory, a P2P test is a network/interconnect benchmark, not a storage subsystem performance validation.
Incorrect
A. A sequential write throughput test using Large Block sizes
Correct: Checkpointing is the process of periodically saving the entire state of a model (weights, optimizer states, and gradients) to durable storage to allow for recovery after a failure. For LLMs, these checkpoints can reach terabytes in size. Because this data is written in large, continuous streams across the cluster (often using parallel file systems like Lustre, Weka, or PixStor), the primary performance bottleneck is sequential write throughput. NVIDIA standard validation (such as using the fio tool) focuses on large block sizes (e.g., 1MB or larger) to ensure the storage can ingest these massive files without stalling the GPUs.
B. A random read IOPS test using small 4KB blocks
Incorrect: Small, random 4KB I/O patterns are typical of Inference workloads or traditional database operations, where thousands of tiny, non-contiguous requests are made. While data ingestion (reading the training dataset) involves reads, checkpointing—the specific focus of this question—is almost entirely a write-heavy operation. Validating with 4KB random reads would not accurately simulate the pressure an LLM checkpoint puts on the storage fabric.
C. A network latency test using traceroute
Incorrect: traceroute is a basic networking utility used to identify the path and hop count to a destination. While low latency is important, the number of “hops“ is less critical than the available bandwidth and RDMA (Remote Direct Memory Access) capabilities of the storage network. A traceroute does not measure throughput or the storage subsystem‘s ability to handle concurrent writes.
D. A GPU-to-GPU P2P test
Incorrect: GPU-to-GPU Peer-to-Peer (P2P) tests (like those in NCCL) verify the speed of data transfers between GPUs over NVLink or InfiniBand. While GPUDirect Storage (GDS) is a related technology that allows storage to bypass the CPU and go straight to GPU memory, a P2P test is a network/interconnect benchmark, not a storage subsystem performance validation.
Unattempted
A. A sequential write throughput test using Large Block sizes
Correct: Checkpointing is the process of periodically saving the entire state of a model (weights, optimizer states, and gradients) to durable storage to allow for recovery after a failure. For LLMs, these checkpoints can reach terabytes in size. Because this data is written in large, continuous streams across the cluster (often using parallel file systems like Lustre, Weka, or PixStor), the primary performance bottleneck is sequential write throughput. NVIDIA standard validation (such as using the fio tool) focuses on large block sizes (e.g., 1MB or larger) to ensure the storage can ingest these massive files without stalling the GPUs.
B. A random read IOPS test using small 4KB blocks
Incorrect: Small, random 4KB I/O patterns are typical of Inference workloads or traditional database operations, where thousands of tiny, non-contiguous requests are made. While data ingestion (reading the training dataset) involves reads, checkpointing—the specific focus of this question—is almost entirely a write-heavy operation. Validating with 4KB random reads would not accurately simulate the pressure an LLM checkpoint puts on the storage fabric.
C. A network latency test using traceroute
Incorrect: traceroute is a basic networking utility used to identify the path and hop count to a destination. While low latency is important, the number of “hops“ is less critical than the available bandwidth and RDMA (Remote Direct Memory Access) capabilities of the storage network. A traceroute does not measure throughput or the storage subsystem‘s ability to handle concurrent writes.
D. A GPU-to-GPU P2P test
Incorrect: GPU-to-GPU Peer-to-Peer (P2P) tests (like those in NCCL) verify the speed of data transfers between GPUs over NVLink or InfiniBand. While GPUDirect Storage (GDS) is a related technology that allows storage to bypass the CPU and go straight to GPU memory, a P2P test is a network/interconnect benchmark, not a storage subsystem performance validation.
Question 9 of 60
9. Question
An administrator is preparing to install a large-scale AI cluster and must establish a hardware Root of Trust before deploying any software. The process involves configuring the Trusted Platform Module (TPM) and validating firmware integrity. Which steps correctly describe the procedure for initializing the TPM and ensuring secure bring-up of the NVIDIA HGX baseboard and system BIOS?
Correct
A. Flash verified firmware, enable TPM in BIOS, and clear ownership via OOB.
Correct: This follows the standard NVIDIA-certified procedure for a secure bring-up.
Verified Firmware: Using the official NVIDIA Flash Tool ensures that the HGX baseboard and HMC (HGX Management Controller) are running digitally signed, untampered code.
TPM Enablement: Initializing the TPM 2.0 module in the BIOS allows the system to begin “measured boot“ processes.
OOB Management: Clearing TPM ownership via the Out-of-Band (OOB) interface (like the BMC) is the standard enterprise method to ensure the administrator has exclusive control over the security keys before the OS deployment begins.
B. Enable Secure Boot only after the operating system installation.
Incorrect: Secure Boot should be enabled before OS installation to ensure that only a signed bootloader can initiate the install. Enabling it after installation can lead to “signed-only“ enforcement failures if the OS was installed with unsigned components, defeating the purpose of a hardware root of trust.
C. Physically remove the GPU baseboard to access a manual reset jumper.
Incorrect: Modern NVIDIA HGX systems are designed for data center serviceability. Critical security resets are handled through secure software protocols (Redfish/IPMI) or authenticated BIOS sessions. Physically dismantling an HGX baseboard for a TPM reset is not a standard operational procedure and poses a high risk of hardware damage.
D. Use Base Command Manager to bypass all TPM checks.
Incorrect: While Base Command Manager (BCM) streamlines deployment, it is designed to enforce security standards, not bypass them. Bypassing TPM checks would leave the cluster vulnerable to firmware-level attacks and would violate the “Hardware Root of Trust“ requirement specified in the scenario.
Incorrect
A. Flash verified firmware, enable TPM in BIOS, and clear ownership via OOB.
Correct: This follows the standard NVIDIA-certified procedure for a secure bring-up.
Verified Firmware: Using the official NVIDIA Flash Tool ensures that the HGX baseboard and HMC (HGX Management Controller) are running digitally signed, untampered code.
TPM Enablement: Initializing the TPM 2.0 module in the BIOS allows the system to begin “measured boot“ processes.
OOB Management: Clearing TPM ownership via the Out-of-Band (OOB) interface (like the BMC) is the standard enterprise method to ensure the administrator has exclusive control over the security keys before the OS deployment begins.
B. Enable Secure Boot only after the operating system installation.
Incorrect: Secure Boot should be enabled before OS installation to ensure that only a signed bootloader can initiate the install. Enabling it after installation can lead to “signed-only“ enforcement failures if the OS was installed with unsigned components, defeating the purpose of a hardware root of trust.
C. Physically remove the GPU baseboard to access a manual reset jumper.
Incorrect: Modern NVIDIA HGX systems are designed for data center serviceability. Critical security resets are handled through secure software protocols (Redfish/IPMI) or authenticated BIOS sessions. Physically dismantling an HGX baseboard for a TPM reset is not a standard operational procedure and poses a high risk of hardware damage.
D. Use Base Command Manager to bypass all TPM checks.
Incorrect: While Base Command Manager (BCM) streamlines deployment, it is designed to enforce security standards, not bypass them. Bypassing TPM checks would leave the cluster vulnerable to firmware-level attacks and would violate the “Hardware Root of Trust“ requirement specified in the scenario.
Unattempted
A. Flash verified firmware, enable TPM in BIOS, and clear ownership via OOB.
Correct: This follows the standard NVIDIA-certified procedure for a secure bring-up.
Verified Firmware: Using the official NVIDIA Flash Tool ensures that the HGX baseboard and HMC (HGX Management Controller) are running digitally signed, untampered code.
TPM Enablement: Initializing the TPM 2.0 module in the BIOS allows the system to begin “measured boot“ processes.
OOB Management: Clearing TPM ownership via the Out-of-Band (OOB) interface (like the BMC) is the standard enterprise method to ensure the administrator has exclusive control over the security keys before the OS deployment begins.
B. Enable Secure Boot only after the operating system installation.
Incorrect: Secure Boot should be enabled before OS installation to ensure that only a signed bootloader can initiate the install. Enabling it after installation can lead to “signed-only“ enforcement failures if the OS was installed with unsigned components, defeating the purpose of a hardware root of trust.
C. Physically remove the GPU baseboard to access a manual reset jumper.
Incorrect: Modern NVIDIA HGX systems are designed for data center serviceability. Critical security resets are handled through secure software protocols (Redfish/IPMI) or authenticated BIOS sessions. Physically dismantling an HGX baseboard for a TPM reset is not a standard operational procedure and poses a high risk of hardware damage.
D. Use Base Command Manager to bypass all TPM checks.
Incorrect: While Base Command Manager (BCM) streamlines deployment, it is designed to enforce security standards, not bypass them. Bypassing TPM checks would leave the cluster vulnerable to firmware-level attacks and would violate the “Hardware Root of Trust“ requirement specified in the scenario.
Question 10 of 60
10. Question
After the physical installation and software configuration, an engineer runs the High-Performance Linpack (HPL) benchmark on a single node. What is the primary objective of running HPL during the Cluster Test and Verification phase for an NVIDIA AI infrastructure?
Correct
A. To verify the installation of the NGC CLI
Incorrect: While the ngc command-line interface is used to pull the benchmark containers from the NGC catalog, running the actual HPL benchmark is far too complex and resource-intensive just to verify a CLI installation. Simple commands like ngc –version are sufficient for that purpose.
B. To stress the GPUs and CPU to verify thermal stability and peak floating-point performance
Correct: High-Performance Linpack (HPL) is the “gold standard“ stress test used in the NCP-AII curriculum. It solves a dense system of linear equations, which pushes the CPU and GPU floating-point units to their maximum theoretical limits. This generates peak heat, making it the primary tool for validating that the serverÂ’s thermal solutions (fans, heatsinks, and data center cooling) can maintain stability under sustained 100% load without thermal throttling or hardware failure.
C. To test the latency of the OOB management network
Incorrect: Out-of-Band (OOB) management (BMC/IPMI) is used for server health monitoring and remote power control. It operates on a separate, low-speed network. HPL is a compute-intensive benchmark that focuses on the internal processor and memory performance, not the external management network latency.
D. To measure the maximum theoretical bandwidth of the storage array
Incorrect: HPL is an in-memory compute benchmark. Once the initial data is loaded into the GPU/CPU memory, there is very little storage I/O. For storage bandwidth validation, the NCP-AII curriculum points to tools like fio or specialized storage benchmarks, rather than HPL.
Incorrect
A. To verify the installation of the NGC CLI
Incorrect: While the ngc command-line interface is used to pull the benchmark containers from the NGC catalog, running the actual HPL benchmark is far too complex and resource-intensive just to verify a CLI installation. Simple commands like ngc –version are sufficient for that purpose.
B. To stress the GPUs and CPU to verify thermal stability and peak floating-point performance
Correct: High-Performance Linpack (HPL) is the “gold standard“ stress test used in the NCP-AII curriculum. It solves a dense system of linear equations, which pushes the CPU and GPU floating-point units to their maximum theoretical limits. This generates peak heat, making it the primary tool for validating that the serverÂ’s thermal solutions (fans, heatsinks, and data center cooling) can maintain stability under sustained 100% load without thermal throttling or hardware failure.
C. To test the latency of the OOB management network
Incorrect: Out-of-Band (OOB) management (BMC/IPMI) is used for server health monitoring and remote power control. It operates on a separate, low-speed network. HPL is a compute-intensive benchmark that focuses on the internal processor and memory performance, not the external management network latency.
D. To measure the maximum theoretical bandwidth of the storage array
Incorrect: HPL is an in-memory compute benchmark. Once the initial data is loaded into the GPU/CPU memory, there is very little storage I/O. For storage bandwidth validation, the NCP-AII curriculum points to tools like fio or specialized storage benchmarks, rather than HPL.
Unattempted
A. To verify the installation of the NGC CLI
Incorrect: While the ngc command-line interface is used to pull the benchmark containers from the NGC catalog, running the actual HPL benchmark is far too complex and resource-intensive just to verify a CLI installation. Simple commands like ngc –version are sufficient for that purpose.
B. To stress the GPUs and CPU to verify thermal stability and peak floating-point performance
Correct: High-Performance Linpack (HPL) is the “gold standard“ stress test used in the NCP-AII curriculum. It solves a dense system of linear equations, which pushes the CPU and GPU floating-point units to their maximum theoretical limits. This generates peak heat, making it the primary tool for validating that the serverÂ’s thermal solutions (fans, heatsinks, and data center cooling) can maintain stability under sustained 100% load without thermal throttling or hardware failure.
C. To test the latency of the OOB management network
Incorrect: Out-of-Band (OOB) management (BMC/IPMI) is used for server health monitoring and remote power control. It operates on a separate, low-speed network. HPL is a compute-intensive benchmark that focuses on the internal processor and memory performance, not the external management network latency.
D. To measure the maximum theoretical bandwidth of the storage array
Incorrect: HPL is an in-memory compute benchmark. Once the initial data is loaded into the GPU/CPU memory, there is very little storage I/O. For storage bandwidth validation, the NCP-AII curriculum points to tools like fio or specialized storage benchmarks, rather than HPL.
Question 11 of 60
11. Question
An IT architect is deploying NVIDIA Base Command Manager (BCM) to manage a new AI cluster. To ensure high availability (HA) of the management plane, which configuration must be implemented according to the BCM best practices during the installation phase?
Correct
A. Configure the BlueField DPU on each node to act as its own independent head node
Incorrect: While BlueField DPUs are powerful “computers-on-a-card“ that offload networking and security tasks, they are not intended to replace the centralized cluster management server in BCM. The BCM head node maintains a complex database and orchestration engine that requires the resources of a dedicated x86 server.
B. Install BCM on a single head node and rely on daily tape backups
Incorrect: This configuration represents a “Single Point of Failure“ (SPOF). While backups are necessary for disaster recovery, they do not provide High Availability. If a single head node fails, the entire cluster management interface and provisioning services go offline until a manual restoration is completed, leading to significant downtime.
C. Setup a Primary and a Secondary head node with shared storage or data synchronization, and configure a virtual IP (VIP)
Correct: This is the NVIDIA-recommended architecture for BCM HA.
Redundancy: Two head nodes (Primary and Secondary) ensure that if one fails, the other is ready to take over.
Data Consistency: BCM uses synchronization (such as MariaDB Galera for the database and rsync or shared storage for the software images) to ensure both nodes are identical.
Failover Logic: A Virtual IP (VIP) is used so that compute nodes and administrators always connect to a single IP address. If the Primary fails, the Secondary “claims“ the VIP, making the failover transparent to the rest of the cluster.
D. Deploy BCM as a containerized application across all compute nodes
Incorrect: BCM is designed to manage the compute nodes, not run on top of them as a distributed application. A decentralized control plane would create immense complexity in maintaining the “source of truth“ for the cluster configuration and is not the supported deployment model for BCM.
Incorrect
A. Configure the BlueField DPU on each node to act as its own independent head node
Incorrect: While BlueField DPUs are powerful “computers-on-a-card“ that offload networking and security tasks, they are not intended to replace the centralized cluster management server in BCM. The BCM head node maintains a complex database and orchestration engine that requires the resources of a dedicated x86 server.
B. Install BCM on a single head node and rely on daily tape backups
Incorrect: This configuration represents a “Single Point of Failure“ (SPOF). While backups are necessary for disaster recovery, they do not provide High Availability. If a single head node fails, the entire cluster management interface and provisioning services go offline until a manual restoration is completed, leading to significant downtime.
C. Setup a Primary and a Secondary head node with shared storage or data synchronization, and configure a virtual IP (VIP)
Correct: This is the NVIDIA-recommended architecture for BCM HA.
Redundancy: Two head nodes (Primary and Secondary) ensure that if one fails, the other is ready to take over.
Data Consistency: BCM uses synchronization (such as MariaDB Galera for the database and rsync or shared storage for the software images) to ensure both nodes are identical.
Failover Logic: A Virtual IP (VIP) is used so that compute nodes and administrators always connect to a single IP address. If the Primary fails, the Secondary “claims“ the VIP, making the failover transparent to the rest of the cluster.
D. Deploy BCM as a containerized application across all compute nodes
Incorrect: BCM is designed to manage the compute nodes, not run on top of them as a distributed application. A decentralized control plane would create immense complexity in maintaining the “source of truth“ for the cluster configuration and is not the supported deployment model for BCM.
Unattempted
A. Configure the BlueField DPU on each node to act as its own independent head node
Incorrect: While BlueField DPUs are powerful “computers-on-a-card“ that offload networking and security tasks, they are not intended to replace the centralized cluster management server in BCM. The BCM head node maintains a complex database and orchestration engine that requires the resources of a dedicated x86 server.
B. Install BCM on a single head node and rely on daily tape backups
Incorrect: This configuration represents a “Single Point of Failure“ (SPOF). While backups are necessary for disaster recovery, they do not provide High Availability. If a single head node fails, the entire cluster management interface and provisioning services go offline until a manual restoration is completed, leading to significant downtime.
C. Setup a Primary and a Secondary head node with shared storage or data synchronization, and configure a virtual IP (VIP)
Correct: This is the NVIDIA-recommended architecture for BCM HA.
Redundancy: Two head nodes (Primary and Secondary) ensure that if one fails, the other is ready to take over.
Data Consistency: BCM uses synchronization (such as MariaDB Galera for the database and rsync or shared storage for the software images) to ensure both nodes are identical.
Failover Logic: A Virtual IP (VIP) is used so that compute nodes and administrators always connect to a single IP address. If the Primary fails, the Secondary “claims“ the VIP, making the failover transparent to the rest of the cluster.
D. Deploy BCM as a containerized application across all compute nodes
Incorrect: BCM is designed to manage the compute nodes, not run on top of them as a distributed application. A decentralized control plane would create immense complexity in maintaining the “source of truth“ for the cluster configuration and is not the supported deployment model for BCM.
Question 12 of 60
12. Question
A cluster administrator is setting up the job scheduling environment on a new NVIDIA AI factory. They need to install Slurm along with Enroot and Pyxis. What is the primary reason for integrating Enroot and Pyxis with the Slurm workload manager?
Correct
A. To translate Slurm commands into Kubernetes YAML files
Incorrect: Slurm and Kubernetes are distinct workload managers. While some tools exist to bridge them, Enroot and Pyxis are specifically designed to provide containerization for HPC (High-Performance Computing) environments without involving the Kubernetes orchestration layer. They allow Slurm to remain the primary scheduler while gaining container-like flexibility.
B. To provide a graphical user interface (GUI)
Incorrect: Slurm, Enroot, and Pyxis are command-line driven tools. While NVIDIA Base Command Manager (BCM) provides a web-based “Base View“ for monitoring, the integration of Enroot and Pyxis is a backend system-level configuration that handles how jobs are executed at the OS level.
C. To automatically overclock the GPUs
Incorrect: GPU clock speeds are managed by the NVIDIA driver and BIOS/firmware power profiles. While an administrator can use Slurm prolog scripts to set specific GPU frequencies (using nvidia-smi), Enroot and Pyxis are focused on the software environment isolation (containers), not hardware performance tuning or overclocking.
D. To allow Slurm to natively launch and manage unprivileged containerized workloads
Correct: This is the primary purpose defined in the NCP-AII curriculum.
Enroot: An NVIDIA-developed container runtime that turns Docker/OCI images into simple, unpacked filesystems (SquashFS). It is “unprivileged,“ meaning it doesn‘t require a root daemon (unlike standard Docker), which is critical for security in multi-tenant AI clusters.
Pyxis: A Slurm SPANK plugin that allows users to use the –container-image flag directly in srun or sbatch commands. It automates the pulling and mounting of the container, allowing researchers to run complex AI stacks (like PyTorch or TensorFlow from NGC) as if they were native applications.
Incorrect
A. To translate Slurm commands into Kubernetes YAML files
Incorrect: Slurm and Kubernetes are distinct workload managers. While some tools exist to bridge them, Enroot and Pyxis are specifically designed to provide containerization for HPC (High-Performance Computing) environments without involving the Kubernetes orchestration layer. They allow Slurm to remain the primary scheduler while gaining container-like flexibility.
B. To provide a graphical user interface (GUI)
Incorrect: Slurm, Enroot, and Pyxis are command-line driven tools. While NVIDIA Base Command Manager (BCM) provides a web-based “Base View“ for monitoring, the integration of Enroot and Pyxis is a backend system-level configuration that handles how jobs are executed at the OS level.
C. To automatically overclock the GPUs
Incorrect: GPU clock speeds are managed by the NVIDIA driver and BIOS/firmware power profiles. While an administrator can use Slurm prolog scripts to set specific GPU frequencies (using nvidia-smi), Enroot and Pyxis are focused on the software environment isolation (containers), not hardware performance tuning or overclocking.
D. To allow Slurm to natively launch and manage unprivileged containerized workloads
Correct: This is the primary purpose defined in the NCP-AII curriculum.
Enroot: An NVIDIA-developed container runtime that turns Docker/OCI images into simple, unpacked filesystems (SquashFS). It is “unprivileged,“ meaning it doesn‘t require a root daemon (unlike standard Docker), which is critical for security in multi-tenant AI clusters.
Pyxis: A Slurm SPANK plugin that allows users to use the –container-image flag directly in srun or sbatch commands. It automates the pulling and mounting of the container, allowing researchers to run complex AI stacks (like PyTorch or TensorFlow from NGC) as if they were native applications.
Unattempted
A. To translate Slurm commands into Kubernetes YAML files
Incorrect: Slurm and Kubernetes are distinct workload managers. While some tools exist to bridge them, Enroot and Pyxis are specifically designed to provide containerization for HPC (High-Performance Computing) environments without involving the Kubernetes orchestration layer. They allow Slurm to remain the primary scheduler while gaining container-like flexibility.
B. To provide a graphical user interface (GUI)
Incorrect: Slurm, Enroot, and Pyxis are command-line driven tools. While NVIDIA Base Command Manager (BCM) provides a web-based “Base View“ for monitoring, the integration of Enroot and Pyxis is a backend system-level configuration that handles how jobs are executed at the OS level.
C. To automatically overclock the GPUs
Incorrect: GPU clock speeds are managed by the NVIDIA driver and BIOS/firmware power profiles. While an administrator can use Slurm prolog scripts to set specific GPU frequencies (using nvidia-smi), Enroot and Pyxis are focused on the software environment isolation (containers), not hardware performance tuning or overclocking.
D. To allow Slurm to natively launch and manage unprivileged containerized workloads
Correct: This is the primary purpose defined in the NCP-AII curriculum.
Enroot: An NVIDIA-developed container runtime that turns Docker/OCI images into simple, unpacked filesystems (SquashFS). It is “unprivileged,“ meaning it doesn‘t require a root daemon (unlike standard Docker), which is critical for security in multi-tenant AI clusters.
Pyxis: A Slurm SPANK plugin that allows users to use the –container-image flag directly in srun or sbatch commands. It automates the pulling and mounting of the container, allowing researchers to run complex AI stacks (like PyTorch or TensorFlow from NGC) as if they were native applications.
Question 13 of 60
13. Question
A cluster node is reporting a ‘GPU Fallen Off Bus‘ error in the system logs. After verifying the physical seating and power connections, what is the next logical step an administrator should take to troubleshoot this hardware fault on an NVIDIA HGX system?
Correct
A. Re-install the NGC CLI and use ‘ngc fix-gpu‘
Incorrect: The NGC CLI is a tool for managing cloud assets, downloading containers, and accessing models; it does not have the capability to interact with hardware-level PCIe registers or “fix“ physical bus connectivity. There is no such command as ngc fix-gpu in the NVIDIA utility suite.
B. Check dmesg for PCIe AER messages and use the BMC to check hardware events
Correct: This is the standard troubleshooting procedure taught in the NCP-AII curriculum.
PCIe AER (Advanced Error Reporting): When a GPU “falls off the bus,“ the Linux kernel often logs AER messages in dmesg that provide specific error codes (e.g., Uncorrectable Error, Completion Timeout). This helps determine if the issue is a signaling problem.
BMC/IPMI: Since the HGX baseboard is managed out-of-band, the Baseboard Management Controller (BMC) logs contain critical telemetry regarding voltage regulators, thermal trips, or physical hardware failures that might have caused the GPU to shut down or disconnect to protect the system.
C. Increase fan speed and disable MIG configuration
Incorrect: While heat can cause hardware instability, increasing fan speed after a GPU has already fallen off the bus will not re-establish the PCIe link. Furthermore, MIG (Multi-Instance GPU) configuration is a software-level partitioning of the GPU memory and compute; it has no role in “re-syncing“ a GPU with a BlueField-3 DPU, as the DPU and GPU are separate functional units on the fabric.
D. Swap the InfiniBand transceivers
Incorrect: InfiniBand transceivers manage network connectivity between nodes. A “GPU Fallen Off Bus“ error is an internal node issue involving the PCIe/NVLink connection between the CPU and GPU. Swapping network cables or transceivers will not resolve or diagnose a local PCIe bus failure.
Incorrect
A. Re-install the NGC CLI and use ‘ngc fix-gpu‘
Incorrect: The NGC CLI is a tool for managing cloud assets, downloading containers, and accessing models; it does not have the capability to interact with hardware-level PCIe registers or “fix“ physical bus connectivity. There is no such command as ngc fix-gpu in the NVIDIA utility suite.
B. Check dmesg for PCIe AER messages and use the BMC to check hardware events
Correct: This is the standard troubleshooting procedure taught in the NCP-AII curriculum.
PCIe AER (Advanced Error Reporting): When a GPU “falls off the bus,“ the Linux kernel often logs AER messages in dmesg that provide specific error codes (e.g., Uncorrectable Error, Completion Timeout). This helps determine if the issue is a signaling problem.
BMC/IPMI: Since the HGX baseboard is managed out-of-band, the Baseboard Management Controller (BMC) logs contain critical telemetry regarding voltage regulators, thermal trips, or physical hardware failures that might have caused the GPU to shut down or disconnect to protect the system.
C. Increase fan speed and disable MIG configuration
Incorrect: While heat can cause hardware instability, increasing fan speed after a GPU has already fallen off the bus will not re-establish the PCIe link. Furthermore, MIG (Multi-Instance GPU) configuration is a software-level partitioning of the GPU memory and compute; it has no role in “re-syncing“ a GPU with a BlueField-3 DPU, as the DPU and GPU are separate functional units on the fabric.
D. Swap the InfiniBand transceivers
Incorrect: InfiniBand transceivers manage network connectivity between nodes. A “GPU Fallen Off Bus“ error is an internal node issue involving the PCIe/NVLink connection between the CPU and GPU. Swapping network cables or transceivers will not resolve or diagnose a local PCIe bus failure.
Unattempted
A. Re-install the NGC CLI and use ‘ngc fix-gpu‘
Incorrect: The NGC CLI is a tool for managing cloud assets, downloading containers, and accessing models; it does not have the capability to interact with hardware-level PCIe registers or “fix“ physical bus connectivity. There is no such command as ngc fix-gpu in the NVIDIA utility suite.
B. Check dmesg for PCIe AER messages and use the BMC to check hardware events
Correct: This is the standard troubleshooting procedure taught in the NCP-AII curriculum.
PCIe AER (Advanced Error Reporting): When a GPU “falls off the bus,“ the Linux kernel often logs AER messages in dmesg that provide specific error codes (e.g., Uncorrectable Error, Completion Timeout). This helps determine if the issue is a signaling problem.
BMC/IPMI: Since the HGX baseboard is managed out-of-band, the Baseboard Management Controller (BMC) logs contain critical telemetry regarding voltage regulators, thermal trips, or physical hardware failures that might have caused the GPU to shut down or disconnect to protect the system.
C. Increase fan speed and disable MIG configuration
Incorrect: While heat can cause hardware instability, increasing fan speed after a GPU has already fallen off the bus will not re-establish the PCIe link. Furthermore, MIG (Multi-Instance GPU) configuration is a software-level partitioning of the GPU memory and compute; it has no role in “re-syncing“ a GPU with a BlueField-3 DPU, as the DPU and GPU are separate functional units on the fabric.
D. Swap the InfiniBand transceivers
Incorrect: InfiniBand transceivers manage network connectivity between nodes. A “GPU Fallen Off Bus“ error is an internal node issue involving the PCIe/NVLink connection between the CPU and GPU. Swapping network cables or transceivers will not resolve or diagnose a local PCIe bus failure.
Question 14 of 60
14. Question
A network engineer is configuring an NVIDIA BlueField-3 Data Processing Unit (DPU) to offload infrastructure tasks from the host CPU. The goal is to manage the physical layer and network services directly on the DPU. Which step is essential for ensuring the BlueField platform is correctly integrated into the AI factory‘s management plane while maintaining secure out-of-band access?
Correct
A. Install the NVIDIA GPU driver on the DPU‘s ARM cores
Incorrect: The DPU is a data processing unit, not a GPU. While it manages the network path to GPUs, it does not run GPU drivers on its internal ARM cores to monitor the HGX baseboard. Monitoring of the HGX baseboard is handled by the HMC (HGX Management Controller) and the host‘s system management software (like NVSM), not by installing GPU drivers on the DPU itself.
B. Establish a connection to the DPU‘s console and configure OOB network parameters
Correct: According to the NCP-AII curriculum and BlueField management standards, the DPU has its own Out-of-Band (OOB) management port (typically a 1GbE RJ45 interface). Configuring this interface allows the DPU to have an independent IP address on the management network. This is essential for lifecycle management (updates, monitoring, and recovery) that remains operational even if the host server‘s OS is down or compromised. This ensures the “infrastructure-on-a-chip“ remains a trusted, isolated entity.
C. Configure the iDRAC to bridge the DPU‘s management port
Incorrect: While the host BMC (like iDRAC, iLO, or NVIDIAÂ’s own BMC) can monitor the DPU via the SMBus (using protocols like PLDM/MCTP), bridging the management ports is not a standard practice for AI factories. Bridging would merge the security domains of the host and the DPU, violating the principle of isolation required for high-performance AI infrastructure.
D. Disable the DPU‘s internal hardware offload engines
Incorrect: The entire purpose of a DPU in an AI factory is to offload tasks (like RDMA, encryption, and storage emulation) from the host CPU. Disabling these engines would force the host CPU to process heavy network traffic, significantly degrading the performance of AI training and inference workloads.
Incorrect
A. Install the NVIDIA GPU driver on the DPU‘s ARM cores
Incorrect: The DPU is a data processing unit, not a GPU. While it manages the network path to GPUs, it does not run GPU drivers on its internal ARM cores to monitor the HGX baseboard. Monitoring of the HGX baseboard is handled by the HMC (HGX Management Controller) and the host‘s system management software (like NVSM), not by installing GPU drivers on the DPU itself.
B. Establish a connection to the DPU‘s console and configure OOB network parameters
Correct: According to the NCP-AII curriculum and BlueField management standards, the DPU has its own Out-of-Band (OOB) management port (typically a 1GbE RJ45 interface). Configuring this interface allows the DPU to have an independent IP address on the management network. This is essential for lifecycle management (updates, monitoring, and recovery) that remains operational even if the host server‘s OS is down or compromised. This ensures the “infrastructure-on-a-chip“ remains a trusted, isolated entity.
C. Configure the iDRAC to bridge the DPU‘s management port
Incorrect: While the host BMC (like iDRAC, iLO, or NVIDIAÂ’s own BMC) can monitor the DPU via the SMBus (using protocols like PLDM/MCTP), bridging the management ports is not a standard practice for AI factories. Bridging would merge the security domains of the host and the DPU, violating the principle of isolation required for high-performance AI infrastructure.
D. Disable the DPU‘s internal hardware offload engines
Incorrect: The entire purpose of a DPU in an AI factory is to offload tasks (like RDMA, encryption, and storage emulation) from the host CPU. Disabling these engines would force the host CPU to process heavy network traffic, significantly degrading the performance of AI training and inference workloads.
Unattempted
A. Install the NVIDIA GPU driver on the DPU‘s ARM cores
Incorrect: The DPU is a data processing unit, not a GPU. While it manages the network path to GPUs, it does not run GPU drivers on its internal ARM cores to monitor the HGX baseboard. Monitoring of the HGX baseboard is handled by the HMC (HGX Management Controller) and the host‘s system management software (like NVSM), not by installing GPU drivers on the DPU itself.
B. Establish a connection to the DPU‘s console and configure OOB network parameters
Correct: According to the NCP-AII curriculum and BlueField management standards, the DPU has its own Out-of-Band (OOB) management port (typically a 1GbE RJ45 interface). Configuring this interface allows the DPU to have an independent IP address on the management network. This is essential for lifecycle management (updates, monitoring, and recovery) that remains operational even if the host server‘s OS is down or compromised. This ensures the “infrastructure-on-a-chip“ remains a trusted, isolated entity.
C. Configure the iDRAC to bridge the DPU‘s management port
Incorrect: While the host BMC (like iDRAC, iLO, or NVIDIAÂ’s own BMC) can monitor the DPU via the SMBus (using protocols like PLDM/MCTP), bridging the management ports is not a standard practice for AI factories. Bridging would merge the security domains of the host and the DPU, violating the principle of isolation required for high-performance AI infrastructure.
D. Disable the DPU‘s internal hardware offload engines
Incorrect: The entire purpose of a DPU in an AI factory is to offload tasks (like RDMA, encryption, and storage emulation) from the host CPU. Disabling these engines would force the host CPU to process heavy network traffic, significantly degrading the performance of AI training and inference workloads.
Question 15 of 60
15. Question
A data center team is deploying a new AI pod and needs to configure Multi-Instance GPU (MIG) for a set of NVIDIA A100 GPUs. The goal is to provide isolation for seven different users. Which command sequence correctly enables MIG and verifies the creation of the smallest possible instances on the first GPU in the system?
Correct: This follows the precise command sequence required by the NVIDIA driver.
nvidia-smi -i 0 -mig 1: Enables MIG mode on the first GPU (index 0). This requires a reset or persistence to be enabled.
nvidia-smi mig -cgi 19,19,19,19,19,19,19: Uses Compute Instance Profile ID 19, which corresponds to the 1g.5gb profile (the smallest slice on an A100, allowing for up to 7 instances). The -cgi flag creates the GPU instances, and the -C flag simultaneously creates the corresponding Compute Instances.
nvidia-smi -L: Lists the GPUs and the newly created UUIDs for each MIG instance to verify success.
B. systemctl enable mig-mode; nvidia-smi -create-instances -size small; show-mig-status
Incorrect: MIG mode is controlled via the NVIDIA driver through nvidia-smi, not through a Linux systemctl service. Furthermore, flags like -create-instances and -size small are not valid nvidia-smi syntax.
C. apt-get install nvidia-mig-manager; mig-part apply profile-7x1g.5gb; nvidia-smi
Incorrect: While the NVIDIA MIG Manager is a real tool used in Kubernetes environments to automate partitioning, the NCP-AII certification focuses on the foundational nvidia-smi commands used for manual configuration. mig-part is not the standard CLI utility for hardware-level instance creation in the base driver.
D. ip link set gpu0 up; docker run –runtime=nvidia –mig-mode=on alpine; nvidia-smi
Incorrect: ip link is used for network interfaces, not GPU devices. Additionally, MIG must be enabled at the hardware/driver level before a container is run. You cannot enable MIG mode via a docker run flag.
Correct: This follows the precise command sequence required by the NVIDIA driver.
nvidia-smi -i 0 -mig 1: Enables MIG mode on the first GPU (index 0). This requires a reset or persistence to be enabled.
nvidia-smi mig -cgi 19,19,19,19,19,19,19: Uses Compute Instance Profile ID 19, which corresponds to the 1g.5gb profile (the smallest slice on an A100, allowing for up to 7 instances). The -cgi flag creates the GPU instances, and the -C flag simultaneously creates the corresponding Compute Instances.
nvidia-smi -L: Lists the GPUs and the newly created UUIDs for each MIG instance to verify success.
B. systemctl enable mig-mode; nvidia-smi -create-instances -size small; show-mig-status
Incorrect: MIG mode is controlled via the NVIDIA driver through nvidia-smi, not through a Linux systemctl service. Furthermore, flags like -create-instances and -size small are not valid nvidia-smi syntax.
C. apt-get install nvidia-mig-manager; mig-part apply profile-7x1g.5gb; nvidia-smi
Incorrect: While the NVIDIA MIG Manager is a real tool used in Kubernetes environments to automate partitioning, the NCP-AII certification focuses on the foundational nvidia-smi commands used for manual configuration. mig-part is not the standard CLI utility for hardware-level instance creation in the base driver.
D. ip link set gpu0 up; docker run –runtime=nvidia –mig-mode=on alpine; nvidia-smi
Incorrect: ip link is used for network interfaces, not GPU devices. Additionally, MIG must be enabled at the hardware/driver level before a container is run. You cannot enable MIG mode via a docker run flag.
Correct: This follows the precise command sequence required by the NVIDIA driver.
nvidia-smi -i 0 -mig 1: Enables MIG mode on the first GPU (index 0). This requires a reset or persistence to be enabled.
nvidia-smi mig -cgi 19,19,19,19,19,19,19: Uses Compute Instance Profile ID 19, which corresponds to the 1g.5gb profile (the smallest slice on an A100, allowing for up to 7 instances). The -cgi flag creates the GPU instances, and the -C flag simultaneously creates the corresponding Compute Instances.
nvidia-smi -L: Lists the GPUs and the newly created UUIDs for each MIG instance to verify success.
B. systemctl enable mig-mode; nvidia-smi -create-instances -size small; show-mig-status
Incorrect: MIG mode is controlled via the NVIDIA driver through nvidia-smi, not through a Linux systemctl service. Furthermore, flags like -create-instances and -size small are not valid nvidia-smi syntax.
C. apt-get install nvidia-mig-manager; mig-part apply profile-7x1g.5gb; nvidia-smi
Incorrect: While the NVIDIA MIG Manager is a real tool used in Kubernetes environments to automate partitioning, the NCP-AII certification focuses on the foundational nvidia-smi commands used for manual configuration. mig-part is not the standard CLI utility for hardware-level instance creation in the base driver.
D. ip link set gpu0 up; docker run –runtime=nvidia –mig-mode=on alpine; nvidia-smi
Incorrect: ip link is used for network interfaces, not GPU devices. Additionally, MIG must be enabled at the hardware/driver level before a container is run. You cannot enable MIG mode via a docker run flag.
Question 16 of 60
16. Question
A system administrator identifies that a GPU in an NVIDIA HGX H100 system is reporting persistent XID errors and frequent ‘Falling off the Bus‘ events. After attempting a driver reload and a warm reboot without success, what is the next logical step in the troubleshooting and optimization process for this hardware fault?
Correct
Option A: Incorrect While a cold reboot (removing power completely) is a valid next step to reset the hardware‘s volatile state, the second half of this option is the dealbreaker. Under NCP-AII standards, permanently lowering clock speeds is considered a “band-aid“ fix that compromises the performance of the AI cluster. Professional infrastructure management requires fixing the root cause, not throttling expensive hardware to mask a defect.
Option B: Incorrect This option confuses the data plane with the management/compute plane. While InfiniBand errors can cause application crashes in a cluster, they do not cause a GPU to report “Falling off the Bus“ or persistent XID errors to the local OS. GPU bus errors are specific to the PCIe or SXM interface between the GPU and the CPU/Baseboard.
Option C: Correct This aligns with NVIDIA‘s hardware field-service protocols. In an HGX system, the GPUs are mounted via high-density SXM connectors. If software-level resets fail, the fault is likely physical—either a mounting/torque issue or a component failure. Physical inspection followed by an RMA (Return Merchandise Authorization) is the logical path to restore the system to its validated, high-performance state.
Option D: Incorrect Reinstalling the entire Linux operating system is a “scorched earth“ approach that rarely solves a hardware-level communication failure. In high-density AI environments, downtime is costly; troubleshooting should be surgical. If the GPU is failing at the bus level, a fresh OS will simply encounter the same hardware interrupt errors.
Incorrect
Option A: Incorrect While a cold reboot (removing power completely) is a valid next step to reset the hardware‘s volatile state, the second half of this option is the dealbreaker. Under NCP-AII standards, permanently lowering clock speeds is considered a “band-aid“ fix that compromises the performance of the AI cluster. Professional infrastructure management requires fixing the root cause, not throttling expensive hardware to mask a defect.
Option B: Incorrect This option confuses the data plane with the management/compute plane. While InfiniBand errors can cause application crashes in a cluster, they do not cause a GPU to report “Falling off the Bus“ or persistent XID errors to the local OS. GPU bus errors are specific to the PCIe or SXM interface between the GPU and the CPU/Baseboard.
Option C: Correct This aligns with NVIDIA‘s hardware field-service protocols. In an HGX system, the GPUs are mounted via high-density SXM connectors. If software-level resets fail, the fault is likely physical—either a mounting/torque issue or a component failure. Physical inspection followed by an RMA (Return Merchandise Authorization) is the logical path to restore the system to its validated, high-performance state.
Option D: Incorrect Reinstalling the entire Linux operating system is a “scorched earth“ approach that rarely solves a hardware-level communication failure. In high-density AI environments, downtime is costly; troubleshooting should be surgical. If the GPU is failing at the bus level, a fresh OS will simply encounter the same hardware interrupt errors.
Unattempted
Option A: Incorrect While a cold reboot (removing power completely) is a valid next step to reset the hardware‘s volatile state, the second half of this option is the dealbreaker. Under NCP-AII standards, permanently lowering clock speeds is considered a “band-aid“ fix that compromises the performance of the AI cluster. Professional infrastructure management requires fixing the root cause, not throttling expensive hardware to mask a defect.
Option B: Incorrect This option confuses the data plane with the management/compute plane. While InfiniBand errors can cause application crashes in a cluster, they do not cause a GPU to report “Falling off the Bus“ or persistent XID errors to the local OS. GPU bus errors are specific to the PCIe or SXM interface between the GPU and the CPU/Baseboard.
Option C: Correct This aligns with NVIDIA‘s hardware field-service protocols. In an HGX system, the GPUs are mounted via high-density SXM connectors. If software-level resets fail, the fault is likely physical—either a mounting/torque issue or a component failure. Physical inspection followed by an RMA (Return Merchandise Authorization) is the logical path to restore the system to its validated, high-performance state.
Option D: Incorrect Reinstalling the entire Linux operating system is a “scorched earth“ approach that rarely solves a hardware-level communication failure. In high-density AI environments, downtime is costly; troubleshooting should be surgical. If the GPU is failing at the bus level, a fresh OS will simply encounter the same hardware interrupt errors.
Question 17 of 60
17. Question
A ClusterKit assessment is performed on a newly deployed AI cluster. The report indicates a failure in the node-to-node communication check. Which of the following is the most logical next step to narrow down the cause of this failure in a multi-node AI factory environment?
Correct
A. Replacing all the Category 6 cables used for the management network
Incorrect: Category 6 (Cat6) cables are used for the 1GbE or 10GbE management network (Out-of-Band). While a management network failure would prevent you from controlling a node, “node-to-node communication“ in the context of ClusterKit assessments typically refers to the High-Speed Fabric (InfiniBand or RoCE) used for data transfer. Replacing management cables will not fix throughput or connectivity issues on the data plane.
B. Verifying the signal quality and firmware versions of the transceivers
Correct: According to NCP-AII troubleshooting standards, the physical layer of the high-speed interconnect is the most common point of failure.
Signal Quality: Using tools like mlxlink allows an engineer to check the Bit Error Rate (BER) and eye diagrams. Poor signal quality often points to a “dirty“ fiber or a failing transceiver.
Firmware: Transceivers and cables (especially Active Optical Cables or AOCs) have firmware. Incompatibilities between the cable firmware and the Switch/HCA (Host Channel Adapter) firmware can cause link instability or complete communication failure.
C. Reinstalling the operating system on the primary head node
Incorrect: The head node manages the cluster, but node-to-node communication happens directly between the compute nodes over the fabric. Reinstalling the OS on the head node is a drastic and irrelevant step that does not address the networking hardware or configuration on the compute nodes themselves.
D. Reducing the GPU clock speed to decrease power consumption
Incorrect: While reducing clock speeds (throttling) might be used to troubleshoot power or thermal issues, it has no direct relationship with the network‘s ability to establish a link or pass data. A communication failure is a networking issue, not a GPU compute or power-draw issue.
Incorrect
A. Replacing all the Category 6 cables used for the management network
Incorrect: Category 6 (Cat6) cables are used for the 1GbE or 10GbE management network (Out-of-Band). While a management network failure would prevent you from controlling a node, “node-to-node communication“ in the context of ClusterKit assessments typically refers to the High-Speed Fabric (InfiniBand or RoCE) used for data transfer. Replacing management cables will not fix throughput or connectivity issues on the data plane.
B. Verifying the signal quality and firmware versions of the transceivers
Correct: According to NCP-AII troubleshooting standards, the physical layer of the high-speed interconnect is the most common point of failure.
Signal Quality: Using tools like mlxlink allows an engineer to check the Bit Error Rate (BER) and eye diagrams. Poor signal quality often points to a “dirty“ fiber or a failing transceiver.
Firmware: Transceivers and cables (especially Active Optical Cables or AOCs) have firmware. Incompatibilities between the cable firmware and the Switch/HCA (Host Channel Adapter) firmware can cause link instability or complete communication failure.
C. Reinstalling the operating system on the primary head node
Incorrect: The head node manages the cluster, but node-to-node communication happens directly between the compute nodes over the fabric. Reinstalling the OS on the head node is a drastic and irrelevant step that does not address the networking hardware or configuration on the compute nodes themselves.
D. Reducing the GPU clock speed to decrease power consumption
Incorrect: While reducing clock speeds (throttling) might be used to troubleshoot power or thermal issues, it has no direct relationship with the network‘s ability to establish a link or pass data. A communication failure is a networking issue, not a GPU compute or power-draw issue.
Unattempted
A. Replacing all the Category 6 cables used for the management network
Incorrect: Category 6 (Cat6) cables are used for the 1GbE or 10GbE management network (Out-of-Band). While a management network failure would prevent you from controlling a node, “node-to-node communication“ in the context of ClusterKit assessments typically refers to the High-Speed Fabric (InfiniBand or RoCE) used for data transfer. Replacing management cables will not fix throughput or connectivity issues on the data plane.
B. Verifying the signal quality and firmware versions of the transceivers
Correct: According to NCP-AII troubleshooting standards, the physical layer of the high-speed interconnect is the most common point of failure.
Signal Quality: Using tools like mlxlink allows an engineer to check the Bit Error Rate (BER) and eye diagrams. Poor signal quality often points to a “dirty“ fiber or a failing transceiver.
Firmware: Transceivers and cables (especially Active Optical Cables or AOCs) have firmware. Incompatibilities between the cable firmware and the Switch/HCA (Host Channel Adapter) firmware can cause link instability or complete communication failure.
C. Reinstalling the operating system on the primary head node
Incorrect: The head node manages the cluster, but node-to-node communication happens directly between the compute nodes over the fabric. Reinstalling the OS on the head node is a drastic and irrelevant step that does not address the networking hardware or configuration on the compute nodes themselves.
D. Reducing the GPU clock speed to decrease power consumption
Incorrect: While reducing clock speeds (throttling) might be used to troubleshoot power or thermal issues, it has no direct relationship with the network‘s ability to establish a link or pass data. A communication failure is a networking issue, not a GPU compute or power-draw issue.
Question 18 of 60
18. Question
A cluster experiences intermittent network performance drops during large-scale NeMo burn-in tests. Troubleshooting reveals that several fan modules in the leaf switches have failed. How does fan failure in a network switch impact the performance of the AI cluster compute fabric?
Correct
Correct: D. Thermal throttling of the switch ASIC can lead to dropped packets and increased latency, which severely degrades the performance of NCCL collective operations.
This option is correct because it accurately describes the causal chain of events following a fan failure in an AI compute fabric. High-performance switches generate significant heat, and fans are critical for maintaining operational temperatures. When fans fail, the switch ASIC (Application-Specific Integrated Circuit) overheats and triggers thermal throttling to protect itself. This throttling reduces the processing capacity of the switch, leading to two critical issues for AI workloads:
Dropped Packets and Increased Latency: As the ASIC struggles to process traffic, the switch becomes unable to keep up with the data rate, resulting in packet loss and higher latency .
Severe Degradation of NCCL: The NVIDIA Collective Communications Library (NCCL) is extremely sensitive to network imperfections. It relies on tightly coupled, low-latency communication and assumes a near-lossless network for optimal performance. Packet loss and increased latency directly disrupt the synchronization between GPUs, force retransmissions, and stall communication pipelines, leading to a dramatic drop in the performance of collective operations like all-reduce .
Incorrect: A. The switch will automatically increase the packet size to compensate for the lack of cooling, causing fragmentation on the network.
This is incorrect. There is no mechanism or logic in network switches that links cooling status to packet size. Packet size is determined by the MTU (Maximum Transmission Unit) settings and the applications generating the traffic, not by thermal conditions. A switch does not have the intelligence or capability to manipulate packet size as a compensatory measure for hardware failures.
B. The fans are only for noise reduction and have no impact on the electronic performance or throughput of the switch hardware.
This is incorrect. Fans in data center switches, especially those in AI clusters, are mission-critical components for thermal management. They are designed to provide forced-air cooling to dissipate the intense heat generated by high-power ASICs and optics. Without proper cooling, components overheat, leading to performance degradation (throttling), instability, and permanent hardware damage. They are not for noise reduction.
C. The failure of a fan causes the switch to switch from InfiniBand mode to Ethernet mode to reduce the electrical load on the power supplies.
This is incorrect. The operational mode of a network switch (e.g., InfiniBand vs. Ethernet) is a fundamental firmware or software configuration. It is not a dynamic feature that can be toggled automatically by the switch in response to environmental factors like a fan failure. There is no such mechanism to switch protocols to reduce power load; a fan failure would lead to thermal events, not a protocol conversion.
Incorrect
Correct: D. Thermal throttling of the switch ASIC can lead to dropped packets and increased latency, which severely degrades the performance of NCCL collective operations.
This option is correct because it accurately describes the causal chain of events following a fan failure in an AI compute fabric. High-performance switches generate significant heat, and fans are critical for maintaining operational temperatures. When fans fail, the switch ASIC (Application-Specific Integrated Circuit) overheats and triggers thermal throttling to protect itself. This throttling reduces the processing capacity of the switch, leading to two critical issues for AI workloads:
Dropped Packets and Increased Latency: As the ASIC struggles to process traffic, the switch becomes unable to keep up with the data rate, resulting in packet loss and higher latency .
Severe Degradation of NCCL: The NVIDIA Collective Communications Library (NCCL) is extremely sensitive to network imperfections. It relies on tightly coupled, low-latency communication and assumes a near-lossless network for optimal performance. Packet loss and increased latency directly disrupt the synchronization between GPUs, force retransmissions, and stall communication pipelines, leading to a dramatic drop in the performance of collective operations like all-reduce .
Incorrect: A. The switch will automatically increase the packet size to compensate for the lack of cooling, causing fragmentation on the network.
This is incorrect. There is no mechanism or logic in network switches that links cooling status to packet size. Packet size is determined by the MTU (Maximum Transmission Unit) settings and the applications generating the traffic, not by thermal conditions. A switch does not have the intelligence or capability to manipulate packet size as a compensatory measure for hardware failures.
B. The fans are only for noise reduction and have no impact on the electronic performance or throughput of the switch hardware.
This is incorrect. Fans in data center switches, especially those in AI clusters, are mission-critical components for thermal management. They are designed to provide forced-air cooling to dissipate the intense heat generated by high-power ASICs and optics. Without proper cooling, components overheat, leading to performance degradation (throttling), instability, and permanent hardware damage. They are not for noise reduction.
C. The failure of a fan causes the switch to switch from InfiniBand mode to Ethernet mode to reduce the electrical load on the power supplies.
This is incorrect. The operational mode of a network switch (e.g., InfiniBand vs. Ethernet) is a fundamental firmware or software configuration. It is not a dynamic feature that can be toggled automatically by the switch in response to environmental factors like a fan failure. There is no such mechanism to switch protocols to reduce power load; a fan failure would lead to thermal events, not a protocol conversion.
Unattempted
Correct: D. Thermal throttling of the switch ASIC can lead to dropped packets and increased latency, which severely degrades the performance of NCCL collective operations.
This option is correct because it accurately describes the causal chain of events following a fan failure in an AI compute fabric. High-performance switches generate significant heat, and fans are critical for maintaining operational temperatures. When fans fail, the switch ASIC (Application-Specific Integrated Circuit) overheats and triggers thermal throttling to protect itself. This throttling reduces the processing capacity of the switch, leading to two critical issues for AI workloads:
Dropped Packets and Increased Latency: As the ASIC struggles to process traffic, the switch becomes unable to keep up with the data rate, resulting in packet loss and higher latency .
Severe Degradation of NCCL: The NVIDIA Collective Communications Library (NCCL) is extremely sensitive to network imperfections. It relies on tightly coupled, low-latency communication and assumes a near-lossless network for optimal performance. Packet loss and increased latency directly disrupt the synchronization between GPUs, force retransmissions, and stall communication pipelines, leading to a dramatic drop in the performance of collective operations like all-reduce .
Incorrect: A. The switch will automatically increase the packet size to compensate for the lack of cooling, causing fragmentation on the network.
This is incorrect. There is no mechanism or logic in network switches that links cooling status to packet size. Packet size is determined by the MTU (Maximum Transmission Unit) settings and the applications generating the traffic, not by thermal conditions. A switch does not have the intelligence or capability to manipulate packet size as a compensatory measure for hardware failures.
B. The fans are only for noise reduction and have no impact on the electronic performance or throughput of the switch hardware.
This is incorrect. Fans in data center switches, especially those in AI clusters, are mission-critical components for thermal management. They are designed to provide forced-air cooling to dissipate the intense heat generated by high-power ASICs and optics. Without proper cooling, components overheat, leading to performance degradation (throttling), instability, and permanent hardware damage. They are not for noise reduction.
C. The failure of a fan causes the switch to switch from InfiniBand mode to Ethernet mode to reduce the electrical load on the power supplies.
This is incorrect. The operational mode of a network switch (e.g., InfiniBand vs. Ethernet) is a fundamental firmware or software configuration. It is not a dynamic feature that can be toggled automatically by the switch in response to environmental factors like a fan failure. There is no such mechanism to switch protocols to reduce power load; a fan failure would lead to thermal events, not a protocol conversion.
Question 19 of 60
19. Question
A technician identifies a faulty GPU in an HGX H100 system that is causing frequent training job failures. After confirming the fault via the BMC logs and nvidia-smi, what is the correct professional procedure for replacing this Field Replaceable Unit (FRU)?
Correct
A. Use a soldering iron to replace individual memory chips
Incorrect: Modern GPU systems like the H100 use HBM (High Bandwidth Memory) that is physically integrated into the GPU package using advanced 2.5D/3D packaging. These components are not serviceable at the board level in a data center environment. Attempting such a repair would void the warranty and likely result in permanent destruction of the hardware.
B. Power down the server, follow anti-static procedures, replace the unit, and re-validate with ClusterKit
Correct: This follows the NVIDIA Professional Services and NCP-AII standard operating procedure:
Safety: Powering down is mandatory for HGX tray service.
ESD Protection: Using anti-static (ESD) wrist straps and mats is critical when handling sensitive AI hardware to prevent static discharge damage.
FRU Logic: In HGX systems, depending on the specific vendor design, the entire GPU tray or baseboard is often treated as a single Field Replaceable Unit (FRU) to ensure thermal and electrical consistency.
Verification: After a replacement, running ClusterKit (specifically the HPL and NCCL tests) is the required step to ensure the new hardware is correctly recognized and performing within specification.
C. Hot-swap the GPU while the server is running
Incorrect: NVIDIA HGX systems are not hot-swappable. The GPUs are connected via high-speed NVLink and PCIe interfaces that do not support live removal. Attempting to pull a GPU or tray while the system is powered on would cause a catastrophic electrical event, likely damaging the rest of the cluster and the system motherboard.
D. Spray with compressed air and re-seat InfiniBand transceivers
Incorrect: While dust can cause thermal issues, a “faulty GPU“ confirmed via BMC logs and nvidia-smi (such as an XID 61 or uncorrectable ECC error) indicates a hardware component failure. Furthermore, InfiniBand transceivers are part of the network fabric; re-seating them would not resolve an internal GPU hardware fault.
Incorrect
A. Use a soldering iron to replace individual memory chips
Incorrect: Modern GPU systems like the H100 use HBM (High Bandwidth Memory) that is physically integrated into the GPU package using advanced 2.5D/3D packaging. These components are not serviceable at the board level in a data center environment. Attempting such a repair would void the warranty and likely result in permanent destruction of the hardware.
B. Power down the server, follow anti-static procedures, replace the unit, and re-validate with ClusterKit
Correct: This follows the NVIDIA Professional Services and NCP-AII standard operating procedure:
Safety: Powering down is mandatory for HGX tray service.
ESD Protection: Using anti-static (ESD) wrist straps and mats is critical when handling sensitive AI hardware to prevent static discharge damage.
FRU Logic: In HGX systems, depending on the specific vendor design, the entire GPU tray or baseboard is often treated as a single Field Replaceable Unit (FRU) to ensure thermal and electrical consistency.
Verification: After a replacement, running ClusterKit (specifically the HPL and NCCL tests) is the required step to ensure the new hardware is correctly recognized and performing within specification.
C. Hot-swap the GPU while the server is running
Incorrect: NVIDIA HGX systems are not hot-swappable. The GPUs are connected via high-speed NVLink and PCIe interfaces that do not support live removal. Attempting to pull a GPU or tray while the system is powered on would cause a catastrophic electrical event, likely damaging the rest of the cluster and the system motherboard.
D. Spray with compressed air and re-seat InfiniBand transceivers
Incorrect: While dust can cause thermal issues, a “faulty GPU“ confirmed via BMC logs and nvidia-smi (such as an XID 61 or uncorrectable ECC error) indicates a hardware component failure. Furthermore, InfiniBand transceivers are part of the network fabric; re-seating them would not resolve an internal GPU hardware fault.
Unattempted
A. Use a soldering iron to replace individual memory chips
Incorrect: Modern GPU systems like the H100 use HBM (High Bandwidth Memory) that is physically integrated into the GPU package using advanced 2.5D/3D packaging. These components are not serviceable at the board level in a data center environment. Attempting such a repair would void the warranty and likely result in permanent destruction of the hardware.
B. Power down the server, follow anti-static procedures, replace the unit, and re-validate with ClusterKit
Correct: This follows the NVIDIA Professional Services and NCP-AII standard operating procedure:
Safety: Powering down is mandatory for HGX tray service.
ESD Protection: Using anti-static (ESD) wrist straps and mats is critical when handling sensitive AI hardware to prevent static discharge damage.
FRU Logic: In HGX systems, depending on the specific vendor design, the entire GPU tray or baseboard is often treated as a single Field Replaceable Unit (FRU) to ensure thermal and electrical consistency.
Verification: After a replacement, running ClusterKit (specifically the HPL and NCCL tests) is the required step to ensure the new hardware is correctly recognized and performing within specification.
C. Hot-swap the GPU while the server is running
Incorrect: NVIDIA HGX systems are not hot-swappable. The GPUs are connected via high-speed NVLink and PCIe interfaces that do not support live removal. Attempting to pull a GPU or tray while the system is powered on would cause a catastrophic electrical event, likely damaging the rest of the cluster and the system motherboard.
D. Spray with compressed air and re-seat InfiniBand transceivers
Incorrect: While dust can cause thermal issues, a “faulty GPU“ confirmed via BMC logs and nvidia-smi (such as an XID 61 or uncorrectable ECC error) indicates a hardware component failure. Furthermore, InfiniBand transceivers are part of the network fabric; re-seating them would not resolve an internal GPU hardware fault.
Question 20 of 60
20. Question
An administrator is managing a cluster where some nodes use BlueField DPUs and others use standard ConnectX HCAs. During a network audit, it is discovered that the BlueField DPUs are not correctly negotiating the link speed on a new 400G spine switch. Which physical layer management step should be prioritized to resolve this communication issue between the DPU and the switch fabric?
Correct
A. Delete the ‘ovs-vswitchd‘ configuration on the DPU ARM cores
Incorrect: ovs-vswitchd is a software-defined networking component (Open vSwitch) that manages virtual switching after the physical link is established. It does not control the low-level hardware link negotiation between the DPU and the spine switch. Deleting its configuration would disrupt virtual networking but would not fix a 400G physical layer negotiation failure.
B. Apply a liquid cooling solution to the transceivers
Incorrect: While high-speed transceivers generate significant heat, they are designed to operate within specific thermal envelopes using the server‘s standard airflow. If a transceiver overheats, it may shut down or error out, but it will not “throttle to 10G“ as a standard safety feature. Furthermore, “applying liquid cooling“ directly to transceivers is not a standard data center practice or a troubleshooting step for link negotiation.
C. Physically swap the DPU into a PCIe Gen 3 slot
Incorrect: 400G throughput requires PCIe Gen 5 bandwidth to prevent bottlenecks. Moving a BlueField-3 DPU to a PCIe Gen 3 slot would severely limit performance and increase signal latency. Additionally, PCIe signal noise is handled by the motherboard and DPU shielding; downgrading the slot generation is not a valid fix for 400G fabric-side negotiation issues.
D. Check the cable ‘EEPROM‘ information and firmware, and use ‘mlxconfig‘
Correct: This aligns with the NCP-AII Physical Layer Management domain.
EEPROM/Firmware: 400G (NDR) links are highly sensitive to cable quality and transceiver firmware compatibility. The administrator must verify that the cable is recognized and that the transceiver firmware is up to date using tools like mlxcables.
mlxconfig: This tool is used to modify non-volatile hardware configurations. A common cause of failed negotiation on NVIDIA VPI (Virtual Protocol Interconnect) devices is the LINK_TYPE being incorrectly set (e.g., forced to InfiniBand when the switch is Ethernet, or vice versa). Setting the correct LINK_TYPE_P1 and LINK_TYPE_P2 ensures the DPU uses the correct protocol for the fabric.
Incorrect
A. Delete the ‘ovs-vswitchd‘ configuration on the DPU ARM cores
Incorrect: ovs-vswitchd is a software-defined networking component (Open vSwitch) that manages virtual switching after the physical link is established. It does not control the low-level hardware link negotiation between the DPU and the spine switch. Deleting its configuration would disrupt virtual networking but would not fix a 400G physical layer negotiation failure.
B. Apply a liquid cooling solution to the transceivers
Incorrect: While high-speed transceivers generate significant heat, they are designed to operate within specific thermal envelopes using the server‘s standard airflow. If a transceiver overheats, it may shut down or error out, but it will not “throttle to 10G“ as a standard safety feature. Furthermore, “applying liquid cooling“ directly to transceivers is not a standard data center practice or a troubleshooting step for link negotiation.
C. Physically swap the DPU into a PCIe Gen 3 slot
Incorrect: 400G throughput requires PCIe Gen 5 bandwidth to prevent bottlenecks. Moving a BlueField-3 DPU to a PCIe Gen 3 slot would severely limit performance and increase signal latency. Additionally, PCIe signal noise is handled by the motherboard and DPU shielding; downgrading the slot generation is not a valid fix for 400G fabric-side negotiation issues.
D. Check the cable ‘EEPROM‘ information and firmware, and use ‘mlxconfig‘
Correct: This aligns with the NCP-AII Physical Layer Management domain.
EEPROM/Firmware: 400G (NDR) links are highly sensitive to cable quality and transceiver firmware compatibility. The administrator must verify that the cable is recognized and that the transceiver firmware is up to date using tools like mlxcables.
mlxconfig: This tool is used to modify non-volatile hardware configurations. A common cause of failed negotiation on NVIDIA VPI (Virtual Protocol Interconnect) devices is the LINK_TYPE being incorrectly set (e.g., forced to InfiniBand when the switch is Ethernet, or vice versa). Setting the correct LINK_TYPE_P1 and LINK_TYPE_P2 ensures the DPU uses the correct protocol for the fabric.
Unattempted
A. Delete the ‘ovs-vswitchd‘ configuration on the DPU ARM cores
Incorrect: ovs-vswitchd is a software-defined networking component (Open vSwitch) that manages virtual switching after the physical link is established. It does not control the low-level hardware link negotiation between the DPU and the spine switch. Deleting its configuration would disrupt virtual networking but would not fix a 400G physical layer negotiation failure.
B. Apply a liquid cooling solution to the transceivers
Incorrect: While high-speed transceivers generate significant heat, they are designed to operate within specific thermal envelopes using the server‘s standard airflow. If a transceiver overheats, it may shut down or error out, but it will not “throttle to 10G“ as a standard safety feature. Furthermore, “applying liquid cooling“ directly to transceivers is not a standard data center practice or a troubleshooting step for link negotiation.
C. Physically swap the DPU into a PCIe Gen 3 slot
Incorrect: 400G throughput requires PCIe Gen 5 bandwidth to prevent bottlenecks. Moving a BlueField-3 DPU to a PCIe Gen 3 slot would severely limit performance and increase signal latency. Additionally, PCIe signal noise is handled by the motherboard and DPU shielding; downgrading the slot generation is not a valid fix for 400G fabric-side negotiation issues.
D. Check the cable ‘EEPROM‘ information and firmware, and use ‘mlxconfig‘
Correct: This aligns with the NCP-AII Physical Layer Management domain.
EEPROM/Firmware: 400G (NDR) links are highly sensitive to cable quality and transceiver firmware compatibility. The administrator must verify that the cable is recognized and that the transceiver firmware is up to date using tools like mlxcables.
mlxconfig: This tool is used to modify non-volatile hardware configurations. A common cause of failed negotiation on NVIDIA VPI (Virtual Protocol Interconnect) devices is the LINK_TYPE being incorrectly set (e.g., forced to InfiniBand when the switch is Ethernet, or vice versa). Setting the correct LINK_TYPE_P1 and LINK_TYPE_P2 ensures the DPU uses the correct protocol for the fabric.
Question 21 of 60
21. Question
To verify the health of the inter-node East-West fabric, the administrator runs an NCCL all_reduce test across 16 nodes. The results show a significant bandwidth bottleneck. Which specific check should the administrator perform on the InfiniBand switches and BlueField-3 DPUs to troubleshoot this network performance issue?
Correct
Correct: C. Confirm that the Adaptive Routing and Congestion Control settings are correctly configured and consistent across the fabric switches and DPUs.
This option is correct because Adaptive Routing and Congestion Control are critical features in high-performance AI fabrics (both InfiniBand and NVIDIA Spectrum-X Ethernet) that directly impact NCCL all_reduce performance. When these settings are misconfigured or inconsistent across the fabric, they create the exact symptoms described—significant bandwidth bottlenecks in collective operations.
Adaptive Routing solves a fundamental problem in AI clusters: traditional ECMP (Equal Cost Multi-Path) hash-based routing leads to inefficient bandwidth utilization because large “elephant flows“ from NCCL operations can cause specific paths to become overloaded while others remain underutilized . Test data shows that under standard ECMP, bandwidth utilization reaches only 84.5%, meaning expensive GPU computing power cannot be fully leveraged . When Adaptive Routing is correctly enabled, bandwidth utilization improves to approximately 97% .
In NVIDIA‘s Spectrum-X Ethernet architecture specifically, the integration between BlueField-3 DPUs and Spectrum-4 switches provides advanced congestion control capabilities. The DPU and switch work together to monitor congestion in real-time, detect hotspots, and dynamically distribute traffic to avoid packet loss and out-of-order delivery . This is essential for NCCL performance because NCCL collective operations like all_reduce are extremely sensitive to network imperfections—packet loss or increased latency directly disrupt GPU synchronization and force retransmissions.
The consistency requirement is critical because Adaptive Routing and Congestion Control must be enabled and configured identically across all fabric elements (switches and DPUs) to function properly. Mismatched settings can lead to partial functionality or instability, manifesting as the bandwidth bottleneck observed in the NCCL test.
Incorrect: A. Ensure that the network cables are painted with a non-conductive coating to prevent static electricity from slowing down the photons.
This is completely incorrect and reflects a fundamental misunderstanding of networking physics. Photons (light particles) are not slowed by static electricity, and network cables do not require non-conductive paint for performance reasons. Data center cables (optical or copper) are designed with proper shielding and materials to meet their specifications. This option has no basis in any NVIDIA certification material or real-world networking practice.
B. Check if the gcc compiler is installed on the switch as the switch needs to recompile the NCCL kernels for every new job.
This is incorrect for multiple reasons. First, NCCL kernels are compiled on the host systems (GPU servers), not on network switches. Network switches run specialized firmware/operating systems and do not compile application code. Second, NCCL kernels do not need recompilation per job—they are compiled once or loaded as binaries. The gcc compiler is a host-side development tool and has no role in switch operation or NCCL runtime execution.
D. Verify that the GPUs are in MIG mode as NCCL requires MIG to be enabled to use the network fabric effectively.
This is incorrect. MIG (Multi-Instance GPU) mode is a feature for partitioning a single GPU into multiple isolated instances, primarily for workload consolidation and multi-tenant scenarios. NCCL does not require MIG mode to use the network fabric effectively. In fact, MIG mode is typically disabled for large-scale training jobs where maximum per-GPU performance is needed. NCCL operates normally regardless of MIG mode settings, and enabling/disabling MIG has no direct relationship to network fabric bandwidth utilization.
Incorrect
Correct: C. Confirm that the Adaptive Routing and Congestion Control settings are correctly configured and consistent across the fabric switches and DPUs.
This option is correct because Adaptive Routing and Congestion Control are critical features in high-performance AI fabrics (both InfiniBand and NVIDIA Spectrum-X Ethernet) that directly impact NCCL all_reduce performance. When these settings are misconfigured or inconsistent across the fabric, they create the exact symptoms described—significant bandwidth bottlenecks in collective operations.
Adaptive Routing solves a fundamental problem in AI clusters: traditional ECMP (Equal Cost Multi-Path) hash-based routing leads to inefficient bandwidth utilization because large “elephant flows“ from NCCL operations can cause specific paths to become overloaded while others remain underutilized . Test data shows that under standard ECMP, bandwidth utilization reaches only 84.5%, meaning expensive GPU computing power cannot be fully leveraged . When Adaptive Routing is correctly enabled, bandwidth utilization improves to approximately 97% .
In NVIDIA‘s Spectrum-X Ethernet architecture specifically, the integration between BlueField-3 DPUs and Spectrum-4 switches provides advanced congestion control capabilities. The DPU and switch work together to monitor congestion in real-time, detect hotspots, and dynamically distribute traffic to avoid packet loss and out-of-order delivery . This is essential for NCCL performance because NCCL collective operations like all_reduce are extremely sensitive to network imperfections—packet loss or increased latency directly disrupt GPU synchronization and force retransmissions.
The consistency requirement is critical because Adaptive Routing and Congestion Control must be enabled and configured identically across all fabric elements (switches and DPUs) to function properly. Mismatched settings can lead to partial functionality or instability, manifesting as the bandwidth bottleneck observed in the NCCL test.
Incorrect: A. Ensure that the network cables are painted with a non-conductive coating to prevent static electricity from slowing down the photons.
This is completely incorrect and reflects a fundamental misunderstanding of networking physics. Photons (light particles) are not slowed by static electricity, and network cables do not require non-conductive paint for performance reasons. Data center cables (optical or copper) are designed with proper shielding and materials to meet their specifications. This option has no basis in any NVIDIA certification material or real-world networking practice.
B. Check if the gcc compiler is installed on the switch as the switch needs to recompile the NCCL kernels for every new job.
This is incorrect for multiple reasons. First, NCCL kernels are compiled on the host systems (GPU servers), not on network switches. Network switches run specialized firmware/operating systems and do not compile application code. Second, NCCL kernels do not need recompilation per job—they are compiled once or loaded as binaries. The gcc compiler is a host-side development tool and has no role in switch operation or NCCL runtime execution.
D. Verify that the GPUs are in MIG mode as NCCL requires MIG to be enabled to use the network fabric effectively.
This is incorrect. MIG (Multi-Instance GPU) mode is a feature for partitioning a single GPU into multiple isolated instances, primarily for workload consolidation and multi-tenant scenarios. NCCL does not require MIG mode to use the network fabric effectively. In fact, MIG mode is typically disabled for large-scale training jobs where maximum per-GPU performance is needed. NCCL operates normally regardless of MIG mode settings, and enabling/disabling MIG has no direct relationship to network fabric bandwidth utilization.
Unattempted
Correct: C. Confirm that the Adaptive Routing and Congestion Control settings are correctly configured and consistent across the fabric switches and DPUs.
This option is correct because Adaptive Routing and Congestion Control are critical features in high-performance AI fabrics (both InfiniBand and NVIDIA Spectrum-X Ethernet) that directly impact NCCL all_reduce performance. When these settings are misconfigured or inconsistent across the fabric, they create the exact symptoms described—significant bandwidth bottlenecks in collective operations.
Adaptive Routing solves a fundamental problem in AI clusters: traditional ECMP (Equal Cost Multi-Path) hash-based routing leads to inefficient bandwidth utilization because large “elephant flows“ from NCCL operations can cause specific paths to become overloaded while others remain underutilized . Test data shows that under standard ECMP, bandwidth utilization reaches only 84.5%, meaning expensive GPU computing power cannot be fully leveraged . When Adaptive Routing is correctly enabled, bandwidth utilization improves to approximately 97% .
In NVIDIA‘s Spectrum-X Ethernet architecture specifically, the integration between BlueField-3 DPUs and Spectrum-4 switches provides advanced congestion control capabilities. The DPU and switch work together to monitor congestion in real-time, detect hotspots, and dynamically distribute traffic to avoid packet loss and out-of-order delivery . This is essential for NCCL performance because NCCL collective operations like all_reduce are extremely sensitive to network imperfections—packet loss or increased latency directly disrupt GPU synchronization and force retransmissions.
The consistency requirement is critical because Adaptive Routing and Congestion Control must be enabled and configured identically across all fabric elements (switches and DPUs) to function properly. Mismatched settings can lead to partial functionality or instability, manifesting as the bandwidth bottleneck observed in the NCCL test.
Incorrect: A. Ensure that the network cables are painted with a non-conductive coating to prevent static electricity from slowing down the photons.
This is completely incorrect and reflects a fundamental misunderstanding of networking physics. Photons (light particles) are not slowed by static electricity, and network cables do not require non-conductive paint for performance reasons. Data center cables (optical or copper) are designed with proper shielding and materials to meet their specifications. This option has no basis in any NVIDIA certification material or real-world networking practice.
B. Check if the gcc compiler is installed on the switch as the switch needs to recompile the NCCL kernels for every new job.
This is incorrect for multiple reasons. First, NCCL kernels are compiled on the host systems (GPU servers), not on network switches. Network switches run specialized firmware/operating systems and do not compile application code. Second, NCCL kernels do not need recompilation per job—they are compiled once or loaded as binaries. The gcc compiler is a host-side development tool and has no role in switch operation or NCCL runtime execution.
D. Verify that the GPUs are in MIG mode as NCCL requires MIG to be enabled to use the network fabric effectively.
This is incorrect. MIG (Multi-Instance GPU) mode is a feature for partitioning a single GPU into multiple isolated instances, primarily for workload consolidation and multi-tenant scenarios. NCCL does not require MIG mode to use the network fabric effectively. In fact, MIG mode is typically disabled for large-scale training jobs where maximum per-GPU performance is needed. NCCL operates normally regardless of MIG mode settings, and enabling/disabling MIG has no direct relationship to network fabric bandwidth utilization.
Question 22 of 60
22. Question
A technical engineer is tasked with integrating a third-party high-performance storage array into a newly deployed NVIDIA AI cluster. What is the most critical initial parameter to configure on the storage side to ensure the GPUs can achieve maximum data ingestion rates during training?
Correct
A. Configure Jumbo Frames (MTU 9000) and ensure GDS compatibility
Correct: This follows NVIDIAÂ’s best practices for high-performance AI storage.
Jumbo Frames (MTU 9000): Increasing the Maximum Transmission Unit (MTU) reduces the number of packets required to transfer large files, lowering interrupt overhead and increasing effective bandwidth across the storage fabric.
GPUDirect Storage (GDS): GDS is a critical technology that creates a direct DMA (Direct Memory Access) path between storage and GPU memory. By bypassing the host CPUÂ’s bounce buffer and system memory, it drastically reduces latency and increases the ingestion rate required for LLM training.
B. Set storage cooling fans to maximum RPM
Incorrect: While thermal management is vital for the data center, the internal fan speed of a storage array is typically managed automatically by the storage controller‘s firmware based on internal sensors. Manually setting fans to max RPM is an environmental adjustment that does not impact the logical data ingestion rate or GPU throughput.
C. Enable deduplication and compression
Incorrect: While these features save space, they are computationally expensive and often introduce significant latency and CPU overhead on the storage controller. For high-performance AI training, where data must be fed to GPUs at wire speed, these features are usually disabled on the “hot“ training tier to prevent them from becoming an I/O bottleneck.
D. Install the NGC CLI on the storage controller
Incorrect: The NGC CLI is a tool for downloading AI models and containers to compute nodes; it is not designed to run on a storage controller‘s proprietary OS. Furthermore, storage health monitoring is performed by the array‘s own management software or via SNMP/Redfish, not through a GPU-centric utility like the NGC CLI.
Incorrect
A. Configure Jumbo Frames (MTU 9000) and ensure GDS compatibility
Correct: This follows NVIDIAÂ’s best practices for high-performance AI storage.
Jumbo Frames (MTU 9000): Increasing the Maximum Transmission Unit (MTU) reduces the number of packets required to transfer large files, lowering interrupt overhead and increasing effective bandwidth across the storage fabric.
GPUDirect Storage (GDS): GDS is a critical technology that creates a direct DMA (Direct Memory Access) path between storage and GPU memory. By bypassing the host CPUÂ’s bounce buffer and system memory, it drastically reduces latency and increases the ingestion rate required for LLM training.
B. Set storage cooling fans to maximum RPM
Incorrect: While thermal management is vital for the data center, the internal fan speed of a storage array is typically managed automatically by the storage controller‘s firmware based on internal sensors. Manually setting fans to max RPM is an environmental adjustment that does not impact the logical data ingestion rate or GPU throughput.
C. Enable deduplication and compression
Incorrect: While these features save space, they are computationally expensive and often introduce significant latency and CPU overhead on the storage controller. For high-performance AI training, where data must be fed to GPUs at wire speed, these features are usually disabled on the “hot“ training tier to prevent them from becoming an I/O bottleneck.
D. Install the NGC CLI on the storage controller
Incorrect: The NGC CLI is a tool for downloading AI models and containers to compute nodes; it is not designed to run on a storage controller‘s proprietary OS. Furthermore, storage health monitoring is performed by the array‘s own management software or via SNMP/Redfish, not through a GPU-centric utility like the NGC CLI.
Unattempted
A. Configure Jumbo Frames (MTU 9000) and ensure GDS compatibility
Correct: This follows NVIDIAÂ’s best practices for high-performance AI storage.
Jumbo Frames (MTU 9000): Increasing the Maximum Transmission Unit (MTU) reduces the number of packets required to transfer large files, lowering interrupt overhead and increasing effective bandwidth across the storage fabric.
GPUDirect Storage (GDS): GDS is a critical technology that creates a direct DMA (Direct Memory Access) path between storage and GPU memory. By bypassing the host CPUÂ’s bounce buffer and system memory, it drastically reduces latency and increases the ingestion rate required for LLM training.
B. Set storage cooling fans to maximum RPM
Incorrect: While thermal management is vital for the data center, the internal fan speed of a storage array is typically managed automatically by the storage controller‘s firmware based on internal sensors. Manually setting fans to max RPM is an environmental adjustment that does not impact the logical data ingestion rate or GPU throughput.
C. Enable deduplication and compression
Incorrect: While these features save space, they are computationally expensive and often introduce significant latency and CPU overhead on the storage controller. For high-performance AI training, where data must be fed to GPUs at wire speed, these features are usually disabled on the “hot“ training tier to prevent them from becoming an I/O bottleneck.
D. Install the NGC CLI on the storage controller
Incorrect: The NGC CLI is a tool for downloading AI models and containers to compute nodes; it is not designed to run on a storage controller‘s proprietary OS. Furthermore, storage health monitoring is performed by the array‘s own management software or via SNMP/Redfish, not through a GPU-centric utility like the NGC CLI.
Question 23 of 60
23. Question
A technician needs to update the NVIDIA DOCA drivers on several BlueField-3 DPUs within a cluster to support new network acceleration features. Which approach ensures the update is performed correctly and that the DPUs are ready to be used by the control plane software like Base Command Manager?
Correct
Correct: B. Use the DOCA Host side drivers and the ‘bfupd‘ utility to update the DPU firmware and software, followed by a cold reboot of the host system to initialize the new drivers.
This option is correct because it accurately describes the standard, supported procedure for updating NVIDIA BlueField-3 DPUs. The update process is fundamentally host-driven and requires specific tools and a reboot to complete successfully.
The update utilizes host-side drivers and the RShim interface, which provides a communication channel between the host server and the DPU‘s Arm subsystem . The primary tool for this operation is bfb-install (referred to here as ‘bfupd‘), which is included in the RShim package . This utility pushes the BFB (BlueField Boot) bundle image—containing all necessary DOCA packages, firmware, and the Arm OS—to the DPU over the RShim interface .
Before initiating the update on the DPU, it is mandatory to first update the DOCA components on the host side . The process involves installing or updating the doca-runtime package on the host, which ensures the RShim service is running correctly . The bfb-install command is then executed from the host to flash the new image onto the DPU .
A cold reboot or power cycle of the host system is a critical final step. This is necessary to initialize the new drivers and firmware, and to ensure the DPU is properly recognized and ready for use by the host‘s operating system and control plane software like Base Command Manager . The documentation specifically notes that after a successful BFB installation, a power cycle on the host must be performed to apply the changes .
Incorrect: A. Download the DOCA SDK on a Windows workstation and use a USB-to-Serial cable to flash the new driver onto each DPU‘s internal flash memory.
This is incorrect. The DOCA SDK and driver updates for BlueField DPUs are performed from the host Linux server, not from a separate Windows workstation. The primary interface for management and flashing is the RShim interface over PCIe, not a USB-to-Serial cable . The update process uses specific tools like bfb-install to deploy a complete BFB image, not just a “driver“ flashed to memory via a serial connection .
C. Uninstall the previous DOCA drivers and then use the NVIDIA Container Toolkit to deploy a ‘driver container‘ that will manage the DPU hardware for the host.
This is incorrect. While the NVIDIA Container Toolkit is used for GPU acceleration in containers, it is not the tool for managing or updating BlueField DPU base firmware and drivers. The DOCA software stack, including the essential RShim driver and bfb-install utility, must be installed on the host operating system via packages (like .deb or .rpm) to perform the initial BFB installation and low-level DPU management . A ‘driver container‘ concept does not replace this host-level installation required to flash the DPU‘s boot image.
D. The DOCA drivers are part of the standard Linux kernel and will be updated automatically whenever the administrator runs a system-wide ‘apt upgrade‘ on the host.
This is incorrect. Although the host components for managing a BlueField DPU are installed via standard Linux package managers like apt, they are not part of the standard upstream Linux kernel. They are proprietary NVIDIA drivers and libraries provided in a dedicated NVIDIA repository. While an administrator might use apt upgrade after adding this repository , the update is not “automatic“ in the sense of being part of a standard system upgrade. Crucially, updating the host packages alone does not update the firmware and software running on the DPU itself. A separate, explicit step using bfb-install is required to flash the new BFB image to the DPU‘s Arm core
Incorrect
Correct: B. Use the DOCA Host side drivers and the ‘bfupd‘ utility to update the DPU firmware and software, followed by a cold reboot of the host system to initialize the new drivers.
This option is correct because it accurately describes the standard, supported procedure for updating NVIDIA BlueField-3 DPUs. The update process is fundamentally host-driven and requires specific tools and a reboot to complete successfully.
The update utilizes host-side drivers and the RShim interface, which provides a communication channel between the host server and the DPU‘s Arm subsystem . The primary tool for this operation is bfb-install (referred to here as ‘bfupd‘), which is included in the RShim package . This utility pushes the BFB (BlueField Boot) bundle image—containing all necessary DOCA packages, firmware, and the Arm OS—to the DPU over the RShim interface .
Before initiating the update on the DPU, it is mandatory to first update the DOCA components on the host side . The process involves installing or updating the doca-runtime package on the host, which ensures the RShim service is running correctly . The bfb-install command is then executed from the host to flash the new image onto the DPU .
A cold reboot or power cycle of the host system is a critical final step. This is necessary to initialize the new drivers and firmware, and to ensure the DPU is properly recognized and ready for use by the host‘s operating system and control plane software like Base Command Manager . The documentation specifically notes that after a successful BFB installation, a power cycle on the host must be performed to apply the changes .
Incorrect: A. Download the DOCA SDK on a Windows workstation and use a USB-to-Serial cable to flash the new driver onto each DPU‘s internal flash memory.
This is incorrect. The DOCA SDK and driver updates for BlueField DPUs are performed from the host Linux server, not from a separate Windows workstation. The primary interface for management and flashing is the RShim interface over PCIe, not a USB-to-Serial cable . The update process uses specific tools like bfb-install to deploy a complete BFB image, not just a “driver“ flashed to memory via a serial connection .
C. Uninstall the previous DOCA drivers and then use the NVIDIA Container Toolkit to deploy a ‘driver container‘ that will manage the DPU hardware for the host.
This is incorrect. While the NVIDIA Container Toolkit is used for GPU acceleration in containers, it is not the tool for managing or updating BlueField DPU base firmware and drivers. The DOCA software stack, including the essential RShim driver and bfb-install utility, must be installed on the host operating system via packages (like .deb or .rpm) to perform the initial BFB installation and low-level DPU management . A ‘driver container‘ concept does not replace this host-level installation required to flash the DPU‘s boot image.
D. The DOCA drivers are part of the standard Linux kernel and will be updated automatically whenever the administrator runs a system-wide ‘apt upgrade‘ on the host.
This is incorrect. Although the host components for managing a BlueField DPU are installed via standard Linux package managers like apt, they are not part of the standard upstream Linux kernel. They are proprietary NVIDIA drivers and libraries provided in a dedicated NVIDIA repository. While an administrator might use apt upgrade after adding this repository , the update is not “automatic“ in the sense of being part of a standard system upgrade. Crucially, updating the host packages alone does not update the firmware and software running on the DPU itself. A separate, explicit step using bfb-install is required to flash the new BFB image to the DPU‘s Arm core
Unattempted
Correct: B. Use the DOCA Host side drivers and the ‘bfupd‘ utility to update the DPU firmware and software, followed by a cold reboot of the host system to initialize the new drivers.
This option is correct because it accurately describes the standard, supported procedure for updating NVIDIA BlueField-3 DPUs. The update process is fundamentally host-driven and requires specific tools and a reboot to complete successfully.
The update utilizes host-side drivers and the RShim interface, which provides a communication channel between the host server and the DPU‘s Arm subsystem . The primary tool for this operation is bfb-install (referred to here as ‘bfupd‘), which is included in the RShim package . This utility pushes the BFB (BlueField Boot) bundle image—containing all necessary DOCA packages, firmware, and the Arm OS—to the DPU over the RShim interface .
Before initiating the update on the DPU, it is mandatory to first update the DOCA components on the host side . The process involves installing or updating the doca-runtime package on the host, which ensures the RShim service is running correctly . The bfb-install command is then executed from the host to flash the new image onto the DPU .
A cold reboot or power cycle of the host system is a critical final step. This is necessary to initialize the new drivers and firmware, and to ensure the DPU is properly recognized and ready for use by the host‘s operating system and control plane software like Base Command Manager . The documentation specifically notes that after a successful BFB installation, a power cycle on the host must be performed to apply the changes .
Incorrect: A. Download the DOCA SDK on a Windows workstation and use a USB-to-Serial cable to flash the new driver onto each DPU‘s internal flash memory.
This is incorrect. The DOCA SDK and driver updates for BlueField DPUs are performed from the host Linux server, not from a separate Windows workstation. The primary interface for management and flashing is the RShim interface over PCIe, not a USB-to-Serial cable . The update process uses specific tools like bfb-install to deploy a complete BFB image, not just a “driver“ flashed to memory via a serial connection .
C. Uninstall the previous DOCA drivers and then use the NVIDIA Container Toolkit to deploy a ‘driver container‘ that will manage the DPU hardware for the host.
This is incorrect. While the NVIDIA Container Toolkit is used for GPU acceleration in containers, it is not the tool for managing or updating BlueField DPU base firmware and drivers. The DOCA software stack, including the essential RShim driver and bfb-install utility, must be installed on the host operating system via packages (like .deb or .rpm) to perform the initial BFB installation and low-level DPU management . A ‘driver container‘ concept does not replace this host-level installation required to flash the DPU‘s boot image.
D. The DOCA drivers are part of the standard Linux kernel and will be updated automatically whenever the administrator runs a system-wide ‘apt upgrade‘ on the host.
This is incorrect. Although the host components for managing a BlueField DPU are installed via standard Linux package managers like apt, they are not part of the standard upstream Linux kernel. They are proprietary NVIDIA drivers and libraries provided in a dedicated NVIDIA repository. While an administrator might use apt upgrade after adding this repository , the update is not “automatic“ in the sense of being part of a standard system upgrade. Crucially, updating the host packages alone does not update the firmware and software running on the DPU itself. A separate, explicit step using bfb-install is required to flash the new BFB image to the DPU‘s Arm core
Question 24 of 60
24. Question
An IT professional is installing the NVIDIA Base Command Manager (BCM) to orchestrate a new AI cluster. After the initial installation, the administrator needs to configure High Availability (HA) for the head node to ensure the cluster remains operational if the primary head node fails. Which component of the BCM architecture must be synchronized between the primary and secondary head nodes to achieve a seamless failover?
Correct
A. The NVIDIA SMI monitoring logs from every compute node must be streamed simultaneously.
Incorrect: While monitoring data is important for long-term telemetry, nvidia-smi logs are transient and do not constitute the “state“ of the cluster management system. Streaming these to both nodes is not a prerequisite for a “seamless failover“ of the control plane itself.
B. The local /tmp directories must be synchronized in real-time using rsync.
Incorrect: The /tmp directory, by definition, contains temporary files that are not critical to the persistent state of the cluster. Synchronizing these via a cron job is inefficient and does not address the core databases or provisioning images required to manage the AI infrastructure.
C. Only the Slurm configuration file needs to be shared via a common NFS mount.
Incorrect: This is a common misconception. While Slurm is the workload manager, BCM manages the entire infrastructure, including software images, network configurations, and node provisioning states. Sharing only a Slurm .conf file would leave the secondary node without the necessary database records to manage node deployments or hardware health.
D. The cluster database, configuration files, and LDAP/Active Directory settings must be kept in sync.
Correct: According to the NCP-AII latest version and BCM administration guides, a seamless failover requires the Cluster Database (which stores the status and configuration of every node) and the Configuration Files (software images, network settings, and exclusion lists) to be identical on both head nodes. Synchronizing LDAP/Active Directory settings ensures that user authentication remains consistent regardless of which head node is active. BCM typically achieves this through internal replication mechanisms (like MariaDB Galera) and file synchronization (like DRBD or specialized rsync-based services).
Incorrect
A. The NVIDIA SMI monitoring logs from every compute node must be streamed simultaneously.
Incorrect: While monitoring data is important for long-term telemetry, nvidia-smi logs are transient and do not constitute the “state“ of the cluster management system. Streaming these to both nodes is not a prerequisite for a “seamless failover“ of the control plane itself.
B. The local /tmp directories must be synchronized in real-time using rsync.
Incorrect: The /tmp directory, by definition, contains temporary files that are not critical to the persistent state of the cluster. Synchronizing these via a cron job is inefficient and does not address the core databases or provisioning images required to manage the AI infrastructure.
C. Only the Slurm configuration file needs to be shared via a common NFS mount.
Incorrect: This is a common misconception. While Slurm is the workload manager, BCM manages the entire infrastructure, including software images, network configurations, and node provisioning states. Sharing only a Slurm .conf file would leave the secondary node without the necessary database records to manage node deployments or hardware health.
D. The cluster database, configuration files, and LDAP/Active Directory settings must be kept in sync.
Correct: According to the NCP-AII latest version and BCM administration guides, a seamless failover requires the Cluster Database (which stores the status and configuration of every node) and the Configuration Files (software images, network settings, and exclusion lists) to be identical on both head nodes. Synchronizing LDAP/Active Directory settings ensures that user authentication remains consistent regardless of which head node is active. BCM typically achieves this through internal replication mechanisms (like MariaDB Galera) and file synchronization (like DRBD or specialized rsync-based services).
Unattempted
A. The NVIDIA SMI monitoring logs from every compute node must be streamed simultaneously.
Incorrect: While monitoring data is important for long-term telemetry, nvidia-smi logs are transient and do not constitute the “state“ of the cluster management system. Streaming these to both nodes is not a prerequisite for a “seamless failover“ of the control plane itself.
B. The local /tmp directories must be synchronized in real-time using rsync.
Incorrect: The /tmp directory, by definition, contains temporary files that are not critical to the persistent state of the cluster. Synchronizing these via a cron job is inefficient and does not address the core databases or provisioning images required to manage the AI infrastructure.
C. Only the Slurm configuration file needs to be shared via a common NFS mount.
Incorrect: This is a common misconception. While Slurm is the workload manager, BCM manages the entire infrastructure, including software images, network configurations, and node provisioning states. Sharing only a Slurm .conf file would leave the secondary node without the necessary database records to manage node deployments or hardware health.
D. The cluster database, configuration files, and LDAP/Active Directory settings must be kept in sync.
Correct: According to the NCP-AII latest version and BCM administration guides, a seamless failover requires the Cluster Database (which stores the status and configuration of every node) and the Configuration Files (software images, network settings, and exclusion lists) to be identical on both head nodes. Synchronizing LDAP/Active Directory settings ensures that user authentication remains consistent regardless of which head node is active. BCM typically achieves this through internal replication mechanisms (like MariaDB Galera) and file synchronization (like DRBD or specialized rsync-based services).
Question 25 of 60
25. Question
The final stage of cluster verification involves running a NeMo burn-in test. This test is designed to stress the GPUs, the inter-node fabric, and the storage system simultaneously. If the test fails with a ‘Connection Timed Out‘ error specifically during the checkpoint saving phase, which component should be the primary focus for troubleshooting?
Correct
Correct: B. The storage fabric and the parallel file system configuration, as checkpointing is an I/O intensive operation that tests the storage throughput and latency.
This option is correct because checkpoint saving in NeMo is a highly I/O-intensive operation that directly stresses the storage infrastructure. When a NeMo burn-in test fails specifically during the checkpoint saving phase with a “Connection Timed Out“ error, it points to issues with the storage fabric or parallel file system configuration rather than compute or network fabric problems.
Checkpointing in large-scale AI training involves saving complete model snapshots, including model weights, optimizer states, and metadata, across multiple GPUs and nodes . In distributed training, each GPU rank may independently write its checkpoint shard to shared storage, creating massive parallel I/O operations . NeMo implements Fully Parallel Saving (FPS), where each data-parallel rank holds a shard of the optimizer state and independently writes to storage, generating significant throughput demands on the storage system .
A “Connection Timed Out“ error during checkpoint saving specifically indicates that the storage system is unable to keep pace with the I/O requests or that there is a connectivity issue between the compute nodes and storage fabric. The storage fabric encompasses the entire data path from GPUs to storage targets, including network switches, storage controllers, and the parallel file system software. When this fabric is misconfigured or underperforming, checkpoint operations that require rapid, large-scale data writes will fail with timeout errors as the training processes wait indefinitely for I/O completion.
NVIDIA Enterprise Reference Architectures emphasize that storage is a critical component validated alongside compute and networking for AI factories . The storage system must be properly integrated with technologies like GPUDirect Storage and NVMe-over-Fabrics to provide the necessary throughput for checkpoint operations .
Incorrect:
A. The VBIOS of the GPUs, as the VBIOS is responsible for the network handshake during the synchronization of large data files.
This is incorrect. GPU VBIOS (Video BIOS) is firmware that initializes the GPU hardware and manages basic GPU functions, but it has no role in network handshakes or data file synchronization. Network communication for checkpoint operations is handled by the host networking stack, NIC firmware (such as BlueField DPUs), and communication libraries like NCCL, not by GPU VBIOS. This option demonstrates a fundamental misunderstanding of GPU architecture and networking responsibilities.
C. The IPMI configuration on the BMC, as the BMC must authorize every data packet that is written to the central storage array.
This is incorrect. The Baseboard Management Controller (BMC) with IPMI (Intelligent Platform Management Interface) is a out-of-band management system used for remote monitoring and control of server hardware (power cycling, temperature monitoring, console access). It operates independently of data plane traffic and does not sit in the path of storage I/O operations. BMCs do not authorize or process data packets for storage writes; storage traffic flows directly through the host‘s network interfaces and storage fabric without BMC involvement.
D. The cooling fans in the server rack, as the sound of the fans can cause vibrations that interfere with the write head of the NVMe drives.
This is incorrect and based on outdated technology assumptions. NVMe (Non-Volatile Memory Express) drives are solid-state devices with no moving parts or “write heads“ like traditional HDDs. They use flash memory and are not susceptible to vibration interference from fan noise. While cooling fans are essential for thermal management in servers (as addressed in previous questions), they have no physical mechanism to impact NVMe drive write operations through sound vibrations. This option incorrectly applies concepts from mechanical hard drive technology to modern solid-state storage.
Incorrect
Correct: B. The storage fabric and the parallel file system configuration, as checkpointing is an I/O intensive operation that tests the storage throughput and latency.
This option is correct because checkpoint saving in NeMo is a highly I/O-intensive operation that directly stresses the storage infrastructure. When a NeMo burn-in test fails specifically during the checkpoint saving phase with a “Connection Timed Out“ error, it points to issues with the storage fabric or parallel file system configuration rather than compute or network fabric problems.
Checkpointing in large-scale AI training involves saving complete model snapshots, including model weights, optimizer states, and metadata, across multiple GPUs and nodes . In distributed training, each GPU rank may independently write its checkpoint shard to shared storage, creating massive parallel I/O operations . NeMo implements Fully Parallel Saving (FPS), where each data-parallel rank holds a shard of the optimizer state and independently writes to storage, generating significant throughput demands on the storage system .
A “Connection Timed Out“ error during checkpoint saving specifically indicates that the storage system is unable to keep pace with the I/O requests or that there is a connectivity issue between the compute nodes and storage fabric. The storage fabric encompasses the entire data path from GPUs to storage targets, including network switches, storage controllers, and the parallel file system software. When this fabric is misconfigured or underperforming, checkpoint operations that require rapid, large-scale data writes will fail with timeout errors as the training processes wait indefinitely for I/O completion.
NVIDIA Enterprise Reference Architectures emphasize that storage is a critical component validated alongside compute and networking for AI factories . The storage system must be properly integrated with technologies like GPUDirect Storage and NVMe-over-Fabrics to provide the necessary throughput for checkpoint operations .
Incorrect:
A. The VBIOS of the GPUs, as the VBIOS is responsible for the network handshake during the synchronization of large data files.
This is incorrect. GPU VBIOS (Video BIOS) is firmware that initializes the GPU hardware and manages basic GPU functions, but it has no role in network handshakes or data file synchronization. Network communication for checkpoint operations is handled by the host networking stack, NIC firmware (such as BlueField DPUs), and communication libraries like NCCL, not by GPU VBIOS. This option demonstrates a fundamental misunderstanding of GPU architecture and networking responsibilities.
C. The IPMI configuration on the BMC, as the BMC must authorize every data packet that is written to the central storage array.
This is incorrect. The Baseboard Management Controller (BMC) with IPMI (Intelligent Platform Management Interface) is a out-of-band management system used for remote monitoring and control of server hardware (power cycling, temperature monitoring, console access). It operates independently of data plane traffic and does not sit in the path of storage I/O operations. BMCs do not authorize or process data packets for storage writes; storage traffic flows directly through the host‘s network interfaces and storage fabric without BMC involvement.
D. The cooling fans in the server rack, as the sound of the fans can cause vibrations that interfere with the write head of the NVMe drives.
This is incorrect and based on outdated technology assumptions. NVMe (Non-Volatile Memory Express) drives are solid-state devices with no moving parts or “write heads“ like traditional HDDs. They use flash memory and are not susceptible to vibration interference from fan noise. While cooling fans are essential for thermal management in servers (as addressed in previous questions), they have no physical mechanism to impact NVMe drive write operations through sound vibrations. This option incorrectly applies concepts from mechanical hard drive technology to modern solid-state storage.
Unattempted
Correct: B. The storage fabric and the parallel file system configuration, as checkpointing is an I/O intensive operation that tests the storage throughput and latency.
This option is correct because checkpoint saving in NeMo is a highly I/O-intensive operation that directly stresses the storage infrastructure. When a NeMo burn-in test fails specifically during the checkpoint saving phase with a “Connection Timed Out“ error, it points to issues with the storage fabric or parallel file system configuration rather than compute or network fabric problems.
Checkpointing in large-scale AI training involves saving complete model snapshots, including model weights, optimizer states, and metadata, across multiple GPUs and nodes . In distributed training, each GPU rank may independently write its checkpoint shard to shared storage, creating massive parallel I/O operations . NeMo implements Fully Parallel Saving (FPS), where each data-parallel rank holds a shard of the optimizer state and independently writes to storage, generating significant throughput demands on the storage system .
A “Connection Timed Out“ error during checkpoint saving specifically indicates that the storage system is unable to keep pace with the I/O requests or that there is a connectivity issue between the compute nodes and storage fabric. The storage fabric encompasses the entire data path from GPUs to storage targets, including network switches, storage controllers, and the parallel file system software. When this fabric is misconfigured or underperforming, checkpoint operations that require rapid, large-scale data writes will fail with timeout errors as the training processes wait indefinitely for I/O completion.
NVIDIA Enterprise Reference Architectures emphasize that storage is a critical component validated alongside compute and networking for AI factories . The storage system must be properly integrated with technologies like GPUDirect Storage and NVMe-over-Fabrics to provide the necessary throughput for checkpoint operations .
Incorrect:
A. The VBIOS of the GPUs, as the VBIOS is responsible for the network handshake during the synchronization of large data files.
This is incorrect. GPU VBIOS (Video BIOS) is firmware that initializes the GPU hardware and manages basic GPU functions, but it has no role in network handshakes or data file synchronization. Network communication for checkpoint operations is handled by the host networking stack, NIC firmware (such as BlueField DPUs), and communication libraries like NCCL, not by GPU VBIOS. This option demonstrates a fundamental misunderstanding of GPU architecture and networking responsibilities.
C. The IPMI configuration on the BMC, as the BMC must authorize every data packet that is written to the central storage array.
This is incorrect. The Baseboard Management Controller (BMC) with IPMI (Intelligent Platform Management Interface) is a out-of-band management system used for remote monitoring and control of server hardware (power cycling, temperature monitoring, console access). It operates independently of data plane traffic and does not sit in the path of storage I/O operations. BMCs do not authorize or process data packets for storage writes; storage traffic flows directly through the host‘s network interfaces and storage fabric without BMC involvement.
D. The cooling fans in the server rack, as the sound of the fans can cause vibrations that interfere with the write head of the NVMe drives.
This is incorrect and based on outdated technology assumptions. NVMe (Non-Volatile Memory Express) drives are solid-state devices with no moving parts or “write heads“ like traditional HDDs. They use flash memory and are not susceptible to vibration interference from fan noise. While cooling fans are essential for thermal management in servers (as addressed in previous questions), they have no physical mechanism to impact NVMe drive write operations through sound vibrations. This option incorrectly applies concepts from mechanical hard drive technology to modern solid-state storage.
Question 26 of 60
26. Question
An AI training job is failing with GPU fell off the bus errors. After checking the logs, the administrator sees numerous PCIe correctable errors before the failure. What is the most appropriate troubleshooting step for this hardware fault according to NVIDIA best practices?
Correct
A. Update the NGC CLI to the latest version
Incorrect: The NGC CLI is a software tool used to pull containers, models, and datasets from the NVIDIA NGC catalog. It has no direct interaction with the physical PCIe bus or the electrical signaling between the GPU and the motherboard. Updating it will not resolve hardware signal integrity issues.
B. Inspect the physical GPU seating and clean the PCIe gold fingers
Correct: According to NVIDIA best practices and the NCP-AII curriculum, a stream of PCIe correctable errors is a primary indicator of degraded signal integrity. These errors occur when the hardware can successfully retry a failed packet transmission, but if the frequency is high, the link eventually collapses, leading to an XID 79 (GPU fallen off the bus) error.
Action: Physically reseating the GPU ensures it is properly aligned in the slot.
Cleaning: Cleaning the “gold fingers“ (contacts) with isopropyl alcohol removes oxidation or debris that may be causing electrical resistance or intermittent connectivity.
C. Increase the Slurm job timeout value
Incorrect: Slurm timeouts manage how long a job is allowed to run before being killed by the scheduler. While increasing a timeout might keep a “stalled“ job in the queue longer, it does not fix the underlying hardware failure that causes the GPU to stop communicating with the OS.
D. Reinstall the Pyxis plugin for Slurm
Incorrect: Pyxis is a Slurm plugin used to facilitate unprivileged container execution via Enroot. While it is essential for job orchestration, it operates at the software middleware layer. It cannot fix a physical bus disconnection or electrical fault reported by the NVIDIA driver.
Incorrect
A. Update the NGC CLI to the latest version
Incorrect: The NGC CLI is a software tool used to pull containers, models, and datasets from the NVIDIA NGC catalog. It has no direct interaction with the physical PCIe bus or the electrical signaling between the GPU and the motherboard. Updating it will not resolve hardware signal integrity issues.
B. Inspect the physical GPU seating and clean the PCIe gold fingers
Correct: According to NVIDIA best practices and the NCP-AII curriculum, a stream of PCIe correctable errors is a primary indicator of degraded signal integrity. These errors occur when the hardware can successfully retry a failed packet transmission, but if the frequency is high, the link eventually collapses, leading to an XID 79 (GPU fallen off the bus) error.
Action: Physically reseating the GPU ensures it is properly aligned in the slot.
Cleaning: Cleaning the “gold fingers“ (contacts) with isopropyl alcohol removes oxidation or debris that may be causing electrical resistance or intermittent connectivity.
C. Increase the Slurm job timeout value
Incorrect: Slurm timeouts manage how long a job is allowed to run before being killed by the scheduler. While increasing a timeout might keep a “stalled“ job in the queue longer, it does not fix the underlying hardware failure that causes the GPU to stop communicating with the OS.
D. Reinstall the Pyxis plugin for Slurm
Incorrect: Pyxis is a Slurm plugin used to facilitate unprivileged container execution via Enroot. While it is essential for job orchestration, it operates at the software middleware layer. It cannot fix a physical bus disconnection or electrical fault reported by the NVIDIA driver.
Unattempted
A. Update the NGC CLI to the latest version
Incorrect: The NGC CLI is a software tool used to pull containers, models, and datasets from the NVIDIA NGC catalog. It has no direct interaction with the physical PCIe bus or the electrical signaling between the GPU and the motherboard. Updating it will not resolve hardware signal integrity issues.
B. Inspect the physical GPU seating and clean the PCIe gold fingers
Correct: According to NVIDIA best practices and the NCP-AII curriculum, a stream of PCIe correctable errors is a primary indicator of degraded signal integrity. These errors occur when the hardware can successfully retry a failed packet transmission, but if the frequency is high, the link eventually collapses, leading to an XID 79 (GPU fallen off the bus) error.
Action: Physically reseating the GPU ensures it is properly aligned in the slot.
Cleaning: Cleaning the “gold fingers“ (contacts) with isopropyl alcohol removes oxidation or debris that may be causing electrical resistance or intermittent connectivity.
C. Increase the Slurm job timeout value
Incorrect: Slurm timeouts manage how long a job is allowed to run before being killed by the scheduler. While increasing a timeout might keep a “stalled“ job in the queue longer, it does not fix the underlying hardware failure that causes the GPU to stop communicating with the OS.
D. Reinstall the Pyxis plugin for Slurm
Incorrect: Pyxis is a Slurm plugin used to facilitate unprivileged container execution via Enroot. While it is essential for job orchestration, it operates at the software middleware layer. It cannot fix a physical bus disconnection or electrical fault reported by the NVIDIA driver.
Question 27 of 60
27. Question
A data scientist wants to maximize the utilization of an NVIDIA H100 GPU by running multiple small inference workloads simultaneously. To ensure that each workload has dedicated and isolated hardware resources, the administrator decides to configure Multi-Instance GPU (MIG). What is a fundamental requirement and a key characteristic of a MIG-enabled GPU in this scenario?
Correct
Correct: B. Each MIG instance provides a full set of hardware resources, including isolated compute (SMs) and memory, ensuring that one workload cannot impact the performance of another.
This option is correct because it accurately describes the fundamental architecture and key benefit of Multi-Instance GPU (MIG) technology. MIG was introduced with NVIDIA‘s Ampere architecture and is supported on Hopper and Blackwell GPUs including the H100 . The technology enables partitioning a single physical GPU into multiple isolated instances, each with dedicated hardware resources.
The hardware-level isolation in MIG provides several critical characteristics:
Dedicated Compute Resources: Each MIG instance receives dedicated Streaming Multiprocessors (SMs). On an H100 GPU with 140 SMs, these are evenly distributed into 7 compute slices, each with 20 SMs . When an instance is created, it is allocated specific SM slices that are exclusively reserved for that instance.
Isolated Memory: Each MIG instance has dedicated high-bandwidth memory slices with guaranteed bandwidth. The H100 has 8 memory slices, each with 10GB of VRAM and an eighth of the total memory bandwidth . A MIG instance receives specific memory slices that are not shared with other instances.
Complete Hardware Path Isolation: MIG provides isolation beyond just compute cores and memory. Each instance has dedicated L2 cache banks, memory controllers, DRAM address buses, and on-chip crossbar ports . This ensures that even when other instances are heavily utilizing their resources, one instance‘s performance remains unaffected.
Parallel Execution with Guaranteed QoS: Unlike time-slicing where workloads compete for the same resources and execute serially, MIG enables true parallel execution with guaranteed quality of service (QoS) . Workloads on different MIG instances run simultaneously on the same physical GPU without competing for shared resources .
Fault Isolation: Hardware-level isolation extends to fault containment. A failure or crash in one MIG instance does not affect applications running in other instances on the same physical GPU .
This architecture makes MIG ideal for the scenario described—running multiple small inference workloads simultaneously on an H100 GPU. Each workload receives dedicated, isolated hardware resources with predictable performance, maximizing utilization while maintaining performance isolation between tenants .
Incorrect:
A. MIG mode can only be enabled if the system is using third-party storage that supports the GPUDirect Storage protocol for all instances.
This is incorrect. MIG is a GPU partitioning technology that operates independently of storage configurations. GPUDirect Storage (GDS) is a separate technology that enables direct data paths between storage and GPU memory, bypassing the CPU . While GDS can be used with MIG-backed vGPUs on supported configurations , it is not a requirement for enabling MIG mode. MIG can be enabled and used with any standard storage solution that meets the application‘s I/O requirements.
C. Enabling MIG allows the GPU to share its L2 cache across all instances to increase the hit rate for large-scale training jobs.
This is incorrect and contradicts the fundamental design of MIG. Rather than sharing L2 cache across instances, MIG partitions the L2 cache and assigns dedicated cache banks to each instance . This isolation ensures that one workload‘s cache usage cannot evict or interfere with another workload‘s cached data. The L2 cache is explicitly NOT shared; it is divided among instances to provide performance isolation and predictable QoS. Large-scale training jobs would typically use full GPUs or large MIG instances, not rely on cross-instance cache sharing.
D. MIG requires the GPU to be in ‘Prohibited Mode‘ so that the Linux kernel can manage the memory allocation for each instance.
This is incorrect. There is no “Prohibited Mode“ in MIG terminology or requirements. MIG is configured and managed through standard NVIDIA tools, primarily the nvidia-smi command-line utility . Administrators enable MIG mode on a per-GPU basis using commands like nvidia-smi -i 0 -mig 1 followed by a GPU reset . Memory allocation for MIG instances is handled by the NVIDIA driver and hardware, not by the Linux kernel managing allocations through a special mode. The GPU‘s memory controller hardware enforces the isolation boundaries between instances .
Incorrect
Correct: B. Each MIG instance provides a full set of hardware resources, including isolated compute (SMs) and memory, ensuring that one workload cannot impact the performance of another.
This option is correct because it accurately describes the fundamental architecture and key benefit of Multi-Instance GPU (MIG) technology. MIG was introduced with NVIDIA‘s Ampere architecture and is supported on Hopper and Blackwell GPUs including the H100 . The technology enables partitioning a single physical GPU into multiple isolated instances, each with dedicated hardware resources.
The hardware-level isolation in MIG provides several critical characteristics:
Dedicated Compute Resources: Each MIG instance receives dedicated Streaming Multiprocessors (SMs). On an H100 GPU with 140 SMs, these are evenly distributed into 7 compute slices, each with 20 SMs . When an instance is created, it is allocated specific SM slices that are exclusively reserved for that instance.
Isolated Memory: Each MIG instance has dedicated high-bandwidth memory slices with guaranteed bandwidth. The H100 has 8 memory slices, each with 10GB of VRAM and an eighth of the total memory bandwidth . A MIG instance receives specific memory slices that are not shared with other instances.
Complete Hardware Path Isolation: MIG provides isolation beyond just compute cores and memory. Each instance has dedicated L2 cache banks, memory controllers, DRAM address buses, and on-chip crossbar ports . This ensures that even when other instances are heavily utilizing their resources, one instance‘s performance remains unaffected.
Parallel Execution with Guaranteed QoS: Unlike time-slicing where workloads compete for the same resources and execute serially, MIG enables true parallel execution with guaranteed quality of service (QoS) . Workloads on different MIG instances run simultaneously on the same physical GPU without competing for shared resources .
Fault Isolation: Hardware-level isolation extends to fault containment. A failure or crash in one MIG instance does not affect applications running in other instances on the same physical GPU .
This architecture makes MIG ideal for the scenario described—running multiple small inference workloads simultaneously on an H100 GPU. Each workload receives dedicated, isolated hardware resources with predictable performance, maximizing utilization while maintaining performance isolation between tenants .
Incorrect:
A. MIG mode can only be enabled if the system is using third-party storage that supports the GPUDirect Storage protocol for all instances.
This is incorrect. MIG is a GPU partitioning technology that operates independently of storage configurations. GPUDirect Storage (GDS) is a separate technology that enables direct data paths between storage and GPU memory, bypassing the CPU . While GDS can be used with MIG-backed vGPUs on supported configurations , it is not a requirement for enabling MIG mode. MIG can be enabled and used with any standard storage solution that meets the application‘s I/O requirements.
C. Enabling MIG allows the GPU to share its L2 cache across all instances to increase the hit rate for large-scale training jobs.
This is incorrect and contradicts the fundamental design of MIG. Rather than sharing L2 cache across instances, MIG partitions the L2 cache and assigns dedicated cache banks to each instance . This isolation ensures that one workload‘s cache usage cannot evict or interfere with another workload‘s cached data. The L2 cache is explicitly NOT shared; it is divided among instances to provide performance isolation and predictable QoS. Large-scale training jobs would typically use full GPUs or large MIG instances, not rely on cross-instance cache sharing.
D. MIG requires the GPU to be in ‘Prohibited Mode‘ so that the Linux kernel can manage the memory allocation for each instance.
This is incorrect. There is no “Prohibited Mode“ in MIG terminology or requirements. MIG is configured and managed through standard NVIDIA tools, primarily the nvidia-smi command-line utility . Administrators enable MIG mode on a per-GPU basis using commands like nvidia-smi -i 0 -mig 1 followed by a GPU reset . Memory allocation for MIG instances is handled by the NVIDIA driver and hardware, not by the Linux kernel managing allocations through a special mode. The GPU‘s memory controller hardware enforces the isolation boundaries between instances .
Unattempted
Correct: B. Each MIG instance provides a full set of hardware resources, including isolated compute (SMs) and memory, ensuring that one workload cannot impact the performance of another.
This option is correct because it accurately describes the fundamental architecture and key benefit of Multi-Instance GPU (MIG) technology. MIG was introduced with NVIDIA‘s Ampere architecture and is supported on Hopper and Blackwell GPUs including the H100 . The technology enables partitioning a single physical GPU into multiple isolated instances, each with dedicated hardware resources.
The hardware-level isolation in MIG provides several critical characteristics:
Dedicated Compute Resources: Each MIG instance receives dedicated Streaming Multiprocessors (SMs). On an H100 GPU with 140 SMs, these are evenly distributed into 7 compute slices, each with 20 SMs . When an instance is created, it is allocated specific SM slices that are exclusively reserved for that instance.
Isolated Memory: Each MIG instance has dedicated high-bandwidth memory slices with guaranteed bandwidth. The H100 has 8 memory slices, each with 10GB of VRAM and an eighth of the total memory bandwidth . A MIG instance receives specific memory slices that are not shared with other instances.
Complete Hardware Path Isolation: MIG provides isolation beyond just compute cores and memory. Each instance has dedicated L2 cache banks, memory controllers, DRAM address buses, and on-chip crossbar ports . This ensures that even when other instances are heavily utilizing their resources, one instance‘s performance remains unaffected.
Parallel Execution with Guaranteed QoS: Unlike time-slicing where workloads compete for the same resources and execute serially, MIG enables true parallel execution with guaranteed quality of service (QoS) . Workloads on different MIG instances run simultaneously on the same physical GPU without competing for shared resources .
Fault Isolation: Hardware-level isolation extends to fault containment. A failure or crash in one MIG instance does not affect applications running in other instances on the same physical GPU .
This architecture makes MIG ideal for the scenario described—running multiple small inference workloads simultaneously on an H100 GPU. Each workload receives dedicated, isolated hardware resources with predictable performance, maximizing utilization while maintaining performance isolation between tenants .
Incorrect:
A. MIG mode can only be enabled if the system is using third-party storage that supports the GPUDirect Storage protocol for all instances.
This is incorrect. MIG is a GPU partitioning technology that operates independently of storage configurations. GPUDirect Storage (GDS) is a separate technology that enables direct data paths between storage and GPU memory, bypassing the CPU . While GDS can be used with MIG-backed vGPUs on supported configurations , it is not a requirement for enabling MIG mode. MIG can be enabled and used with any standard storage solution that meets the application‘s I/O requirements.
C. Enabling MIG allows the GPU to share its L2 cache across all instances to increase the hit rate for large-scale training jobs.
This is incorrect and contradicts the fundamental design of MIG. Rather than sharing L2 cache across instances, MIG partitions the L2 cache and assigns dedicated cache banks to each instance . This isolation ensures that one workload‘s cache usage cannot evict or interfere with another workload‘s cached data. The L2 cache is explicitly NOT shared; it is divided among instances to provide performance isolation and predictable QoS. Large-scale training jobs would typically use full GPUs or large MIG instances, not rely on cross-instance cache sharing.
D. MIG requires the GPU to be in ‘Prohibited Mode‘ so that the Linux kernel can manage the memory allocation for each instance.
This is incorrect. There is no “Prohibited Mode“ in MIG terminology or requirements. MIG is configured and managed through standard NVIDIA tools, primarily the nvidia-smi command-line utility . Administrators enable MIG mode on a per-GPU basis using commands like nvidia-smi -i 0 -mig 1 followed by a GPU reset . Memory allocation for MIG instances is handled by the NVIDIA driver and hardware, not by the Linux kernel managing allocations through a special mode. The GPU‘s memory controller hardware enforces the isolation boundaries between instances .
Question 28 of 60
28. Question
As part of the cluster verification process, an engineer is running the NVIDIA Collective Communications Library (NCCL) tests. What is the primary purpose of the NCCL ‘all-reduce‘ benchmark when validating a newly deployed AI cluster with multiple HGX H100 nodes?
Correct
A. To verify that the Slurm scheduler can count SATA boot drivesIncorrect: Slurm manages job allocation and resource counting (CPUs, GPUs, Memory). However, the NCCL all-reduce benchmark is a communication primitive used during training; it has no functional relationship with local SATA boot drives or storage hardware discovery.B. To test the power supply redundancy by synchronizing power drawIncorrect: While NCCL tests do increase power draw because they engage the GPUs, they are not designed as power supply unit (PSU) tests. Forcing synchronization to a “microsecond“ during a Power-On Self-Test (POST) is not a feature of NCCL, nor is it how hardware redundancy is validated in an AI factory.C. To measure peak bandwidth and latency of GPU-to-GPU communicationCorrect: This is the primary purpose of NCCL tests in the NCP-AII curriculum.All-Reduce: This is the most common communication pattern in Distributed Data Parallel (DDP) training. Every GPU shares its gradients with every other GPU.Validation: The benchmark measures the “Bus Bandwidth“ and “Algorithm Bandwidth.“ If the results are significantly lower than the hardware capability (e.g., failing to reach near-line-rate on 400Gb/s InfiniBand), it indicates issues with NVLink, InfiniBand/RoCE configuration, or topology detection.D. To calibrate the signal quality of the BMC‘s management networkIncorrect: The BMC (Baseboard Management Controller) operates on a low-speed Out-of-Band (OOB) network (typically 1GbE). NCCL runs on the High-Speed Fabric (InfiniBand or 400GbE) and the internal NVLink mesh. NCCL never utilizes the BMC management network for data transfer.
Incorrect
A. To verify that the Slurm scheduler can count SATA boot drivesIncorrect: Slurm manages job allocation and resource counting (CPUs, GPUs, Memory). However, the NCCL all-reduce benchmark is a communication primitive used during training; it has no functional relationship with local SATA boot drives or storage hardware discovery.B. To test the power supply redundancy by synchronizing power drawIncorrect: While NCCL tests do increase power draw because they engage the GPUs, they are not designed as power supply unit (PSU) tests. Forcing synchronization to a “microsecond“ during a Power-On Self-Test (POST) is not a feature of NCCL, nor is it how hardware redundancy is validated in an AI factory.C. To measure peak bandwidth and latency of GPU-to-GPU communicationCorrect: This is the primary purpose of NCCL tests in the NCP-AII curriculum.All-Reduce: This is the most common communication pattern in Distributed Data Parallel (DDP) training. Every GPU shares its gradients with every other GPU.Validation: The benchmark measures the “Bus Bandwidth“ and “Algorithm Bandwidth.“ If the results are significantly lower than the hardware capability (e.g., failing to reach near-line-rate on 400Gb/s InfiniBand), it indicates issues with NVLink, InfiniBand/RoCE configuration, or topology detection.D. To calibrate the signal quality of the BMC‘s management networkIncorrect: The BMC (Baseboard Management Controller) operates on a low-speed Out-of-Band (OOB) network (typically 1GbE). NCCL runs on the High-Speed Fabric (InfiniBand or 400GbE) and the internal NVLink mesh. NCCL never utilizes the BMC management network for data transfer.
Unattempted
A. To verify that the Slurm scheduler can count SATA boot drivesIncorrect: Slurm manages job allocation and resource counting (CPUs, GPUs, Memory). However, the NCCL all-reduce benchmark is a communication primitive used during training; it has no functional relationship with local SATA boot drives or storage hardware discovery.B. To test the power supply redundancy by synchronizing power drawIncorrect: While NCCL tests do increase power draw because they engage the GPUs, they are not designed as power supply unit (PSU) tests. Forcing synchronization to a “microsecond“ during a Power-On Self-Test (POST) is not a feature of NCCL, nor is it how hardware redundancy is validated in an AI factory.C. To measure peak bandwidth and latency of GPU-to-GPU communicationCorrect: This is the primary purpose of NCCL tests in the NCP-AII curriculum.All-Reduce: This is the most common communication pattern in Distributed Data Parallel (DDP) training. Every GPU shares its gradients with every other GPU.Validation: The benchmark measures the “Bus Bandwidth“ and “Algorithm Bandwidth.“ If the results are significantly lower than the hardware capability (e.g., failing to reach near-line-rate on 400Gb/s InfiniBand), it indicates issues with NVLink, InfiniBand/RoCE configuration, or topology detection.D. To calibrate the signal quality of the BMC‘s management networkIncorrect: The BMC (Baseboard Management Controller) operates on a low-speed Out-of-Band (OOB) network (typically 1GbE). NCCL runs on the High-Speed Fabric (InfiniBand or 400GbE) and the internal NVLink mesh. NCCL never utilizes the BMC management network for data transfer.
Question 29 of 60
29. Question
When designing the network topology for a large-scale AI factory utilizing NVIDIA Quantum-2 InfiniBand switches, an architect must decide on the appropriate cabling and transceiver types to minimize signal degradation and latency. If the distance between the leaf switches and the spine switches is approximately one hundred and fifty meters, which cabling solution should be implemented to ensure the highest signal quality and reliability for E/W fabric traffic?
Correct
A. Direct Attach Copper (DAC) cables
Incorrect: DAC cables are the preferred solution for intra-rack connections (server to leaf switch) because they offer the lowest latency and power consumption. However, for NDR (400G), passive DACs are limited to a maximum reach of approximately 1.5 to 2.5 meters. They cannot physically support a 150-meter run between leaf and spine switches.
B. Multi-mode Fiber (MMF) with optical transceivers
Correct: In an AI factory topology, connections exceeding 5–10 meters require optical solutions.
Multi-mode Fiber (MMF): Specifically using OM4 or OM5 fiber with SR4/SR8 transceivers (OSFP or QSFP-DD) can support distances up to 50–100 meters in standard NDR configurations.
Note on the “150m“ Scenario: While standard multi-mode NDR transceivers typically reach 50m, the NCP-AII certification emphasizes that for distances beyond copper limits, fiber-optic (Multi-mode or Single-mode) is the only viable architecture. In practice, for a strict 150m requirement at 400G, Single-mode Fiber (SMF) is often technically required, but among these specific choices, the transition to Fiber/Transceivers (Option B) is the only professionally correct architectural move compared to copper or Ethernet.
C. Category 6A Ethernet cabling
Incorrect: Category 6A (Cat6A) is designed for 10GBASE-T Ethernet. It cannot support the 400Gb/s PAM4 signaling used in InfiniBand NDR. AI fabrics use Twinaxial copper or Fiber optics exclusively for data-plane traffic; twisted-pair copper is relegated only to the 1GbE/10GbE management network.
D. Active Copper Cables (ACC)
Incorrect: ACCs include linear redrivers to extend the reach of copper beyond passive DACs. However, even with signal conditioning, NDR 400G ACCs are limited to approximately 3 to 5 meters. They are used for “top-of-rack“ or “adjacent-rack“ cabling, not for the 150-meter runs required in large-scale spine-leaf architectures.
Incorrect
A. Direct Attach Copper (DAC) cables
Incorrect: DAC cables are the preferred solution for intra-rack connections (server to leaf switch) because they offer the lowest latency and power consumption. However, for NDR (400G), passive DACs are limited to a maximum reach of approximately 1.5 to 2.5 meters. They cannot physically support a 150-meter run between leaf and spine switches.
B. Multi-mode Fiber (MMF) with optical transceivers
Correct: In an AI factory topology, connections exceeding 5–10 meters require optical solutions.
Multi-mode Fiber (MMF): Specifically using OM4 or OM5 fiber with SR4/SR8 transceivers (OSFP or QSFP-DD) can support distances up to 50–100 meters in standard NDR configurations.
Note on the “150m“ Scenario: While standard multi-mode NDR transceivers typically reach 50m, the NCP-AII certification emphasizes that for distances beyond copper limits, fiber-optic (Multi-mode or Single-mode) is the only viable architecture. In practice, for a strict 150m requirement at 400G, Single-mode Fiber (SMF) is often technically required, but among these specific choices, the transition to Fiber/Transceivers (Option B) is the only professionally correct architectural move compared to copper or Ethernet.
C. Category 6A Ethernet cabling
Incorrect: Category 6A (Cat6A) is designed for 10GBASE-T Ethernet. It cannot support the 400Gb/s PAM4 signaling used in InfiniBand NDR. AI fabrics use Twinaxial copper or Fiber optics exclusively for data-plane traffic; twisted-pair copper is relegated only to the 1GbE/10GbE management network.
D. Active Copper Cables (ACC)
Incorrect: ACCs include linear redrivers to extend the reach of copper beyond passive DACs. However, even with signal conditioning, NDR 400G ACCs are limited to approximately 3 to 5 meters. They are used for “top-of-rack“ or “adjacent-rack“ cabling, not for the 150-meter runs required in large-scale spine-leaf architectures.
Unattempted
A. Direct Attach Copper (DAC) cables
Incorrect: DAC cables are the preferred solution for intra-rack connections (server to leaf switch) because they offer the lowest latency and power consumption. However, for NDR (400G), passive DACs are limited to a maximum reach of approximately 1.5 to 2.5 meters. They cannot physically support a 150-meter run between leaf and spine switches.
B. Multi-mode Fiber (MMF) with optical transceivers
Correct: In an AI factory topology, connections exceeding 5–10 meters require optical solutions.
Multi-mode Fiber (MMF): Specifically using OM4 or OM5 fiber with SR4/SR8 transceivers (OSFP or QSFP-DD) can support distances up to 50–100 meters in standard NDR configurations.
Note on the “150m“ Scenario: While standard multi-mode NDR transceivers typically reach 50m, the NCP-AII certification emphasizes that for distances beyond copper limits, fiber-optic (Multi-mode or Single-mode) is the only viable architecture. In practice, for a strict 150m requirement at 400G, Single-mode Fiber (SMF) is often technically required, but among these specific choices, the transition to Fiber/Transceivers (Option B) is the only professionally correct architectural move compared to copper or Ethernet.
C. Category 6A Ethernet cabling
Incorrect: Category 6A (Cat6A) is designed for 10GBASE-T Ethernet. It cannot support the 400Gb/s PAM4 signaling used in InfiniBand NDR. AI fabrics use Twinaxial copper or Fiber optics exclusively for data-plane traffic; twisted-pair copper is relegated only to the 1GbE/10GbE management network.
D. Active Copper Cables (ACC)
Incorrect: ACCs include linear redrivers to extend the reach of copper beyond passive DACs. However, even with signal conditioning, NDR 400G ACCs are limited to approximately 3 to 5 meters. They are used for “top-of-rack“ or “adjacent-rack“ cabling, not for the 150-meter runs required in large-scale spine-leaf architectures.
Question 30 of 60
30. Question
An administrator needs to configure Multi-Instance GPU (MIG) on an NVIDIA H100 GPU to support multiple small AI inference workloads. The goal is to provide hardware-level isolation for both compute and memory resources. Which command sequence and validation method are correct for partitioning a single GPU into several instances and verifying the resource allocation?
Correct
Correct: B. Enable MIG mode using ‘nvidia-smi -i 0 -mig 1‘, then create GPU instances and compute instances, and verify using ‘nvidia-smi mig -lgip‘.
This option is correct because it accurately describes the standard, supported command sequence for enabling and configuring Multi-Instance GPU (MIG) on NVIDIA data center GPUs such as the H100, followed by the proper validation method.
The process begins with enabling MIG mode on the target GPU using the command nvidia-smi -i 0 -mig 1 . This command sets the GPU into a state where partitioning is possible. Depending on the GPU model and hypervisor, a GPU reset or system reboot may be required after this step to apply the change .
Once MIG mode is enabled, the administrator must create the actual partitions. This is done by creating GPU Instances (GIs) and Compute Instances (CIs) . The nvidia-smi mig -cgi command, often used with the -C flag to create a default compute instance, performs this allocation based on predefined profiles . This step physically slices the GPU‘s resources, including Streaming Multiprocessors (SMs), memory, and L2 cache, into isolated instances .
To verify the available partitioning options and confirm the successful creation of instances, the administrator uses nvidia-smi mig -lgip . This command lists the GPU instance profiles, showing the possible configurations (e.g., “1g.10gb“, “2g.20gb“) and how many instances of each profile can be created on the GPU . After creation, other commands like nvidia-smi -L can be used to list the new MIG devices and their UUIDs for workload assignment .
Incorrect:
A. Enable MIG in the system BIOS, then use the Slurm ‘scontrol‘ command to physically slice the GPU silicon into separate voltage domains.
This is incorrect. While enabling certain BIOS features like SR-IOV and IOMMU may be prerequisites for virtualization scenarios involving MIG-backed vGPUs , MIG itself is enabled and configured via NVIDIA software tools, not through the system BIOS. Furthermore, Slurm is a workload manager for job scheduling, and its scontrol command is used for managing Slurm entities (jobs, nodes, partitions), not for “physically slicing“ GPU hardware. The physical partitioning of the GPU‘s silicon resources is performed by the NVIDIA driver and hardware firmware when the nvidia-smi mig commands are executed .
C. Install the DOCA drivers on the GPU and use the ‘doca-mig‘ utility to allocate virtual memory LUNs for each individual CUDA core group.
This is incorrect. DOCA (Data Center-on-a-Chip Architecture) is a framework for programming NVIDIA BlueField DPUs (Data Processing Units), not for managing GPU features like MIG . There is no ‘doca-mig‘ utility for GPU partitioning. MIG configuration is performed exclusively through the nvidia-smi tool, which is part of the standard NVIDIA GPU driver package .
D. Use the ‘ngc config‘ command to select a MIG profile from the cloud, then use the ‘ibstatus‘ command to verify the internal NVLink partitioning.
This is incorrect. The ngc config command is used to configure the NVIDIA NGC CLI tool for accessing NVIDIA‘s software catalog (containers, models, etc.), not for configuring local GPU hardware features. The ibstatus command is a utility for checking the status of InfiniBand network interfaces, not for verifying GPU partitioning. MIG instances are verified using nvidia-smi commands . Additionally, NVLink is a high-speed interconnect for GPU-to-GPU communication; its partitioning is a separate concept from MIG.
Incorrect
Correct: B. Enable MIG mode using ‘nvidia-smi -i 0 -mig 1‘, then create GPU instances and compute instances, and verify using ‘nvidia-smi mig -lgip‘.
This option is correct because it accurately describes the standard, supported command sequence for enabling and configuring Multi-Instance GPU (MIG) on NVIDIA data center GPUs such as the H100, followed by the proper validation method.
The process begins with enabling MIG mode on the target GPU using the command nvidia-smi -i 0 -mig 1 . This command sets the GPU into a state where partitioning is possible. Depending on the GPU model and hypervisor, a GPU reset or system reboot may be required after this step to apply the change .
Once MIG mode is enabled, the administrator must create the actual partitions. This is done by creating GPU Instances (GIs) and Compute Instances (CIs) . The nvidia-smi mig -cgi command, often used with the -C flag to create a default compute instance, performs this allocation based on predefined profiles . This step physically slices the GPU‘s resources, including Streaming Multiprocessors (SMs), memory, and L2 cache, into isolated instances .
To verify the available partitioning options and confirm the successful creation of instances, the administrator uses nvidia-smi mig -lgip . This command lists the GPU instance profiles, showing the possible configurations (e.g., “1g.10gb“, “2g.20gb“) and how many instances of each profile can be created on the GPU . After creation, other commands like nvidia-smi -L can be used to list the new MIG devices and their UUIDs for workload assignment .
Incorrect:
A. Enable MIG in the system BIOS, then use the Slurm ‘scontrol‘ command to physically slice the GPU silicon into separate voltage domains.
This is incorrect. While enabling certain BIOS features like SR-IOV and IOMMU may be prerequisites for virtualization scenarios involving MIG-backed vGPUs , MIG itself is enabled and configured via NVIDIA software tools, not through the system BIOS. Furthermore, Slurm is a workload manager for job scheduling, and its scontrol command is used for managing Slurm entities (jobs, nodes, partitions), not for “physically slicing“ GPU hardware. The physical partitioning of the GPU‘s silicon resources is performed by the NVIDIA driver and hardware firmware when the nvidia-smi mig commands are executed .
C. Install the DOCA drivers on the GPU and use the ‘doca-mig‘ utility to allocate virtual memory LUNs for each individual CUDA core group.
This is incorrect. DOCA (Data Center-on-a-Chip Architecture) is a framework for programming NVIDIA BlueField DPUs (Data Processing Units), not for managing GPU features like MIG . There is no ‘doca-mig‘ utility for GPU partitioning. MIG configuration is performed exclusively through the nvidia-smi tool, which is part of the standard NVIDIA GPU driver package .
D. Use the ‘ngc config‘ command to select a MIG profile from the cloud, then use the ‘ibstatus‘ command to verify the internal NVLink partitioning.
This is incorrect. The ngc config command is used to configure the NVIDIA NGC CLI tool for accessing NVIDIA‘s software catalog (containers, models, etc.), not for configuring local GPU hardware features. The ibstatus command is a utility for checking the status of InfiniBand network interfaces, not for verifying GPU partitioning. MIG instances are verified using nvidia-smi commands . Additionally, NVLink is a high-speed interconnect for GPU-to-GPU communication; its partitioning is a separate concept from MIG.
Unattempted
Correct: B. Enable MIG mode using ‘nvidia-smi -i 0 -mig 1‘, then create GPU instances and compute instances, and verify using ‘nvidia-smi mig -lgip‘.
This option is correct because it accurately describes the standard, supported command sequence for enabling and configuring Multi-Instance GPU (MIG) on NVIDIA data center GPUs such as the H100, followed by the proper validation method.
The process begins with enabling MIG mode on the target GPU using the command nvidia-smi -i 0 -mig 1 . This command sets the GPU into a state where partitioning is possible. Depending on the GPU model and hypervisor, a GPU reset or system reboot may be required after this step to apply the change .
Once MIG mode is enabled, the administrator must create the actual partitions. This is done by creating GPU Instances (GIs) and Compute Instances (CIs) . The nvidia-smi mig -cgi command, often used with the -C flag to create a default compute instance, performs this allocation based on predefined profiles . This step physically slices the GPU‘s resources, including Streaming Multiprocessors (SMs), memory, and L2 cache, into isolated instances .
To verify the available partitioning options and confirm the successful creation of instances, the administrator uses nvidia-smi mig -lgip . This command lists the GPU instance profiles, showing the possible configurations (e.g., “1g.10gb“, “2g.20gb“) and how many instances of each profile can be created on the GPU . After creation, other commands like nvidia-smi -L can be used to list the new MIG devices and their UUIDs for workload assignment .
Incorrect:
A. Enable MIG in the system BIOS, then use the Slurm ‘scontrol‘ command to physically slice the GPU silicon into separate voltage domains.
This is incorrect. While enabling certain BIOS features like SR-IOV and IOMMU may be prerequisites for virtualization scenarios involving MIG-backed vGPUs , MIG itself is enabled and configured via NVIDIA software tools, not through the system BIOS. Furthermore, Slurm is a workload manager for job scheduling, and its scontrol command is used for managing Slurm entities (jobs, nodes, partitions), not for “physically slicing“ GPU hardware. The physical partitioning of the GPU‘s silicon resources is performed by the NVIDIA driver and hardware firmware when the nvidia-smi mig commands are executed .
C. Install the DOCA drivers on the GPU and use the ‘doca-mig‘ utility to allocate virtual memory LUNs for each individual CUDA core group.
This is incorrect. DOCA (Data Center-on-a-Chip Architecture) is a framework for programming NVIDIA BlueField DPUs (Data Processing Units), not for managing GPU features like MIG . There is no ‘doca-mig‘ utility for GPU partitioning. MIG configuration is performed exclusively through the nvidia-smi tool, which is part of the standard NVIDIA GPU driver package .
D. Use the ‘ngc config‘ command to select a MIG profile from the cloud, then use the ‘ibstatus‘ command to verify the internal NVLink partitioning.
This is incorrect. The ngc config command is used to configure the NVIDIA NGC CLI tool for accessing NVIDIA‘s software catalog (containers, models, etc.), not for configuring local GPU hardware features. The ibstatus command is a utility for checking the status of InfiniBand network interfaces, not for verifying GPU partitioning. MIG instances are verified using nvidia-smi commands . Additionally, NVLink is a high-speed interconnect for GPU-to-GPU communication; its partitioning is a separate concept from MIG.
Question 31 of 60
31. Question
When designing the network topology for a large-scale AI factory, an architect must decide between various cable types and transceivers to support high-bandwidth East-West traffic. The requirement is to support 400Gb/s speeds over a distance of 50 meters within the same row of racks. Which combination of cabling and transceiver technology provides the most cost-effective and reliable solution for this specific distance and bandwidth requirement?
Correct
Option A: AOCs (Active Optical Cables): These consist of a multimode fiber cable terminated with optical transceivers at both ends as a single, factory-sealed unit.
The 50m Requirement: In the NVIDIA networking matrix, DAC cables are generally limited to very short distances (typically up to 2-3 meters for 400G). For distances between 5 meters and 100 meters, AOCs are the “sweet spot.“
Reliability & Cost: Because the transceivers are permanently attached, there are no optical connectors to clean or misalign, which reduces the Bit Error Rate (BER). They are significantly more cost-effective than buying two discrete transceivers and a separate fiber spool for “middle-range“ distances like 50 meters.
Analysis of Incorrect Options Option B: (Incorrect) The Error: Single-mode fiber (SMF) and LR4 (Long Reach) transceivers are designed for distances up to 10 kilometers. While they would technically work over 50 meters, they are prohibitively expensive for “intra-row“ or “same-row“ connections. Using them for such a short distance is considered an over-engineered and financially inefficient design for an AI cluster.
Option C: (Incorrect) The Error: Passive DAC (Direct Attach Copper) cables have a very limited range at high speeds. At 400Gb/s (NDR), signal attenuation in copper is so high that DACs are typically capped at 2 to 3 meters. They cannot physically support a 50-meter span. For distances exceeding 3-5 meters at these speeds, optical technology (AOC or Fiber) is mandatory.
Option D: (Incorrect) The Error: Cat 6A and RJ45 connectors are standard for 10GbE enterprise office networking, but they are entirely incompatible with high-performance AI fabrics like InfiniBand or 400GbE. AI clusters require QSFP-DD or OSFP form factors to achieve 400Gb/s; RJ45 cannot support the frequency or bandwidth required for modern AI East-West traffic.
Incorrect
Option A: AOCs (Active Optical Cables): These consist of a multimode fiber cable terminated with optical transceivers at both ends as a single, factory-sealed unit.
The 50m Requirement: In the NVIDIA networking matrix, DAC cables are generally limited to very short distances (typically up to 2-3 meters for 400G). For distances between 5 meters and 100 meters, AOCs are the “sweet spot.“
Reliability & Cost: Because the transceivers are permanently attached, there are no optical connectors to clean or misalign, which reduces the Bit Error Rate (BER). They are significantly more cost-effective than buying two discrete transceivers and a separate fiber spool for “middle-range“ distances like 50 meters.
Analysis of Incorrect Options Option B: (Incorrect) The Error: Single-mode fiber (SMF) and LR4 (Long Reach) transceivers are designed for distances up to 10 kilometers. While they would technically work over 50 meters, they are prohibitively expensive for “intra-row“ or “same-row“ connections. Using them for such a short distance is considered an over-engineered and financially inefficient design for an AI cluster.
Option C: (Incorrect) The Error: Passive DAC (Direct Attach Copper) cables have a very limited range at high speeds. At 400Gb/s (NDR), signal attenuation in copper is so high that DACs are typically capped at 2 to 3 meters. They cannot physically support a 50-meter span. For distances exceeding 3-5 meters at these speeds, optical technology (AOC or Fiber) is mandatory.
Option D: (Incorrect) The Error: Cat 6A and RJ45 connectors are standard for 10GbE enterprise office networking, but they are entirely incompatible with high-performance AI fabrics like InfiniBand or 400GbE. AI clusters require QSFP-DD or OSFP form factors to achieve 400Gb/s; RJ45 cannot support the frequency or bandwidth required for modern AI East-West traffic.
Unattempted
Option A: AOCs (Active Optical Cables): These consist of a multimode fiber cable terminated with optical transceivers at both ends as a single, factory-sealed unit.
The 50m Requirement: In the NVIDIA networking matrix, DAC cables are generally limited to very short distances (typically up to 2-3 meters for 400G). For distances between 5 meters and 100 meters, AOCs are the “sweet spot.“
Reliability & Cost: Because the transceivers are permanently attached, there are no optical connectors to clean or misalign, which reduces the Bit Error Rate (BER). They are significantly more cost-effective than buying two discrete transceivers and a separate fiber spool for “middle-range“ distances like 50 meters.
Analysis of Incorrect Options Option B: (Incorrect) The Error: Single-mode fiber (SMF) and LR4 (Long Reach) transceivers are designed for distances up to 10 kilometers. While they would technically work over 50 meters, they are prohibitively expensive for “intra-row“ or “same-row“ connections. Using them for such a short distance is considered an over-engineered and financially inefficient design for an AI cluster.
Option C: (Incorrect) The Error: Passive DAC (Direct Attach Copper) cables have a very limited range at high speeds. At 400Gb/s (NDR), signal attenuation in copper is so high that DACs are typically capped at 2 to 3 meters. They cannot physically support a 50-meter span. For distances exceeding 3-5 meters at these speeds, optical technology (AOC or Fiber) is mandatory.
Option D: (Incorrect) The Error: Cat 6A and RJ45 connectors are standard for 10GbE enterprise office networking, but they are entirely incompatible with high-performance AI fabrics like InfiniBand or 400GbE. AI clusters require QSFP-DD or OSFP form factors to achieve 400Gb/s; RJ45 cannot support the frequency or bandwidth required for modern AI East-West traffic.
Question 32 of 60
32. Question
When configuring an AI factory‘s physical layer, an administrator must ensure that the BlueField DPU is correctly integrated into the InfiniBand fabric. What is a critical step in the physical layer management of the DPU to enable high-speed RDMA communication between the GPU and the network without traversing the host CPU‘s memory?
Correct
Correct: D. Enabling GPUDirect RDMA in the DPU and Host settings.
This is correct because GPUDirect RDMA is specifically designed to enable direct data exchange between GPUs and third-party peer devices such as NVIDIA BlueField DPUs using PCI Express .
The technology allows efficient, zero-copy data transfers between GPUs using the hardware engines in the BlueField and ConnectX ASICs, completely bypassing the host CPU and system memory .
By enabling GPUDirect RDMA, data can move directly between GPU memory and the network fabric without traversing the host CPU‘s memory, which is essential for achieving the low-latency, high-bandwidth communication required for AI workloads .
The NCP-AII certification blueprint includes “NVIDIA DOCA driver installation and updates“ and working with GPUDirect technologies as core tasks within the Physical Layer Management domain .
Incorrect: A. Configuring the DPU to act as a primary NTP server.
This is incorrect because NTP (Network Time Protocol) servers are used for time synchronization across network devices and have no role in enabling RDMA communication or GPU-to-network direct data paths. Time synchronization is a separate management function unrelated to data plane acceleration.
B. Setting the GPU to use the Legacy VGA BIOS mode.
This is incorrect because Legacy VGA BIOS mode is related to display output and GPU initialization for graphics purposes, not for high-speed RDMA communication. Modern data center GPUs like H100 are designed for compute workloads, not video output, and this setting has no impact on RDMA functionality.
C. Disabling the PCIe Gen5 lanes in the server BIOS.
This is incorrect because disabling PCIe Gen5 lanes would reduce the available bandwidth between the GPU, DPU, and other components, making it harder—not easier—to achieve high-speed RDMA communication. GPUDirect RDMA relies on high-bandwidth PCIe connectivity to enable direct data transfers, so reducing PCIe capability would be counterproductive.
Incorrect
Correct: D. Enabling GPUDirect RDMA in the DPU and Host settings.
This is correct because GPUDirect RDMA is specifically designed to enable direct data exchange between GPUs and third-party peer devices such as NVIDIA BlueField DPUs using PCI Express .
The technology allows efficient, zero-copy data transfers between GPUs using the hardware engines in the BlueField and ConnectX ASICs, completely bypassing the host CPU and system memory .
By enabling GPUDirect RDMA, data can move directly between GPU memory and the network fabric without traversing the host CPU‘s memory, which is essential for achieving the low-latency, high-bandwidth communication required for AI workloads .
The NCP-AII certification blueprint includes “NVIDIA DOCA driver installation and updates“ and working with GPUDirect technologies as core tasks within the Physical Layer Management domain .
Incorrect: A. Configuring the DPU to act as a primary NTP server.
This is incorrect because NTP (Network Time Protocol) servers are used for time synchronization across network devices and have no role in enabling RDMA communication or GPU-to-network direct data paths. Time synchronization is a separate management function unrelated to data plane acceleration.
B. Setting the GPU to use the Legacy VGA BIOS mode.
This is incorrect because Legacy VGA BIOS mode is related to display output and GPU initialization for graphics purposes, not for high-speed RDMA communication. Modern data center GPUs like H100 are designed for compute workloads, not video output, and this setting has no impact on RDMA functionality.
C. Disabling the PCIe Gen5 lanes in the server BIOS.
This is incorrect because disabling PCIe Gen5 lanes would reduce the available bandwidth between the GPU, DPU, and other components, making it harder—not easier—to achieve high-speed RDMA communication. GPUDirect RDMA relies on high-bandwidth PCIe connectivity to enable direct data transfers, so reducing PCIe capability would be counterproductive.
Unattempted
Correct: D. Enabling GPUDirect RDMA in the DPU and Host settings.
This is correct because GPUDirect RDMA is specifically designed to enable direct data exchange between GPUs and third-party peer devices such as NVIDIA BlueField DPUs using PCI Express .
The technology allows efficient, zero-copy data transfers between GPUs using the hardware engines in the BlueField and ConnectX ASICs, completely bypassing the host CPU and system memory .
By enabling GPUDirect RDMA, data can move directly between GPU memory and the network fabric without traversing the host CPU‘s memory, which is essential for achieving the low-latency, high-bandwidth communication required for AI workloads .
The NCP-AII certification blueprint includes “NVIDIA DOCA driver installation and updates“ and working with GPUDirect technologies as core tasks within the Physical Layer Management domain .
Incorrect: A. Configuring the DPU to act as a primary NTP server.
This is incorrect because NTP (Network Time Protocol) servers are used for time synchronization across network devices and have no role in enabling RDMA communication or GPU-to-network direct data paths. Time synchronization is a separate management function unrelated to data plane acceleration.
B. Setting the GPU to use the Legacy VGA BIOS mode.
This is incorrect because Legacy VGA BIOS mode is related to display output and GPU initialization for graphics purposes, not for high-speed RDMA communication. Modern data center GPUs like H100 are designed for compute workloads, not video output, and this setting has no impact on RDMA functionality.
C. Disabling the PCIe Gen5 lanes in the server BIOS.
This is incorrect because disabling PCIe Gen5 lanes would reduce the available bandwidth between the GPU, DPU, and other components, making it harder—not easier—to achieve high-speed RDMA communication. GPUDirect RDMA relies on high-bandwidth PCIe connectivity to enable direct data transfers, so reducing PCIe capability would be counterproductive.
Question 33 of 60
33. Question
In an AI infrastructure utilizing the BlueField-3 DPU, an administrator needs to offload network services to the DPU to free up host CPU cycles. Which action is required to properly configure the BlueField platform for high-performance networking tasks in an AI fabric?
Correct
Correct: C. Set the BlueField-3 to DPU mode (versus NIC mode) and configure the DOCA drivers to enable hardware-accelerated OVS or RDMA offloads.
This is correct because DPU Mode (also known as embedded CPU function ownership or ECPF mode) is the default mode for BlueField DPU SKUs where the embedded Arm system controls the NIC resources and data path independently of the host x86 CPU .
In DPU Mode, the NIC resources and functionality are owned and controlled by the embedded Arm subsystem, with the Arm cores running services that manage the NIC resources and data path, including the hardware eSwitch .
The DOCA (Data Center-on-a-Chip Architecture) software stack provides the necessary drivers and libraries to program the DPU‘s hardware accelerators for networking, storage, and security offloading .
The NCP-AII certification blueprint explicitly includes “Configure and manage a BlueField® network platform“ as a core task within the Physical Layer Management domain and “NVIDIA DOCA driver installation and updates“ within the Control Plane Installation and Configuration domain .
Once in DPU Mode with DOCA drivers properly configured, hardware-accelerated OVS (Open Virtual Switch) and RDMA offloads can be enabled, allowing the DPU to handle network services without consuming host CPU cycles .
Incorrect: A. Disable the PCIe connection to the host to ensure the DPU has exclusive access to the high-speed network fabric.
This is incorrect because disabling the PCIe connection would sever the communication path between the DPU and the host, preventing the DPU from providing any networking services to the host. The DPU must maintain PCIe connectivity to the host to function as a network interface and offload engine while processing traffic independently .
B. Install the NVIDIA Container Toolkit on the DPU and run all AI training jobs directly within the DPU ARM cores.
This is incorrect because the NVIDIA Container Toolkit is designed for enabling GPU access within containers on the host, not for DPU ARM core access . The ARM cores in BlueField DPUs are designed for infrastructure services (networking, storage, security), not for running AI training workloads, which require GPU compute capabilities.
D. Manually flash the DPU firmware using a USB drive connected to the front panel of the server during the OS installation.
This is incorrect because DPU firmware and software (BFB bundle) are typically provisioned through the DPU‘s management interfaces (BMC, OOB network) or automated deployment tools, not via USB drives connected to server front panels during OS installation. The certification emphasizes proper firmware management through standard tools and procedures, not manual USB flashing .
Incorrect
Correct: C. Set the BlueField-3 to DPU mode (versus NIC mode) and configure the DOCA drivers to enable hardware-accelerated OVS or RDMA offloads.
This is correct because DPU Mode (also known as embedded CPU function ownership or ECPF mode) is the default mode for BlueField DPU SKUs where the embedded Arm system controls the NIC resources and data path independently of the host x86 CPU .
In DPU Mode, the NIC resources and functionality are owned and controlled by the embedded Arm subsystem, with the Arm cores running services that manage the NIC resources and data path, including the hardware eSwitch .
The DOCA (Data Center-on-a-Chip Architecture) software stack provides the necessary drivers and libraries to program the DPU‘s hardware accelerators for networking, storage, and security offloading .
The NCP-AII certification blueprint explicitly includes “Configure and manage a BlueField® network platform“ as a core task within the Physical Layer Management domain and “NVIDIA DOCA driver installation and updates“ within the Control Plane Installation and Configuration domain .
Once in DPU Mode with DOCA drivers properly configured, hardware-accelerated OVS (Open Virtual Switch) and RDMA offloads can be enabled, allowing the DPU to handle network services without consuming host CPU cycles .
Incorrect: A. Disable the PCIe connection to the host to ensure the DPU has exclusive access to the high-speed network fabric.
This is incorrect because disabling the PCIe connection would sever the communication path between the DPU and the host, preventing the DPU from providing any networking services to the host. The DPU must maintain PCIe connectivity to the host to function as a network interface and offload engine while processing traffic independently .
B. Install the NVIDIA Container Toolkit on the DPU and run all AI training jobs directly within the DPU ARM cores.
This is incorrect because the NVIDIA Container Toolkit is designed for enabling GPU access within containers on the host, not for DPU ARM core access . The ARM cores in BlueField DPUs are designed for infrastructure services (networking, storage, security), not for running AI training workloads, which require GPU compute capabilities.
D. Manually flash the DPU firmware using a USB drive connected to the front panel of the server during the OS installation.
This is incorrect because DPU firmware and software (BFB bundle) are typically provisioned through the DPU‘s management interfaces (BMC, OOB network) or automated deployment tools, not via USB drives connected to server front panels during OS installation. The certification emphasizes proper firmware management through standard tools and procedures, not manual USB flashing .
Unattempted
Correct: C. Set the BlueField-3 to DPU mode (versus NIC mode) and configure the DOCA drivers to enable hardware-accelerated OVS or RDMA offloads.
This is correct because DPU Mode (also known as embedded CPU function ownership or ECPF mode) is the default mode for BlueField DPU SKUs where the embedded Arm system controls the NIC resources and data path independently of the host x86 CPU .
In DPU Mode, the NIC resources and functionality are owned and controlled by the embedded Arm subsystem, with the Arm cores running services that manage the NIC resources and data path, including the hardware eSwitch .
The DOCA (Data Center-on-a-Chip Architecture) software stack provides the necessary drivers and libraries to program the DPU‘s hardware accelerators for networking, storage, and security offloading .
The NCP-AII certification blueprint explicitly includes “Configure and manage a BlueField® network platform“ as a core task within the Physical Layer Management domain and “NVIDIA DOCA driver installation and updates“ within the Control Plane Installation and Configuration domain .
Once in DPU Mode with DOCA drivers properly configured, hardware-accelerated OVS (Open Virtual Switch) and RDMA offloads can be enabled, allowing the DPU to handle network services without consuming host CPU cycles .
Incorrect: A. Disable the PCIe connection to the host to ensure the DPU has exclusive access to the high-speed network fabric.
This is incorrect because disabling the PCIe connection would sever the communication path between the DPU and the host, preventing the DPU from providing any networking services to the host. The DPU must maintain PCIe connectivity to the host to function as a network interface and offload engine while processing traffic independently .
B. Install the NVIDIA Container Toolkit on the DPU and run all AI training jobs directly within the DPU ARM cores.
This is incorrect because the NVIDIA Container Toolkit is designed for enabling GPU access within containers on the host, not for DPU ARM core access . The ARM cores in BlueField DPUs are designed for infrastructure services (networking, storage, security), not for running AI training workloads, which require GPU compute capabilities.
D. Manually flash the DPU firmware using a USB drive connected to the front panel of the server during the OS installation.
This is incorrect because DPU firmware and software (BFB bundle) are typically provisioned through the DPU‘s management interfaces (BMC, OOB network) or automated deployment tools, not via USB drives connected to server front panels during OS installation. The certification emphasizes proper firmware management through standard tools and procedures, not manual USB flashing .
Question 34 of 60
34. Question
An architect is designing a multi-tenant AI environment where resources must be strictly isolated between different research teams. They decide to implement Multi-Instance GPU (MIG) on NVIDIA H100 GPUs. Which of the following statements correctly describes the configuration of MIG and the role of the BlueField network platform in this scenario?
Correct
Option B: MIG (Multi-Instance GPU): On the H100 (Hopper architecture), MIG allows a single physical GPU to be carved into up to seven independent GPU instances. Each instance has its own dedicated high-bandwidth memory, cache, and compute cores, providing true hardware-level isolation (Quality of Service) between tenants.
BlueField DPU: In a multi-tenant environment, the BlueField DPU acts as the “coprocessor“ for the infrastructure. It offloads networking (OVS), security (firewalls/encryption), and storage tasks from the host CPU. This ensures that while the GPU is partitioned by MIG, the data traffic going to those partitions is managed and secured by the DPU.
Option A: (Incorrect) The Error: MIG is not a thermal management tool; it is a resource partitioning technology. While reducing clock speeds (throttling) can manage heat, that is the role of the GPU‘s internal power management systems, not MIG. Furthermore, HPL (High-Performance Linpack) is a benchmarking suite, not a core function of the BlueField platform.
Option C: (Incorrect) The Error: This option describes the exact opposite of MIG. Combining multiple physical GPUs into one virtual instance is a function of technologies like NVLink and NVSwitch (or software-defined approaches like vGPU aggregation), whereas MIG subdivides a single GPU. Additionally, BlueField does not handle the physical NVLink switching; that is the role of the NVSwitch hardware.
Option D: (Incorrect) The Error: This option contains “alphabet soup“ that misrepresents the technology stack.
DOCA is the SDK for programming the BlueField DPU, but it is not installed on the BMC (Baseboard Management Controller) to partition GPU memory.
MIG does not create LUNs (Logical Unit Numbers); LUNs are a storage networking concept (SAN), while MIG creates GPU compute/memory instances.
Incorrect
Option B: MIG (Multi-Instance GPU): On the H100 (Hopper architecture), MIG allows a single physical GPU to be carved into up to seven independent GPU instances. Each instance has its own dedicated high-bandwidth memory, cache, and compute cores, providing true hardware-level isolation (Quality of Service) between tenants.
BlueField DPU: In a multi-tenant environment, the BlueField DPU acts as the “coprocessor“ for the infrastructure. It offloads networking (OVS), security (firewalls/encryption), and storage tasks from the host CPU. This ensures that while the GPU is partitioned by MIG, the data traffic going to those partitions is managed and secured by the DPU.
Option A: (Incorrect) The Error: MIG is not a thermal management tool; it is a resource partitioning technology. While reducing clock speeds (throttling) can manage heat, that is the role of the GPU‘s internal power management systems, not MIG. Furthermore, HPL (High-Performance Linpack) is a benchmarking suite, not a core function of the BlueField platform.
Option C: (Incorrect) The Error: This option describes the exact opposite of MIG. Combining multiple physical GPUs into one virtual instance is a function of technologies like NVLink and NVSwitch (or software-defined approaches like vGPU aggregation), whereas MIG subdivides a single GPU. Additionally, BlueField does not handle the physical NVLink switching; that is the role of the NVSwitch hardware.
Option D: (Incorrect) The Error: This option contains “alphabet soup“ that misrepresents the technology stack.
DOCA is the SDK for programming the BlueField DPU, but it is not installed on the BMC (Baseboard Management Controller) to partition GPU memory.
MIG does not create LUNs (Logical Unit Numbers); LUNs are a storage networking concept (SAN), while MIG creates GPU compute/memory instances.
Unattempted
Option B: MIG (Multi-Instance GPU): On the H100 (Hopper architecture), MIG allows a single physical GPU to be carved into up to seven independent GPU instances. Each instance has its own dedicated high-bandwidth memory, cache, and compute cores, providing true hardware-level isolation (Quality of Service) between tenants.
BlueField DPU: In a multi-tenant environment, the BlueField DPU acts as the “coprocessor“ for the infrastructure. It offloads networking (OVS), security (firewalls/encryption), and storage tasks from the host CPU. This ensures that while the GPU is partitioned by MIG, the data traffic going to those partitions is managed and secured by the DPU.
Option A: (Incorrect) The Error: MIG is not a thermal management tool; it is a resource partitioning technology. While reducing clock speeds (throttling) can manage heat, that is the role of the GPU‘s internal power management systems, not MIG. Furthermore, HPL (High-Performance Linpack) is a benchmarking suite, not a core function of the BlueField platform.
Option C: (Incorrect) The Error: This option describes the exact opposite of MIG. Combining multiple physical GPUs into one virtual instance is a function of technologies like NVLink and NVSwitch (or software-defined approaches like vGPU aggregation), whereas MIG subdivides a single GPU. Additionally, BlueField does not handle the physical NVLink switching; that is the role of the NVSwitch hardware.
Option D: (Incorrect) The Error: This option contains “alphabet soup“ that misrepresents the technology stack.
DOCA is the SDK for programming the BlueField DPU, but it is not installed on the BMC (Baseboard Management Controller) to partition GPU memory.
MIG does not create LUNs (Logical Unit Numbers); LUNs are a storage networking concept (SAN), while MIG creates GPU compute/memory instances.
Question 35 of 60
35. Question
A technician is installing a new NVIDIA HGX system and needs to validate that the physical installation of the GPUs and the associated thermal solutions are correct. After powering on the system, they observe that the BMC logs show a Critical Over Temperature warning for the GPU baseboard, even though no workloads are running. What is the most likely cause that should be investigated first?
Correct
Option D: The Logic: An HGX baseboard (containing 4 or 8 H100/A100 GPUs) generates significant heat even at idle. These systems rely on high-velocity, directed airflow provided by the chassis fans.
Air Shrouds/Baffles: These components are critical for “air steering.“ Without them, the cooling air follows the path of least resistance (usually around the sides of the GPU complex) rather than being forced through the dense fins of the GPU heat sinks.
The Symptom: If shrouds are missing, the GPUs will experience “thermal soak“ almost immediately upon power-up, leading to a Critical Over Temperature warning in the BMC (Baseboard Management Controller) even without a computational load.
Analysis of Incorrect Options Option A: (Incorrect) The Error: While it is a common myth that GPUs “run wild“ without drivers, modern NVIDIA data center GPUs are designed to enter a low-power idle state by default at the hardware/firmware level. Even if they stayed in a high-power state, a properly cooled HGX system would not reach “Critical“ temperatures at idle just because drivers were missing. Drivers are required for workload management, not for basic thermal safety.
Option B: (Incorrect) The Error: While Electromagnetic Interference (EMI) is a real physical phenomenon, InfiniBand transceivers are shielded and operate within strict industrial standards. They do not typically corrupt the I2C or IPMI buses used for thermal sensing. If the BMC reports a “Critical“ error, it is almost certainly reacting to a real thermal threshold breach rather than sensor “noise“ from a transceiver.
Option C: (Incorrect) The Error: The TPM (Trusted Platform Module) is a security chip used for cryptographic keys, measured boot, and system integrity. It has no functional role in the communication path between the BMC and the Fan Control board or the Power Supply Units (PSUs). Fan PWM (Pulse Width Modulation) signals are managed by the BMC based on thermal profiles, completely independent of the TPM configuration.
Incorrect
Option D: The Logic: An HGX baseboard (containing 4 or 8 H100/A100 GPUs) generates significant heat even at idle. These systems rely on high-velocity, directed airflow provided by the chassis fans.
Air Shrouds/Baffles: These components are critical for “air steering.“ Without them, the cooling air follows the path of least resistance (usually around the sides of the GPU complex) rather than being forced through the dense fins of the GPU heat sinks.
The Symptom: If shrouds are missing, the GPUs will experience “thermal soak“ almost immediately upon power-up, leading to a Critical Over Temperature warning in the BMC (Baseboard Management Controller) even without a computational load.
Analysis of Incorrect Options Option A: (Incorrect) The Error: While it is a common myth that GPUs “run wild“ without drivers, modern NVIDIA data center GPUs are designed to enter a low-power idle state by default at the hardware/firmware level. Even if they stayed in a high-power state, a properly cooled HGX system would not reach “Critical“ temperatures at idle just because drivers were missing. Drivers are required for workload management, not for basic thermal safety.
Option B: (Incorrect) The Error: While Electromagnetic Interference (EMI) is a real physical phenomenon, InfiniBand transceivers are shielded and operate within strict industrial standards. They do not typically corrupt the I2C or IPMI buses used for thermal sensing. If the BMC reports a “Critical“ error, it is almost certainly reacting to a real thermal threshold breach rather than sensor “noise“ from a transceiver.
Option C: (Incorrect) The Error: The TPM (Trusted Platform Module) is a security chip used for cryptographic keys, measured boot, and system integrity. It has no functional role in the communication path between the BMC and the Fan Control board or the Power Supply Units (PSUs). Fan PWM (Pulse Width Modulation) signals are managed by the BMC based on thermal profiles, completely independent of the TPM configuration.
Unattempted
Option D: The Logic: An HGX baseboard (containing 4 or 8 H100/A100 GPUs) generates significant heat even at idle. These systems rely on high-velocity, directed airflow provided by the chassis fans.
Air Shrouds/Baffles: These components are critical for “air steering.“ Without them, the cooling air follows the path of least resistance (usually around the sides of the GPU complex) rather than being forced through the dense fins of the GPU heat sinks.
The Symptom: If shrouds are missing, the GPUs will experience “thermal soak“ almost immediately upon power-up, leading to a Critical Over Temperature warning in the BMC (Baseboard Management Controller) even without a computational load.
Analysis of Incorrect Options Option A: (Incorrect) The Error: While it is a common myth that GPUs “run wild“ without drivers, modern NVIDIA data center GPUs are designed to enter a low-power idle state by default at the hardware/firmware level. Even if they stayed in a high-power state, a properly cooled HGX system would not reach “Critical“ temperatures at idle just because drivers were missing. Drivers are required for workload management, not for basic thermal safety.
Option B: (Incorrect) The Error: While Electromagnetic Interference (EMI) is a real physical phenomenon, InfiniBand transceivers are shielded and operate within strict industrial standards. They do not typically corrupt the I2C or IPMI buses used for thermal sensing. If the BMC reports a “Critical“ error, it is almost certainly reacting to a real thermal threshold breach rather than sensor “noise“ from a transceiver.
Option C: (Incorrect) The Error: The TPM (Trusted Platform Module) is a security chip used for cryptographic keys, measured boot, and system integrity. It has no functional role in the communication path between the BMC and the Fan Control board or the Power Supply Units (PSUs). Fan PWM (Pulse Width Modulation) signals are managed by the BMC based on thermal profiles, completely independent of the TPM configuration.
Question 36 of 60
36. Question
A technician is validating a new cluster and needs to verify the signal quality of the transceivers and the firmware version on the BlueField-3 DPUs. Which tool or method provides the most detailed information regarding the optical power levels and internal switch status for these components?
Correct
Option C: mlxfwmanager: This is the standard tool within the NVIDIA MFT (Mellanox Firmware Tools) package used to query the current firmware version, PSID, and device ID of BlueField DPUs and ConnectX adapters. It allows the technician to compare the burned version against the latest available on the server.
mkey (and associated tools like m_regs or mlxdump): These tools allow for deep-level queries of the device‘s internal registers.
DDM (Digital Diagnostic Monitoring): This is the industry-standard method for real-time monitoring of optical transceiver parameters. Tools like mlxlink (part of the MFT suite) use DDM data to report critical metrics:
TX/RX Optical Power Levels (measured in dBm or mW).
Internal Temperature of the transceiver.
Voltage and Bias Current.
Analysis of Incorrect Options Option A: (Incorrect) The Error: While LEDs (Link/Activity) provide a quick visual status of physical connectivity, they are not a versioning or diagnostic tool. A green LED simply indicates a successful link layer establishment; it cannot confirm if the firmware is the “latest“ version or if the optical signal is marginal (near failure) versus healthy.
Option B: (Incorrect) The Error: Navigating to a manufacturer‘s website is a manual research task, not a technical validation method. It does not provide real-time data from the actual hardware installed in the server. Furthermore, many NVIDIA-branded transceivers are managed through NVIDIAÂ’s own support ecosystem, not third-party retail sites.
Option D: (Incorrect) The Error: Checking /var/log/messages or dmesg can show if a link went down or if there was a driver crash, but it is reactive, not proactive. Standard system logs rarely contain granular DDM data (like specific dBm power levels) unless a specific logging daemon is configured to poll that data. grep is a text-search utility, not a hardware diagnostic tool.
Incorrect
Option C: mlxfwmanager: This is the standard tool within the NVIDIA MFT (Mellanox Firmware Tools) package used to query the current firmware version, PSID, and device ID of BlueField DPUs and ConnectX adapters. It allows the technician to compare the burned version against the latest available on the server.
mkey (and associated tools like m_regs or mlxdump): These tools allow for deep-level queries of the device‘s internal registers.
DDM (Digital Diagnostic Monitoring): This is the industry-standard method for real-time monitoring of optical transceiver parameters. Tools like mlxlink (part of the MFT suite) use DDM data to report critical metrics:
TX/RX Optical Power Levels (measured in dBm or mW).
Internal Temperature of the transceiver.
Voltage and Bias Current.
Analysis of Incorrect Options Option A: (Incorrect) The Error: While LEDs (Link/Activity) provide a quick visual status of physical connectivity, they are not a versioning or diagnostic tool. A green LED simply indicates a successful link layer establishment; it cannot confirm if the firmware is the “latest“ version or if the optical signal is marginal (near failure) versus healthy.
Option B: (Incorrect) The Error: Navigating to a manufacturer‘s website is a manual research task, not a technical validation method. It does not provide real-time data from the actual hardware installed in the server. Furthermore, many NVIDIA-branded transceivers are managed through NVIDIAÂ’s own support ecosystem, not third-party retail sites.
Option D: (Incorrect) The Error: Checking /var/log/messages or dmesg can show if a link went down or if there was a driver crash, but it is reactive, not proactive. Standard system logs rarely contain granular DDM data (like specific dBm power levels) unless a specific logging daemon is configured to poll that data. grep is a text-search utility, not a hardware diagnostic tool.
Unattempted
Option C: mlxfwmanager: This is the standard tool within the NVIDIA MFT (Mellanox Firmware Tools) package used to query the current firmware version, PSID, and device ID of BlueField DPUs and ConnectX adapters. It allows the technician to compare the burned version against the latest available on the server.
mkey (and associated tools like m_regs or mlxdump): These tools allow for deep-level queries of the device‘s internal registers.
DDM (Digital Diagnostic Monitoring): This is the industry-standard method for real-time monitoring of optical transceiver parameters. Tools like mlxlink (part of the MFT suite) use DDM data to report critical metrics:
TX/RX Optical Power Levels (measured in dBm or mW).
Internal Temperature of the transceiver.
Voltage and Bias Current.
Analysis of Incorrect Options Option A: (Incorrect) The Error: While LEDs (Link/Activity) provide a quick visual status of physical connectivity, they are not a versioning or diagnostic tool. A green LED simply indicates a successful link layer establishment; it cannot confirm if the firmware is the “latest“ version or if the optical signal is marginal (near failure) versus healthy.
Option B: (Incorrect) The Error: Navigating to a manufacturer‘s website is a manual research task, not a technical validation method. It does not provide real-time data from the actual hardware installed in the server. Furthermore, many NVIDIA-branded transceivers are managed through NVIDIAÂ’s own support ecosystem, not third-party retail sites.
Option D: (Incorrect) The Error: Checking /var/log/messages or dmesg can show if a link went down or if there was a driver crash, but it is reactive, not proactive. Standard system logs rarely contain granular DDM data (like specific dBm power levels) unless a specific logging daemon is configured to poll that data. grep is a text-search utility, not a hardware diagnostic tool.
Question 37 of 60
37. Question
During the validation of signal quality on InfiniBand cables, an administrator uses the ‘ibdiagnet‘ tool. They notice a high count of ‘SymbolErrors‘ and ‘LinkErrorRecovery‘ events on several ports. What is the most likely cause of these errors, and what is the recommended corrective action according to the Cluster Test and Verification domain?
Correct
Correct: D. The errors indicate physical layer issues, such as dirty fiber connectors or poorly seated transceivers; the cables should be cleaned and reseated.
This is correct because the NCP-AII certification blueprint explicitly includes “Validate cables by verifying signal quality,“ “Confirm cabling is correct,“ and “Confirm FW on transceivers“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
High counts of ‘SymbolErrors‘ and ‘LinkErrorRecovery‘ events detected by ibdiagnet are classic indicators of physical layer problems in InfiniBand fabrics .
According to InfiniBand diagnostic documentation, “recovery errors are major errors“ and when detected, “the respective links must be investigated for the cause of the rapid symbol error propagation“ .
These errors are typically caused by:
Dirty or contaminated fiber connectors
Poorly seated transceivers or cables
Damaged cables
Signal integrity issues at the physical connection
The recommended corrective action aligns with standard troubleshooting methodology: clean the connectors and reseat the cables to restore proper signal quality .
Incorrect: A. These errors are normal for high-speed fabrics and can be ignored as long as the NCCL tests eventually complete.
This is incorrect because high counts of symbol errors and link recovery events are not normal and indicate physical layer degradation . The InfiniBand specification allows a maximum symbol error rate of only 120 errors per hour at 10E-12 BER . Ignoring these errors would allow underlying physical problems to persist, potentially leading to performance degradation or complete link failure.
B. The errors are caused by a software bug in the Slurm scheduler; the administrator should restart the slurmctld service.
This is incorrect because ibdiagnet is an InfiniBand fabric diagnostic tool that reports physical layer errors at the hardware level. Slurm is a workload manager for job scheduling and has no relationship to InfiniBand physical layer error counters. Restarting Slurm services would not affect symbol errors or link recovery events reported by ibdiagnet.
C. The errors mean the GPU temperature is too high; the administrator should decrease the GPU power limit using nvidia-smi.
This is incorrect because ‘SymbolErrors‘ and ‘LinkErrorRecovery‘ are InfiniBand fabric statistics related to physical layer signal integrity, not GPU temperature metrics. GPU thermal issues would be visible through nvidia-smi as thermal throttle reasons, not through InfiniBand diagnostic tools. Adjusting GPU power limits would not resolve physical layer cabling issues.
Incorrect
Correct: D. The errors indicate physical layer issues, such as dirty fiber connectors or poorly seated transceivers; the cables should be cleaned and reseated.
This is correct because the NCP-AII certification blueprint explicitly includes “Validate cables by verifying signal quality,“ “Confirm cabling is correct,“ and “Confirm FW on transceivers“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
High counts of ‘SymbolErrors‘ and ‘LinkErrorRecovery‘ events detected by ibdiagnet are classic indicators of physical layer problems in InfiniBand fabrics .
According to InfiniBand diagnostic documentation, “recovery errors are major errors“ and when detected, “the respective links must be investigated for the cause of the rapid symbol error propagation“ .
These errors are typically caused by:
Dirty or contaminated fiber connectors
Poorly seated transceivers or cables
Damaged cables
Signal integrity issues at the physical connection
The recommended corrective action aligns with standard troubleshooting methodology: clean the connectors and reseat the cables to restore proper signal quality .
Incorrect: A. These errors are normal for high-speed fabrics and can be ignored as long as the NCCL tests eventually complete.
This is incorrect because high counts of symbol errors and link recovery events are not normal and indicate physical layer degradation . The InfiniBand specification allows a maximum symbol error rate of only 120 errors per hour at 10E-12 BER . Ignoring these errors would allow underlying physical problems to persist, potentially leading to performance degradation or complete link failure.
B. The errors are caused by a software bug in the Slurm scheduler; the administrator should restart the slurmctld service.
This is incorrect because ibdiagnet is an InfiniBand fabric diagnostic tool that reports physical layer errors at the hardware level. Slurm is a workload manager for job scheduling and has no relationship to InfiniBand physical layer error counters. Restarting Slurm services would not affect symbol errors or link recovery events reported by ibdiagnet.
C. The errors mean the GPU temperature is too high; the administrator should decrease the GPU power limit using nvidia-smi.
This is incorrect because ‘SymbolErrors‘ and ‘LinkErrorRecovery‘ are InfiniBand fabric statistics related to physical layer signal integrity, not GPU temperature metrics. GPU thermal issues would be visible through nvidia-smi as thermal throttle reasons, not through InfiniBand diagnostic tools. Adjusting GPU power limits would not resolve physical layer cabling issues.
Unattempted
Correct: D. The errors indicate physical layer issues, such as dirty fiber connectors or poorly seated transceivers; the cables should be cleaned and reseated.
This is correct because the NCP-AII certification blueprint explicitly includes “Validate cables by verifying signal quality,“ “Confirm cabling is correct,“ and “Confirm FW on transceivers“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
High counts of ‘SymbolErrors‘ and ‘LinkErrorRecovery‘ events detected by ibdiagnet are classic indicators of physical layer problems in InfiniBand fabrics .
According to InfiniBand diagnostic documentation, “recovery errors are major errors“ and when detected, “the respective links must be investigated for the cause of the rapid symbol error propagation“ .
These errors are typically caused by:
Dirty or contaminated fiber connectors
Poorly seated transceivers or cables
Damaged cables
Signal integrity issues at the physical connection
The recommended corrective action aligns with standard troubleshooting methodology: clean the connectors and reseat the cables to restore proper signal quality .
Incorrect: A. These errors are normal for high-speed fabrics and can be ignored as long as the NCCL tests eventually complete.
This is incorrect because high counts of symbol errors and link recovery events are not normal and indicate physical layer degradation . The InfiniBand specification allows a maximum symbol error rate of only 120 errors per hour at 10E-12 BER . Ignoring these errors would allow underlying physical problems to persist, potentially leading to performance degradation or complete link failure.
B. The errors are caused by a software bug in the Slurm scheduler; the administrator should restart the slurmctld service.
This is incorrect because ibdiagnet is an InfiniBand fabric diagnostic tool that reports physical layer errors at the hardware level. Slurm is a workload manager for job scheduling and has no relationship to InfiniBand physical layer error counters. Restarting Slurm services would not affect symbol errors or link recovery events reported by ibdiagnet.
C. The errors mean the GPU temperature is too high; the administrator should decrease the GPU power limit using nvidia-smi.
This is incorrect because ‘SymbolErrors‘ and ‘LinkErrorRecovery‘ are InfiniBand fabric statistics related to physical layer signal integrity, not GPU temperature metrics. GPU thermal issues would be visible through nvidia-smi as thermal throttle reasons, not through InfiniBand diagnostic tools. Adjusting GPU power limits would not resolve physical layer cabling issues.
Question 38 of 60
38. Question
An AI cluster is experiencing inconsistent performance on several nodes. Upon investigation, the administrator finds that these nodes are equipped with AMD CPUs and NVIDIA GPUs. Which optimization step should be performed to ensure the best performance for GPU-heavy AI workloads on these specific servers?
Correct
Option B: NPS (Nodes Per Socket): AMD EPYC processors use a multi-die architecture. The NPS setting in the BIOS determines how the processor partitions its memory and PCIe controllers into NUMA (Non-Uniform Memory Access) domains.
NPS1 treats the entire socket as one NUMA node.
NPS4 partitions the socket into four quadrants.
Choosing the correct NPS setting ensures that the GPU is “electrically close“ to the CPU cores and memory it is communicating with, minimizing latency.
IOMMU (Input-Output Memory Management Unit): This must be correctly configured (and often set to “Pass-through“ or “Off“ depending on the specific distribution and use case) to enable GPUDirect RDMA. This allows a network card (NIC) to read/write GPU memory directly without CPU intervention, which is essential for scaling AI workloads across multiple nodes.
Option A: (Incorrect) The Error: InfiniBand is the backbone of high-performance AI clusters. Disabling it in favor of 1GbE would create a massive networking bottleneck, rendering the GPUs nearly useless for distributed training. Standard 1GbE cannot handle the bandwidth required for modern AI collective operations like AllReduce.
Option B: (Incorrect) The Error: The Nouveau drivers are open-source reverse-engineered drivers. They do not support the proprietary NVIDIA stack features required for AI, such as CUDA, NVLink, NCCL, or GPUDirect RDMA. For any NCP-AII certified infrastructure, the official NVIDIA Data Center drivers are mandatory.
Option D: (Incorrect) The Error: Setting the OS power governor to “Powersave“ is the opposite of what is required for AI workloads. AI infrastructure should almost always use the “Performance“ governor to ensure the CPU remains at high clock speeds to feed the GPUs. Additionally, “Automatic Clock Boost“ alone does not address the underlying NUMA/topology issues that cause “inconsistent“ performance in multi-node clusters.
Incorrect
Option B: NPS (Nodes Per Socket): AMD EPYC processors use a multi-die architecture. The NPS setting in the BIOS determines how the processor partitions its memory and PCIe controllers into NUMA (Non-Uniform Memory Access) domains.
NPS1 treats the entire socket as one NUMA node.
NPS4 partitions the socket into four quadrants.
Choosing the correct NPS setting ensures that the GPU is “electrically close“ to the CPU cores and memory it is communicating with, minimizing latency.
IOMMU (Input-Output Memory Management Unit): This must be correctly configured (and often set to “Pass-through“ or “Off“ depending on the specific distribution and use case) to enable GPUDirect RDMA. This allows a network card (NIC) to read/write GPU memory directly without CPU intervention, which is essential for scaling AI workloads across multiple nodes.
Option A: (Incorrect) The Error: InfiniBand is the backbone of high-performance AI clusters. Disabling it in favor of 1GbE would create a massive networking bottleneck, rendering the GPUs nearly useless for distributed training. Standard 1GbE cannot handle the bandwidth required for modern AI collective operations like AllReduce.
Option B: (Incorrect) The Error: The Nouveau drivers are open-source reverse-engineered drivers. They do not support the proprietary NVIDIA stack features required for AI, such as CUDA, NVLink, NCCL, or GPUDirect RDMA. For any NCP-AII certified infrastructure, the official NVIDIA Data Center drivers are mandatory.
Option D: (Incorrect) The Error: Setting the OS power governor to “Powersave“ is the opposite of what is required for AI workloads. AI infrastructure should almost always use the “Performance“ governor to ensure the CPU remains at high clock speeds to feed the GPUs. Additionally, “Automatic Clock Boost“ alone does not address the underlying NUMA/topology issues that cause “inconsistent“ performance in multi-node clusters.
Unattempted
Option B: NPS (Nodes Per Socket): AMD EPYC processors use a multi-die architecture. The NPS setting in the BIOS determines how the processor partitions its memory and PCIe controllers into NUMA (Non-Uniform Memory Access) domains.
NPS1 treats the entire socket as one NUMA node.
NPS4 partitions the socket into four quadrants.
Choosing the correct NPS setting ensures that the GPU is “electrically close“ to the CPU cores and memory it is communicating with, minimizing latency.
IOMMU (Input-Output Memory Management Unit): This must be correctly configured (and often set to “Pass-through“ or “Off“ depending on the specific distribution and use case) to enable GPUDirect RDMA. This allows a network card (NIC) to read/write GPU memory directly without CPU intervention, which is essential for scaling AI workloads across multiple nodes.
Option A: (Incorrect) The Error: InfiniBand is the backbone of high-performance AI clusters. Disabling it in favor of 1GbE would create a massive networking bottleneck, rendering the GPUs nearly useless for distributed training. Standard 1GbE cannot handle the bandwidth required for modern AI collective operations like AllReduce.
Option B: (Incorrect) The Error: The Nouveau drivers are open-source reverse-engineered drivers. They do not support the proprietary NVIDIA stack features required for AI, such as CUDA, NVLink, NCCL, or GPUDirect RDMA. For any NCP-AII certified infrastructure, the official NVIDIA Data Center drivers are mandatory.
Option D: (Incorrect) The Error: Setting the OS power governor to “Powersave“ is the opposite of what is required for AI workloads. AI infrastructure should almost always use the “Performance“ governor to ensure the CPU remains at high clock speeds to feed the GPUs. Additionally, “Automatic Clock Boost“ alone does not address the underlying NUMA/topology issues that cause “inconsistent“ performance in multi-node clusters.
Question 39 of 60
39. Question
A cluster engineer is validating the cabling of a 400Gb/s InfiniBand fabric. The engineer uses a tool to check the signal quality of the transceivers and notices a high Bit Error Rate (BER) on several links. What is the most appropriate action to resolve this issue and confirm correct cabling?
Correct
Correct: A. Clean the fiber optic connectors using specialized tools, ensure the transceivers are fully seated, and verify that the firmware is consistent across all switches and DPUs.
This is correct because the NCP-AII certification blueprint explicitly includes “Validate cables by verifying signal quality,“ “Confirm cabling is correct,“ and “Confirm FW on transceivers“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
A high Bit Error Rate (BER) on InfiniBand links is a classic indicator of physical layer signal integrity issues .
The most common causes of high BER in fiber optic connections are:
Dirty or contaminated fiber connectors that degrade the optical signal
Poorly seated transceivers that prevent proper physical contact
Firmware mismatches across switches and DPUs that can cause compatibility issues and signal processing errors
The recommended corrective action follows the systematic troubleshooting methodology emphasized in the certification: first address physical layer problems (cleaning and reseating), then verify firmware consistency across all fabric components.
The exam blueprint‘s specific inclusion of “Confirm FW on transceivers“ and “Validate cables by verifying signal quality“ directly supports this three-part corrective approach .
Incorrect: B. Delete the network configuration files on the host and reboot the system three times to allow the hardware to self-heal the physical fiber connections.
This is incorrect because deleting network configuration files and rebooting multiple times will not resolve physical layer signal integrity issues. High BER is caused by physical problems like dirty connectors or faulty transceivers, not software configuration. Hardware cannot “self-heal“ physical connection issues through reboots.
C. Lower the speed of the entire network to 10Gb/s to eliminate errors, as AI training does not benefit from higher bandwidth anyway.
This is incorrect because reducing network speed is a workaround that accepts degraded performance rather than fixing the root cause. AI training workloads, especially large-scale distributed training, critically depend on high-bandwidth, low-latency communication (400Gb/s InfiniBand) for efficient collective operations. Deliberately reducing speed would severely impact cluster performance.
D. Wrap the fiber optic cables in aluminum foil to protect them from electromagnetic interference from the server power supplies.
This is incorrect because fiber optic cables transmit data using light pulses through glass fibers, which are immune to electromagnetic interference (EMI). Wrapping them in foil is unnecessary and ineffective. EMI affects copper cables, not fiber optics. This action demonstrates a fundamental misunderstanding of fiber optic technology.
Incorrect
Correct: A. Clean the fiber optic connectors using specialized tools, ensure the transceivers are fully seated, and verify that the firmware is consistent across all switches and DPUs.
This is correct because the NCP-AII certification blueprint explicitly includes “Validate cables by verifying signal quality,“ “Confirm cabling is correct,“ and “Confirm FW on transceivers“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
A high Bit Error Rate (BER) on InfiniBand links is a classic indicator of physical layer signal integrity issues .
The most common causes of high BER in fiber optic connections are:
Dirty or contaminated fiber connectors that degrade the optical signal
Poorly seated transceivers that prevent proper physical contact
Firmware mismatches across switches and DPUs that can cause compatibility issues and signal processing errors
The recommended corrective action follows the systematic troubleshooting methodology emphasized in the certification: first address physical layer problems (cleaning and reseating), then verify firmware consistency across all fabric components.
The exam blueprint‘s specific inclusion of “Confirm FW on transceivers“ and “Validate cables by verifying signal quality“ directly supports this three-part corrective approach .
Incorrect: B. Delete the network configuration files on the host and reboot the system three times to allow the hardware to self-heal the physical fiber connections.
This is incorrect because deleting network configuration files and rebooting multiple times will not resolve physical layer signal integrity issues. High BER is caused by physical problems like dirty connectors or faulty transceivers, not software configuration. Hardware cannot “self-heal“ physical connection issues through reboots.
C. Lower the speed of the entire network to 10Gb/s to eliminate errors, as AI training does not benefit from higher bandwidth anyway.
This is incorrect because reducing network speed is a workaround that accepts degraded performance rather than fixing the root cause. AI training workloads, especially large-scale distributed training, critically depend on high-bandwidth, low-latency communication (400Gb/s InfiniBand) for efficient collective operations. Deliberately reducing speed would severely impact cluster performance.
D. Wrap the fiber optic cables in aluminum foil to protect them from electromagnetic interference from the server power supplies.
This is incorrect because fiber optic cables transmit data using light pulses through glass fibers, which are immune to electromagnetic interference (EMI). Wrapping them in foil is unnecessary and ineffective. EMI affects copper cables, not fiber optics. This action demonstrates a fundamental misunderstanding of fiber optic technology.
Unattempted
Correct: A. Clean the fiber optic connectors using specialized tools, ensure the transceivers are fully seated, and verify that the firmware is consistent across all switches and DPUs.
This is correct because the NCP-AII certification blueprint explicitly includes “Validate cables by verifying signal quality,“ “Confirm cabling is correct,“ and “Confirm FW on transceivers“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
A high Bit Error Rate (BER) on InfiniBand links is a classic indicator of physical layer signal integrity issues .
The most common causes of high BER in fiber optic connections are:
Dirty or contaminated fiber connectors that degrade the optical signal
Poorly seated transceivers that prevent proper physical contact
Firmware mismatches across switches and DPUs that can cause compatibility issues and signal processing errors
The recommended corrective action follows the systematic troubleshooting methodology emphasized in the certification: first address physical layer problems (cleaning and reseating), then verify firmware consistency across all fabric components.
The exam blueprint‘s specific inclusion of “Confirm FW on transceivers“ and “Validate cables by verifying signal quality“ directly supports this three-part corrective approach .
Incorrect: B. Delete the network configuration files on the host and reboot the system three times to allow the hardware to self-heal the physical fiber connections.
This is incorrect because deleting network configuration files and rebooting multiple times will not resolve physical layer signal integrity issues. High BER is caused by physical problems like dirty connectors or faulty transceivers, not software configuration. Hardware cannot “self-heal“ physical connection issues through reboots.
C. Lower the speed of the entire network to 10Gb/s to eliminate errors, as AI training does not benefit from higher bandwidth anyway.
This is incorrect because reducing network speed is a workaround that accepts degraded performance rather than fixing the root cause. AI training workloads, especially large-scale distributed training, critically depend on high-bandwidth, low-latency communication (400Gb/s InfiniBand) for efficient collective operations. Deliberately reducing speed would severely impact cluster performance.
D. Wrap the fiber optic cables in aluminum foil to protect them from electromagnetic interference from the server power supplies.
This is incorrect because fiber optic cables transmit data using light pulses through glass fibers, which are immune to electromagnetic interference (EMI). Wrapping them in foil is unnecessary and ineffective. EMI affects copper cables, not fiber optics. This action demonstrates a fundamental misunderstanding of fiber optic technology.
Question 40 of 60
40. Question
During the deployment of an AI factory, the security policy requires the initialization of the Trusted Platform Module TPM and the configuration of the Baseboard Management Controller BMC to ensure a secure boot process and remote management. If the administrator needs to perform these tasks across 100 nodes simultaneously before the operating system is installed, which methodology is considered best practice for scale and efficiency?
Correct
Option D: Redfish API: Redfish is the industry-standard, RESTful API for managing modern server hardware. It is the successor to IPMI and is designed specifically for scale-out environments.
Scalability: Because Redfish uses standard HTTP/JSON, an administrator can write a single script (using Python or Curl) to push identical BIOS, TPM, and BMC configurations to 100+ nodes simultaneously.
Agentless: This can be done via the OOB (Out-of-Band) network port before an Operating System is even installed, making it the only viable “Day 0“ deployment strategy for a cluster of this size.
Analysis of Incorrect Options Option A: (Incorrect) The Error: Hardware-level settings like TPM initialization and BMC network configurations are stored in the server‘s firmware (SPI flash) or dedicated security chips, not on the SSD. Cloning an OS drive will not configure the underlying hardware security policies or the remote management controller. Furthermore, physically swapping SSDs in 100 nodes is labor-intensive and inefficient.
Option B: (Incorrect) The Error: “Sneakernet“ (physically walking to every server with a USB drive) is the antithesis of AI Factory best practices. It is prone to human error, slow, and does not scale. Modern NVIDIA-certified systems (like DGX or HGX) are designed to be managed remotely from the moment they are racked and cabled.
Option C: (Incorrect) The Error: While logging into a BMC web GUI is a valid way to configure one server, doing so for 100 nodes is a massive waste of administrative time. It is a “manual“ process that cannot be easily audited or replicated perfectly across a large-scale cluster without the risk of configuration drift.
Incorrect
Option D: Redfish API: Redfish is the industry-standard, RESTful API for managing modern server hardware. It is the successor to IPMI and is designed specifically for scale-out environments.
Scalability: Because Redfish uses standard HTTP/JSON, an administrator can write a single script (using Python or Curl) to push identical BIOS, TPM, and BMC configurations to 100+ nodes simultaneously.
Agentless: This can be done via the OOB (Out-of-Band) network port before an Operating System is even installed, making it the only viable “Day 0“ deployment strategy for a cluster of this size.
Analysis of Incorrect Options Option A: (Incorrect) The Error: Hardware-level settings like TPM initialization and BMC network configurations are stored in the server‘s firmware (SPI flash) or dedicated security chips, not on the SSD. Cloning an OS drive will not configure the underlying hardware security policies or the remote management controller. Furthermore, physically swapping SSDs in 100 nodes is labor-intensive and inefficient.
Option B: (Incorrect) The Error: “Sneakernet“ (physically walking to every server with a USB drive) is the antithesis of AI Factory best practices. It is prone to human error, slow, and does not scale. Modern NVIDIA-certified systems (like DGX or HGX) are designed to be managed remotely from the moment they are racked and cabled.
Option C: (Incorrect) The Error: While logging into a BMC web GUI is a valid way to configure one server, doing so for 100 nodes is a massive waste of administrative time. It is a “manual“ process that cannot be easily audited or replicated perfectly across a large-scale cluster without the risk of configuration drift.
Unattempted
Option D: Redfish API: Redfish is the industry-standard, RESTful API for managing modern server hardware. It is the successor to IPMI and is designed specifically for scale-out environments.
Scalability: Because Redfish uses standard HTTP/JSON, an administrator can write a single script (using Python or Curl) to push identical BIOS, TPM, and BMC configurations to 100+ nodes simultaneously.
Agentless: This can be done via the OOB (Out-of-Band) network port before an Operating System is even installed, making it the only viable “Day 0“ deployment strategy for a cluster of this size.
Analysis of Incorrect Options Option A: (Incorrect) The Error: Hardware-level settings like TPM initialization and BMC network configurations are stored in the server‘s firmware (SPI flash) or dedicated security chips, not on the SSD. Cloning an OS drive will not configure the underlying hardware security policies or the remote management controller. Furthermore, physically swapping SSDs in 100 nodes is labor-intensive and inefficient.
Option B: (Incorrect) The Error: “Sneakernet“ (physically walking to every server with a USB drive) is the antithesis of AI Factory best practices. It is prone to human error, slow, and does not scale. Modern NVIDIA-certified systems (like DGX or HGX) are designed to be managed remotely from the moment they are racked and cabled.
Option C: (Incorrect) The Error: While logging into a BMC web GUI is a valid way to configure one server, doing so for 100 nodes is a massive waste of administrative time. It is a “manual“ process that cannot be easily audited or replicated perfectly across a large-scale cluster without the risk of configuration drift.
Question 41 of 60
41. Question
After installing a new set of InfiniBand cables in a liquid-cooled AI cluster, the administrator runs a NCCL burn-in test. They notice that while most nodes pass, one specific link consistently fails after 15 minutes of heavy load. What is the most likely cause that should be investigated during the Cluster Test and Verification phase?
Correct
Correct: B. A thermal issue where the transceiver is overheating under load due to poor airflow or a faulty cooling loop in that rack section.
The NCP-AII certification blueprint explicitly includes “Confirm FW on transceivers,“ “Validate cables by verifying signal quality,“ and “Identify faulty cards, GPUs, and power supplies“ as core tasks within the Cluster Test and Verification and Troubleshoot and Optimize domains .
A failure that occurs consistently after 15 minutes of heavy load on a single link is a classic symptom of a thermal issue. Under sustained load, optical transceivers generate heat, and if cooling is inadequate, they can overheat and fail .
The InfiniBand switch health monitoring documentation specifically includes “Transceivers“ as a monitored component and tracks “Temperature Sensors“ to prevent overheating . A faulty cooling loop or poor airflow in a specific rack section would cause localized overheating only for that link.
The ConnectX-5 adapter documentation confirms that transceivers have thermal thresholds, and crossing those thresholds impacts system operation. It notes that users “can read these thermal sensors and adapt the system airflow following the readouts and the needs“ .
In a liquid-cooled environment, a faulty cooling loop in one rack section would explain why only one specific link fails while others remain operational, and why the failure only manifests under sustained load when heat builds up.
Incorrect: A. The NGC CLI is using an outdated API key which causes the NCCL container to time out after a fixed duration of use.
This is incorrect because NGC CLI is used for downloading containers and managing NGC resources, not for runtime container execution . An outdated API key would prevent container downloads entirely, not cause a link to fail after 15 minutes of heavy load during an NCCL test. The NCCL container, once downloaded, does not require ongoing NGC CLI authentication to run.
C. The Slurm scheduler is assigning too many MPI ranks to the node, causing a memory overflow in the BCM database.
This is incorrect because Slurm is a workload manager for job scheduling, not a runtime component that would cause a specific network link to fail after 15 minutes . BCM (Base Command Manager) is a separate management tool, and a memory overflow in its database would not manifest as a physical link failure. This option incorrectly conflates job scheduling with physical layer issues.
D. The TPM is locking the PCIe bus because it detects an unauthorized network packet during the burn-in test execution.
This is incorrect because the Trusted Platform Module (TPM) is a security chip for cryptographic operations and platform integrity (secure boot, encryption), not for PCIe bus locking or network packet inspection . TPM has no role in monitoring or blocking network traffic during runtime. This description has no basis in actual TPM functionality.
Incorrect
Correct: B. A thermal issue where the transceiver is overheating under load due to poor airflow or a faulty cooling loop in that rack section.
The NCP-AII certification blueprint explicitly includes “Confirm FW on transceivers,“ “Validate cables by verifying signal quality,“ and “Identify faulty cards, GPUs, and power supplies“ as core tasks within the Cluster Test and Verification and Troubleshoot and Optimize domains .
A failure that occurs consistently after 15 minutes of heavy load on a single link is a classic symptom of a thermal issue. Under sustained load, optical transceivers generate heat, and if cooling is inadequate, they can overheat and fail .
The InfiniBand switch health monitoring documentation specifically includes “Transceivers“ as a monitored component and tracks “Temperature Sensors“ to prevent overheating . A faulty cooling loop or poor airflow in a specific rack section would cause localized overheating only for that link.
The ConnectX-5 adapter documentation confirms that transceivers have thermal thresholds, and crossing those thresholds impacts system operation. It notes that users “can read these thermal sensors and adapt the system airflow following the readouts and the needs“ .
In a liquid-cooled environment, a faulty cooling loop in one rack section would explain why only one specific link fails while others remain operational, and why the failure only manifests under sustained load when heat builds up.
Incorrect: A. The NGC CLI is using an outdated API key which causes the NCCL container to time out after a fixed duration of use.
This is incorrect because NGC CLI is used for downloading containers and managing NGC resources, not for runtime container execution . An outdated API key would prevent container downloads entirely, not cause a link to fail after 15 minutes of heavy load during an NCCL test. The NCCL container, once downloaded, does not require ongoing NGC CLI authentication to run.
C. The Slurm scheduler is assigning too many MPI ranks to the node, causing a memory overflow in the BCM database.
This is incorrect because Slurm is a workload manager for job scheduling, not a runtime component that would cause a specific network link to fail after 15 minutes . BCM (Base Command Manager) is a separate management tool, and a memory overflow in its database would not manifest as a physical link failure. This option incorrectly conflates job scheduling with physical layer issues.
D. The TPM is locking the PCIe bus because it detects an unauthorized network packet during the burn-in test execution.
This is incorrect because the Trusted Platform Module (TPM) is a security chip for cryptographic operations and platform integrity (secure boot, encryption), not for PCIe bus locking or network packet inspection . TPM has no role in monitoring or blocking network traffic during runtime. This description has no basis in actual TPM functionality.
Unattempted
Correct: B. A thermal issue where the transceiver is overheating under load due to poor airflow or a faulty cooling loop in that rack section.
The NCP-AII certification blueprint explicitly includes “Confirm FW on transceivers,“ “Validate cables by verifying signal quality,“ and “Identify faulty cards, GPUs, and power supplies“ as core tasks within the Cluster Test and Verification and Troubleshoot and Optimize domains .
A failure that occurs consistently after 15 minutes of heavy load on a single link is a classic symptom of a thermal issue. Under sustained load, optical transceivers generate heat, and if cooling is inadequate, they can overheat and fail .
The InfiniBand switch health monitoring documentation specifically includes “Transceivers“ as a monitored component and tracks “Temperature Sensors“ to prevent overheating . A faulty cooling loop or poor airflow in a specific rack section would cause localized overheating only for that link.
The ConnectX-5 adapter documentation confirms that transceivers have thermal thresholds, and crossing those thresholds impacts system operation. It notes that users “can read these thermal sensors and adapt the system airflow following the readouts and the needs“ .
In a liquid-cooled environment, a faulty cooling loop in one rack section would explain why only one specific link fails while others remain operational, and why the failure only manifests under sustained load when heat builds up.
Incorrect: A. The NGC CLI is using an outdated API key which causes the NCCL container to time out after a fixed duration of use.
This is incorrect because NGC CLI is used for downloading containers and managing NGC resources, not for runtime container execution . An outdated API key would prevent container downloads entirely, not cause a link to fail after 15 minutes of heavy load during an NCCL test. The NCCL container, once downloaded, does not require ongoing NGC CLI authentication to run.
C. The Slurm scheduler is assigning too many MPI ranks to the node, causing a memory overflow in the BCM database.
This is incorrect because Slurm is a workload manager for job scheduling, not a runtime component that would cause a specific network link to fail after 15 minutes . BCM (Base Command Manager) is a separate management tool, and a memory overflow in its database would not manifest as a physical link failure. This option incorrectly conflates job scheduling with physical layer issues.
D. The TPM is locking the PCIe bus because it detects an unauthorized network packet during the burn-in test execution.
This is incorrect because the Trusted Platform Module (TPM) is a security chip for cryptographic operations and platform integrity (secure boot, encryption), not for PCIe bus locking or network packet inspection . TPM has no role in monitoring or blocking network traffic during runtime. This description has no basis in actual TPM functionality.
Question 42 of 60
42. Question
To enable containerized AI workloads to access the GPU hardware, an administrator must install the NVIDIA Container Toolkit. Which component of the toolkit is responsible for interfacing with the container runtime (such as Docker or containerd) to mount the NVIDIA libraries and device nodes into the container?
Correct
Option C: The Mechanism: The nvidia-container-runtime-hook (often used via the nvidia-container-toolkit) is a binary that implements the OCI (Open Container Initiative) pre-start hook specification.
The Function: When a container is created but before it actually starts running the user application, the container runtime (Docker, Containerd, or CRI-O) calls this hook.
What it does: It queries the GPU capability requirements, locates the NVIDIA libraries (like libcuda.so) and device nodes (like /dev/nvidia0) on the host, and “bind-mounts“ them into the container‘s file system. This allows the container to remain portable while still accessing the underlying hardware.
Analysis of Incorrect Options Option A: (Incorrect) The Error: DOCA (Data Center Infrastructure-on-a-Chip Architecture) is the SDK and driver stack for BlueField DPUs, not for GPU-to-container orchestration. While DOCA handles networking and storage offloads, it does not manage the mounting of GPU device nodes into a Docker or Containerd environment.
Option B: (Incorrect) The Error: The VBIOS (Video BIOS) is a low-level firmware flashed onto the GPU hardware itself. It initializes the hardware during the boot process. It is not a software component that interfaces with high-level container runtimes, nor does it provide a “virtualized interface“ for containers; the NVIDIA driver and the container toolkit handle that abstraction.
Option D: (Incorrect) The Error: While Slurm and Enroot are widely used in NVIDIA Base Command and HPC clusters, their primary purpose is not encryption for authorization. Enroot is a tool that turns container images into unprivileged sandboxes, and while it works with the NVIDIA Container Toolkit, it is not the component of the toolkit responsible for the runtime hook logic described in the question.
Incorrect
Option C: The Mechanism: The nvidia-container-runtime-hook (often used via the nvidia-container-toolkit) is a binary that implements the OCI (Open Container Initiative) pre-start hook specification.
The Function: When a container is created but before it actually starts running the user application, the container runtime (Docker, Containerd, or CRI-O) calls this hook.
What it does: It queries the GPU capability requirements, locates the NVIDIA libraries (like libcuda.so) and device nodes (like /dev/nvidia0) on the host, and “bind-mounts“ them into the container‘s file system. This allows the container to remain portable while still accessing the underlying hardware.
Analysis of Incorrect Options Option A: (Incorrect) The Error: DOCA (Data Center Infrastructure-on-a-Chip Architecture) is the SDK and driver stack for BlueField DPUs, not for GPU-to-container orchestration. While DOCA handles networking and storage offloads, it does not manage the mounting of GPU device nodes into a Docker or Containerd environment.
Option B: (Incorrect) The Error: The VBIOS (Video BIOS) is a low-level firmware flashed onto the GPU hardware itself. It initializes the hardware during the boot process. It is not a software component that interfaces with high-level container runtimes, nor does it provide a “virtualized interface“ for containers; the NVIDIA driver and the container toolkit handle that abstraction.
Option D: (Incorrect) The Error: While Slurm and Enroot are widely used in NVIDIA Base Command and HPC clusters, their primary purpose is not encryption for authorization. Enroot is a tool that turns container images into unprivileged sandboxes, and while it works with the NVIDIA Container Toolkit, it is not the component of the toolkit responsible for the runtime hook logic described in the question.
Unattempted
Option C: The Mechanism: The nvidia-container-runtime-hook (often used via the nvidia-container-toolkit) is a binary that implements the OCI (Open Container Initiative) pre-start hook specification.
The Function: When a container is created but before it actually starts running the user application, the container runtime (Docker, Containerd, or CRI-O) calls this hook.
What it does: It queries the GPU capability requirements, locates the NVIDIA libraries (like libcuda.so) and device nodes (like /dev/nvidia0) on the host, and “bind-mounts“ them into the container‘s file system. This allows the container to remain portable while still accessing the underlying hardware.
Analysis of Incorrect Options Option A: (Incorrect) The Error: DOCA (Data Center Infrastructure-on-a-Chip Architecture) is the SDK and driver stack for BlueField DPUs, not for GPU-to-container orchestration. While DOCA handles networking and storage offloads, it does not manage the mounting of GPU device nodes into a Docker or Containerd environment.
Option B: (Incorrect) The Error: The VBIOS (Video BIOS) is a low-level firmware flashed onto the GPU hardware itself. It initializes the hardware during the boot process. It is not a software component that interfaces with high-level container runtimes, nor does it provide a “virtualized interface“ for containers; the NVIDIA driver and the container toolkit handle that abstraction.
Option D: (Incorrect) The Error: While Slurm and Enroot are widely used in NVIDIA Base Command and HPC clusters, their primary purpose is not encryption for authorization. Enroot is a tool that turns container images into unprivileged sandboxes, and while it works with the NVIDIA Container Toolkit, it is not the component of the toolkit responsible for the runtime hook logic described in the question.
Question 43 of 60
43. Question
In a scenario where an AI cluster is experiencing high latency during collective communications, the administrator suspects that the MIG configuration on the GPUs is improperly aligned with the physical network topology. What is the most effective way to verify the current MIG status and its impact on the hardware resources?
Correct
Correct: B. Run ‘nvidia-smi mig -lgip‘ to list the GPU instance profiles and cross-reference them with the physical PCIe placement of the BlueField-3 DPU.
This is correct because the NCP-AII certification blueprint explicitly includes “Configure MIG (AI and HPC)“ as a core topic within the Physical Layer Management domain .
The command nvidia-smi mig -lgip is the standard NVIDIA tool to “list the GPU instance profiles“ and view available MIG configurations . This allows the administrator to see how the GPU is partitioned.
High latency during collective communications can occur if MIG instances are not properly aligned with the physical network topology. Cross-referencing the MIG instance placement with the PCIe topology ensures that each GPU slice is communicating with the correct network interface (BlueField-3 DPU) on the optimal PCIe path, avoiding unnecessary data hops through the CPU or other NUMA nodes .
This verification step directly addresses the need to validate hardware resource alignment when troubleshooting performance issues, as outlined in the Troubleshoot and Optimize domain .
Incorrect: A. Execute an HPL test on each MIG partition and compare the thermal output to the manufacturer‘s specification for the HGX chassis fans.
This is incorrect because HPL (High-Performance Linpack) is a compute performance and thermal stability benchmark , not a tool for verifying MIG topology alignment with the network fabric. While thermal output is important for system health, it does not diagnose communication latency issues related to MIG-network misalignment.
C. Use the ‘ibstatus‘ command to check if the MIG partitions are emitting InfiniBand beacons and adjust the BIOS settings to enable SR-IOV for the NVLink fabric.
This is incorrect for multiple reasons. First, ibstatus is an InfiniBand device status tool, not a command for querying MIG configuration. Second, MIG partitions do not “emit InfiniBand beacons“—this is a fabricated concept. Third, SR-IOV is a virtualization feature for network adapters, not related to NVLink fabric configuration. NVLink is a GPU interconnect technology that does not use SR-IOV.
D. Download the NGC CLI and use the ‘ngc config‘ command to remotely reset the MIG hardware registers via the cloud-based management portal.
This is incorrect because NGC CLI is a tool for downloading containers and managing NGC resources , not for configuring or resetting MIG hardware registers. MIG configuration is performed locally using nvidia-smi commands on the host system . There is no cloud-based portal for resetting MIG hardware registers.
Incorrect
Correct: B. Run ‘nvidia-smi mig -lgip‘ to list the GPU instance profiles and cross-reference them with the physical PCIe placement of the BlueField-3 DPU.
This is correct because the NCP-AII certification blueprint explicitly includes “Configure MIG (AI and HPC)“ as a core topic within the Physical Layer Management domain .
The command nvidia-smi mig -lgip is the standard NVIDIA tool to “list the GPU instance profiles“ and view available MIG configurations . This allows the administrator to see how the GPU is partitioned.
High latency during collective communications can occur if MIG instances are not properly aligned with the physical network topology. Cross-referencing the MIG instance placement with the PCIe topology ensures that each GPU slice is communicating with the correct network interface (BlueField-3 DPU) on the optimal PCIe path, avoiding unnecessary data hops through the CPU or other NUMA nodes .
This verification step directly addresses the need to validate hardware resource alignment when troubleshooting performance issues, as outlined in the Troubleshoot and Optimize domain .
Incorrect: A. Execute an HPL test on each MIG partition and compare the thermal output to the manufacturer‘s specification for the HGX chassis fans.
This is incorrect because HPL (High-Performance Linpack) is a compute performance and thermal stability benchmark , not a tool for verifying MIG topology alignment with the network fabric. While thermal output is important for system health, it does not diagnose communication latency issues related to MIG-network misalignment.
C. Use the ‘ibstatus‘ command to check if the MIG partitions are emitting InfiniBand beacons and adjust the BIOS settings to enable SR-IOV for the NVLink fabric.
This is incorrect for multiple reasons. First, ibstatus is an InfiniBand device status tool, not a command for querying MIG configuration. Second, MIG partitions do not “emit InfiniBand beacons“—this is a fabricated concept. Third, SR-IOV is a virtualization feature for network adapters, not related to NVLink fabric configuration. NVLink is a GPU interconnect technology that does not use SR-IOV.
D. Download the NGC CLI and use the ‘ngc config‘ command to remotely reset the MIG hardware registers via the cloud-based management portal.
This is incorrect because NGC CLI is a tool for downloading containers and managing NGC resources , not for configuring or resetting MIG hardware registers. MIG configuration is performed locally using nvidia-smi commands on the host system . There is no cloud-based portal for resetting MIG hardware registers.
Unattempted
Correct: B. Run ‘nvidia-smi mig -lgip‘ to list the GPU instance profiles and cross-reference them with the physical PCIe placement of the BlueField-3 DPU.
This is correct because the NCP-AII certification blueprint explicitly includes “Configure MIG (AI and HPC)“ as a core topic within the Physical Layer Management domain .
The command nvidia-smi mig -lgip is the standard NVIDIA tool to “list the GPU instance profiles“ and view available MIG configurations . This allows the administrator to see how the GPU is partitioned.
High latency during collective communications can occur if MIG instances are not properly aligned with the physical network topology. Cross-referencing the MIG instance placement with the PCIe topology ensures that each GPU slice is communicating with the correct network interface (BlueField-3 DPU) on the optimal PCIe path, avoiding unnecessary data hops through the CPU or other NUMA nodes .
This verification step directly addresses the need to validate hardware resource alignment when troubleshooting performance issues, as outlined in the Troubleshoot and Optimize domain .
Incorrect: A. Execute an HPL test on each MIG partition and compare the thermal output to the manufacturer‘s specification for the HGX chassis fans.
This is incorrect because HPL (High-Performance Linpack) is a compute performance and thermal stability benchmark , not a tool for verifying MIG topology alignment with the network fabric. While thermal output is important for system health, it does not diagnose communication latency issues related to MIG-network misalignment.
C. Use the ‘ibstatus‘ command to check if the MIG partitions are emitting InfiniBand beacons and adjust the BIOS settings to enable SR-IOV for the NVLink fabric.
This is incorrect for multiple reasons. First, ibstatus is an InfiniBand device status tool, not a command for querying MIG configuration. Second, MIG partitions do not “emit InfiniBand beacons“—this is a fabricated concept. Third, SR-IOV is a virtualization feature for network adapters, not related to NVLink fabric configuration. NVLink is a GPU interconnect technology that does not use SR-IOV.
D. Download the NGC CLI and use the ‘ngc config‘ command to remotely reset the MIG hardware registers via the cloud-based management portal.
This is incorrect because NGC CLI is a tool for downloading containers and managing NGC resources , not for configuring or resetting MIG hardware registers. MIG configuration is performed locally using nvidia-smi commands on the host system . There is no cloud-based portal for resetting MIG hardware registers.
Question 44 of 60
44. Question
A technician is troubleshooting a suspected faulty fan in an NVIDIA-certified server. The BMC reports that ‘Fan 4‘ is spinning at 0 RPM, but the server is still running. What is the most appropriate action to take to ensure the continued health of the system while minimizing downtime for the AI workloads?
Correct
Option A: Redundancy & QoS: NVIDIA-certified systems (like DGX or HGX) are designed with N+1 fan redundancy. A single fan failure (0 RPM) won‘t cause an immediate system crash, but it eliminates the safety margin.
Maintenance Planning: In a professional AI environment, the goal is to avoid “emergency“ shutdowns. Scheduling a window allows you to migrate workloads (e.g., via Slurm or Kubernetes drain) and replace the module—which is often hot-swappable on many certified chassis—to restore full redundancy without risking unplanned thermal throttling or hardware damage during a peak training run.
Analysis of Incorrect Options Option B: (Incorrect) The Error: This is an overreaction. While thermal damage is serious, a single fan failure in a redundant system does not justify a cluster-wide emergency outage. NVIDIA-certified systems have built-in thermal safeguards; if a node becomes dangerously hot, it will throttle or shut itself down independently. Shutting down the entire cluster would cause massive, unnecessary downtime for all research teams.
Option C: (Incorrect) The Error: While the BMC might automatically increase the speed of the remaining fans to compensate (a “fail-to-high“ policy), leaving a faulty fan in place indefinitely is a violation of data center best practices. The increased vibration and wear on the remaining fans can lead to cascading failures. Furthermore, AI workloads are extremely power-intensive; operating without full redundancy is a high-risk gamble.
Option D: (Incorrect) The Error: Using nvidia-smi only tells you the temperature of the GPU cores. It does not report on the temperature of other critical components like the VRMs (Voltage Regulator Modules), NVSwitches, or memory modules that might be located in the “dead zone“ of the failed fan. Ignoring a 0 RPM report simply because the GPU looks “okay“ can lead to silent component degradation or localized hot spots.
Incorrect
Option A: Redundancy & QoS: NVIDIA-certified systems (like DGX or HGX) are designed with N+1 fan redundancy. A single fan failure (0 RPM) won‘t cause an immediate system crash, but it eliminates the safety margin.
Maintenance Planning: In a professional AI environment, the goal is to avoid “emergency“ shutdowns. Scheduling a window allows you to migrate workloads (e.g., via Slurm or Kubernetes drain) and replace the module—which is often hot-swappable on many certified chassis—to restore full redundancy without risking unplanned thermal throttling or hardware damage during a peak training run.
Analysis of Incorrect Options Option B: (Incorrect) The Error: This is an overreaction. While thermal damage is serious, a single fan failure in a redundant system does not justify a cluster-wide emergency outage. NVIDIA-certified systems have built-in thermal safeguards; if a node becomes dangerously hot, it will throttle or shut itself down independently. Shutting down the entire cluster would cause massive, unnecessary downtime for all research teams.
Option C: (Incorrect) The Error: While the BMC might automatically increase the speed of the remaining fans to compensate (a “fail-to-high“ policy), leaving a faulty fan in place indefinitely is a violation of data center best practices. The increased vibration and wear on the remaining fans can lead to cascading failures. Furthermore, AI workloads are extremely power-intensive; operating without full redundancy is a high-risk gamble.
Option D: (Incorrect) The Error: Using nvidia-smi only tells you the temperature of the GPU cores. It does not report on the temperature of other critical components like the VRMs (Voltage Regulator Modules), NVSwitches, or memory modules that might be located in the “dead zone“ of the failed fan. Ignoring a 0 RPM report simply because the GPU looks “okay“ can lead to silent component degradation or localized hot spots.
Unattempted
Option A: Redundancy & QoS: NVIDIA-certified systems (like DGX or HGX) are designed with N+1 fan redundancy. A single fan failure (0 RPM) won‘t cause an immediate system crash, but it eliminates the safety margin.
Maintenance Planning: In a professional AI environment, the goal is to avoid “emergency“ shutdowns. Scheduling a window allows you to migrate workloads (e.g., via Slurm or Kubernetes drain) and replace the module—which is often hot-swappable on many certified chassis—to restore full redundancy without risking unplanned thermal throttling or hardware damage during a peak training run.
Analysis of Incorrect Options Option B: (Incorrect) The Error: This is an overreaction. While thermal damage is serious, a single fan failure in a redundant system does not justify a cluster-wide emergency outage. NVIDIA-certified systems have built-in thermal safeguards; if a node becomes dangerously hot, it will throttle or shut itself down independently. Shutting down the entire cluster would cause massive, unnecessary downtime for all research teams.
Option C: (Incorrect) The Error: While the BMC might automatically increase the speed of the remaining fans to compensate (a “fail-to-high“ policy), leaving a faulty fan in place indefinitely is a violation of data center best practices. The increased vibration and wear on the remaining fans can lead to cascading failures. Furthermore, AI workloads are extremely power-intensive; operating without full redundancy is a high-risk gamble.
Option D: (Incorrect) The Error: Using nvidia-smi only tells you the temperature of the GPU cores. It does not report on the temperature of other critical components like the VRMs (Voltage Regulator Modules), NVSwitches, or memory modules that might be located in the “dead zone“ of the failed fan. Ignoring a 0 RPM report simply because the GPU looks “okay“ can lead to silent component degradation or localized hot spots.
Question 45 of 60
45. Question
An infrastructure engineer is validating the cabling for a large-scale AI cluster using InfiniBand NDR transceivers and Twinax copper cables. During the signal quality verification, several links show high Bit Error Rates (BER). Which action is the most appropriate according to NVIDIA validation standards for ensuring physical layer stability before proceeding to software installation?
Correct
Correct: A. Replace the copper cables with Active Optical Cables (AOC) if the distance exceeds 3 meters or check for exceeded bend radius on existing cables.
This is correct because the NCP-AII certification blueprint explicitly includes “Describe and validate cable types and transceivers,“ “Validate cables by verifying signal quality,“ and “Confirm cabling is correct“ as core tasks within the System and Server Bring-up and Cluster Test and Verification domains .
For high-speed InfiniBand NDR (400G) connections using Twinax copper cables, there are critical physical limitations:
Passive Twinax copper cables are designed for short-distance connections, typically up to 3 meters
Beyond this distance, signal degradation occurs, resulting in high Bit Error Rates (BER)
Active Optical Cables (AOC) are the appropriate solution for longer distances, providing better signal integrity over extended runs
Another common cause of high BER with copper cables is exceeding the minimum bend radius (34.5mm for some cables) , which can damage internal conductors and degrade signal quality
The NVIDIA documentation for troubleshooting bad links specifies a systematic sequence: pull out, clean, and reinsert connections, but for copper cables at longer distances, replacement with appropriate media (AOC) is necessary
This approach directly aligns with the certification‘s emphasis on physical layer validation before proceeding to software installation
Incorrect: B. Manually force the port speed to a lower generation in the BIOS to compensate for the signal degradation.
This is incorrect because reducing port speed is a workaround that accepts degraded performance rather than fixing the root cause. AI workloads require the full 400Gbps bandwidth for efficient collective operations. The certification emphasizes validating signal quality and correcting physical layer issues, not compromising on performance specifications.
C. Ignore the BER if the link state is Up because the Subnet Manager will automatically correct all packet errors at the transport layer.
This is incorrect because high Bit Error Rates indicate physical layer signal integrity problems that cannot be fully corrected by higher-layer protocols. The NVIDIA InfiniBand troubleshooting guide explicitly lists “High BER reported“ as an alert requiring action, including cleaning connectors, reseating cables, or replacing transceivers . Ignoring these errors risks performance degradation and eventual link failure.
D. Use the NGC CLI to reset the GPU firmware which often recalibrates the integrated network controllers.
This is incorrect because NGC CLI is a tool for downloading containers and managing NGC resources , not for resetting GPU firmware or recalibrating network controllers. GPU firmware updates are performed using specialized tools like nv-flash or nvidia-smi, not NGC CLI. Additionally, Bit Error Rate issues are related to physical layer cabling, not GPU firmware.
Incorrect
Correct: A. Replace the copper cables with Active Optical Cables (AOC) if the distance exceeds 3 meters or check for exceeded bend radius on existing cables.
This is correct because the NCP-AII certification blueprint explicitly includes “Describe and validate cable types and transceivers,“ “Validate cables by verifying signal quality,“ and “Confirm cabling is correct“ as core tasks within the System and Server Bring-up and Cluster Test and Verification domains .
For high-speed InfiniBand NDR (400G) connections using Twinax copper cables, there are critical physical limitations:
Passive Twinax copper cables are designed for short-distance connections, typically up to 3 meters
Beyond this distance, signal degradation occurs, resulting in high Bit Error Rates (BER)
Active Optical Cables (AOC) are the appropriate solution for longer distances, providing better signal integrity over extended runs
Another common cause of high BER with copper cables is exceeding the minimum bend radius (34.5mm for some cables) , which can damage internal conductors and degrade signal quality
The NVIDIA documentation for troubleshooting bad links specifies a systematic sequence: pull out, clean, and reinsert connections, but for copper cables at longer distances, replacement with appropriate media (AOC) is necessary
This approach directly aligns with the certification‘s emphasis on physical layer validation before proceeding to software installation
Incorrect: B. Manually force the port speed to a lower generation in the BIOS to compensate for the signal degradation.
This is incorrect because reducing port speed is a workaround that accepts degraded performance rather than fixing the root cause. AI workloads require the full 400Gbps bandwidth for efficient collective operations. The certification emphasizes validating signal quality and correcting physical layer issues, not compromising on performance specifications.
C. Ignore the BER if the link state is Up because the Subnet Manager will automatically correct all packet errors at the transport layer.
This is incorrect because high Bit Error Rates indicate physical layer signal integrity problems that cannot be fully corrected by higher-layer protocols. The NVIDIA InfiniBand troubleshooting guide explicitly lists “High BER reported“ as an alert requiring action, including cleaning connectors, reseating cables, or replacing transceivers . Ignoring these errors risks performance degradation and eventual link failure.
D. Use the NGC CLI to reset the GPU firmware which often recalibrates the integrated network controllers.
This is incorrect because NGC CLI is a tool for downloading containers and managing NGC resources , not for resetting GPU firmware or recalibrating network controllers. GPU firmware updates are performed using specialized tools like nv-flash or nvidia-smi, not NGC CLI. Additionally, Bit Error Rate issues are related to physical layer cabling, not GPU firmware.
Unattempted
Correct: A. Replace the copper cables with Active Optical Cables (AOC) if the distance exceeds 3 meters or check for exceeded bend radius on existing cables.
This is correct because the NCP-AII certification blueprint explicitly includes “Describe and validate cable types and transceivers,“ “Validate cables by verifying signal quality,“ and “Confirm cabling is correct“ as core tasks within the System and Server Bring-up and Cluster Test and Verification domains .
For high-speed InfiniBand NDR (400G) connections using Twinax copper cables, there are critical physical limitations:
Passive Twinax copper cables are designed for short-distance connections, typically up to 3 meters
Beyond this distance, signal degradation occurs, resulting in high Bit Error Rates (BER)
Active Optical Cables (AOC) are the appropriate solution for longer distances, providing better signal integrity over extended runs
Another common cause of high BER with copper cables is exceeding the minimum bend radius (34.5mm for some cables) , which can damage internal conductors and degrade signal quality
The NVIDIA documentation for troubleshooting bad links specifies a systematic sequence: pull out, clean, and reinsert connections, but for copper cables at longer distances, replacement with appropriate media (AOC) is necessary
This approach directly aligns with the certification‘s emphasis on physical layer validation before proceeding to software installation
Incorrect: B. Manually force the port speed to a lower generation in the BIOS to compensate for the signal degradation.
This is incorrect because reducing port speed is a workaround that accepts degraded performance rather than fixing the root cause. AI workloads require the full 400Gbps bandwidth for efficient collective operations. The certification emphasizes validating signal quality and correcting physical layer issues, not compromising on performance specifications.
C. Ignore the BER if the link state is Up because the Subnet Manager will automatically correct all packet errors at the transport layer.
This is incorrect because high Bit Error Rates indicate physical layer signal integrity problems that cannot be fully corrected by higher-layer protocols. The NVIDIA InfiniBand troubleshooting guide explicitly lists “High BER reported“ as an alert requiring action, including cleaning connectors, reseating cables, or replacing transceivers . Ignoring these errors risks performance degradation and eventual link failure.
D. Use the NGC CLI to reset the GPU firmware which often recalibrates the integrated network controllers.
This is incorrect because NGC CLI is a tool for downloading containers and managing NGC resources , not for resetting GPU firmware or recalibrating network controllers. GPU firmware updates are performed using specialized tools like nv-flash or nvidia-smi, not NGC CLI. Additionally, Bit Error Rate issues are related to physical layer cabling, not GPU firmware.
Question 46 of 60
46. Question
When configuring a BlueField-3 DPU for an AI factory, an administrator needs to ensure that the network traffic can be accelerated using DOCA. Which software component must be installed on the DPU‘s internal operating system to provide the necessary drivers and libraries for offloading transport-layer functions?
Correct
Correct: D. The DOCA SDK and Runtime must be installed on the DPU to enable the development and execution of accelerated networking applications.
The NCP-AII certification blueprint explicitly includes “Installing GPU and DOCA drivers“ and “NVIDIA DOCA driver installation and updates“ as core tasks within the Control Plane Installation and Configuration domain, which comprises 19% of the examination .
DOCA (Data Center-on-a-Chip Architecture) is the software infrastructure for programming NVIDIA BlueField DPUs, containing “a runtime and development environment, including libraries and drivers for device management and programmability, for the host and as part of a BlueField Platform Software“ .
The DOCA SDK and Runtime installed on the DPU‘s internal operating system provide the necessary libraries for offloading transport-layer functions and accelerating networking applications .
The DOCA Developer Guide confirms that the BlueField DPU runs its own operating system (supplied via BFB image) and that DOCA‘s development container can be deployed on top of BlueField for developing accelerated applications .
For transport-layer acceleration specifically, DOCA includes specialized libraries like DOCA Flow for networking offloads, and features such as kTLS (Kernel Transport Layer Security) offload that enables the NIC/DPU to accelerate encryption, decryption, and authentication of network traffic .
Incorrect: A. The Slurm Workload Manager must be installed on the DPU to schedule the network packets as if they were compute jobs.
This is incorrect because Slurm is a workload manager for job scheduling on compute nodes, not for network packet processing on DPUs. The NCP-AII blueprint lists Slurm under Control Plane Installation and Configuration for cluster orchestration, not for DPU networking functions .
B. The Base Command Manager (BCM) must be installed inside the DPU to manage the power and cooling of the DPU‘s internal heatsink.
This is incorrect because Base Command Manager is a cluster management tool for node provisioning and lifecycle management, not for DPU thermal management. Power and cooling of DPU hardware is handled by the BMC and system thermal firmware, not by BCM .
C. The NVIDIA Container Toolkit must be installed inside the DPU to allow Docker containers to access the Arm cores.
This is incorrect because the NVIDIA Container Toolkit is designed for enabling GPU access within containers on the host, not for DPU Arm core access. While containers can run on the DPU, the Container Toolkit‘s purpose is GPU acceleration for containers, not DPU programming or network acceleration .
Incorrect
Correct: D. The DOCA SDK and Runtime must be installed on the DPU to enable the development and execution of accelerated networking applications.
The NCP-AII certification blueprint explicitly includes “Installing GPU and DOCA drivers“ and “NVIDIA DOCA driver installation and updates“ as core tasks within the Control Plane Installation and Configuration domain, which comprises 19% of the examination .
DOCA (Data Center-on-a-Chip Architecture) is the software infrastructure for programming NVIDIA BlueField DPUs, containing “a runtime and development environment, including libraries and drivers for device management and programmability, for the host and as part of a BlueField Platform Software“ .
The DOCA SDK and Runtime installed on the DPU‘s internal operating system provide the necessary libraries for offloading transport-layer functions and accelerating networking applications .
The DOCA Developer Guide confirms that the BlueField DPU runs its own operating system (supplied via BFB image) and that DOCA‘s development container can be deployed on top of BlueField for developing accelerated applications .
For transport-layer acceleration specifically, DOCA includes specialized libraries like DOCA Flow for networking offloads, and features such as kTLS (Kernel Transport Layer Security) offload that enables the NIC/DPU to accelerate encryption, decryption, and authentication of network traffic .
Incorrect: A. The Slurm Workload Manager must be installed on the DPU to schedule the network packets as if they were compute jobs.
This is incorrect because Slurm is a workload manager for job scheduling on compute nodes, not for network packet processing on DPUs. The NCP-AII blueprint lists Slurm under Control Plane Installation and Configuration for cluster orchestration, not for DPU networking functions .
B. The Base Command Manager (BCM) must be installed inside the DPU to manage the power and cooling of the DPU‘s internal heatsink.
This is incorrect because Base Command Manager is a cluster management tool for node provisioning and lifecycle management, not for DPU thermal management. Power and cooling of DPU hardware is handled by the BMC and system thermal firmware, not by BCM .
C. The NVIDIA Container Toolkit must be installed inside the DPU to allow Docker containers to access the Arm cores.
This is incorrect because the NVIDIA Container Toolkit is designed for enabling GPU access within containers on the host, not for DPU Arm core access. While containers can run on the DPU, the Container Toolkit‘s purpose is GPU acceleration for containers, not DPU programming or network acceleration .
Unattempted
Correct: D. The DOCA SDK and Runtime must be installed on the DPU to enable the development and execution of accelerated networking applications.
The NCP-AII certification blueprint explicitly includes “Installing GPU and DOCA drivers“ and “NVIDIA DOCA driver installation and updates“ as core tasks within the Control Plane Installation and Configuration domain, which comprises 19% of the examination .
DOCA (Data Center-on-a-Chip Architecture) is the software infrastructure for programming NVIDIA BlueField DPUs, containing “a runtime and development environment, including libraries and drivers for device management and programmability, for the host and as part of a BlueField Platform Software“ .
The DOCA SDK and Runtime installed on the DPU‘s internal operating system provide the necessary libraries for offloading transport-layer functions and accelerating networking applications .
The DOCA Developer Guide confirms that the BlueField DPU runs its own operating system (supplied via BFB image) and that DOCA‘s development container can be deployed on top of BlueField for developing accelerated applications .
For transport-layer acceleration specifically, DOCA includes specialized libraries like DOCA Flow for networking offloads, and features such as kTLS (Kernel Transport Layer Security) offload that enables the NIC/DPU to accelerate encryption, decryption, and authentication of network traffic .
Incorrect: A. The Slurm Workload Manager must be installed on the DPU to schedule the network packets as if they were compute jobs.
This is incorrect because Slurm is a workload manager for job scheduling on compute nodes, not for network packet processing on DPUs. The NCP-AII blueprint lists Slurm under Control Plane Installation and Configuration for cluster orchestration, not for DPU networking functions .
B. The Base Command Manager (BCM) must be installed inside the DPU to manage the power and cooling of the DPU‘s internal heatsink.
This is incorrect because Base Command Manager is a cluster management tool for node provisioning and lifecycle management, not for DPU thermal management. Power and cooling of DPU hardware is handled by the BMC and system thermal firmware, not by BCM .
C. The NVIDIA Container Toolkit must be installed inside the DPU to allow Docker containers to access the Arm cores.
This is incorrect because the NVIDIA Container Toolkit is designed for enabling GPU access within containers on the host, not for DPU Arm core access. While containers can run on the DPU, the Container Toolkit‘s purpose is GPU acceleration for containers, not DPU programming or network acceleration .
Question 47 of 60
47. Question
A network engineer is configuring a BlueField-3 Data Processing Unit (DPU) to act as a secure offload engine for the AI cluster‘s management plane. To ensure the DPU is correctly integrated into the fabric, which action must be taken to manage the DPU independently of the host CPU while providing networking services to the host?
Correct
Correct: C. Configure the DPU in Separated Mode where the DPU OS runs independently and manages its own network interfaces and security policies.
This is correct because Separated Mode, also known as the symmetric model, is explicitly designed to allow both the host and the DPU Arm cores to operate network functions independently .
In this mode, “a network function is assigned to both the Arm cores and the host cores,“ with the ports and functions being symmetric and “no dependency between the two functions“ .
Each function has its own MAC address, enabling them to operate “simultaneously or separately,“ with the ability to send and receive Ethernet and RDMA traffic .
The NCP-AII certification blueprint explicitly includes “Configure and manage a BlueField® network platform“ as a core task within the Physical Layer Management domain.
This configuration enables the DPU to act as a secure offload engine by running its own operating system and management stack independently, providing networking services to the host while maintaining separate control over network interfaces and security policies .
Incorrect: A. Disable the internal ARM cores of the BlueField DPU to allow the host operating system to take full control of the network hardware resources.
This is incorrect because disabling the Arm cores configures the DPU into NIC Mode, where “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This would prevent the DPU from acting as a secure offload engine and defeat the purpose of independent management.
B. Install a standard Ethernet driver on the host and ignore the BlueField-specific management tools as they are only used for basic troubleshooting.
This is incorrect because BlueField-specific management tools and the DOCA software framework are essential for unlocking the DPU‘s full potential as a programmable infrastructure processor . Ignoring these tools would prevent proper configuration and management of the DPU as an independent offload engine.
D. Set the DPU to Bridge Mode so that all traffic passes through the host CPU for inspection before being processed by the BlueField hardware acceleration.
This is incorrect because “Bridge Mode“ is not a recognized operational mode in NVIDIA BlueField documentation. Additionally, forcing traffic through the host CPU contradicts the goal of offloading networking and security tasks to the DPU. The documented modes are NIC Mode, DPU Mode (ECPF), Restricted Mode (Zero Trust), and Separated Host Mode .
Incorrect
Correct: C. Configure the DPU in Separated Mode where the DPU OS runs independently and manages its own network interfaces and security policies.
This is correct because Separated Mode, also known as the symmetric model, is explicitly designed to allow both the host and the DPU Arm cores to operate network functions independently .
In this mode, “a network function is assigned to both the Arm cores and the host cores,“ with the ports and functions being symmetric and “no dependency between the two functions“ .
Each function has its own MAC address, enabling them to operate “simultaneously or separately,“ with the ability to send and receive Ethernet and RDMA traffic .
The NCP-AII certification blueprint explicitly includes “Configure and manage a BlueField® network platform“ as a core task within the Physical Layer Management domain.
This configuration enables the DPU to act as a secure offload engine by running its own operating system and management stack independently, providing networking services to the host while maintaining separate control over network interfaces and security policies .
Incorrect: A. Disable the internal ARM cores of the BlueField DPU to allow the host operating system to take full control of the network hardware resources.
This is incorrect because disabling the Arm cores configures the DPU into NIC Mode, where “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This would prevent the DPU from acting as a secure offload engine and defeat the purpose of independent management.
B. Install a standard Ethernet driver on the host and ignore the BlueField-specific management tools as they are only used for basic troubleshooting.
This is incorrect because BlueField-specific management tools and the DOCA software framework are essential for unlocking the DPU‘s full potential as a programmable infrastructure processor . Ignoring these tools would prevent proper configuration and management of the DPU as an independent offload engine.
D. Set the DPU to Bridge Mode so that all traffic passes through the host CPU for inspection before being processed by the BlueField hardware acceleration.
This is incorrect because “Bridge Mode“ is not a recognized operational mode in NVIDIA BlueField documentation. Additionally, forcing traffic through the host CPU contradicts the goal of offloading networking and security tasks to the DPU. The documented modes are NIC Mode, DPU Mode (ECPF), Restricted Mode (Zero Trust), and Separated Host Mode .
Unattempted
Correct: C. Configure the DPU in Separated Mode where the DPU OS runs independently and manages its own network interfaces and security policies.
This is correct because Separated Mode, also known as the symmetric model, is explicitly designed to allow both the host and the DPU Arm cores to operate network functions independently .
In this mode, “a network function is assigned to both the Arm cores and the host cores,“ with the ports and functions being symmetric and “no dependency between the two functions“ .
Each function has its own MAC address, enabling them to operate “simultaneously or separately,“ with the ability to send and receive Ethernet and RDMA traffic .
The NCP-AII certification blueprint explicitly includes “Configure and manage a BlueField® network platform“ as a core task within the Physical Layer Management domain.
This configuration enables the DPU to act as a secure offload engine by running its own operating system and management stack independently, providing networking services to the host while maintaining separate control over network interfaces and security policies .
Incorrect: A. Disable the internal ARM cores of the BlueField DPU to allow the host operating system to take full control of the network hardware resources.
This is incorrect because disabling the Arm cores configures the DPU into NIC Mode, where “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This would prevent the DPU from acting as a secure offload engine and defeat the purpose of independent management.
B. Install a standard Ethernet driver on the host and ignore the BlueField-specific management tools as they are only used for basic troubleshooting.
This is incorrect because BlueField-specific management tools and the DOCA software framework are essential for unlocking the DPU‘s full potential as a programmable infrastructure processor . Ignoring these tools would prevent proper configuration and management of the DPU as an independent offload engine.
D. Set the DPU to Bridge Mode so that all traffic passes through the host CPU for inspection before being processed by the BlueField hardware acceleration.
This is incorrect because “Bridge Mode“ is not a recognized operational mode in NVIDIA BlueField documentation. Additionally, forcing traffic through the host CPU contradicts the goal of offloading networking and security tasks to the DPU. The documented modes are NIC Mode, DPU Mode (ECPF), Restricted Mode (Zero Trust), and Separated Host Mode .
Question 48 of 60
48. Question
During the server bring-up phase, an administrator is configuring the Out-of-Band (OOB) management and the Trusted Platform Module (TPM). The goal is to ensure secure remote management and hardware-rooted identity for the AI nodes. What is the primary purpose of initializing the TPM 2.0 module in the BIOS/UEFI settings before deploying the operating system and NVIDIA drivers?
Correct
Option D: Platform Integrity: The TPM 2.0 acts as a secure storage for “measurements“ (hashes) of the system‘s firmware, BIOS, and bootloader. This enables Secure Boot to verify that no malicious code has tampered with the system before the OS loads.
Hardware-Rooted Identity: By initializing the TPM in the BIOS, the system establishes a unique, hardware-bound identity. This is used for attestation, allowing remote management tools to prove that a node is genuine and has not been altered.
Key Management: In NVIDIA environments, the TPM is frequently used to store encryption keys for Self-Encrypting Drives (SEDs). For example, the nv-disk-encrypt tool on DGX systems uses the TPM to securely store the “vault“ keys, ensuring data remains protected even if the system is reimaged or drives are removed.
Analysis of Incorrect Options Option A: (Incorrect) The Error: A TPM is a low-speed cryptoprocessor designed for small-scale key management and hash storage. It does not have the computational throughput to offload high-performance encryption tasks from a GPU. Modern NVIDIA GPUs handle their own internal encryption (e.g., through AES-GCM hardware engines) without needing the TPM for data-plane performance.
Option B: (Incorrect) The Error: GPU overclocking and thermal management are handled by the GPU BIOS (VBIOS) and the NVIDIA Driver/SMI suite. While the TPM stores security parameters, it is not involved in the real-time performance or frequency scaling of the GPU cores.
Option C: (Incorrect) The Error: The TPM is a security component, not a networking or firewall bypass tool. The BMC (Baseboard Management Controller) manages the OOB interface independently. While the BMC may use the TPM for its own secure boot process, it does not use it to circumvent network security policies or firewalls.
Incorrect
Option D: Platform Integrity: The TPM 2.0 acts as a secure storage for “measurements“ (hashes) of the system‘s firmware, BIOS, and bootloader. This enables Secure Boot to verify that no malicious code has tampered with the system before the OS loads.
Hardware-Rooted Identity: By initializing the TPM in the BIOS, the system establishes a unique, hardware-bound identity. This is used for attestation, allowing remote management tools to prove that a node is genuine and has not been altered.
Key Management: In NVIDIA environments, the TPM is frequently used to store encryption keys for Self-Encrypting Drives (SEDs). For example, the nv-disk-encrypt tool on DGX systems uses the TPM to securely store the “vault“ keys, ensuring data remains protected even if the system is reimaged or drives are removed.
Analysis of Incorrect Options Option A: (Incorrect) The Error: A TPM is a low-speed cryptoprocessor designed for small-scale key management and hash storage. It does not have the computational throughput to offload high-performance encryption tasks from a GPU. Modern NVIDIA GPUs handle their own internal encryption (e.g., through AES-GCM hardware engines) without needing the TPM for data-plane performance.
Option B: (Incorrect) The Error: GPU overclocking and thermal management are handled by the GPU BIOS (VBIOS) and the NVIDIA Driver/SMI suite. While the TPM stores security parameters, it is not involved in the real-time performance or frequency scaling of the GPU cores.
Option C: (Incorrect) The Error: The TPM is a security component, not a networking or firewall bypass tool. The BMC (Baseboard Management Controller) manages the OOB interface independently. While the BMC may use the TPM for its own secure boot process, it does not use it to circumvent network security policies or firewalls.
Unattempted
Option D: Platform Integrity: The TPM 2.0 acts as a secure storage for “measurements“ (hashes) of the system‘s firmware, BIOS, and bootloader. This enables Secure Boot to verify that no malicious code has tampered with the system before the OS loads.
Hardware-Rooted Identity: By initializing the TPM in the BIOS, the system establishes a unique, hardware-bound identity. This is used for attestation, allowing remote management tools to prove that a node is genuine and has not been altered.
Key Management: In NVIDIA environments, the TPM is frequently used to store encryption keys for Self-Encrypting Drives (SEDs). For example, the nv-disk-encrypt tool on DGX systems uses the TPM to securely store the “vault“ keys, ensuring data remains protected even if the system is reimaged or drives are removed.
Analysis of Incorrect Options Option A: (Incorrect) The Error: A TPM is a low-speed cryptoprocessor designed for small-scale key management and hash storage. It does not have the computational throughput to offload high-performance encryption tasks from a GPU. Modern NVIDIA GPUs handle their own internal encryption (e.g., through AES-GCM hardware engines) without needing the TPM for data-plane performance.
Option B: (Incorrect) The Error: GPU overclocking and thermal management are handled by the GPU BIOS (VBIOS) and the NVIDIA Driver/SMI suite. While the TPM stores security parameters, it is not involved in the real-time performance or frequency scaling of the GPU cores.
Option C: (Incorrect) The Error: The TPM is a security component, not a networking or firewall bypass tool. The BMC (Baseboard Management Controller) manages the OOB interface independently. While the BMC may use the TPM for its own secure boot process, it does not use it to circumvent network security policies or firewalls.
Question 49 of 60
49. Question
An administrator is installing Base Command Manager to orchestrate a new AI cluster. During the setup of the head node, they must configure High Availability. What is the primary mechanism BCM uses to ensure the cluster remains operational if the primary head node suffers a catastrophic hardware failure?
Correct
Correct: A. BCM configures a secondary head node that synchronizes its database and configuration files with the primary and uses a heartbeat mechanism to trigger failover.
This is correct because the NCP-AII certification blueprint explicitly includes “Install the Base Command™ Manager (BCM), configure and verify the HA“ as a core task within the Control Plane Installation and Configuration domain, which comprises 19% of the examination .
During BCM installation, when requesting the license, the system specifically prompts for HA configuration with the parameter “HA | N, if you set ha for Y“, indicating that HA setup requires designating a secondary head node during the licensing phase .
The license request process also requires specifying “the MAC address of the first NIC of the secondary head node so that it can also serve the BCM licenses in the event of a failover“ , confirming that a second head node is prepared with shared access to licensing.
The heartbeat network is implied through the requirement that both head nodes have connectivity to serve licenses during failover, and the cluster database and configuration synchronization is managed through BCM‘s internal mechanisms when HA is properly configured .
This architecture ensures that if the primary head node suffers catastrophic hardware failure, the secondary node can detect the loss of heartbeat and take over management of the cluster .
Incorrect: B. BCM uses a round-robin DNS strategy to distribute Slurm job requests to all compute nodes simultaneously, bypassing the need for a management node.
This is incorrect because round-robin DNS is a simple load-balancing technique for distributing traffic across multiple servers, not a High Availability solution for cluster management. BCM‘s HA architecture requires a dedicated secondary head node, not DNS-based distribution. Compute nodes cannot bypass the management node entirely as they require provisioning, monitoring, and job scheduling from the head node.
C. BCM utilizes the GPU‘s NVLink interconnect to mirror the entire operating system of the head node onto the first compute node in the cluster.
This is incorrect because NVLink is a high-speed GPU-to-GPU interconnect technology, not a mechanism for OS mirroring or cluster management failover. NVLink operates at the GPU level for peer-to-peer communication and has no role in head node redundancy or operating system replication.
D. BCM requires the administrator to manually copy the Slurm configuration to a USB drive and plug it into a different server whenever the primary fails.
This is incorrect because manual intervention contradicts the purpose of High Availability, which is designed to provide automatic failover without human intervention. The certification emphasizes automated HA configuration through BCM‘s built-in mechanisms, not manual recovery procedures .
Incorrect
Correct: A. BCM configures a secondary head node that synchronizes its database and configuration files with the primary and uses a heartbeat mechanism to trigger failover.
This is correct because the NCP-AII certification blueprint explicitly includes “Install the Base Command™ Manager (BCM), configure and verify the HA“ as a core task within the Control Plane Installation and Configuration domain, which comprises 19% of the examination .
During BCM installation, when requesting the license, the system specifically prompts for HA configuration with the parameter “HA | N, if you set ha for Y“, indicating that HA setup requires designating a secondary head node during the licensing phase .
The license request process also requires specifying “the MAC address of the first NIC of the secondary head node so that it can also serve the BCM licenses in the event of a failover“ , confirming that a second head node is prepared with shared access to licensing.
The heartbeat network is implied through the requirement that both head nodes have connectivity to serve licenses during failover, and the cluster database and configuration synchronization is managed through BCM‘s internal mechanisms when HA is properly configured .
This architecture ensures that if the primary head node suffers catastrophic hardware failure, the secondary node can detect the loss of heartbeat and take over management of the cluster .
Incorrect: B. BCM uses a round-robin DNS strategy to distribute Slurm job requests to all compute nodes simultaneously, bypassing the need for a management node.
This is incorrect because round-robin DNS is a simple load-balancing technique for distributing traffic across multiple servers, not a High Availability solution for cluster management. BCM‘s HA architecture requires a dedicated secondary head node, not DNS-based distribution. Compute nodes cannot bypass the management node entirely as they require provisioning, monitoring, and job scheduling from the head node.
C. BCM utilizes the GPU‘s NVLink interconnect to mirror the entire operating system of the head node onto the first compute node in the cluster.
This is incorrect because NVLink is a high-speed GPU-to-GPU interconnect technology, not a mechanism for OS mirroring or cluster management failover. NVLink operates at the GPU level for peer-to-peer communication and has no role in head node redundancy or operating system replication.
D. BCM requires the administrator to manually copy the Slurm configuration to a USB drive and plug it into a different server whenever the primary fails.
This is incorrect because manual intervention contradicts the purpose of High Availability, which is designed to provide automatic failover without human intervention. The certification emphasizes automated HA configuration through BCM‘s built-in mechanisms, not manual recovery procedures .
Unattempted
Correct: A. BCM configures a secondary head node that synchronizes its database and configuration files with the primary and uses a heartbeat mechanism to trigger failover.
This is correct because the NCP-AII certification blueprint explicitly includes “Install the Base Command™ Manager (BCM), configure and verify the HA“ as a core task within the Control Plane Installation and Configuration domain, which comprises 19% of the examination .
During BCM installation, when requesting the license, the system specifically prompts for HA configuration with the parameter “HA | N, if you set ha for Y“, indicating that HA setup requires designating a secondary head node during the licensing phase .
The license request process also requires specifying “the MAC address of the first NIC of the secondary head node so that it can also serve the BCM licenses in the event of a failover“ , confirming that a second head node is prepared with shared access to licensing.
The heartbeat network is implied through the requirement that both head nodes have connectivity to serve licenses during failover, and the cluster database and configuration synchronization is managed through BCM‘s internal mechanisms when HA is properly configured .
This architecture ensures that if the primary head node suffers catastrophic hardware failure, the secondary node can detect the loss of heartbeat and take over management of the cluster .
Incorrect: B. BCM uses a round-robin DNS strategy to distribute Slurm job requests to all compute nodes simultaneously, bypassing the need for a management node.
This is incorrect because round-robin DNS is a simple load-balancing technique for distributing traffic across multiple servers, not a High Availability solution for cluster management. BCM‘s HA architecture requires a dedicated secondary head node, not DNS-based distribution. Compute nodes cannot bypass the management node entirely as they require provisioning, monitoring, and job scheduling from the head node.
C. BCM utilizes the GPU‘s NVLink interconnect to mirror the entire operating system of the head node onto the first compute node in the cluster.
This is incorrect because NVLink is a high-speed GPU-to-GPU interconnect technology, not a mechanism for OS mirroring or cluster management failover. NVLink operates at the GPU level for peer-to-peer communication and has no role in head node redundancy or operating system replication.
D. BCM requires the administrator to manually copy the Slurm configuration to a USB drive and plug it into a different server whenever the primary fails.
This is incorrect because manual intervention contradicts the purpose of High Availability, which is designed to provide automatic failover without human intervention. The certification emphasizes automated HA configuration through BCM‘s built-in mechanisms, not manual recovery procedures .
Question 50 of 60
50. Question
An infrastructure engineer is validating the cabling for a large-scale AI cluster using InfiniBand NDR transceivers and Twinax copper cables. During the signal quality verification phase, several links report high Bit Error Rates (BER). Which action is the most appropriate according to NVIDIA validation standards to ensure physical layer stability before proceeding to the software control plane installation?
Correct
Option A: Distance Limitations for NDR: In the NVIDIA LinkX portfolio, passive DAC (Direct Attach Copper) cables for NDR/400G are strictly limited by physics. While earlier generations (EDR/HDR) supported longer copper runs, at 400Gb/s, the maximum reliable length for a passive DAC is typically 2.5 to 3 meters.
High BER and Bend Radius: Bit Error Rate (BER) is the primary metric for signal quality. If a 3m+ cable is used, or if a cable is bent beyond its specified bend radius (which can pinch the internal twinax shielding), the signal integrity will degrade, causing high BER.
The Solution: For distances beyond 3 meters, NVIDIA validation standards require AOCs (Active Optical Cables) or transceivers with fiber. AOCs use optical signaling, which is immune to the high-frequency attenuation and electromagnetic interference (EMI) that plagues long copper runs at 400G speeds.
Analysis of Incorrect Options Option B: (Incorrect) The Error: This is a dangerous misconception in high-performance computing (HPC). While some management layers can handle packet retransmission, a high BER at the physical layer causes constant “CRC errors“ and “Symbol errors.“ This leads to massive performance jitter and link instability. The Subnet Manager (SM) manages the fabric logic, but it cannot “fix“ a physically degraded signal. NVIDIA standards require the physical layer to be clean (virtually zero BER) before moving to software configuration.
Option C: (Incorrect) The Error: The NGC CLI is used for managing GPU-optimized containers and models; it is not a hardware firmware utility. While flint or mlxup (from the MFT suite) can update firmware, firmware resets rarely fix physical signal issues caused by incorrect cabling. The problem is electrical/optical, not logical.
Option D: (Incorrect) The Error: Forcing a link to a lower speed (e.g., dropping NDR 400G to HDR 200G) is a “band-aid“ that defeats the purpose of an AI Factory design. AI workloads like Large Language Model (LLM) training are extremely sensitive to bandwidth. If the design calls for NDR, the technician must resolve the physical cabling issue rather than down-clocking the hardware, which would create a permanent bottleneck in the cluster.
Incorrect
Option A: Distance Limitations for NDR: In the NVIDIA LinkX portfolio, passive DAC (Direct Attach Copper) cables for NDR/400G are strictly limited by physics. While earlier generations (EDR/HDR) supported longer copper runs, at 400Gb/s, the maximum reliable length for a passive DAC is typically 2.5 to 3 meters.
High BER and Bend Radius: Bit Error Rate (BER) is the primary metric for signal quality. If a 3m+ cable is used, or if a cable is bent beyond its specified bend radius (which can pinch the internal twinax shielding), the signal integrity will degrade, causing high BER.
The Solution: For distances beyond 3 meters, NVIDIA validation standards require AOCs (Active Optical Cables) or transceivers with fiber. AOCs use optical signaling, which is immune to the high-frequency attenuation and electromagnetic interference (EMI) that plagues long copper runs at 400G speeds.
Analysis of Incorrect Options Option B: (Incorrect) The Error: This is a dangerous misconception in high-performance computing (HPC). While some management layers can handle packet retransmission, a high BER at the physical layer causes constant “CRC errors“ and “Symbol errors.“ This leads to massive performance jitter and link instability. The Subnet Manager (SM) manages the fabric logic, but it cannot “fix“ a physically degraded signal. NVIDIA standards require the physical layer to be clean (virtually zero BER) before moving to software configuration.
Option C: (Incorrect) The Error: The NGC CLI is used for managing GPU-optimized containers and models; it is not a hardware firmware utility. While flint or mlxup (from the MFT suite) can update firmware, firmware resets rarely fix physical signal issues caused by incorrect cabling. The problem is electrical/optical, not logical.
Option D: (Incorrect) The Error: Forcing a link to a lower speed (e.g., dropping NDR 400G to HDR 200G) is a “band-aid“ that defeats the purpose of an AI Factory design. AI workloads like Large Language Model (LLM) training are extremely sensitive to bandwidth. If the design calls for NDR, the technician must resolve the physical cabling issue rather than down-clocking the hardware, which would create a permanent bottleneck in the cluster.
Unattempted
Option A: Distance Limitations for NDR: In the NVIDIA LinkX portfolio, passive DAC (Direct Attach Copper) cables for NDR/400G are strictly limited by physics. While earlier generations (EDR/HDR) supported longer copper runs, at 400Gb/s, the maximum reliable length for a passive DAC is typically 2.5 to 3 meters.
High BER and Bend Radius: Bit Error Rate (BER) is the primary metric for signal quality. If a 3m+ cable is used, or if a cable is bent beyond its specified bend radius (which can pinch the internal twinax shielding), the signal integrity will degrade, causing high BER.
The Solution: For distances beyond 3 meters, NVIDIA validation standards require AOCs (Active Optical Cables) or transceivers with fiber. AOCs use optical signaling, which is immune to the high-frequency attenuation and electromagnetic interference (EMI) that plagues long copper runs at 400G speeds.
Analysis of Incorrect Options Option B: (Incorrect) The Error: This is a dangerous misconception in high-performance computing (HPC). While some management layers can handle packet retransmission, a high BER at the physical layer causes constant “CRC errors“ and “Symbol errors.“ This leads to massive performance jitter and link instability. The Subnet Manager (SM) manages the fabric logic, but it cannot “fix“ a physically degraded signal. NVIDIA standards require the physical layer to be clean (virtually zero BER) before moving to software configuration.
Option C: (Incorrect) The Error: The NGC CLI is used for managing GPU-optimized containers and models; it is not a hardware firmware utility. While flint or mlxup (from the MFT suite) can update firmware, firmware resets rarely fix physical signal issues caused by incorrect cabling. The problem is electrical/optical, not logical.
Option D: (Incorrect) The Error: Forcing a link to a lower speed (e.g., dropping NDR 400G to HDR 200G) is a “band-aid“ that defeats the purpose of an AI Factory design. AI workloads like Large Language Model (LLM) training are extremely sensitive to bandwidth. If the design calls for NDR, the technician must resolve the physical cabling issue rather than down-clocking the hardware, which would create a permanent bottleneck in the cluster.
Question 51 of 60
51. Question
A data scientist reports that their distributed training job is running 50 percent slower than usual. The administrator uses mlnx_perf and nvidia-smi to troubleshoot. They see high retransmission rates on the network and Power Brake events on the GPUs. What is the most likely root cause?
Correct
Correct: A. A failing Power Supply Unit is causing the GPUs to throttle and the network switch to drop packets due to unstable voltage.
This is correct because the combination of two distinct symptoms—high network retransmission rates (detected by mlnx_perf) and GPU “Power Brake“ events (detected by nvidia-smi)—points to a common root cause affecting multiple subsystems .
According to the NCP-AII certification blueprint, “Identify faulty cards, GPUs, and power supplies“ is a core task within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
A failing or unstable Power Supply Unit (PSU) can cause voltage fluctuations that simultaneously impact both the GPUs and the network infrastructure .
The Dell support documentation specifically confirms that when nvidia-smi -q output shows “HW Power Brake Slowdown“ as “Active,“ this indicates a hardware power brake condition that throttles GPU performance .
In documented field cases, users experienced poor GPU performance with “HW Power Brake Slowdown“ active, and the solution involved updating system firmware (BIOS, CPLD, iDRAC) to resolve the underlying power delivery issue .
Unstable voltage from a failing PSU would also affect network switches and NICs, causing intermittent link issues, packet loss, and high retransmission rates observable in mlnx_perf .
Incorrect: B. The GPUs are waiting for a software update from the Windows Update service, which is blocking the InfiniBand fabric.
This is incorrect for multiple reasons. First, AI clusters running H100 GPUs use Linux-based operating systems, not Windows. Second, Windows Update does not run on Linux servers and cannot block InfiniBand fabrics. Third, software updates do not manifest as “Power Brake“ events or network retransmissions in the manner described .
C. The Slurm scheduler has been set to slow mode by the administrator to save on the cluster‘s monthly electricity bill.
This is incorrect because Slurm does not have a configurable ‘slow mode‘ for power saving. Slurm is a workload manager for job scheduling, not a power management tool . While power saving can be configured through other mechanisms, the symptoms of GPU Power Brake events and network retransmissions indicate hardware-level issues, not scheduler configuration changes .
D. The users are using the wrong font in their Jupyter notebooks, which is causing the GPU to work harder to render the text.
This is incorrect because Jupyter notebook font selection has no impact on GPU compute performance or network retransmission rates. GPUs are designed for compute workloads, not text rendering in web interfaces. This option completely misunderstands GPU functionality and has no basis in NVIDIA diagnostic methodology .
Incorrect
Correct: A. A failing Power Supply Unit is causing the GPUs to throttle and the network switch to drop packets due to unstable voltage.
This is correct because the combination of two distinct symptoms—high network retransmission rates (detected by mlnx_perf) and GPU “Power Brake“ events (detected by nvidia-smi)—points to a common root cause affecting multiple subsystems .
According to the NCP-AII certification blueprint, “Identify faulty cards, GPUs, and power supplies“ is a core task within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
A failing or unstable Power Supply Unit (PSU) can cause voltage fluctuations that simultaneously impact both the GPUs and the network infrastructure .
The Dell support documentation specifically confirms that when nvidia-smi -q output shows “HW Power Brake Slowdown“ as “Active,“ this indicates a hardware power brake condition that throttles GPU performance .
In documented field cases, users experienced poor GPU performance with “HW Power Brake Slowdown“ active, and the solution involved updating system firmware (BIOS, CPLD, iDRAC) to resolve the underlying power delivery issue .
Unstable voltage from a failing PSU would also affect network switches and NICs, causing intermittent link issues, packet loss, and high retransmission rates observable in mlnx_perf .
Incorrect: B. The GPUs are waiting for a software update from the Windows Update service, which is blocking the InfiniBand fabric.
This is incorrect for multiple reasons. First, AI clusters running H100 GPUs use Linux-based operating systems, not Windows. Second, Windows Update does not run on Linux servers and cannot block InfiniBand fabrics. Third, software updates do not manifest as “Power Brake“ events or network retransmissions in the manner described .
C. The Slurm scheduler has been set to slow mode by the administrator to save on the cluster‘s monthly electricity bill.
This is incorrect because Slurm does not have a configurable ‘slow mode‘ for power saving. Slurm is a workload manager for job scheduling, not a power management tool . While power saving can be configured through other mechanisms, the symptoms of GPU Power Brake events and network retransmissions indicate hardware-level issues, not scheduler configuration changes .
D. The users are using the wrong font in their Jupyter notebooks, which is causing the GPU to work harder to render the text.
This is incorrect because Jupyter notebook font selection has no impact on GPU compute performance or network retransmission rates. GPUs are designed for compute workloads, not text rendering in web interfaces. This option completely misunderstands GPU functionality and has no basis in NVIDIA diagnostic methodology .
Unattempted
Correct: A. A failing Power Supply Unit is causing the GPUs to throttle and the network switch to drop packets due to unstable voltage.
This is correct because the combination of two distinct symptoms—high network retransmission rates (detected by mlnx_perf) and GPU “Power Brake“ events (detected by nvidia-smi)—points to a common root cause affecting multiple subsystems .
According to the NCP-AII certification blueprint, “Identify faulty cards, GPUs, and power supplies“ is a core task within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
A failing or unstable Power Supply Unit (PSU) can cause voltage fluctuations that simultaneously impact both the GPUs and the network infrastructure .
The Dell support documentation specifically confirms that when nvidia-smi -q output shows “HW Power Brake Slowdown“ as “Active,“ this indicates a hardware power brake condition that throttles GPU performance .
In documented field cases, users experienced poor GPU performance with “HW Power Brake Slowdown“ active, and the solution involved updating system firmware (BIOS, CPLD, iDRAC) to resolve the underlying power delivery issue .
Unstable voltage from a failing PSU would also affect network switches and NICs, causing intermittent link issues, packet loss, and high retransmission rates observable in mlnx_perf .
Incorrect: B. The GPUs are waiting for a software update from the Windows Update service, which is blocking the InfiniBand fabric.
This is incorrect for multiple reasons. First, AI clusters running H100 GPUs use Linux-based operating systems, not Windows. Second, Windows Update does not run on Linux servers and cannot block InfiniBand fabrics. Third, software updates do not manifest as “Power Brake“ events or network retransmissions in the manner described .
C. The Slurm scheduler has been set to slow mode by the administrator to save on the cluster‘s monthly electricity bill.
This is incorrect because Slurm does not have a configurable ‘slow mode‘ for power saving. Slurm is a workload manager for job scheduling, not a power management tool . While power saving can be configured through other mechanisms, the symptoms of GPU Power Brake events and network retransmissions indicate hardware-level issues, not scheduler configuration changes .
D. The users are using the wrong font in their Jupyter notebooks, which is causing the GPU to work harder to render the text.
This is incorrect because Jupyter notebook font selection has no impact on GPU compute performance or network retransmission rates. GPUs are designed for compute workloads, not text rendering in web interfaces. This option completely misunderstands GPU functionality and has no basis in NVIDIA diagnostic methodology .
Question 52 of 60
52. Question
During a NeMo burn-in test, a specific node repeatedly fails with a ‘Bus Error‘ when accessing the GPUs. All other nodes pass the test. What should the administrator check first to validate the physical hardware of the failing node?
Correct
Option C: The Logic: In a cluster where only one node fails while others succeed, the problem is local to that specific hardware. A “Bus Error“ often results from a GPU “falling off the bus,“ which can be caused by physical instability.
nvidia-smi: This tool is used to check if the OS still “sees“ all GPUs and to look for XID errors (e.g., XID 61 or 63), which are specific indicators of PCIe or memory-related hardware faults.
BMC Logs: The Baseboard Management Controller records low-level hardware events, such as power spikes, thermal trips, or PCIe link training failures, providing proof of a physical seating or cabling issue that the OS might not fully capture.
NVLink Integrity: Since NeMo workloads rely heavily on high-speed GPU-to-GPU communication, verifying the NVLink bridge or the HGX baseboard‘s integrity is the primary step.
Analysis of Incorrect Options Option A: (Incorrect) The Error: While a typo in a Slurm configuration (like an incorrect GRES count) might prevent a job from starting or cause it to crash with a “Resource not found“ error, it will not trigger a low-level Bus Error. Bus errors occur at the kernel/hardware level, far below the abstraction of a workload scheduler‘s description field.
Option B: (Incorrect) The Error: The NGC CLI is a tool for downloading models and containers; it is not part of the active compute or memory-access path during a workload‘s execution. Updating it to a “beta version“ would not resolve a recurring hardware interrupt or memory access failure on a single node.
Option D: (Incorrect) The Error: This is a “destructive“ and technically impossible troubleshooting step for the problem described. Cat5e is restricted to 1Gb/s Ethernet and cannot physically interface with InfiniBand (QSFP/OSFP) ports. Furthermore, a Bus Error refers to the internal PCIe/NVLink bus of the server, not the external cluster network. Swapping network cables would not fix a failure occurring inside the GPU-to-CPU communication path.
Incorrect
Option C: The Logic: In a cluster where only one node fails while others succeed, the problem is local to that specific hardware. A “Bus Error“ often results from a GPU “falling off the bus,“ which can be caused by physical instability.
nvidia-smi: This tool is used to check if the OS still “sees“ all GPUs and to look for XID errors (e.g., XID 61 or 63), which are specific indicators of PCIe or memory-related hardware faults.
BMC Logs: The Baseboard Management Controller records low-level hardware events, such as power spikes, thermal trips, or PCIe link training failures, providing proof of a physical seating or cabling issue that the OS might not fully capture.
NVLink Integrity: Since NeMo workloads rely heavily on high-speed GPU-to-GPU communication, verifying the NVLink bridge or the HGX baseboard‘s integrity is the primary step.
Analysis of Incorrect Options Option A: (Incorrect) The Error: While a typo in a Slurm configuration (like an incorrect GRES count) might prevent a job from starting or cause it to crash with a “Resource not found“ error, it will not trigger a low-level Bus Error. Bus errors occur at the kernel/hardware level, far below the abstraction of a workload scheduler‘s description field.
Option B: (Incorrect) The Error: The NGC CLI is a tool for downloading models and containers; it is not part of the active compute or memory-access path during a workload‘s execution. Updating it to a “beta version“ would not resolve a recurring hardware interrupt or memory access failure on a single node.
Option D: (Incorrect) The Error: This is a “destructive“ and technically impossible troubleshooting step for the problem described. Cat5e is restricted to 1Gb/s Ethernet and cannot physically interface with InfiniBand (QSFP/OSFP) ports. Furthermore, a Bus Error refers to the internal PCIe/NVLink bus of the server, not the external cluster network. Swapping network cables would not fix a failure occurring inside the GPU-to-CPU communication path.
Unattempted
Option C: The Logic: In a cluster where only one node fails while others succeed, the problem is local to that specific hardware. A “Bus Error“ often results from a GPU “falling off the bus,“ which can be caused by physical instability.
nvidia-smi: This tool is used to check if the OS still “sees“ all GPUs and to look for XID errors (e.g., XID 61 or 63), which are specific indicators of PCIe or memory-related hardware faults.
BMC Logs: The Baseboard Management Controller records low-level hardware events, such as power spikes, thermal trips, or PCIe link training failures, providing proof of a physical seating or cabling issue that the OS might not fully capture.
NVLink Integrity: Since NeMo workloads rely heavily on high-speed GPU-to-GPU communication, verifying the NVLink bridge or the HGX baseboard‘s integrity is the primary step.
Analysis of Incorrect Options Option A: (Incorrect) The Error: While a typo in a Slurm configuration (like an incorrect GRES count) might prevent a job from starting or cause it to crash with a “Resource not found“ error, it will not trigger a low-level Bus Error. Bus errors occur at the kernel/hardware level, far below the abstraction of a workload scheduler‘s description field.
Option B: (Incorrect) The Error: The NGC CLI is a tool for downloading models and containers; it is not part of the active compute or memory-access path during a workload‘s execution. Updating it to a “beta version“ would not resolve a recurring hardware interrupt or memory access failure on a single node.
Option D: (Incorrect) The Error: This is a “destructive“ and technically impossible troubleshooting step for the problem described. Cat5e is restricted to 1Gb/s Ethernet and cannot physically interface with InfiniBand (QSFP/OSFP) ports. Furthermore, a Bus Error refers to the internal PCIe/NVLink bus of the server, not the external cluster network. Swapping network cables would not fix a failure occurring inside the GPU-to-CPU communication path.
Question 53 of 60
53. Question
To ensure the reliability of the East/West (E/W) fabric in a large AI cluster, an administrator uses the NVIDIA ClusterKit to perform a multifaceted node assessment. What is the specific purpose of running the NCCL burn-in test as part of this assessment, and what does it reveal about the health of the cluster infrastructure?
Correct
Correct: D. It stresses the GPU-to-GPU communication over the InfiniBand fabric for an extended period to identify intermittent cable failures, transceiver overheating, or unstable links.
This is correct because the NCP-AII certification blueprint explicitly includes “Run NCCL to verify E/W fabric bandwidth“ and “Perform NCCL burn-in“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
NCCL (NVIDIA Collective Communications Library) tests are specifically designed to validate network communication between GPUs , both within a single node and across multiple nodes over the InfiniBand fabric .
The purpose of running NCCL burn-in for an extended period is to identify intermittent issues that may not appear during short tests. As documented in real-world cluster validation experience, “To ensure being able to capture ‘glitchy‘ InfiniBand links, we ran the tests for 15 minutes continuously“ .
Extended NCCL burn-in testing reveals:
Intermittent cable failures that only manifest under sustained load
Transceiver overheating issues that develop after thermal buildup
Unstable links that may drop packets or degrade performance over time
The Together.ai documentation confirms that NCCL tests are used to “validate the health of your GPU nodes and underlying hardware,“ specifically for “identifying intermittent cable failures, transceiver overheating, or unstable links“ .
This comprehensive validation of the East/West fabric ensures the cluster can handle the demanding communication patterns of distributed AI training workloads.
Incorrect: A. It validates that the NVIDIA Container Toolkit is correctly installed by running a simple ‘hello world‘ container on every node in the cluster simultaneously.
This is incorrect because NCCL tests validate GPU-to-GPU communication over the high-speed fabric, not container toolkit installation. The NVIDIA Container Toolkit is validated through different methods, such as running docker run –rm –gpus all nvidia/cuda nvidia-smi.
B. It tests the read and write speeds of the local NVMe drives to ensure that data loading for AI models will not be a bottleneck during training.
This is incorrect because storage performance testing is a separate verification task explicitly listed in the exam blueprint under “Test storage“ . NCCL specifically tests network communication between GPUs, not local storage I/O performance.
Incorrect
Correct: D. It stresses the GPU-to-GPU communication over the InfiniBand fabric for an extended period to identify intermittent cable failures, transceiver overheating, or unstable links.
This is correct because the NCP-AII certification blueprint explicitly includes “Run NCCL to verify E/W fabric bandwidth“ and “Perform NCCL burn-in“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
NCCL (NVIDIA Collective Communications Library) tests are specifically designed to validate network communication between GPUs , both within a single node and across multiple nodes over the InfiniBand fabric .
The purpose of running NCCL burn-in for an extended period is to identify intermittent issues that may not appear during short tests. As documented in real-world cluster validation experience, “To ensure being able to capture ‘glitchy‘ InfiniBand links, we ran the tests for 15 minutes continuously“ .
Extended NCCL burn-in testing reveals:
Intermittent cable failures that only manifest under sustained load
Transceiver overheating issues that develop after thermal buildup
Unstable links that may drop packets or degrade performance over time
The Together.ai documentation confirms that NCCL tests are used to “validate the health of your GPU nodes and underlying hardware,“ specifically for “identifying intermittent cable failures, transceiver overheating, or unstable links“ .
This comprehensive validation of the East/West fabric ensures the cluster can handle the demanding communication patterns of distributed AI training workloads.
Incorrect: A. It validates that the NVIDIA Container Toolkit is correctly installed by running a simple ‘hello world‘ container on every node in the cluster simultaneously.
This is incorrect because NCCL tests validate GPU-to-GPU communication over the high-speed fabric, not container toolkit installation. The NVIDIA Container Toolkit is validated through different methods, such as running docker run –rm –gpus all nvidia/cuda nvidia-smi.
B. It tests the read and write speeds of the local NVMe drives to ensure that data loading for AI models will not be a bottleneck during training.
This is incorrect because storage performance testing is a separate verification task explicitly listed in the exam blueprint under “Test storage“ . NCCL specifically tests network communication between GPUs, not local storage I/O performance.
Unattempted
Correct: D. It stresses the GPU-to-GPU communication over the InfiniBand fabric for an extended period to identify intermittent cable failures, transceiver overheating, or unstable links.
This is correct because the NCP-AII certification blueprint explicitly includes “Run NCCL to verify E/W fabric bandwidth“ and “Perform NCCL burn-in“ as core tasks within the Cluster Test and Verification domain, which comprises 33% of the examination .
NCCL (NVIDIA Collective Communications Library) tests are specifically designed to validate network communication between GPUs , both within a single node and across multiple nodes over the InfiniBand fabric .
The purpose of running NCCL burn-in for an extended period is to identify intermittent issues that may not appear during short tests. As documented in real-world cluster validation experience, “To ensure being able to capture ‘glitchy‘ InfiniBand links, we ran the tests for 15 minutes continuously“ .
Extended NCCL burn-in testing reveals:
Intermittent cable failures that only manifest under sustained load
Transceiver overheating issues that develop after thermal buildup
Unstable links that may drop packets or degrade performance over time
The Together.ai documentation confirms that NCCL tests are used to “validate the health of your GPU nodes and underlying hardware,“ specifically for “identifying intermittent cable failures, transceiver overheating, or unstable links“ .
This comprehensive validation of the East/West fabric ensures the cluster can handle the demanding communication patterns of distributed AI training workloads.
Incorrect: A. It validates that the NVIDIA Container Toolkit is correctly installed by running a simple ‘hello world‘ container on every node in the cluster simultaneously.
This is incorrect because NCCL tests validate GPU-to-GPU communication over the high-speed fabric, not container toolkit installation. The NVIDIA Container Toolkit is validated through different methods, such as running docker run –rm –gpus all nvidia/cuda nvidia-smi.
B. It tests the read and write speeds of the local NVMe drives to ensure that data loading for AI models will not be a bottleneck during training.
This is incorrect because storage performance testing is a separate verification task explicitly listed in the exam blueprint under “Test storage“ . NCCL specifically tests network communication between GPUs, not local storage I/O performance.
Question 54 of 60
54. Question
Before putting a new AI cluster into production, an engineer must perform a multifaceted node assessment using NVIDIA ClusterKit. The engineer is specifically looking for a way to validate that the InfiniBand fabric is providing the expected East-West bandwidth between nodes. Which benchmark or test is most commonly used for this purpose?
Correct
Option B: NCCL Tests: The NVIDIA Collective Communications Library (NCCL) is the primary engine for multi-GPU communication. The official nccl-tests suite (including all_reduce_perf, all_gather_perf, and p2p_perf) is the industry standard for measuring actual usable bandwidth and latency.
Fabric Validation: While ClusterKit is a multifaceted tool, it utilizes these specific NCCL benchmarks to verify that the InfiniBand or RoCE fabric is achieving its theoretical “line rate“ (e.g., 400Gb/s for NDR) during the collective operations used in real AI training.
NVIDIA ClusterKit: According to the NCP-AII blueprint, ClusterKit is used to perform a “multifaceted node assessment,“ but it specifically relies on NCCL to verify the performance of the high-speed network fabric across nodes.
Analysis of Incorrect Options Option A: (Incorrect) The Error: This is a host-level CPU benchmark. While integer performance is important for general system health, it does not measure the high-speed interconnect between nodes. A node could have a perfect Unix bench score while having a completely failed or misconfigured InfiniBand link.
Option C: (Incorrect) The Error: Physical verification is a “Day 1“ task (Cabling/Transceiver check), but it is not a benchmark. Matching labels and colors ensures the cables are plugged in, but it does not validate signal integrity, firmware compatibility, or the actual data throughput. A cable can be visually correct but have a high Bit Error Rate (BER) or be running at a lower speed (e.g., 100Gb/s instead of 400Gb/s).
Option D: (Incorrect) The Error: This tests the North-South (Management) network. Management networks (typically 1GbE or 10GbE) are for SSH and monitoring. They are not designed for the massive “East-West“ traffic required for AI model gradients. Using a web server to download a text file is a functional test for internet access, not a performance validation for an AI Factory fabric.
Incorrect
Option B: NCCL Tests: The NVIDIA Collective Communications Library (NCCL) is the primary engine for multi-GPU communication. The official nccl-tests suite (including all_reduce_perf, all_gather_perf, and p2p_perf) is the industry standard for measuring actual usable bandwidth and latency.
Fabric Validation: While ClusterKit is a multifaceted tool, it utilizes these specific NCCL benchmarks to verify that the InfiniBand or RoCE fabric is achieving its theoretical “line rate“ (e.g., 400Gb/s for NDR) during the collective operations used in real AI training.
NVIDIA ClusterKit: According to the NCP-AII blueprint, ClusterKit is used to perform a “multifaceted node assessment,“ but it specifically relies on NCCL to verify the performance of the high-speed network fabric across nodes.
Analysis of Incorrect Options Option A: (Incorrect) The Error: This is a host-level CPU benchmark. While integer performance is important for general system health, it does not measure the high-speed interconnect between nodes. A node could have a perfect Unix bench score while having a completely failed or misconfigured InfiniBand link.
Option C: (Incorrect) The Error: Physical verification is a “Day 1“ task (Cabling/Transceiver check), but it is not a benchmark. Matching labels and colors ensures the cables are plugged in, but it does not validate signal integrity, firmware compatibility, or the actual data throughput. A cable can be visually correct but have a high Bit Error Rate (BER) or be running at a lower speed (e.g., 100Gb/s instead of 400Gb/s).
Option D: (Incorrect) The Error: This tests the North-South (Management) network. Management networks (typically 1GbE or 10GbE) are for SSH and monitoring. They are not designed for the massive “East-West“ traffic required for AI model gradients. Using a web server to download a text file is a functional test for internet access, not a performance validation for an AI Factory fabric.
Unattempted
Option B: NCCL Tests: The NVIDIA Collective Communications Library (NCCL) is the primary engine for multi-GPU communication. The official nccl-tests suite (including all_reduce_perf, all_gather_perf, and p2p_perf) is the industry standard for measuring actual usable bandwidth and latency.
Fabric Validation: While ClusterKit is a multifaceted tool, it utilizes these specific NCCL benchmarks to verify that the InfiniBand or RoCE fabric is achieving its theoretical “line rate“ (e.g., 400Gb/s for NDR) during the collective operations used in real AI training.
NVIDIA ClusterKit: According to the NCP-AII blueprint, ClusterKit is used to perform a “multifaceted node assessment,“ but it specifically relies on NCCL to verify the performance of the high-speed network fabric across nodes.
Analysis of Incorrect Options Option A: (Incorrect) The Error: This is a host-level CPU benchmark. While integer performance is important for general system health, it does not measure the high-speed interconnect between nodes. A node could have a perfect Unix bench score while having a completely failed or misconfigured InfiniBand link.
Option C: (Incorrect) The Error: Physical verification is a “Day 1“ task (Cabling/Transceiver check), but it is not a benchmark. Matching labels and colors ensures the cables are plugged in, but it does not validate signal integrity, firmware compatibility, or the actual data throughput. A cable can be visually correct but have a high Bit Error Rate (BER) or be running at a lower speed (e.g., 100Gb/s instead of 400Gb/s).
Option D: (Incorrect) The Error: This tests the North-South (Management) network. Management networks (typically 1GbE or 10GbE) are for SSH and monitoring. They are not designed for the massive “East-West“ traffic required for AI model gradients. Using a web server to download a text file is a functional test for internet access, not a performance validation for an AI Factory fabric.
Question 55 of 60
55. Question
A system administrator needs to optimize an NVIDIA BlueField network platform to handle intensive data movement for an AI cluster. Which configuration step is necessary to enable the DPU to perform offloaded hardware acceleration for InfiniBand or Ethernet traffic in a production environment?
Correct
Correct: A. Configure the DPU in DPU-Mode (rather than Separated-Mode) and ensure the correct DOCA runtime environment is provisioned to manage the acceleration engines.
This is correct because DPU Mode, also known as embedded CPU function ownership (ECPF) mode, is the default and required mode for BlueField DPUs to enable offloading and hardware acceleration .
In DPU Mode, “the NIC resources and functionality are owned and controlled by the embedded Arm subsystem,“ and “all network communication to the host flows through a virtual switch control plane hosted on the Arm cores“ .
The DOCA (Data Center-on-a-Chip Architecture) software framework is specifically designed to “offload infrastructure workloads from the host CPU“ and “accelerate execution through the BlueField DPU“ .
DOCA provides “a runtime and development environment, including libraries and drivers for device management and programmability“ that enables access to the DPU‘s hardware acceleration engines for networking, storage, and security .
The NCP-AII certification blueprint explicitly includes “Configure and manage a BlueField® network platform“ and “NVIDIA DOCA driver installation and updates“ as core tasks within the Physical Layer Management and Control Plane Installation domains .
With DOCA properly provisioned on the DPU in DPU Mode, administrators can leverage hardware accelerators for “network, cloud, storage, encryption, streaming, and time synchronization“ functions .
Incorrect: B. Utilize the NVIDIA SMI tool to flash the BlueField firmware directly onto the HGX baseboard to unify the management of the network and compute layers.
This is incorrect because NVIDIA SMI is a tool for GPU management and monitoring, not for BlueField DPU firmware flashing. DPU firmware is managed through dedicated tools like mlxconfig, bfb-install, or the DOCA SDK, not through nvidia-smi. Additionally, firmware is flashed to the DPU itself, not onto the HGX baseboard.
C. Set the MIG profile to 1g.10gb on the BlueField DPU to ensure that the network traffic is partitioned into small, manageable virtual streams for the GPU.
This is incorrect because MIG (Multi-Instance GPU) is a GPU partitioning technology, not a DPU configuration feature. MIG profiles apply to NVIDIA GPUs (like H100 or A100) for partitioning GPU resources, not to BlueField DPUs for network traffic management. The DPU handles network acceleration through its own hardware engines, independent of GPU MIG configurations.
D. Disable the internal ARM cores on the BlueField DPU to allow the host CPU to take over the network steering logic for better AI workload synchronization.
This is incorrect because disabling the ARM cores would put the DPU into NIC Mode, where “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This would prevent any offloading of network tasks to the DPU, forcing all networking to be handled by the host CPU, which directly contradicts the goal of using the DPU for hardware acceleration.
Incorrect
Correct: A. Configure the DPU in DPU-Mode (rather than Separated-Mode) and ensure the correct DOCA runtime environment is provisioned to manage the acceleration engines.
This is correct because DPU Mode, also known as embedded CPU function ownership (ECPF) mode, is the default and required mode for BlueField DPUs to enable offloading and hardware acceleration .
In DPU Mode, “the NIC resources and functionality are owned and controlled by the embedded Arm subsystem,“ and “all network communication to the host flows through a virtual switch control plane hosted on the Arm cores“ .
The DOCA (Data Center-on-a-Chip Architecture) software framework is specifically designed to “offload infrastructure workloads from the host CPU“ and “accelerate execution through the BlueField DPU“ .
DOCA provides “a runtime and development environment, including libraries and drivers for device management and programmability“ that enables access to the DPU‘s hardware acceleration engines for networking, storage, and security .
The NCP-AII certification blueprint explicitly includes “Configure and manage a BlueField® network platform“ and “NVIDIA DOCA driver installation and updates“ as core tasks within the Physical Layer Management and Control Plane Installation domains .
With DOCA properly provisioned on the DPU in DPU Mode, administrators can leverage hardware accelerators for “network, cloud, storage, encryption, streaming, and time synchronization“ functions .
Incorrect: B. Utilize the NVIDIA SMI tool to flash the BlueField firmware directly onto the HGX baseboard to unify the management of the network and compute layers.
This is incorrect because NVIDIA SMI is a tool for GPU management and monitoring, not for BlueField DPU firmware flashing. DPU firmware is managed through dedicated tools like mlxconfig, bfb-install, or the DOCA SDK, not through nvidia-smi. Additionally, firmware is flashed to the DPU itself, not onto the HGX baseboard.
C. Set the MIG profile to 1g.10gb on the BlueField DPU to ensure that the network traffic is partitioned into small, manageable virtual streams for the GPU.
This is incorrect because MIG (Multi-Instance GPU) is a GPU partitioning technology, not a DPU configuration feature. MIG profiles apply to NVIDIA GPUs (like H100 or A100) for partitioning GPU resources, not to BlueField DPUs for network traffic management. The DPU handles network acceleration through its own hardware engines, independent of GPU MIG configurations.
D. Disable the internal ARM cores on the BlueField DPU to allow the host CPU to take over the network steering logic for better AI workload synchronization.
This is incorrect because disabling the ARM cores would put the DPU into NIC Mode, where “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This would prevent any offloading of network tasks to the DPU, forcing all networking to be handled by the host CPU, which directly contradicts the goal of using the DPU for hardware acceleration.
Unattempted
Correct: A. Configure the DPU in DPU-Mode (rather than Separated-Mode) and ensure the correct DOCA runtime environment is provisioned to manage the acceleration engines.
This is correct because DPU Mode, also known as embedded CPU function ownership (ECPF) mode, is the default and required mode for BlueField DPUs to enable offloading and hardware acceleration .
In DPU Mode, “the NIC resources and functionality are owned and controlled by the embedded Arm subsystem,“ and “all network communication to the host flows through a virtual switch control plane hosted on the Arm cores“ .
The DOCA (Data Center-on-a-Chip Architecture) software framework is specifically designed to “offload infrastructure workloads from the host CPU“ and “accelerate execution through the BlueField DPU“ .
DOCA provides “a runtime and development environment, including libraries and drivers for device management and programmability“ that enables access to the DPU‘s hardware acceleration engines for networking, storage, and security .
The NCP-AII certification blueprint explicitly includes “Configure and manage a BlueField® network platform“ and “NVIDIA DOCA driver installation and updates“ as core tasks within the Physical Layer Management and Control Plane Installation domains .
With DOCA properly provisioned on the DPU in DPU Mode, administrators can leverage hardware accelerators for “network, cloud, storage, encryption, streaming, and time synchronization“ functions .
Incorrect: B. Utilize the NVIDIA SMI tool to flash the BlueField firmware directly onto the HGX baseboard to unify the management of the network and compute layers.
This is incorrect because NVIDIA SMI is a tool for GPU management and monitoring, not for BlueField DPU firmware flashing. DPU firmware is managed through dedicated tools like mlxconfig, bfb-install, or the DOCA SDK, not through nvidia-smi. Additionally, firmware is flashed to the DPU itself, not onto the HGX baseboard.
C. Set the MIG profile to 1g.10gb on the BlueField DPU to ensure that the network traffic is partitioned into small, manageable virtual streams for the GPU.
This is incorrect because MIG (Multi-Instance GPU) is a GPU partitioning technology, not a DPU configuration feature. MIG profiles apply to NVIDIA GPUs (like H100 or A100) for partitioning GPU resources, not to BlueField DPUs for network traffic management. The DPU handles network acceleration through its own hardware engines, independent of GPU MIG configurations.
D. Disable the internal ARM cores on the BlueField DPU to allow the host CPU to take over the network steering logic for better AI workload synchronization.
This is incorrect because disabling the ARM cores would put the DPU into NIC Mode, where “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This would prevent any offloading of network tasks to the DPU, forcing all networking to be handled by the host CPU, which directly contradicts the goal of using the DPU for hardware acceleration.
Question 56 of 60
56. Question
A system administrator receives an alert regarding a GPU hardware fault on a node in the AI factory. The ‘nvidia-smi‘ output shows ‘Unknown Error‘ for GPU 0, and the system logs report a PCIe AER (Advanced Error Reporting) fatal error. After attempting a software reset without success, what is the next step in the troubleshooting and optimization process?
Correct
Option B: The Logic: A PCIe AER (Advanced Error Reporting) Fatal Error indicates an uncorrectable hardware event in the transaction layer. Once a software reset (nvidia-smi -r) fails to recover the device, it signifies that the GPU‘s internal PCIe controller or core logic is no longer communicating with the host.
FRU (Field Replaceable Unit): In the NVIDIA service model (especially for DGX and HGX systems), components like individual GPUs, fans, and power supplies are classified as FRUs. The standard operating procedure is to identify the specific faulty module (GPU 0 in this case), power down the system to prevent further electrical issues, and replace the hardware using the official service manual‘s anti-static and torque specifications.
Analysis of Incorrect Options Option A: (Incorrect) The Error: While cooling is vital for performance, it cannot repair a fatal PCIe link failure. AER fatal errors are electrical or logical disconnects at the hardware level. Increasing fan speed is a preventative measure for thermal throttling, but it is not a curative measure for a GPU that is no longer detected by the PCIe root complex.
Option C: (Incorrect) The Error: Modifying a scheduler (like Slurm) to ignore a “Fatal“ hardware error is highly dangerous. A GPU in an “Unknown Error“ state can cause system-wide instability, kernel panics, or corrupted data results for other users on the same node. NVIDIA best practices require a faulty node to be drained and cordoned until the hardware is repaired.
Option D: (Incorrect) The Error: This is “troubleshooting by superstition.“ Since the error is reported as a PCIe AER Fatal Error in the system logs (dmesg/journalctl), the issue exists at the hardware/firmware interface. Reinstalling drivers will not fix a physical communication break. If the OS cannot see the device‘s PCIe registers, the driver will simply fail to load again.
Incorrect
Option B: The Logic: A PCIe AER (Advanced Error Reporting) Fatal Error indicates an uncorrectable hardware event in the transaction layer. Once a software reset (nvidia-smi -r) fails to recover the device, it signifies that the GPU‘s internal PCIe controller or core logic is no longer communicating with the host.
FRU (Field Replaceable Unit): In the NVIDIA service model (especially for DGX and HGX systems), components like individual GPUs, fans, and power supplies are classified as FRUs. The standard operating procedure is to identify the specific faulty module (GPU 0 in this case), power down the system to prevent further electrical issues, and replace the hardware using the official service manual‘s anti-static and torque specifications.
Analysis of Incorrect Options Option A: (Incorrect) The Error: While cooling is vital for performance, it cannot repair a fatal PCIe link failure. AER fatal errors are electrical or logical disconnects at the hardware level. Increasing fan speed is a preventative measure for thermal throttling, but it is not a curative measure for a GPU that is no longer detected by the PCIe root complex.
Option C: (Incorrect) The Error: Modifying a scheduler (like Slurm) to ignore a “Fatal“ hardware error is highly dangerous. A GPU in an “Unknown Error“ state can cause system-wide instability, kernel panics, or corrupted data results for other users on the same node. NVIDIA best practices require a faulty node to be drained and cordoned until the hardware is repaired.
Option D: (Incorrect) The Error: This is “troubleshooting by superstition.“ Since the error is reported as a PCIe AER Fatal Error in the system logs (dmesg/journalctl), the issue exists at the hardware/firmware interface. Reinstalling drivers will not fix a physical communication break. If the OS cannot see the device‘s PCIe registers, the driver will simply fail to load again.
Unattempted
Option B: The Logic: A PCIe AER (Advanced Error Reporting) Fatal Error indicates an uncorrectable hardware event in the transaction layer. Once a software reset (nvidia-smi -r) fails to recover the device, it signifies that the GPU‘s internal PCIe controller or core logic is no longer communicating with the host.
FRU (Field Replaceable Unit): In the NVIDIA service model (especially for DGX and HGX systems), components like individual GPUs, fans, and power supplies are classified as FRUs. The standard operating procedure is to identify the specific faulty module (GPU 0 in this case), power down the system to prevent further electrical issues, and replace the hardware using the official service manual‘s anti-static and torque specifications.
Analysis of Incorrect Options Option A: (Incorrect) The Error: While cooling is vital for performance, it cannot repair a fatal PCIe link failure. AER fatal errors are electrical or logical disconnects at the hardware level. Increasing fan speed is a preventative measure for thermal throttling, but it is not a curative measure for a GPU that is no longer detected by the PCIe root complex.
Option C: (Incorrect) The Error: Modifying a scheduler (like Slurm) to ignore a “Fatal“ hardware error is highly dangerous. A GPU in an “Unknown Error“ state can cause system-wide instability, kernel panics, or corrupted data results for other users on the same node. NVIDIA best practices require a faulty node to be drained and cordoned until the hardware is repaired.
Option D: (Incorrect) The Error: This is “troubleshooting by superstition.“ Since the error is reported as a PCIe AER Fatal Error in the system logs (dmesg/journalctl), the issue exists at the hardware/firmware interface. Reinstalling drivers will not fix a physical communication break. If the OS cannot see the device‘s PCIe registers, the driver will simply fail to load again.
Question 57 of 60
57. Question
A technician identifies that a fan in an NVIDIA-Certified System has failed, causing one of the GPUs to overheat. What is the appropriate procedure for replacing this faulty component in a typical high-performance AI server?
Correct
Correct: D. Shut down the system, follow proper ESD procedures, replace the faulty fan module with an identical spare, and verify the repair by checking the fan speed in the BMC.
This is correct because the NCP-AII certification blueprint explicitly includes “Identify and troubleshoot hardware faults (e.g., GPU, fan, network card)“ and “Replace faulty cards, GPUs, and power supplies“ as core tasks within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
The practice test materials specifically confirm that during system bring-up, “physical seating/cabling issues are common“ and that proper procedure involves validating connections and hardware before making software changes .
The procedure described follows standard hardware replacement methodology for high-performance AI servers:
Shut down the system: Required for safe component replacement as these systems are not designed for hot-swappable fans in the context of GPU thermal management
Follow proper ESD procedures: Electrostatic discharge protection is critical when handling server components to prevent damage
Replace with identical spare: Using identical, validated spare parts ensures compatibility and maintains system thermal specifications
Verify the repair by checking fan speed in the BMC: The Baseboard Management Controller provides monitoring of fan RPM and system health, confirming successful repair
After replacement, proper airflow and cooling are restored, preventing GPU thermal throttling and ensuring the system meets performance requirements for AI workloads .
Incorrect: A. Remove all other working fans to ensure that the air pressure remains balanced across the entire motherboard, preventing any turbulent airflow.
This is incorrect because removing working fans would severely degrade cooling capacity and likely cause immediate overheating of all components. Air pressure balance is achieved through proper fan configuration, not by removing functional cooling units.
B. Pour cold water into the server to lower the temperature while waiting for a replacement fan to arrive from the manufacturer next week.
This is incorrect because introducing liquids into server hardware would cause catastrophic short circuits and permanent damage to all electronic components. This action demonstrates a fundamental misunderstanding of data center safety practices.
C. Keep the system running to avoid downtime, open the chassis, and manually hold the new fan in place with adhesive tape until the next scheduled maintenance.
This is incorrect because operating a server with an open chassis and improperly secured components violates safety protocols and creates risk of electrical shock, component damage, and fire hazard. Adhesive tape cannot provide reliable, long-term mechanical or electrical connections. Proper maintenance requires scheduled downtime for safe component replacement.
Incorrect
Correct: D. Shut down the system, follow proper ESD procedures, replace the faulty fan module with an identical spare, and verify the repair by checking the fan speed in the BMC.
This is correct because the NCP-AII certification blueprint explicitly includes “Identify and troubleshoot hardware faults (e.g., GPU, fan, network card)“ and “Replace faulty cards, GPUs, and power supplies“ as core tasks within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
The practice test materials specifically confirm that during system bring-up, “physical seating/cabling issues are common“ and that proper procedure involves validating connections and hardware before making software changes .
The procedure described follows standard hardware replacement methodology for high-performance AI servers:
Shut down the system: Required for safe component replacement as these systems are not designed for hot-swappable fans in the context of GPU thermal management
Follow proper ESD procedures: Electrostatic discharge protection is critical when handling server components to prevent damage
Replace with identical spare: Using identical, validated spare parts ensures compatibility and maintains system thermal specifications
Verify the repair by checking fan speed in the BMC: The Baseboard Management Controller provides monitoring of fan RPM and system health, confirming successful repair
After replacement, proper airflow and cooling are restored, preventing GPU thermal throttling and ensuring the system meets performance requirements for AI workloads .
Incorrect: A. Remove all other working fans to ensure that the air pressure remains balanced across the entire motherboard, preventing any turbulent airflow.
This is incorrect because removing working fans would severely degrade cooling capacity and likely cause immediate overheating of all components. Air pressure balance is achieved through proper fan configuration, not by removing functional cooling units.
B. Pour cold water into the server to lower the temperature while waiting for a replacement fan to arrive from the manufacturer next week.
This is incorrect because introducing liquids into server hardware would cause catastrophic short circuits and permanent damage to all electronic components. This action demonstrates a fundamental misunderstanding of data center safety practices.
C. Keep the system running to avoid downtime, open the chassis, and manually hold the new fan in place with adhesive tape until the next scheduled maintenance.
This is incorrect because operating a server with an open chassis and improperly secured components violates safety protocols and creates risk of electrical shock, component damage, and fire hazard. Adhesive tape cannot provide reliable, long-term mechanical or electrical connections. Proper maintenance requires scheduled downtime for safe component replacement.
Unattempted
Correct: D. Shut down the system, follow proper ESD procedures, replace the faulty fan module with an identical spare, and verify the repair by checking the fan speed in the BMC.
This is correct because the NCP-AII certification blueprint explicitly includes “Identify and troubleshoot hardware faults (e.g., GPU, fan, network card)“ and “Replace faulty cards, GPUs, and power supplies“ as core tasks within the Troubleshoot and Optimize domain, which comprises 12% of the examination .
The practice test materials specifically confirm that during system bring-up, “physical seating/cabling issues are common“ and that proper procedure involves validating connections and hardware before making software changes .
The procedure described follows standard hardware replacement methodology for high-performance AI servers:
Shut down the system: Required for safe component replacement as these systems are not designed for hot-swappable fans in the context of GPU thermal management
Follow proper ESD procedures: Electrostatic discharge protection is critical when handling server components to prevent damage
Replace with identical spare: Using identical, validated spare parts ensures compatibility and maintains system thermal specifications
Verify the repair by checking fan speed in the BMC: The Baseboard Management Controller provides monitoring of fan RPM and system health, confirming successful repair
After replacement, proper airflow and cooling are restored, preventing GPU thermal throttling and ensuring the system meets performance requirements for AI workloads .
Incorrect: A. Remove all other working fans to ensure that the air pressure remains balanced across the entire motherboard, preventing any turbulent airflow.
This is incorrect because removing working fans would severely degrade cooling capacity and likely cause immediate overheating of all components. Air pressure balance is achieved through proper fan configuration, not by removing functional cooling units.
B. Pour cold water into the server to lower the temperature while waiting for a replacement fan to arrive from the manufacturer next week.
This is incorrect because introducing liquids into server hardware would cause catastrophic short circuits and permanent damage to all electronic components. This action demonstrates a fundamental misunderstanding of data center safety practices.
C. Keep the system running to avoid downtime, open the chassis, and manually hold the new fan in place with adhesive tape until the next scheduled maintenance.
This is incorrect because operating a server with an open chassis and improperly secured components violates safety protocols and creates risk of electrical shock, component damage, and fire hazard. Adhesive tape cannot provide reliable, long-term mechanical or electrical connections. Proper maintenance requires scheduled downtime for safe component replacement.
Question 58 of 60
58. Question
To verify the health of the InfiniBand fabric, an engineer runs an NCCL (NVIDIA Collective Communications Library) test. The results show high bandwidth for intra-node communication but very low bandwidth for inter-node communication. Which troubleshooting step should be prioritized?
Correct
Option D: The Logic: If intra-node (GPU-to-GPU within the same server) bandwidth is high, the internal NVLink/PCIe paths and local software are functional. If inter-node (GPU-on-Node-A to GPU-on-Node-B) bandwidth is low, the bottleneck is external to the server.
Prioritizing the Physical Layer: NCP-AII guidelines follow a “bottom-up“ troubleshooting model. The most common causes of poor inter-node performance at 400G (NDR) or 200G (HDR) speeds are physical:
Signal Quality: Marginal cable runs or high Bit Error Rates (BER) can cause packet drops.
Transceiver Seating: An incorrectly seated or dirty optical transceiver can cause a link to “negotiate“ at a lower speed or exhibit high latency.
Cabling: Checking for exceeded bend radii or cable damage is a standard validation step.
Analysis of Incorrect Options Option A: (Incorrect) The Error: MIG (Multi-Instance GPU) is used to partition a single GPU for multi-tenancy. Increasing the number of instances divides a single GPU‘s resources; it does not provide “parallel paths“ for the network fabric. In fact, MIG is typically irrelevant to raw inter-node bandwidth troubleshooting, which is handled at the NIC (Network Interface Card) and fabric level.
Option B: (Incorrect) The Error: The OOB (Out-of-Band) management network (typically 1GbE) and the high-speed Data/Compute network (InfiniBand/400GbE) operate on entirely separate physical ports and air-gapped networks. They do not “interfere“ with each other. Disabling management would only make it harder to monitor the system via the BMC.
Option C: (Incorrect) The Error: The NVIDIA Container Toolkit manages how containers access GPU hardware. While a misconfiguration here could prevent a test from running, it would not cause a “high intra-node / low inter-node“ performance split. If the test is successfully providing intra-node results, the toolkit and libraries are already correctly symlinked and functional.
Incorrect
Option D: The Logic: If intra-node (GPU-to-GPU within the same server) bandwidth is high, the internal NVLink/PCIe paths and local software are functional. If inter-node (GPU-on-Node-A to GPU-on-Node-B) bandwidth is low, the bottleneck is external to the server.
Prioritizing the Physical Layer: NCP-AII guidelines follow a “bottom-up“ troubleshooting model. The most common causes of poor inter-node performance at 400G (NDR) or 200G (HDR) speeds are physical:
Signal Quality: Marginal cable runs or high Bit Error Rates (BER) can cause packet drops.
Transceiver Seating: An incorrectly seated or dirty optical transceiver can cause a link to “negotiate“ at a lower speed or exhibit high latency.
Cabling: Checking for exceeded bend radii or cable damage is a standard validation step.
Analysis of Incorrect Options Option A: (Incorrect) The Error: MIG (Multi-Instance GPU) is used to partition a single GPU for multi-tenancy. Increasing the number of instances divides a single GPU‘s resources; it does not provide “parallel paths“ for the network fabric. In fact, MIG is typically irrelevant to raw inter-node bandwidth troubleshooting, which is handled at the NIC (Network Interface Card) and fabric level.
Option B: (Incorrect) The Error: The OOB (Out-of-Band) management network (typically 1GbE) and the high-speed Data/Compute network (InfiniBand/400GbE) operate on entirely separate physical ports and air-gapped networks. They do not “interfere“ with each other. Disabling management would only make it harder to monitor the system via the BMC.
Option C: (Incorrect) The Error: The NVIDIA Container Toolkit manages how containers access GPU hardware. While a misconfiguration here could prevent a test from running, it would not cause a “high intra-node / low inter-node“ performance split. If the test is successfully providing intra-node results, the toolkit and libraries are already correctly symlinked and functional.
Unattempted
Option D: The Logic: If intra-node (GPU-to-GPU within the same server) bandwidth is high, the internal NVLink/PCIe paths and local software are functional. If inter-node (GPU-on-Node-A to GPU-on-Node-B) bandwidth is low, the bottleneck is external to the server.
Prioritizing the Physical Layer: NCP-AII guidelines follow a “bottom-up“ troubleshooting model. The most common causes of poor inter-node performance at 400G (NDR) or 200G (HDR) speeds are physical:
Signal Quality: Marginal cable runs or high Bit Error Rates (BER) can cause packet drops.
Transceiver Seating: An incorrectly seated or dirty optical transceiver can cause a link to “negotiate“ at a lower speed or exhibit high latency.
Cabling: Checking for exceeded bend radii or cable damage is a standard validation step.
Analysis of Incorrect Options Option A: (Incorrect) The Error: MIG (Multi-Instance GPU) is used to partition a single GPU for multi-tenancy. Increasing the number of instances divides a single GPU‘s resources; it does not provide “parallel paths“ for the network fabric. In fact, MIG is typically irrelevant to raw inter-node bandwidth troubleshooting, which is handled at the NIC (Network Interface Card) and fabric level.
Option B: (Incorrect) The Error: The OOB (Out-of-Band) management network (typically 1GbE) and the high-speed Data/Compute network (InfiniBand/400GbE) operate on entirely separate physical ports and air-gapped networks. They do not “interfere“ with each other. Disabling management would only make it harder to monitor the system via the BMC.
Option C: (Incorrect) The Error: The NVIDIA Container Toolkit manages how containers access GPU hardware. While a misconfiguration here could prevent a test from running, it would not cause a “high intra-node / low inter-node“ performance split. If the test is successfully providing intra-node results, the toolkit and libraries are already correctly symlinked and functional.
Question 59 of 60
59. Question
A technician identifies a faulty GPU in an HGX system that is causing the entire node to hang during boot. After replacing the physical GPU and confirming it is seated correctly, what is the next logical step to restore the node to the cluster in a fully optimized state?
Correct
Option B: Firmware Matching: In high-density HGX and DGX systems, all GPUs on the baseboard must run the same firmware version to ensure the NVLink fabric and thermal management policies operate predictably. If a replacement GPU has a different firmware version than the existing seven, it can lead to erratic performance or system crashes.
Persistence Mode (nvidia-smi -pm 1): In headless Linux/HPC environments, the NVIDIA driver normally de-initializes the GPU when no application is using it. This causes a delay (ECC scrubbing) and potential loss of state when a new job starts. Enabling Persistence Mode keeps the driver loaded and the GPU initialized, which is a mandatory optimization for AI clusters to ensure low job-start latency and consistent hardware state.
Analysis of Incorrect Options Option A: (Incorrect) The Error: DOCA (Data Center Infrastructure-on-a-Chip Architecture) is the stack for the BlueField DPU, not the GPU. While the DPU and GPU work together, installing a different DOCA version on a single node to fix a GPU issue is a logical mismatch. Furthermore, “version-drifting“ a single node in a cluster violates the principle of homogeneity required for stable AI infrastructure.
Option C: (Incorrect) The Error: NVIDIA-certified GPUs undergo rigorous factory burn-in. There is no requirement or professional standard that mandates 48 hours of “crypto-mining“ for a replacement GPU. In fact, running unauthorized software like mining scripts in a production AI factory would typically be a violation of security and usage policies.
Option D: (Incorrect) The Error: GPU hardware faults do not “corrupt“ the storage array in a way that requires a full reformat. AI training data is typically stored on shared storage (Lustre, Weka, or NFS) or dedicated local NVMe drives. Reformatting the entire array is an extreme, unnecessary measure that results in massive data loss and downtime.
Incorrect
Option B: Firmware Matching: In high-density HGX and DGX systems, all GPUs on the baseboard must run the same firmware version to ensure the NVLink fabric and thermal management policies operate predictably. If a replacement GPU has a different firmware version than the existing seven, it can lead to erratic performance or system crashes.
Persistence Mode (nvidia-smi -pm 1): In headless Linux/HPC environments, the NVIDIA driver normally de-initializes the GPU when no application is using it. This causes a delay (ECC scrubbing) and potential loss of state when a new job starts. Enabling Persistence Mode keeps the driver loaded and the GPU initialized, which is a mandatory optimization for AI clusters to ensure low job-start latency and consistent hardware state.
Analysis of Incorrect Options Option A: (Incorrect) The Error: DOCA (Data Center Infrastructure-on-a-Chip Architecture) is the stack for the BlueField DPU, not the GPU. While the DPU and GPU work together, installing a different DOCA version on a single node to fix a GPU issue is a logical mismatch. Furthermore, “version-drifting“ a single node in a cluster violates the principle of homogeneity required for stable AI infrastructure.
Option C: (Incorrect) The Error: NVIDIA-certified GPUs undergo rigorous factory burn-in. There is no requirement or professional standard that mandates 48 hours of “crypto-mining“ for a replacement GPU. In fact, running unauthorized software like mining scripts in a production AI factory would typically be a violation of security and usage policies.
Option D: (Incorrect) The Error: GPU hardware faults do not “corrupt“ the storage array in a way that requires a full reformat. AI training data is typically stored on shared storage (Lustre, Weka, or NFS) or dedicated local NVMe drives. Reformatting the entire array is an extreme, unnecessary measure that results in massive data loss and downtime.
Unattempted
Option B: Firmware Matching: In high-density HGX and DGX systems, all GPUs on the baseboard must run the same firmware version to ensure the NVLink fabric and thermal management policies operate predictably. If a replacement GPU has a different firmware version than the existing seven, it can lead to erratic performance or system crashes.
Persistence Mode (nvidia-smi -pm 1): In headless Linux/HPC environments, the NVIDIA driver normally de-initializes the GPU when no application is using it. This causes a delay (ECC scrubbing) and potential loss of state when a new job starts. Enabling Persistence Mode keeps the driver loaded and the GPU initialized, which is a mandatory optimization for AI clusters to ensure low job-start latency and consistent hardware state.
Analysis of Incorrect Options Option A: (Incorrect) The Error: DOCA (Data Center Infrastructure-on-a-Chip Architecture) is the stack for the BlueField DPU, not the GPU. While the DPU and GPU work together, installing a different DOCA version on a single node to fix a GPU issue is a logical mismatch. Furthermore, “version-drifting“ a single node in a cluster violates the principle of homogeneity required for stable AI infrastructure.
Option C: (Incorrect) The Error: NVIDIA-certified GPUs undergo rigorous factory burn-in. There is no requirement or professional standard that mandates 48 hours of “crypto-mining“ for a replacement GPU. In fact, running unauthorized software like mining scripts in a production AI factory would typically be a violation of security and usage policies.
Option D: (Incorrect) The Error: GPU hardware faults do not “corrupt“ the storage array in a way that requires a full reformat. AI training data is typically stored on shared storage (Lustre, Weka, or NFS) or dedicated local NVMe drives. Reformatting the entire array is an extreme, unnecessary measure that results in massive data loss and downtime.
Question 60 of 60
60. Question
When updating NVIDIA GPU drivers on a production cluster managed by Base Command Manager, what is the recommended procedure to ensure the new drivers are correctly applied to all compute nodes without causing job failures or system inconsistency?
Correct
Correct: A. Update the software image or category in BCM, then use the node update command to synchronize the nodes after draining them.
This is correct because the NCP-AII certification blueprint explicitly includes managing software images and node updates within the Control Plane Installation and Configuration domain, which comprises 19% of the examination .
BCM (Base Command Manager) uses a centralized image-based management model where node software is defined through software images and categories .
The official BCM documentation for upgrading BaseOS components confirms that before installing new software or upgrading existing components, “you should update the system image with the latest versions“ .
The recommended workflow follows a systematic process:
First, modify the software image (by cloning or directly updating) to include the new GPU drivers
Second, apply the updated image to nodes or categories in BCM
Third, drain nodes to prevent job interruptions during the update
Finally, provision nodes with the new image using BCM‘s node update commands
The VAST Data installation guide demonstrates this exact pattern: clone the software image, make modifications, assign the new image to nodes/categories, and provision the nodes .
This approach ensures all nodes receive identical, validated driver versions while maintaining cluster consistency and avoiding job failures through proper node draining .
Incorrect: B. Directly run the .run installer on the head node and let it propagate via SSH to all active nodes automatically.
This is incorrect because BCM does not automatically propagate driver installations from the head node to compute nodes via SSH. BCM uses a centralized image-based management system where changes are made to software images, which are then deployed to nodes during provisioning . Direct SSH propagation would bypass BCM‘s configuration management and could lead to inconsistent states across the cluster.
C. Uninstall the current drivers using apt-get purge while the GPUs are under 100% load to test driver hot-swapping.
This is incorrect because GPU drivers cannot be hot-swapped while GPUs are under load. Attempting to purge drivers during active GPU workloads would cause immediate job failures and potential system instability. Proper procedure requires draining nodes to remove workloads before driver updates .
D. Use the NGC CLI to push the new driver as a container image that runs in the background of every user job.
This is incorrect because NGC CLI is used for downloading containers and managing NGC resources, not for pushing driver updates to nodes . GPU drivers are kernel-level components that must be installed on the host OS, not run as containerized background processes alongside user jobs. Drivers cannot be “pushed“ via NGC CLI or executed within user job containers.
Incorrect
Correct: A. Update the software image or category in BCM, then use the node update command to synchronize the nodes after draining them.
This is correct because the NCP-AII certification blueprint explicitly includes managing software images and node updates within the Control Plane Installation and Configuration domain, which comprises 19% of the examination .
BCM (Base Command Manager) uses a centralized image-based management model where node software is defined through software images and categories .
The official BCM documentation for upgrading BaseOS components confirms that before installing new software or upgrading existing components, “you should update the system image with the latest versions“ .
The recommended workflow follows a systematic process:
First, modify the software image (by cloning or directly updating) to include the new GPU drivers
Second, apply the updated image to nodes or categories in BCM
Third, drain nodes to prevent job interruptions during the update
Finally, provision nodes with the new image using BCM‘s node update commands
The VAST Data installation guide demonstrates this exact pattern: clone the software image, make modifications, assign the new image to nodes/categories, and provision the nodes .
This approach ensures all nodes receive identical, validated driver versions while maintaining cluster consistency and avoiding job failures through proper node draining .
Incorrect: B. Directly run the .run installer on the head node and let it propagate via SSH to all active nodes automatically.
This is incorrect because BCM does not automatically propagate driver installations from the head node to compute nodes via SSH. BCM uses a centralized image-based management system where changes are made to software images, which are then deployed to nodes during provisioning . Direct SSH propagation would bypass BCM‘s configuration management and could lead to inconsistent states across the cluster.
C. Uninstall the current drivers using apt-get purge while the GPUs are under 100% load to test driver hot-swapping.
This is incorrect because GPU drivers cannot be hot-swapped while GPUs are under load. Attempting to purge drivers during active GPU workloads would cause immediate job failures and potential system instability. Proper procedure requires draining nodes to remove workloads before driver updates .
D. Use the NGC CLI to push the new driver as a container image that runs in the background of every user job.
This is incorrect because NGC CLI is used for downloading containers and managing NGC resources, not for pushing driver updates to nodes . GPU drivers are kernel-level components that must be installed on the host OS, not run as containerized background processes alongside user jobs. Drivers cannot be “pushed“ via NGC CLI or executed within user job containers.
Unattempted
Correct: A. Update the software image or category in BCM, then use the node update command to synchronize the nodes after draining them.
This is correct because the NCP-AII certification blueprint explicitly includes managing software images and node updates within the Control Plane Installation and Configuration domain, which comprises 19% of the examination .
BCM (Base Command Manager) uses a centralized image-based management model where node software is defined through software images and categories .
The official BCM documentation for upgrading BaseOS components confirms that before installing new software or upgrading existing components, “you should update the system image with the latest versions“ .
The recommended workflow follows a systematic process:
First, modify the software image (by cloning or directly updating) to include the new GPU drivers
Second, apply the updated image to nodes or categories in BCM
Third, drain nodes to prevent job interruptions during the update
Finally, provision nodes with the new image using BCM‘s node update commands
The VAST Data installation guide demonstrates this exact pattern: clone the software image, make modifications, assign the new image to nodes/categories, and provision the nodes .
This approach ensures all nodes receive identical, validated driver versions while maintaining cluster consistency and avoiding job failures through proper node draining .
Incorrect: B. Directly run the .run installer on the head node and let it propagate via SSH to all active nodes automatically.
This is incorrect because BCM does not automatically propagate driver installations from the head node to compute nodes via SSH. BCM uses a centralized image-based management system where changes are made to software images, which are then deployed to nodes during provisioning . Direct SSH propagation would bypass BCM‘s configuration management and could lead to inconsistent states across the cluster.
C. Uninstall the current drivers using apt-get purge while the GPUs are under 100% load to test driver hot-swapping.
This is incorrect because GPU drivers cannot be hot-swapped while GPUs are under load. Attempting to purge drivers during active GPU workloads would cause immediate job failures and potential system instability. Proper procedure requires draining nodes to remove workloads before driver updates .
D. Use the NGC CLI to push the new driver as a container image that runs in the background of every user job.
This is incorrect because NGC CLI is used for downloading containers and managing NGC resources, not for pushing driver updates to nodes . GPU drivers are kernel-level components that must be installed on the host OS, not run as containerized background processes alongside user jobs. Drivers cannot be “pushed“ via NGC CLI or executed within user job containers.
X
Use Page numbers below to navigate to other practice tests