You have already completed the Test before. Hence you can not start it again.
Test is loading...
You must sign in or sign up to start the Test.
You have to finish following quiz, to start this Test:
Your results are here!! for" NVIDIA NCP-AII Practice Test 6 "
0 of 60 questions answered correctly
Your time:
Time has elapsed
Your Final Score is : 0
You have attempted : 0
Number of Correct Questions : 0 and scored 0
Number of Incorrect Questions : 0 and Negative marks 0
Average score
Your score
NVIDIA NCP-AII
You have attempted: 0
Number of Correct Questions: 0 and scored 0
Number of Incorrect Questions: 0 and Negative marks 0
You can review your answers by clicking on “View Answers” option. Important Note : Open Reference Documentation Links in New Tab (Right Click and Open in New Tab).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Answered
Review
Question 1 of 60
1. Question
A system administrator is configuring a cluster where specific nodes require high-throughput storage access for large datasets. They decide to use BlueField DPUs to implement NVMe-over-Fabrics (NVMe-oF) storage acceleration. Which step is essential for configuring the BlueField network platform to support this specific offload capability?
Correct
Correct:
B. Configure the DPU in Embedded Function mode and use DOCA drivers to expose virtual NVMe controllers to the host OS. For a BlueField DPU to perform storage offloading (specifically using NVIDIA DOCA SNAP), it must be in Embedded Function Mode (also known as DPU Mode). In this state, the DPUÂ’s ARM cores run their own OS and management stack. Using the DOCA framework, the DPU emulates a local physical NVMe drive on the host‘s PCIe bus. The host OS “sees“ a standard NVMe controller and interacts with it using native drivers, while the DPU transparently handles the complex NVMe-over-Fabrics translation and network data movement in the background.
Incorrect:
A. Disable the SNAP (Storage, Network, and Analytics Performance) service on the DPU to allow the host to handle storage interrupts. NVIDIA SNAP is the very technology required to enable storage acceleration. Disabling it would prevent the DPU from intercepting and accelerating storage traffic. The goal of the DPU is to offload these interrupts from the host CPU, not to pass them back.
C. Connect the DPU to the BMC via a serial cable to allow the BMC to manage the NVMe flash translation layer. The Baseboard Management Controller (BMC) is for out-of-band server management (power, thermals, firmware updates) and does not have the computational power or architectural path to manage high-speed NVMe flash translation or storage protocols. These tasks are handled by the DPU‘s ARM cores and specialized hardware accelerators.
D. Install the CUDA Toolkit directly on the BlueField ARM cores to process storage encryption using the integrated Tensor Cores. While BlueField DPUs have ARM cores and hardware acceleration engines, they do not contain Tensor Cores (which are specific to NVIDIA GPUs). Storage encryption on a BlueField DPU is typically handled by dedicated hardware crypto-engines via DOCA libraries, not by running CUDA kernels on the ARM cores.
Incorrect
Correct:
B. Configure the DPU in Embedded Function mode and use DOCA drivers to expose virtual NVMe controllers to the host OS. For a BlueField DPU to perform storage offloading (specifically using NVIDIA DOCA SNAP), it must be in Embedded Function Mode (also known as DPU Mode). In this state, the DPUÂ’s ARM cores run their own OS and management stack. Using the DOCA framework, the DPU emulates a local physical NVMe drive on the host‘s PCIe bus. The host OS “sees“ a standard NVMe controller and interacts with it using native drivers, while the DPU transparently handles the complex NVMe-over-Fabrics translation and network data movement in the background.
Incorrect:
A. Disable the SNAP (Storage, Network, and Analytics Performance) service on the DPU to allow the host to handle storage interrupts. NVIDIA SNAP is the very technology required to enable storage acceleration. Disabling it would prevent the DPU from intercepting and accelerating storage traffic. The goal of the DPU is to offload these interrupts from the host CPU, not to pass them back.
C. Connect the DPU to the BMC via a serial cable to allow the BMC to manage the NVMe flash translation layer. The Baseboard Management Controller (BMC) is for out-of-band server management (power, thermals, firmware updates) and does not have the computational power or architectural path to manage high-speed NVMe flash translation or storage protocols. These tasks are handled by the DPU‘s ARM cores and specialized hardware accelerators.
D. Install the CUDA Toolkit directly on the BlueField ARM cores to process storage encryption using the integrated Tensor Cores. While BlueField DPUs have ARM cores and hardware acceleration engines, they do not contain Tensor Cores (which are specific to NVIDIA GPUs). Storage encryption on a BlueField DPU is typically handled by dedicated hardware crypto-engines via DOCA libraries, not by running CUDA kernels on the ARM cores.
Unattempted
Correct:
B. Configure the DPU in Embedded Function mode and use DOCA drivers to expose virtual NVMe controllers to the host OS. For a BlueField DPU to perform storage offloading (specifically using NVIDIA DOCA SNAP), it must be in Embedded Function Mode (also known as DPU Mode). In this state, the DPUÂ’s ARM cores run their own OS and management stack. Using the DOCA framework, the DPU emulates a local physical NVMe drive on the host‘s PCIe bus. The host OS “sees“ a standard NVMe controller and interacts with it using native drivers, while the DPU transparently handles the complex NVMe-over-Fabrics translation and network data movement in the background.
Incorrect:
A. Disable the SNAP (Storage, Network, and Analytics Performance) service on the DPU to allow the host to handle storage interrupts. NVIDIA SNAP is the very technology required to enable storage acceleration. Disabling it would prevent the DPU from intercepting and accelerating storage traffic. The goal of the DPU is to offload these interrupts from the host CPU, not to pass them back.
C. Connect the DPU to the BMC via a serial cable to allow the BMC to manage the NVMe flash translation layer. The Baseboard Management Controller (BMC) is for out-of-band server management (power, thermals, firmware updates) and does not have the computational power or architectural path to manage high-speed NVMe flash translation or storage protocols. These tasks are handled by the DPU‘s ARM cores and specialized hardware accelerators.
D. Install the CUDA Toolkit directly on the BlueField ARM cores to process storage encryption using the integrated Tensor Cores. While BlueField DPUs have ARM cores and hardware acceleration engines, they do not contain Tensor Cores (which are specific to NVIDIA GPUs). Storage encryption on a BlueField DPU is typically handled by dedicated hardware crypto-engines via DOCA libraries, not by running CUDA kernels on the ARM cores.
Question 2 of 60
2. Question
During the physical installation of GPU-based servers, a technician must validate that the cooling parameters meet the requirements for NVIDIA H100 GPUs. If the BMC reports that the GPU inlet temperature is nearing the thermal throttle limit despite low ambient room temperatures, what is the most likely physical configuration error within the server rack?
Correct
Correct: A. The server is missing blanking panels in the rack, causing hot air recirculation from the hot aisle back into the cold aisle. The NCP-AII curriculum emphasizes “Data Center Hygiene“ as a prerequisite for stable AI infrastructure. Blanking panels are essential for maintaining the Hot-Aisle/Cold-Aisle containment model. Without them, the high-pressure hot air exhausted from the rear of the servers can leak back through empty rack spaces into the front (cold) aisle. This “recirculation“ increases the temperature of the air entering the GPU inlets. Even if the room‘s ambient temperature is low, the localized air at the server face becomes hot enough to trigger the H100Â’s thermal protection mechanisms, leading to reduced clock speeds (throttling).
Incorrect: B. The TPM is not initialized correctly, which prevents the motherboard fans from reaching their maximum RPM setpoints during high workloads. The Trusted Platform Module (TPM) is a security chip used for hardware-based root of trust and encryption. It has no functional link to the serverÂ’s thermal management system or the Pulse Width Modulation (PWM) signals that control fan speed. Fan profiles are typically managed by the BMC (Baseboard Management Controller) based on temperature sensors, not security module states.
C. The storage array is connected via SAS instead of NVMe, significantly increasing the heat density of the server chassis and blocking airflow. While the type of storage affects data throughput, it is not a primary driver of the “thermal throttle limit“ for GPUs in an NVIDIA-Certified System. Furthermore, modern NVMe drives often generate more heat than traditional SAS drives due to their higher performance. Regardless, the choice of storage interface does not fundamentally block the massive airflow required by H100 GPUs unless the physical cabling is so poorly managed that it violates the server‘s internal airflow design.
D. The GPU-based servers are configured with the wrong IP addresses in the OOB management network preventing proper fan speed control. The Out-of-Band (OOB) management network (BMC) handles fan speed control internally via its own firmware and onboard sensors. While a wrong IP address would prevent the administrator from remotely logging into the BMC to view logs or change settings, it does not stop the BMC from automatically increasing fan speeds when it detects rising GPU temperatures. The fan control logic is autonomous to the server and does not rely on network connectivity.
Incorrect
Correct: A. The server is missing blanking panels in the rack, causing hot air recirculation from the hot aisle back into the cold aisle. The NCP-AII curriculum emphasizes “Data Center Hygiene“ as a prerequisite for stable AI infrastructure. Blanking panels are essential for maintaining the Hot-Aisle/Cold-Aisle containment model. Without them, the high-pressure hot air exhausted from the rear of the servers can leak back through empty rack spaces into the front (cold) aisle. This “recirculation“ increases the temperature of the air entering the GPU inlets. Even if the room‘s ambient temperature is low, the localized air at the server face becomes hot enough to trigger the H100Â’s thermal protection mechanisms, leading to reduced clock speeds (throttling).
Incorrect: B. The TPM is not initialized correctly, which prevents the motherboard fans from reaching their maximum RPM setpoints during high workloads. The Trusted Platform Module (TPM) is a security chip used for hardware-based root of trust and encryption. It has no functional link to the serverÂ’s thermal management system or the Pulse Width Modulation (PWM) signals that control fan speed. Fan profiles are typically managed by the BMC (Baseboard Management Controller) based on temperature sensors, not security module states.
C. The storage array is connected via SAS instead of NVMe, significantly increasing the heat density of the server chassis and blocking airflow. While the type of storage affects data throughput, it is not a primary driver of the “thermal throttle limit“ for GPUs in an NVIDIA-Certified System. Furthermore, modern NVMe drives often generate more heat than traditional SAS drives due to their higher performance. Regardless, the choice of storage interface does not fundamentally block the massive airflow required by H100 GPUs unless the physical cabling is so poorly managed that it violates the server‘s internal airflow design.
D. The GPU-based servers are configured with the wrong IP addresses in the OOB management network preventing proper fan speed control. The Out-of-Band (OOB) management network (BMC) handles fan speed control internally via its own firmware and onboard sensors. While a wrong IP address would prevent the administrator from remotely logging into the BMC to view logs or change settings, it does not stop the BMC from automatically increasing fan speeds when it detects rising GPU temperatures. The fan control logic is autonomous to the server and does not rely on network connectivity.
Unattempted
Correct: A. The server is missing blanking panels in the rack, causing hot air recirculation from the hot aisle back into the cold aisle. The NCP-AII curriculum emphasizes “Data Center Hygiene“ as a prerequisite for stable AI infrastructure. Blanking panels are essential for maintaining the Hot-Aisle/Cold-Aisle containment model. Without them, the high-pressure hot air exhausted from the rear of the servers can leak back through empty rack spaces into the front (cold) aisle. This “recirculation“ increases the temperature of the air entering the GPU inlets. Even if the room‘s ambient temperature is low, the localized air at the server face becomes hot enough to trigger the H100Â’s thermal protection mechanisms, leading to reduced clock speeds (throttling).
Incorrect: B. The TPM is not initialized correctly, which prevents the motherboard fans from reaching their maximum RPM setpoints during high workloads. The Trusted Platform Module (TPM) is a security chip used for hardware-based root of trust and encryption. It has no functional link to the serverÂ’s thermal management system or the Pulse Width Modulation (PWM) signals that control fan speed. Fan profiles are typically managed by the BMC (Baseboard Management Controller) based on temperature sensors, not security module states.
C. The storage array is connected via SAS instead of NVMe, significantly increasing the heat density of the server chassis and blocking airflow. While the type of storage affects data throughput, it is not a primary driver of the “thermal throttle limit“ for GPUs in an NVIDIA-Certified System. Furthermore, modern NVMe drives often generate more heat than traditional SAS drives due to their higher performance. Regardless, the choice of storage interface does not fundamentally block the massive airflow required by H100 GPUs unless the physical cabling is so poorly managed that it violates the server‘s internal airflow design.
D. The GPU-based servers are configured with the wrong IP addresses in the OOB management network preventing proper fan speed control. The Out-of-Band (OOB) management network (BMC) handles fan speed control internally via its own firmware and onboard sensors. While a wrong IP address would prevent the administrator from remotely logging into the BMC to view logs or change settings, it does not stop the BMC from automatically increasing fan speeds when it detects rising GPU temperatures. The fan control logic is autonomous to the server and does not rely on network connectivity.
Question 3 of 60
3. Question
An administrator is configuring the Trusted Platform Module (TPM) and Out-of-Band (OOB) management for a new cluster of NVIDIA-Certified servers. What is the primary security benefit of enabling and initializing the TPM 2.0 module during the system bring-up phase, and how does it relate to the integrity of the AI infrastructure software stack?
Correct
Correct: A. The TPM provides a hardware root of trust that allows the system to perform measured boots, ensuring that the bootloader and OS kernel have not been tampered with before the NVIDIA drivers are loaded. The NCP-AII curriculum identifies TPM 2.0 as the cornerstone of “Measured Boot“ and “Secure Boot“ processes. By initializing the TPM, the system creates a cryptographic log of each component in the boot chain—from the UEFI firmware to the bootloader and the Linux kernel. If any of these components are altered by unauthorized parties, the measurements will not match the expected values, and the system can be prevented from booting or accessing sensitive keys. This integrity check ensures that the NVIDIA GPU drivers and the DOCA framework are running on a “clean“ and verified foundation, protecting the AI stack from rootkits and persistent threats.
Incorrect: B. Enabling the TPM automatically encrypts all data stored on third-party storage arrays using the BMC management network for key exchange. This is incorrect because the TPM is a local security module for the host server. While a TPM can be used to store local disk encryption keys (like LUKS), it does not automatically manage or encrypt external third-party storage arrays (such as DDN or NetApp). Furthermore, BMC networks are typically used for management and telemetry, not as the primary medium for high-scale storage encryption key exchanges in a production AI environment.
C. The TPM acts as a high-speed cache for GPU kernels, allowing the NVIDIA SMI tool to store frequently used math functions in a secure hardware enclave. The TPM is a low-speed cryptographic processor designed for security operations like RSA/ECC key generation and digital signatures. It is not designed for data throughput and has no role in caching GPU kernels or math functions. nvidia-smi is a management tool, and GPU kernels are cached in the GPU‘s own memory (VRAM) or the system RAM, not in a security module.
D. The TPM is required to bypass the license check for NVIDIA AI Enterprise software when the node is operating in a disconnected or air-gapped environment. The TPM is an integrity and security tool, not a licensing bypass mechanism. NVIDIA AI Enterprise licensing in air-gapped environments is typically managed through a CLS (Cloud License Service) or DLS (Delegated License Service) instance hosted locally on the network. The TPM does not grant the administrator the ability to circumvent these software licensing requirements.
Incorrect
Correct: A. The TPM provides a hardware root of trust that allows the system to perform measured boots, ensuring that the bootloader and OS kernel have not been tampered with before the NVIDIA drivers are loaded. The NCP-AII curriculum identifies TPM 2.0 as the cornerstone of “Measured Boot“ and “Secure Boot“ processes. By initializing the TPM, the system creates a cryptographic log of each component in the boot chain—from the UEFI firmware to the bootloader and the Linux kernel. If any of these components are altered by unauthorized parties, the measurements will not match the expected values, and the system can be prevented from booting or accessing sensitive keys. This integrity check ensures that the NVIDIA GPU drivers and the DOCA framework are running on a “clean“ and verified foundation, protecting the AI stack from rootkits and persistent threats.
Incorrect: B. Enabling the TPM automatically encrypts all data stored on third-party storage arrays using the BMC management network for key exchange. This is incorrect because the TPM is a local security module for the host server. While a TPM can be used to store local disk encryption keys (like LUKS), it does not automatically manage or encrypt external third-party storage arrays (such as DDN or NetApp). Furthermore, BMC networks are typically used for management and telemetry, not as the primary medium for high-scale storage encryption key exchanges in a production AI environment.
C. The TPM acts as a high-speed cache for GPU kernels, allowing the NVIDIA SMI tool to store frequently used math functions in a secure hardware enclave. The TPM is a low-speed cryptographic processor designed for security operations like RSA/ECC key generation and digital signatures. It is not designed for data throughput and has no role in caching GPU kernels or math functions. nvidia-smi is a management tool, and GPU kernels are cached in the GPU‘s own memory (VRAM) or the system RAM, not in a security module.
D. The TPM is required to bypass the license check for NVIDIA AI Enterprise software when the node is operating in a disconnected or air-gapped environment. The TPM is an integrity and security tool, not a licensing bypass mechanism. NVIDIA AI Enterprise licensing in air-gapped environments is typically managed through a CLS (Cloud License Service) or DLS (Delegated License Service) instance hosted locally on the network. The TPM does not grant the administrator the ability to circumvent these software licensing requirements.
Unattempted
Correct: A. The TPM provides a hardware root of trust that allows the system to perform measured boots, ensuring that the bootloader and OS kernel have not been tampered with before the NVIDIA drivers are loaded. The NCP-AII curriculum identifies TPM 2.0 as the cornerstone of “Measured Boot“ and “Secure Boot“ processes. By initializing the TPM, the system creates a cryptographic log of each component in the boot chain—from the UEFI firmware to the bootloader and the Linux kernel. If any of these components are altered by unauthorized parties, the measurements will not match the expected values, and the system can be prevented from booting or accessing sensitive keys. This integrity check ensures that the NVIDIA GPU drivers and the DOCA framework are running on a “clean“ and verified foundation, protecting the AI stack from rootkits and persistent threats.
Incorrect: B. Enabling the TPM automatically encrypts all data stored on third-party storage arrays using the BMC management network for key exchange. This is incorrect because the TPM is a local security module for the host server. While a TPM can be used to store local disk encryption keys (like LUKS), it does not automatically manage or encrypt external third-party storage arrays (such as DDN or NetApp). Furthermore, BMC networks are typically used for management and telemetry, not as the primary medium for high-scale storage encryption key exchanges in a production AI environment.
C. The TPM acts as a high-speed cache for GPU kernels, allowing the NVIDIA SMI tool to store frequently used math functions in a secure hardware enclave. The TPM is a low-speed cryptographic processor designed for security operations like RSA/ECC key generation and digital signatures. It is not designed for data throughput and has no role in caching GPU kernels or math functions. nvidia-smi is a management tool, and GPU kernels are cached in the GPU‘s own memory (VRAM) or the system RAM, not in a security module.
D. The TPM is required to bypass the license check for NVIDIA AI Enterprise software when the node is operating in a disconnected or air-gapped environment. The TPM is an integrity and security tool, not a licensing bypass mechanism. NVIDIA AI Enterprise licensing in air-gapped environments is typically managed through a CLS (Cloud License Service) or DLS (Delegated License Service) instance hosted locally on the network. The TPM does not grant the administrator the ability to circumvent these software licensing requirements.
Question 4 of 60
4. Question
A Linux administrator is installing the NVIDIA Container Toolkit on a fresh Ubuntu installation to support Docker-based AI workloads. After installing the package, what is the mandatory next step to ensure the Docker daemon can utilize the NVIDIA GPU runtime?
Correct
Correct: C. Edit the /etc/docker/daemon.json file to set the ‘default-runtime‘ to ‘nvidia‘ and restart the Docker service. The NCP-AII curriculum specifies that simply installing the nvidia-container-toolkit package is insufficient. The Docker engine must be explicitly instructed to use the NVIDIA Container Runtime (a thin wrapper around runc). By modifying the daemon.json configuration file, the administrator registers the nvidia runtime. Setting it as the default-runtime ensures that any container started by the daemon has access to the GPU libraries and binaries (like nvidia-smi) without needing to manually pass the –runtime=nvidia flag every time. A restart of the Docker service is mandatory for these configuration changes to take effect.
Incorrect: A. Recompile the Linux kernel with the CUDA_SUPPORT=yes flag and reboot the machine. The NVIDIA driver and Container Toolkit are designed to work with standard, distribution-provided Linux kernels (such as the generic Ubuntu kernel). There is no “CUDA_SUPPORT“ flag in the Linux kernel source, and recompiling the kernel is not part of the NVIDIA-Certified installation workflow. The NCP-AII exam focuses on the installation of the NVIDIA Driver kernel modules, which are built against the existing kernel, rather than replacing the kernel itself.
B. Run the ‘nvidia-smi –factory-reset‘ command to clear the GPU state for container consumption. The factory-reset command is a troubleshooting tool used to revert GPU settings (like power limits or clock offsets) to their original state. It has no functional role in the “Software Stack“ installation process and does not enable Docker to communicate with the GPU. Clearing the GPU state does not solve the configuration requirement between the Docker daemon and the NVIDIA runtime.
D. Install the DOCA SDK on the host and map the GPU via a virtual PCIe switch to the Docker container. This option confuses DPU (Data Processing Unit) management with GPU containerization. The DOCA SDK is used for programming BlueField DPUs. While DPUs can use PCIe switches to manage traffic, standard Docker-based AI workloads on a host GPU do not require “virtual PCIe mapping.“ They rely on the NVIDIA Container Runtime to mount the necessary character devices (like /dev/nvidia0) into the container‘s namespace.
Incorrect
Correct: C. Edit the /etc/docker/daemon.json file to set the ‘default-runtime‘ to ‘nvidia‘ and restart the Docker service. The NCP-AII curriculum specifies that simply installing the nvidia-container-toolkit package is insufficient. The Docker engine must be explicitly instructed to use the NVIDIA Container Runtime (a thin wrapper around runc). By modifying the daemon.json configuration file, the administrator registers the nvidia runtime. Setting it as the default-runtime ensures that any container started by the daemon has access to the GPU libraries and binaries (like nvidia-smi) without needing to manually pass the –runtime=nvidia flag every time. A restart of the Docker service is mandatory for these configuration changes to take effect.
Incorrect: A. Recompile the Linux kernel with the CUDA_SUPPORT=yes flag and reboot the machine. The NVIDIA driver and Container Toolkit are designed to work with standard, distribution-provided Linux kernels (such as the generic Ubuntu kernel). There is no “CUDA_SUPPORT“ flag in the Linux kernel source, and recompiling the kernel is not part of the NVIDIA-Certified installation workflow. The NCP-AII exam focuses on the installation of the NVIDIA Driver kernel modules, which are built against the existing kernel, rather than replacing the kernel itself.
B. Run the ‘nvidia-smi –factory-reset‘ command to clear the GPU state for container consumption. The factory-reset command is a troubleshooting tool used to revert GPU settings (like power limits or clock offsets) to their original state. It has no functional role in the “Software Stack“ installation process and does not enable Docker to communicate with the GPU. Clearing the GPU state does not solve the configuration requirement between the Docker daemon and the NVIDIA runtime.
D. Install the DOCA SDK on the host and map the GPU via a virtual PCIe switch to the Docker container. This option confuses DPU (Data Processing Unit) management with GPU containerization. The DOCA SDK is used for programming BlueField DPUs. While DPUs can use PCIe switches to manage traffic, standard Docker-based AI workloads on a host GPU do not require “virtual PCIe mapping.“ They rely on the NVIDIA Container Runtime to mount the necessary character devices (like /dev/nvidia0) into the container‘s namespace.
Unattempted
Correct: C. Edit the /etc/docker/daemon.json file to set the ‘default-runtime‘ to ‘nvidia‘ and restart the Docker service. The NCP-AII curriculum specifies that simply installing the nvidia-container-toolkit package is insufficient. The Docker engine must be explicitly instructed to use the NVIDIA Container Runtime (a thin wrapper around runc). By modifying the daemon.json configuration file, the administrator registers the nvidia runtime. Setting it as the default-runtime ensures that any container started by the daemon has access to the GPU libraries and binaries (like nvidia-smi) without needing to manually pass the –runtime=nvidia flag every time. A restart of the Docker service is mandatory for these configuration changes to take effect.
Incorrect: A. Recompile the Linux kernel with the CUDA_SUPPORT=yes flag and reboot the machine. The NVIDIA driver and Container Toolkit are designed to work with standard, distribution-provided Linux kernels (such as the generic Ubuntu kernel). There is no “CUDA_SUPPORT“ flag in the Linux kernel source, and recompiling the kernel is not part of the NVIDIA-Certified installation workflow. The NCP-AII exam focuses on the installation of the NVIDIA Driver kernel modules, which are built against the existing kernel, rather than replacing the kernel itself.
B. Run the ‘nvidia-smi –factory-reset‘ command to clear the GPU state for container consumption. The factory-reset command is a troubleshooting tool used to revert GPU settings (like power limits or clock offsets) to their original state. It has no functional role in the “Software Stack“ installation process and does not enable Docker to communicate with the GPU. Clearing the GPU state does not solve the configuration requirement between the Docker daemon and the NVIDIA runtime.
D. Install the DOCA SDK on the host and map the GPU via a virtual PCIe switch to the Docker container. This option confuses DPU (Data Processing Unit) management with GPU containerization. The DOCA SDK is used for programming BlueField DPUs. While DPUs can use PCIe switches to manage traffic, standard Docker-based AI workloads on a host GPU do not require “virtual PCIe mapping.“ They rely on the NVIDIA Container Runtime to mount the necessary character devices (like /dev/nvidia0) into the container‘s namespace.
Question 5 of 60
5. Question
When troubleshooting storage performance for an AI factory, an administrator notices that the GPU utilization is low during training and the iowait metric on the compute nodes is high. What is the most effective optimization to resolve this storage bottleneck?
Correct
Correct: B. Implement NVIDIA GPUDirect Storage (GDS) to enable a direct data path between the storage and the GPU memory, bypassing the CPU. In the standard I/O path, data must be copied from storage into a “bounce buffer“ in CPU system memory before being copied again to the GPU. This consumes CPU cycles and increases latency, leading to the iowait spikes seen in the scenario. GPUDirect Storage (GDS) is the definitive optimization taught in the NCP-AII track for this problem. It uses Direct Memory Access (DMA) to move data directly from the storage interface (like NVMe or NVMe-oF) to the GPU memory. This bypasses the CPU completely, drastically reducing iowait, lowering latency, and allowing the GPUs to reach maximum utilization.
Incorrect: A. Change the training algorithm from a parallel approach to a sequential approach to reduce the number of simultaneous read requests. Switching to a sequential approach is antithetical to AI infrastructure goals. Parallelism is the fundamental strength of NVIDIA GPUs. Reducing the number of simultaneous requests might lower the iowait metric, but it would do so by making the training process significantly slower, which is a failure of optimization in an AI Factory context.
C. Reduce the resolution of the training images so that the storage system has less data to read from the disks during each epoch. While reducing data volume technically lessens the load on storage, it is a workload compromise, not an infrastructure optimization. In the NCP-AII framework, the goal is to build an infrastructure capable of handling the researcher‘s requirements. Changing the science (reducing image resolution) to fit a poorly configured system is not the correct administrative response to a hardware/software bottleneck.
D. Add more GPUs to each node to increase the total amount of compute power available to process the slow-moving data. Adding more GPUs to a system already suffering from a storage bottleneck will actually worsen the problem. More GPUs create even more demand for data, which would increase the pressure on the already struggling storage path and CPU. The NCP-AII curriculum teaches that you must solve the “starvation“ issue at the source (I/O) before scaling compute.
Incorrect
Correct: B. Implement NVIDIA GPUDirect Storage (GDS) to enable a direct data path between the storage and the GPU memory, bypassing the CPU. In the standard I/O path, data must be copied from storage into a “bounce buffer“ in CPU system memory before being copied again to the GPU. This consumes CPU cycles and increases latency, leading to the iowait spikes seen in the scenario. GPUDirect Storage (GDS) is the definitive optimization taught in the NCP-AII track for this problem. It uses Direct Memory Access (DMA) to move data directly from the storage interface (like NVMe or NVMe-oF) to the GPU memory. This bypasses the CPU completely, drastically reducing iowait, lowering latency, and allowing the GPUs to reach maximum utilization.
Incorrect: A. Change the training algorithm from a parallel approach to a sequential approach to reduce the number of simultaneous read requests. Switching to a sequential approach is antithetical to AI infrastructure goals. Parallelism is the fundamental strength of NVIDIA GPUs. Reducing the number of simultaneous requests might lower the iowait metric, but it would do so by making the training process significantly slower, which is a failure of optimization in an AI Factory context.
C. Reduce the resolution of the training images so that the storage system has less data to read from the disks during each epoch. While reducing data volume technically lessens the load on storage, it is a workload compromise, not an infrastructure optimization. In the NCP-AII framework, the goal is to build an infrastructure capable of handling the researcher‘s requirements. Changing the science (reducing image resolution) to fit a poorly configured system is not the correct administrative response to a hardware/software bottleneck.
D. Add more GPUs to each node to increase the total amount of compute power available to process the slow-moving data. Adding more GPUs to a system already suffering from a storage bottleneck will actually worsen the problem. More GPUs create even more demand for data, which would increase the pressure on the already struggling storage path and CPU. The NCP-AII curriculum teaches that you must solve the “starvation“ issue at the source (I/O) before scaling compute.
Unattempted
Correct: B. Implement NVIDIA GPUDirect Storage (GDS) to enable a direct data path between the storage and the GPU memory, bypassing the CPU. In the standard I/O path, data must be copied from storage into a “bounce buffer“ in CPU system memory before being copied again to the GPU. This consumes CPU cycles and increases latency, leading to the iowait spikes seen in the scenario. GPUDirect Storage (GDS) is the definitive optimization taught in the NCP-AII track for this problem. It uses Direct Memory Access (DMA) to move data directly from the storage interface (like NVMe or NVMe-oF) to the GPU memory. This bypasses the CPU completely, drastically reducing iowait, lowering latency, and allowing the GPUs to reach maximum utilization.
Incorrect: A. Change the training algorithm from a parallel approach to a sequential approach to reduce the number of simultaneous read requests. Switching to a sequential approach is antithetical to AI infrastructure goals. Parallelism is the fundamental strength of NVIDIA GPUs. Reducing the number of simultaneous requests might lower the iowait metric, but it would do so by making the training process significantly slower, which is a failure of optimization in an AI Factory context.
C. Reduce the resolution of the training images so that the storage system has less data to read from the disks during each epoch. While reducing data volume technically lessens the load on storage, it is a workload compromise, not an infrastructure optimization. In the NCP-AII framework, the goal is to build an infrastructure capable of handling the researcher‘s requirements. Changing the science (reducing image resolution) to fit a poorly configured system is not the correct administrative response to a hardware/software bottleneck.
D. Add more GPUs to each node to increase the total amount of compute power available to process the slow-moving data. Adding more GPUs to a system already suffering from a storage bottleneck will actually worsen the problem. More GPUs create even more demand for data, which would increase the pressure on the already struggling storage path and CPU. The NCP-AII curriculum teaches that you must solve the “starvation“ issue at the source (I/O) before scaling compute.
Question 6 of 60
6. Question
A ‘burn-in‘ test is being conducted on a new AI cluster using NVIDIA NeMo. Why is a model-specific burn-in test like NeMo preferred over a simple synthetic stress test when validating an AI factory for production use?
Correct
Correct:
D. It stresses the specific communication patterns and memory access behaviors typical of real-world Large Language Model (LLM) training workloads. An AI Factory designed for production must be validated against the actual workloads it will run. Synthetic stress tests often only push power consumption or local GPU compute. A NeMo-based burn-in utilizes the NVIDIA Collective Communications Library (NCCL) to perform “All-Reduce“ and “All-to-All“ operations. This stresses the NVLink and InfiniBand/Ethernet fabrics, GPUDirect RDMA, and high-bandwidth memory (HBM) in ways that synthetic tests cannot, ensuring the cluster is stable for distributed LLM training.
Incorrect:
A. It automatically repairs any physical layer cable faults by using the BlueField-3‘s ARM cores to re-route traffic through the NVLink fabric. This is technically inaccurate. While BlueField-3 DPUs have ARM cores for offloading infrastructure tasks, they do not “repair“ physical cable faults. Furthermore, NVLink and the external cluster fabric (InfiniBand/Ethernet) are distinct; traffic cannot simply be re-routed from one to the other to bypass a broken physical cable.
B. It is required to activate the permanent hardware warranty on the H100 GPUs by registering the burn-in results with the NVIDIA SMI registry. Hardware warranties are associated with the purchase and registration of the physical units, not the performance of a specific software burn-in test. While nvidia-smi is used to monitor health, there is no requirement to “register results“ with a registry to activate a warranty.
C. It is the only way to verify that the BMC can successfully communicate with the NVIDIA GPU Cloud to download the latest firmware updates. The Baseboard Management Controller (BMC) handles out-of-band management and firmware updates independently of high-level AI frameworks like NeMo. Verifying BMC connectivity is a basic networking step that does not require a model-specific stress test.
Incorrect
Correct:
D. It stresses the specific communication patterns and memory access behaviors typical of real-world Large Language Model (LLM) training workloads. An AI Factory designed for production must be validated against the actual workloads it will run. Synthetic stress tests often only push power consumption or local GPU compute. A NeMo-based burn-in utilizes the NVIDIA Collective Communications Library (NCCL) to perform “All-Reduce“ and “All-to-All“ operations. This stresses the NVLink and InfiniBand/Ethernet fabrics, GPUDirect RDMA, and high-bandwidth memory (HBM) in ways that synthetic tests cannot, ensuring the cluster is stable for distributed LLM training.
Incorrect:
A. It automatically repairs any physical layer cable faults by using the BlueField-3‘s ARM cores to re-route traffic through the NVLink fabric. This is technically inaccurate. While BlueField-3 DPUs have ARM cores for offloading infrastructure tasks, they do not “repair“ physical cable faults. Furthermore, NVLink and the external cluster fabric (InfiniBand/Ethernet) are distinct; traffic cannot simply be re-routed from one to the other to bypass a broken physical cable.
B. It is required to activate the permanent hardware warranty on the H100 GPUs by registering the burn-in results with the NVIDIA SMI registry. Hardware warranties are associated with the purchase and registration of the physical units, not the performance of a specific software burn-in test. While nvidia-smi is used to monitor health, there is no requirement to “register results“ with a registry to activate a warranty.
C. It is the only way to verify that the BMC can successfully communicate with the NVIDIA GPU Cloud to download the latest firmware updates. The Baseboard Management Controller (BMC) handles out-of-band management and firmware updates independently of high-level AI frameworks like NeMo. Verifying BMC connectivity is a basic networking step that does not require a model-specific stress test.
Unattempted
Correct:
D. It stresses the specific communication patterns and memory access behaviors typical of real-world Large Language Model (LLM) training workloads. An AI Factory designed for production must be validated against the actual workloads it will run. Synthetic stress tests often only push power consumption or local GPU compute. A NeMo-based burn-in utilizes the NVIDIA Collective Communications Library (NCCL) to perform “All-Reduce“ and “All-to-All“ operations. This stresses the NVLink and InfiniBand/Ethernet fabrics, GPUDirect RDMA, and high-bandwidth memory (HBM) in ways that synthetic tests cannot, ensuring the cluster is stable for distributed LLM training.
Incorrect:
A. It automatically repairs any physical layer cable faults by using the BlueField-3‘s ARM cores to re-route traffic through the NVLink fabric. This is technically inaccurate. While BlueField-3 DPUs have ARM cores for offloading infrastructure tasks, they do not “repair“ physical cable faults. Furthermore, NVLink and the external cluster fabric (InfiniBand/Ethernet) are distinct; traffic cannot simply be re-routed from one to the other to bypass a broken physical cable.
B. It is required to activate the permanent hardware warranty on the H100 GPUs by registering the burn-in results with the NVIDIA SMI registry. Hardware warranties are associated with the purchase and registration of the physical units, not the performance of a specific software burn-in test. While nvidia-smi is used to monitor health, there is no requirement to “register results“ with a registry to activate a warranty.
C. It is the only way to verify that the BMC can successfully communicate with the NVIDIA GPU Cloud to download the latest firmware updates. The Baseboard Management Controller (BMC) handles out-of-band management and firmware updates independently of high-level AI frameworks like NeMo. Verifying BMC connectivity is a basic networking step that does not require a model-specific stress test.
Question 7 of 60
7. Question
A system administrator is using NVIDIA Base Command Manager (BCM) to deploy an OS image across a new cluster of 64 nodes. The administrator needs to ensure that the Slurm scheduler is properly integrated and that the Enroot and Pyxis plugins are installed. What is the specific function of the Pyxis plugin in this AI infrastructure environment?
Correct
Correct: • D. It enables Slurm to launch containerized workloads using Enroot. In a high-performance AI cluster managed by NVIDIA Base Command Manager (BCM), containers are the standard for reproducibility. Enroot is NVIDIAÂ’s tool for turning container images into unprivileged sandboxes, while Pyxis is the specific Slurm SPANK plugin that allows users to run these containers directly via Slurm commands (e.g., srun –container-image=…). Without Pyxis, Slurm would not have the native awareness to invoke Enroot to set up the container environment for a job.
Incorrect: • A. It acts as a distributed file system for storing large datasets. Pyxis is a scheduler plugin, not a storage solution. In an NVIDIA AI Factory, distributed file systems are typically handled by technologies like Lustre, IBM Spectrum Scale (GPFS), or WekaIO, which provide the high-throughput data access required for training.
• B. It provides a graphical user interface for monitoring GPU temperatures. Monitoring and telemetry in an NVIDIA cluster are handled by the NVIDIA Data Center GPU Manager (DCGM) and visualized through tools like Grafana or the Base Command Manager dashboard. Pyxis operates at the command-line/scheduler level and does not provide a GUI.
• C. It manages the power cycling of the GPU nodes via the BMC. Power management and hardware orchestration are functions of the Base Command Manager (BCM) itself, communicating with the Baseboard Management Controller (BMC) via protocols like IPMI or Redfish. Pyxis is strictly focused on container integration within the job scheduler.
Incorrect
Correct: • D. It enables Slurm to launch containerized workloads using Enroot. In a high-performance AI cluster managed by NVIDIA Base Command Manager (BCM), containers are the standard for reproducibility. Enroot is NVIDIAÂ’s tool for turning container images into unprivileged sandboxes, while Pyxis is the specific Slurm SPANK plugin that allows users to run these containers directly via Slurm commands (e.g., srun –container-image=…). Without Pyxis, Slurm would not have the native awareness to invoke Enroot to set up the container environment for a job.
Incorrect: • A. It acts as a distributed file system for storing large datasets. Pyxis is a scheduler plugin, not a storage solution. In an NVIDIA AI Factory, distributed file systems are typically handled by technologies like Lustre, IBM Spectrum Scale (GPFS), or WekaIO, which provide the high-throughput data access required for training.
• B. It provides a graphical user interface for monitoring GPU temperatures. Monitoring and telemetry in an NVIDIA cluster are handled by the NVIDIA Data Center GPU Manager (DCGM) and visualized through tools like Grafana or the Base Command Manager dashboard. Pyxis operates at the command-line/scheduler level and does not provide a GUI.
• C. It manages the power cycling of the GPU nodes via the BMC. Power management and hardware orchestration are functions of the Base Command Manager (BCM) itself, communicating with the Baseboard Management Controller (BMC) via protocols like IPMI or Redfish. Pyxis is strictly focused on container integration within the job scheduler.
Unattempted
Correct: • D. It enables Slurm to launch containerized workloads using Enroot. In a high-performance AI cluster managed by NVIDIA Base Command Manager (BCM), containers are the standard for reproducibility. Enroot is NVIDIAÂ’s tool for turning container images into unprivileged sandboxes, while Pyxis is the specific Slurm SPANK plugin that allows users to run these containers directly via Slurm commands (e.g., srun –container-image=…). Without Pyxis, Slurm would not have the native awareness to invoke Enroot to set up the container environment for a job.
Incorrect: • A. It acts as a distributed file system for storing large datasets. Pyxis is a scheduler plugin, not a storage solution. In an NVIDIA AI Factory, distributed file systems are typically handled by technologies like Lustre, IBM Spectrum Scale (GPFS), or WekaIO, which provide the high-throughput data access required for training.
• B. It provides a graphical user interface for monitoring GPU temperatures. Monitoring and telemetry in an NVIDIA cluster are handled by the NVIDIA Data Center GPU Manager (DCGM) and visualized through tools like Grafana or the Base Command Manager dashboard. Pyxis operates at the command-line/scheduler level and does not provide a GUI.
• C. It manages the power cycling of the GPU nodes via the BMC. Power management and hardware orchestration are functions of the Base Command Manager (BCM) itself, communicating with the Baseboard Management Controller (BMC) via protocols like IPMI or Redfish. Pyxis is strictly focused on container integration within the job scheduler.
Question 8 of 60
8. Question
An administrator is performing a NeMo burn-in test on a newly configured cluster. During the test, several nodes reboot spontaneously. After checking the logs, the administrator finds ‘Power Supply Input Lost‘ and ‘Critical Over Temperature‘ events. What is the primary purpose of the burn-in test in this context, and what does this failure indicate?
Correct
Correct: • B. The test is designed to stress the physical infrastructure; the failure indicates that the data center‘s power and cooling capacity cannot handle the peak load of the cluster. A NeMo burn-in test is a “real-world“ stress test. Unlike idle or low-utility states, training Large Language Models (LLMs) pushes H100/H200 GPUs to their peak Thermal Design Power (TDP). If a cluster experiences “Power Supply Input Lost“ or “Critical Over Temperature“ during this test, it confirms that the data center‘s electrical circuits are tripping under load or the CRAC (Computer Room Air Conditioning) units cannot dissipate the heat fast enough. This is exactly what a burn-in is meant to uncover before the system enters production.
Incorrect: • A. The test is used to check the speed of the Linux boot process; the failure means the OS is rebooting too slowly. NeMo is an AI framework for model training, not a boot-time benchmarking tool. A “Power Supply Input Lost“ event is a hardware-level failure, not a software performance metric related to the Linux kernel or systemd boot sequences.
• C. The test is intended to verify the Slurm license; the failure indicates that the license has expired and forced a system shutdown. Slurm is an open-source workload manager (though commercial versions like Slurm Enterprise exist). However, a license expiration would typically result in job submission failures or service start errors, not physical hardware events like “Critical Over Temperature“ or power loss.
• D. The test is meant to train a chatbot; the failure indicates that the chatbot is too complex for the current GPU memory. While NeMo can be used to train chatbots, an “Out of Memory“ (OOM) error due to model complexity would result in a software crash (Cuda Error: out of memory), not a spontaneous node reboot with power and thermal hardware logs.
Incorrect
Correct: • B. The test is designed to stress the physical infrastructure; the failure indicates that the data center‘s power and cooling capacity cannot handle the peak load of the cluster. A NeMo burn-in test is a “real-world“ stress test. Unlike idle or low-utility states, training Large Language Models (LLMs) pushes H100/H200 GPUs to their peak Thermal Design Power (TDP). If a cluster experiences “Power Supply Input Lost“ or “Critical Over Temperature“ during this test, it confirms that the data center‘s electrical circuits are tripping under load or the CRAC (Computer Room Air Conditioning) units cannot dissipate the heat fast enough. This is exactly what a burn-in is meant to uncover before the system enters production.
Incorrect: • A. The test is used to check the speed of the Linux boot process; the failure means the OS is rebooting too slowly. NeMo is an AI framework for model training, not a boot-time benchmarking tool. A “Power Supply Input Lost“ event is a hardware-level failure, not a software performance metric related to the Linux kernel or systemd boot sequences.
• C. The test is intended to verify the Slurm license; the failure indicates that the license has expired and forced a system shutdown. Slurm is an open-source workload manager (though commercial versions like Slurm Enterprise exist). However, a license expiration would typically result in job submission failures or service start errors, not physical hardware events like “Critical Over Temperature“ or power loss.
• D. The test is meant to train a chatbot; the failure indicates that the chatbot is too complex for the current GPU memory. While NeMo can be used to train chatbots, an “Out of Memory“ (OOM) error due to model complexity would result in a software crash (Cuda Error: out of memory), not a spontaneous node reboot with power and thermal hardware logs.
Unattempted
Correct: • B. The test is designed to stress the physical infrastructure; the failure indicates that the data center‘s power and cooling capacity cannot handle the peak load of the cluster. A NeMo burn-in test is a “real-world“ stress test. Unlike idle or low-utility states, training Large Language Models (LLMs) pushes H100/H200 GPUs to their peak Thermal Design Power (TDP). If a cluster experiences “Power Supply Input Lost“ or “Critical Over Temperature“ during this test, it confirms that the data center‘s electrical circuits are tripping under load or the CRAC (Computer Room Air Conditioning) units cannot dissipate the heat fast enough. This is exactly what a burn-in is meant to uncover before the system enters production.
Incorrect: • A. The test is used to check the speed of the Linux boot process; the failure means the OS is rebooting too slowly. NeMo is an AI framework for model training, not a boot-time benchmarking tool. A “Power Supply Input Lost“ event is a hardware-level failure, not a software performance metric related to the Linux kernel or systemd boot sequences.
• C. The test is intended to verify the Slurm license; the failure indicates that the license has expired and forced a system shutdown. Slurm is an open-source workload manager (though commercial versions like Slurm Enterprise exist). However, a license expiration would typically result in job submission failures or service start errors, not physical hardware events like “Critical Over Temperature“ or power loss.
• D. The test is meant to train a chatbot; the failure indicates that the chatbot is too complex for the current GPU memory. While NeMo can be used to train chatbots, an “Out of Memory“ (OOM) error due to model complexity would result in a software crash (Cuda Error: out of memory), not a spontaneous node reboot with power and thermal hardware logs.
Question 9 of 60
9. Question
To enable containerized AI workloads on a freshly installed cluster, an engineer must install the NVIDIA Container Toolkit. Which command-line utility is used after the installation to verify that the Docker daemon can successfully communicate with the NVIDIA driver and that the GPUs are visible within a container environment?
Correct
Correct:
B. docker run –rm –runtime=nvidia –gpus all nvidia/cuda:12.0-base nvidia-smi This command performs a full end-to-end validation. It pulls a lightweight CUDA base image, starts a container using the NVIDIA Container Runtime, passes all available GPUs into that container, and executes nvidia-smi from within the isolated environment. If the output displays the GPU table, it proves that the NVIDIA Container Toolkit has correctly mapped the kernel drivers into the container namespace.
Incorrect:
A. systemctl status doca-agent The doca-agent is related to NVIDIA DOCA (Data Center Infrastructure-on-a-Chip Architecture), which is used for programming BlueField DPUs. While important for network and storage offloading, checking its status does not verify if the Docker daemon can access GPUs for AI workloads.
C. slurm scontrol show nodes This is a Slurm command used to view the state and configuration of nodes within a cluster (e.g., CPU count, memory, state). While it might show “GRES“ (Generic Resources) if configured, it only shows what the scheduler thinks is there; it does not test the actual functional path between Docker and the GPU hardware.
D. ngc config set This command is used to configure the NVIDIA GPU Cloud (NGC) CLI tool with API keys and organization settings. It handles authentication for downloading models and containers but has no role in verifying the local hardware-to-container communication path.
Incorrect
Correct:
B. docker run –rm –runtime=nvidia –gpus all nvidia/cuda:12.0-base nvidia-smi This command performs a full end-to-end validation. It pulls a lightweight CUDA base image, starts a container using the NVIDIA Container Runtime, passes all available GPUs into that container, and executes nvidia-smi from within the isolated environment. If the output displays the GPU table, it proves that the NVIDIA Container Toolkit has correctly mapped the kernel drivers into the container namespace.
Incorrect:
A. systemctl status doca-agent The doca-agent is related to NVIDIA DOCA (Data Center Infrastructure-on-a-Chip Architecture), which is used for programming BlueField DPUs. While important for network and storage offloading, checking its status does not verify if the Docker daemon can access GPUs for AI workloads.
C. slurm scontrol show nodes This is a Slurm command used to view the state and configuration of nodes within a cluster (e.g., CPU count, memory, state). While it might show “GRES“ (Generic Resources) if configured, it only shows what the scheduler thinks is there; it does not test the actual functional path between Docker and the GPU hardware.
D. ngc config set This command is used to configure the NVIDIA GPU Cloud (NGC) CLI tool with API keys and organization settings. It handles authentication for downloading models and containers but has no role in verifying the local hardware-to-container communication path.
Unattempted
Correct:
B. docker run –rm –runtime=nvidia –gpus all nvidia/cuda:12.0-base nvidia-smi This command performs a full end-to-end validation. It pulls a lightweight CUDA base image, starts a container using the NVIDIA Container Runtime, passes all available GPUs into that container, and executes nvidia-smi from within the isolated environment. If the output displays the GPU table, it proves that the NVIDIA Container Toolkit has correctly mapped the kernel drivers into the container namespace.
Incorrect:
A. systemctl status doca-agent The doca-agent is related to NVIDIA DOCA (Data Center Infrastructure-on-a-Chip Architecture), which is used for programming BlueField DPUs. While important for network and storage offloading, checking its status does not verify if the Docker daemon can access GPUs for AI workloads.
C. slurm scontrol show nodes This is a Slurm command used to view the state and configuration of nodes within a cluster (e.g., CPU count, memory, state). While it might show “GRES“ (Generic Resources) if configured, it only shows what the scheduler thinks is there; it does not test the actual functional path between Docker and the GPU hardware.
D. ngc config set This command is used to configure the NVIDIA GPU Cloud (NGC) CLI tool with API keys and organization settings. It handles authentication for downloading models and containers but has no role in verifying the local hardware-to-container communication path.
Question 10 of 60
10. Question
During the final verification phase of an AI factory deployment, the team executes a High-Performance Linpack (HPL) test. The results show a significant Rmax value drop compared to the Rpeak theoretical performance. Which cluster-level assessment tool is best suited for identifying if the issue is a specific limping node or a general network congestion issue?
Correct
Correct:
C. ClusterKit; it performs multifaceted node assessments and identifies outliers in performance across the entire cluster. When a large-scale test like High-Performance Linpack (HPL) underperforms, the cause is often a “limping node“—a single node that is technically functional but running slower than its peers (due to thermal throttling, memory errors, or PCIe issues). ClusterKit is the primary tool used in NVIDIA AI Factory deployments to run sub-tests (like bandwidth and compute benchmarks) across all nodes simultaneously. It automatically aggregates results to highlight which specific nodes are statistical outliers, allowing administrators to isolate hardware issues from general network congestion.
Incorrect:
A. The DOCA Benchmarking tool; it isolates the DPU performance from the GPU performance to check for CPU bottlenecks. While NVIDIA DOCA is used for DPU (Data Processing Unit) acceleration and management, the DOCA benchmarking suite is focused on network offload and storage performance. It is not the primary tool for diagnosing GPU-heavy computational drops in an HPL test, which primarily stresses the Tensor Cores and NVLink fabric.
B. The Slurm squeue command; it identifies which jobs are pending and allows the administrator to prioritize the HPL task. The squeue command is a basic job management utility that shows the status of the job queue. It provides zero insight into the performance metrics or hardware health of the nodes running the HPL task. It can tell you that a job is running, but not how well it is performing.
D. The ping utility; it checks for basic ICMP connectivity between the head node and the compute nodes. Ping only verifies that a node is reachable at the network layer (Layer 3). HPL performance issues are usually related to high-bandwidth interconnects (InfiniBand/NVLink) or floating-point compute efficiency. A node can “ping“ perfectly fine while still having a faulty GPU or a degraded 200Gbps network link that is causing a massive drop in Rmax.
Incorrect
Correct:
C. ClusterKit; it performs multifaceted node assessments and identifies outliers in performance across the entire cluster. When a large-scale test like High-Performance Linpack (HPL) underperforms, the cause is often a “limping node“—a single node that is technically functional but running slower than its peers (due to thermal throttling, memory errors, or PCIe issues). ClusterKit is the primary tool used in NVIDIA AI Factory deployments to run sub-tests (like bandwidth and compute benchmarks) across all nodes simultaneously. It automatically aggregates results to highlight which specific nodes are statistical outliers, allowing administrators to isolate hardware issues from general network congestion.
Incorrect:
A. The DOCA Benchmarking tool; it isolates the DPU performance from the GPU performance to check for CPU bottlenecks. While NVIDIA DOCA is used for DPU (Data Processing Unit) acceleration and management, the DOCA benchmarking suite is focused on network offload and storage performance. It is not the primary tool for diagnosing GPU-heavy computational drops in an HPL test, which primarily stresses the Tensor Cores and NVLink fabric.
B. The Slurm squeue command; it identifies which jobs are pending and allows the administrator to prioritize the HPL task. The squeue command is a basic job management utility that shows the status of the job queue. It provides zero insight into the performance metrics or hardware health of the nodes running the HPL task. It can tell you that a job is running, but not how well it is performing.
D. The ping utility; it checks for basic ICMP connectivity between the head node and the compute nodes. Ping only verifies that a node is reachable at the network layer (Layer 3). HPL performance issues are usually related to high-bandwidth interconnects (InfiniBand/NVLink) or floating-point compute efficiency. A node can “ping“ perfectly fine while still having a faulty GPU or a degraded 200Gbps network link that is causing a massive drop in Rmax.
Unattempted
Correct:
C. ClusterKit; it performs multifaceted node assessments and identifies outliers in performance across the entire cluster. When a large-scale test like High-Performance Linpack (HPL) underperforms, the cause is often a “limping node“—a single node that is technically functional but running slower than its peers (due to thermal throttling, memory errors, or PCIe issues). ClusterKit is the primary tool used in NVIDIA AI Factory deployments to run sub-tests (like bandwidth and compute benchmarks) across all nodes simultaneously. It automatically aggregates results to highlight which specific nodes are statistical outliers, allowing administrators to isolate hardware issues from general network congestion.
Incorrect:
A. The DOCA Benchmarking tool; it isolates the DPU performance from the GPU performance to check for CPU bottlenecks. While NVIDIA DOCA is used for DPU (Data Processing Unit) acceleration and management, the DOCA benchmarking suite is focused on network offload and storage performance. It is not the primary tool for diagnosing GPU-heavy computational drops in an HPL test, which primarily stresses the Tensor Cores and NVLink fabric.
B. The Slurm squeue command; it identifies which jobs are pending and allows the administrator to prioritize the HPL task. The squeue command is a basic job management utility that shows the status of the job queue. It provides zero insight into the performance metrics or hardware health of the nodes running the HPL task. It can tell you that a job is running, but not how well it is performing.
D. The ping utility; it checks for basic ICMP connectivity between the head node and the compute nodes. Ping only verifies that a node is reachable at the network layer (Layer 3). HPL performance issues are usually related to high-bandwidth interconnects (InfiniBand/NVLink) or floating-point compute efficiency. A node can “ping“ perfectly fine while still having a faulty GPU or a degraded 200Gbps network link that is causing a massive drop in Rmax.
Question 11 of 60
11. Question
A system administrator needs to optimize an NVIDIA BlueField network platform to handle intensive data movement for a large-scale AI cluster. Which configuration step is necessary to enable the DPU to perform offloaded hardware acceleration for InfiniBand or Ethernet traffic in a production environment?
Correct
Correct:
D. Configure the DPU in DPU-Mode (rather than Separated-Mode) and ensure the correct DOCA runtime environment is provisioned to manage the acceleration engines. In an AI Factory, the BlueField DPU must be set to DPU-Mode (also known as Embedded Function Promotion) to act as an independent compute node that manages its own network stack and security policies. In this mode, the DPUÂ’s ARM cores run an OS (typically Ubuntu) and use the NVIDIA DOCA (Data Center Infrastructure-on-a-Chip Architecture) framework to offload tasks like data encryption, storage virtualization, and network telemetry from the host CPU and GPU.
Incorrect:
A. Utilize the NVIDIA SMI tool to flash the BlueField firmware directly onto the HGX baseboard to unify the management of the network and compute layers. The NVIDIA SMI (nvidia-smi) tool is primarily for GPU management. BlueField DPUs have their own firmware and management tools (like mstflint or bfcfg). Furthermore, DPU firmware is flashed to the DPUÂ’s own flash memory, not the HGX baseboard, which is a separate physical component for GPU interconnectivity.
B. Set the MIG profile to 1g.10gb on the BlueField DPU to ensure that the network traffic is partitioned into small, manageable virtual streams for the GPU. MIG (Multi-Instance GPU) is a feature specific to NVIDIA GPUs (like the A100 or H100) that allows a single physical GPU to be partitioned into multiple hardware-isolated instances. It does not apply to BlueField DPUs. DPU traffic isolation is typically handled through SR-IOV (Single Root I/O Virtualization) or VirtIO, not MIG profiles.
C. Disable the internal ARM cores on the BlueField DPU to allow the host CPU to take over the network steering logic for better AI workload synchronization. The entire purpose of a DPU is to offload processing from the host CPU. Disabling the ARM cores would effectively turn the DPU into a standard NIC (Network Interface Card) or “dumb“ adapter, defeating the purpose of the BlueField platform in a high-scale AI environment where host CPU cycles are needed for other management tasks.
Incorrect
Correct:
D. Configure the DPU in DPU-Mode (rather than Separated-Mode) and ensure the correct DOCA runtime environment is provisioned to manage the acceleration engines. In an AI Factory, the BlueField DPU must be set to DPU-Mode (also known as Embedded Function Promotion) to act as an independent compute node that manages its own network stack and security policies. In this mode, the DPUÂ’s ARM cores run an OS (typically Ubuntu) and use the NVIDIA DOCA (Data Center Infrastructure-on-a-Chip Architecture) framework to offload tasks like data encryption, storage virtualization, and network telemetry from the host CPU and GPU.
Incorrect:
A. Utilize the NVIDIA SMI tool to flash the BlueField firmware directly onto the HGX baseboard to unify the management of the network and compute layers. The NVIDIA SMI (nvidia-smi) tool is primarily for GPU management. BlueField DPUs have their own firmware and management tools (like mstflint or bfcfg). Furthermore, DPU firmware is flashed to the DPUÂ’s own flash memory, not the HGX baseboard, which is a separate physical component for GPU interconnectivity.
B. Set the MIG profile to 1g.10gb on the BlueField DPU to ensure that the network traffic is partitioned into small, manageable virtual streams for the GPU. MIG (Multi-Instance GPU) is a feature specific to NVIDIA GPUs (like the A100 or H100) that allows a single physical GPU to be partitioned into multiple hardware-isolated instances. It does not apply to BlueField DPUs. DPU traffic isolation is typically handled through SR-IOV (Single Root I/O Virtualization) or VirtIO, not MIG profiles.
C. Disable the internal ARM cores on the BlueField DPU to allow the host CPU to take over the network steering logic for better AI workload synchronization. The entire purpose of a DPU is to offload processing from the host CPU. Disabling the ARM cores would effectively turn the DPU into a standard NIC (Network Interface Card) or “dumb“ adapter, defeating the purpose of the BlueField platform in a high-scale AI environment where host CPU cycles are needed for other management tasks.
Unattempted
Correct:
D. Configure the DPU in DPU-Mode (rather than Separated-Mode) and ensure the correct DOCA runtime environment is provisioned to manage the acceleration engines. In an AI Factory, the BlueField DPU must be set to DPU-Mode (also known as Embedded Function Promotion) to act as an independent compute node that manages its own network stack and security policies. In this mode, the DPUÂ’s ARM cores run an OS (typically Ubuntu) and use the NVIDIA DOCA (Data Center Infrastructure-on-a-Chip Architecture) framework to offload tasks like data encryption, storage virtualization, and network telemetry from the host CPU and GPU.
Incorrect:
A. Utilize the NVIDIA SMI tool to flash the BlueField firmware directly onto the HGX baseboard to unify the management of the network and compute layers. The NVIDIA SMI (nvidia-smi) tool is primarily for GPU management. BlueField DPUs have their own firmware and management tools (like mstflint or bfcfg). Furthermore, DPU firmware is flashed to the DPUÂ’s own flash memory, not the HGX baseboard, which is a separate physical component for GPU interconnectivity.
B. Set the MIG profile to 1g.10gb on the BlueField DPU to ensure that the network traffic is partitioned into small, manageable virtual streams for the GPU. MIG (Multi-Instance GPU) is a feature specific to NVIDIA GPUs (like the A100 or H100) that allows a single physical GPU to be partitioned into multiple hardware-isolated instances. It does not apply to BlueField DPUs. DPU traffic isolation is typically handled through SR-IOV (Single Root I/O Virtualization) or VirtIO, not MIG profiles.
C. Disable the internal ARM cores on the BlueField DPU to allow the host CPU to take over the network steering logic for better AI workload synchronization. The entire purpose of a DPU is to offload processing from the host CPU. Disabling the ARM cores would effectively turn the DPU into a standard NIC (Network Interface Card) or “dumb“ adapter, defeating the purpose of the BlueField platform in a high-scale AI environment where host CPU cycles are needed for other management tasks.
Question 12 of 60
12. Question
A system administrator receives an alert that an NVIDIA H100 GPU in a cluster node has entered a fallen off the bus state. The nvidia-smi command shows the GPU is missing, and dmesg reports a PCIe bus error. After a warm reboot fails to resolve the issue, what is the next best troubleshooting step to identify if the fault is with the GPU or the PCIe slot?
Correct
Correct:
D. Perform a cold boot of the system, and if the issue persists, move the GPU to a known-working PCIe slot to see if the error follows the card. A “GPU has fallen off the bus“ error usually indicates a hardware communication failure between the GPU and the motherboard. A cold boot (completely removing power) is necessary to reset the PCIe training state and the GPU‘s internal firmware. If a cold boot fails, swapping the GPU to a different slot is the definitive way to isolate the fault: if the error moves with the card, the GPU is likely defective (requiring an RMA); if the error stays with the original slot, the motherboard or PCIe riser is the point of failure.
Incorrect:
A. Update the Slurm configuration to ignore PCIe errors, allowing the rest of the GPUs on the node to continue running workloads without interruption. Ignoring a PCIe bus error is dangerous in a production AI Factory. A GPU that has fallen off the bus can cause kernel panics, system instability, or data corruption. Slurm should be used to drain the node so it can be repaired, not to mask critical hardware failures.
B. Use a multimeter to measure the voltage on the NVLink bridge pins while the system is running a stress test to check for power fluctuations. This is highly impractical and unsafe in a data center. High-density servers like the NVIDIA HGX H100 use sophisticated internal power distribution boards. Attempting to manual-probe pins during a stress test risks short-circuiting the hardware and does not address the PCIe communication error reported by dmesg.
C. Reinstall the NVIDIA driver using the –force flag to overwrite any corrupted PCIe training sequences stored in the kernel memory of the OS. PCIe training happens at the BIOS/UEFI and hardware level during the boot sequence, well before the OS driver is loaded. If the hardware is not visible to the bus (as indicated by nvidia-smi showing it is missing), reinstalling the driver will not fix the underlying physical or link-layer communication issue.
Incorrect
Correct:
D. Perform a cold boot of the system, and if the issue persists, move the GPU to a known-working PCIe slot to see if the error follows the card. A “GPU has fallen off the bus“ error usually indicates a hardware communication failure between the GPU and the motherboard. A cold boot (completely removing power) is necessary to reset the PCIe training state and the GPU‘s internal firmware. If a cold boot fails, swapping the GPU to a different slot is the definitive way to isolate the fault: if the error moves with the card, the GPU is likely defective (requiring an RMA); if the error stays with the original slot, the motherboard or PCIe riser is the point of failure.
Incorrect:
A. Update the Slurm configuration to ignore PCIe errors, allowing the rest of the GPUs on the node to continue running workloads without interruption. Ignoring a PCIe bus error is dangerous in a production AI Factory. A GPU that has fallen off the bus can cause kernel panics, system instability, or data corruption. Slurm should be used to drain the node so it can be repaired, not to mask critical hardware failures.
B. Use a multimeter to measure the voltage on the NVLink bridge pins while the system is running a stress test to check for power fluctuations. This is highly impractical and unsafe in a data center. High-density servers like the NVIDIA HGX H100 use sophisticated internal power distribution boards. Attempting to manual-probe pins during a stress test risks short-circuiting the hardware and does not address the PCIe communication error reported by dmesg.
C. Reinstall the NVIDIA driver using the –force flag to overwrite any corrupted PCIe training sequences stored in the kernel memory of the OS. PCIe training happens at the BIOS/UEFI and hardware level during the boot sequence, well before the OS driver is loaded. If the hardware is not visible to the bus (as indicated by nvidia-smi showing it is missing), reinstalling the driver will not fix the underlying physical or link-layer communication issue.
Unattempted
Correct:
D. Perform a cold boot of the system, and if the issue persists, move the GPU to a known-working PCIe slot to see if the error follows the card. A “GPU has fallen off the bus“ error usually indicates a hardware communication failure between the GPU and the motherboard. A cold boot (completely removing power) is necessary to reset the PCIe training state and the GPU‘s internal firmware. If a cold boot fails, swapping the GPU to a different slot is the definitive way to isolate the fault: if the error moves with the card, the GPU is likely defective (requiring an RMA); if the error stays with the original slot, the motherboard or PCIe riser is the point of failure.
Incorrect:
A. Update the Slurm configuration to ignore PCIe errors, allowing the rest of the GPUs on the node to continue running workloads without interruption. Ignoring a PCIe bus error is dangerous in a production AI Factory. A GPU that has fallen off the bus can cause kernel panics, system instability, or data corruption. Slurm should be used to drain the node so it can be repaired, not to mask critical hardware failures.
B. Use a multimeter to measure the voltage on the NVLink bridge pins while the system is running a stress test to check for power fluctuations. This is highly impractical and unsafe in a data center. High-density servers like the NVIDIA HGX H100 use sophisticated internal power distribution boards. Attempting to manual-probe pins during a stress test risks short-circuiting the hardware and does not address the PCIe communication error reported by dmesg.
C. Reinstall the NVIDIA driver using the –force flag to overwrite any corrupted PCIe training sequences stored in the kernel memory of the OS. PCIe training happens at the BIOS/UEFI and hardware level during the boot sequence, well before the OS driver is loaded. If the hardware is not visible to the bus (as indicated by nvidia-smi showing it is missing), reinstalling the driver will not fix the underlying physical or link-layer communication issue.
Question 13 of 60
13. Question
An administrator is installing Base Command Manager (BCM) to orchestrate a new AI cluster. During the setup of the head node, they must configure High Availability (HA). What is the primary mechanism BCM uses to ensure the cluster remains operational if the primary head node suffers a catastrophic hardware failure?
Correct
Correct:
B. BCM configures a secondary head node that synchronizes its database and configuration files with the primary; it uses a heartbeat mechanism to trigger an automatic failover. High Availability in Base Command Manager (BCM) is achieved by deploying a pair of head nodes. The primary head node continuously synchronizes its internal database, configuration files, and software repositories with the secondary node. A heartbeat (typically over a dedicated management network) monitors the health of the primary. If the heartbeat is lost, the secondary node automatically assumes the “active“ role, taking over the cluster‘s virtual IP (VIP) and management services to ensure zero-to-minimal downtime for the AI factory.
Incorrect:
A. BCM requires the administrator to manually copy the Slurm configuration to a USB drive and plug it into a different server whenever the primary node fails. This describes a manual recovery process, not a High Availability mechanism. Modern AI infrastructure requires automated failover. BCM is designed to handle synchronization and state transitions programmatically without physical media intervention.
C. BCM uses a round-robin DNS strategy to distribute Slurm job requests to all compute nodes simultaneously, bypassing the need for a management node. Round-robin DNS is a load-balancing technique for web traffic, not a cluster management HA strategy. A management node (Head Node) is essential in a Slurm environment to act as the central controller (slurmctld). Without a head node or a functional HA pair, the scheduler cannot manage resource allocations or job queues.
D. BCM utilizes the GPU‘s NVLink interconnect to mirror the entire operating system of the head node onto the first compute node in the cluster. NVLink is a high-speed, point-to-point interconnect designed for GPU-to-GPU data transfers during model training; it is not used for operating system mirroring or cluster management tasks. Additionally, head nodes are typically CPU-heavy management servers and often do not even contain the GPUs necessary for an NVLink fabric connection.
Incorrect
Correct:
B. BCM configures a secondary head node that synchronizes its database and configuration files with the primary; it uses a heartbeat mechanism to trigger an automatic failover. High Availability in Base Command Manager (BCM) is achieved by deploying a pair of head nodes. The primary head node continuously synchronizes its internal database, configuration files, and software repositories with the secondary node. A heartbeat (typically over a dedicated management network) monitors the health of the primary. If the heartbeat is lost, the secondary node automatically assumes the “active“ role, taking over the cluster‘s virtual IP (VIP) and management services to ensure zero-to-minimal downtime for the AI factory.
Incorrect:
A. BCM requires the administrator to manually copy the Slurm configuration to a USB drive and plug it into a different server whenever the primary node fails. This describes a manual recovery process, not a High Availability mechanism. Modern AI infrastructure requires automated failover. BCM is designed to handle synchronization and state transitions programmatically without physical media intervention.
C. BCM uses a round-robin DNS strategy to distribute Slurm job requests to all compute nodes simultaneously, bypassing the need for a management node. Round-robin DNS is a load-balancing technique for web traffic, not a cluster management HA strategy. A management node (Head Node) is essential in a Slurm environment to act as the central controller (slurmctld). Without a head node or a functional HA pair, the scheduler cannot manage resource allocations or job queues.
D. BCM utilizes the GPU‘s NVLink interconnect to mirror the entire operating system of the head node onto the first compute node in the cluster. NVLink is a high-speed, point-to-point interconnect designed for GPU-to-GPU data transfers during model training; it is not used for operating system mirroring or cluster management tasks. Additionally, head nodes are typically CPU-heavy management servers and often do not even contain the GPUs necessary for an NVLink fabric connection.
Unattempted
Correct:
B. BCM configures a secondary head node that synchronizes its database and configuration files with the primary; it uses a heartbeat mechanism to trigger an automatic failover. High Availability in Base Command Manager (BCM) is achieved by deploying a pair of head nodes. The primary head node continuously synchronizes its internal database, configuration files, and software repositories with the secondary node. A heartbeat (typically over a dedicated management network) monitors the health of the primary. If the heartbeat is lost, the secondary node automatically assumes the “active“ role, taking over the cluster‘s virtual IP (VIP) and management services to ensure zero-to-minimal downtime for the AI factory.
Incorrect:
A. BCM requires the administrator to manually copy the Slurm configuration to a USB drive and plug it into a different server whenever the primary node fails. This describes a manual recovery process, not a High Availability mechanism. Modern AI infrastructure requires automated failover. BCM is designed to handle synchronization and state transitions programmatically without physical media intervention.
C. BCM uses a round-robin DNS strategy to distribute Slurm job requests to all compute nodes simultaneously, bypassing the need for a management node. Round-robin DNS is a load-balancing technique for web traffic, not a cluster management HA strategy. A management node (Head Node) is essential in a Slurm environment to act as the central controller (slurmctld). Without a head node or a functional HA pair, the scheduler cannot manage resource allocations or job queues.
D. BCM utilizes the GPU‘s NVLink interconnect to mirror the entire operating system of the head node onto the first compute node in the cluster. NVLink is a high-speed, point-to-point interconnect designed for GPU-to-GPU data transfers during model training; it is not used for operating system mirroring or cluster management tasks. Additionally, head nodes are typically CPU-heavy management servers and often do not even contain the GPUs necessary for an NVLink fabric connection.
Question 14 of 60
14. Question
A network card in an AI node is frequently dropping packets, causing distributed training jobs to time out. The administrator has already replaced the cable and the transceiver. Which troubleshooting tool or method should be used next to determine if the issue is a faulty network card (NIC) or a configuration mismatch on the switch port?
Correct
Correct:
B. Checking the switch port counters for ‘FCS Errors‘ or ‘Runts‘ and comparing them with the NIC‘s internal error counters via ‘ethtool -S‘ or ‘mlnx_perf‘. In an AI cluster, high-speed interconnects (like InfiniBand or 100/200/400GbE) are sensitive to configuration mismatches (e.g., MTU size, Auto-Negotiation, or FEC settings). Frame Check Sequence (FCS) errors usually indicate signal integrity issues or hardware faults, while Runts or specific discard counters can indicate MTU mismatches. By using ethtool -S or NVIDIAÂ’s mlnx_perf utility, an administrator can see real-time hardware counters on the NIC. Comparing these to the switch‘s logs allows the admin to see if the errors are being generated at the source (NIC) or if the switch is dropping the traffic due to a port configuration error.
Incorrect:
A. Using the ‘nvidia-smi -q‘ command to check if the GPU‘s encoders are causing interference with the network card‘s electrical signals. nvidia-smi -q provides detailed hardware and software information about the GPU (clocks, power, memory). It does not provide network telemetry. While electrical interference (EMI) is a theoretical possibility in high-density racks, there is no “interference check“ feature in nvidia-smi to diagnose NIC packet loss.
C. Running a ‘Hello World‘ program in Python to see if the operating system‘s kernel is capable of processing basic arithmetic during packet loss. This test is irrelevant to network troubleshooting. Basic CPU arithmetic and kernel processing of local code are unrelated to the network interface card‘s ability to transmit or receive frames over a high-speed fabric.
D. Installing a second operating system on the node to see if the network card works better under a different brand of Linux. While driver compatibility is important, “distro hopping“ is not an efficient troubleshooting step in a production AI Factory environment. Most NVIDIA-certified systems use specific versions of Ubuntu or Red Hat Enterprise Linux (RHEL). The issue is much more likely to be found at the physical or data-link layer (Layer 1 or 2) using standard diagnostic tools rather than the OS brand.
Incorrect
Correct:
B. Checking the switch port counters for ‘FCS Errors‘ or ‘Runts‘ and comparing them with the NIC‘s internal error counters via ‘ethtool -S‘ or ‘mlnx_perf‘. In an AI cluster, high-speed interconnects (like InfiniBand or 100/200/400GbE) are sensitive to configuration mismatches (e.g., MTU size, Auto-Negotiation, or FEC settings). Frame Check Sequence (FCS) errors usually indicate signal integrity issues or hardware faults, while Runts or specific discard counters can indicate MTU mismatches. By using ethtool -S or NVIDIAÂ’s mlnx_perf utility, an administrator can see real-time hardware counters on the NIC. Comparing these to the switch‘s logs allows the admin to see if the errors are being generated at the source (NIC) or if the switch is dropping the traffic due to a port configuration error.
Incorrect:
A. Using the ‘nvidia-smi -q‘ command to check if the GPU‘s encoders are causing interference with the network card‘s electrical signals. nvidia-smi -q provides detailed hardware and software information about the GPU (clocks, power, memory). It does not provide network telemetry. While electrical interference (EMI) is a theoretical possibility in high-density racks, there is no “interference check“ feature in nvidia-smi to diagnose NIC packet loss.
C. Running a ‘Hello World‘ program in Python to see if the operating system‘s kernel is capable of processing basic arithmetic during packet loss. This test is irrelevant to network troubleshooting. Basic CPU arithmetic and kernel processing of local code are unrelated to the network interface card‘s ability to transmit or receive frames over a high-speed fabric.
D. Installing a second operating system on the node to see if the network card works better under a different brand of Linux. While driver compatibility is important, “distro hopping“ is not an efficient troubleshooting step in a production AI Factory environment. Most NVIDIA-certified systems use specific versions of Ubuntu or Red Hat Enterprise Linux (RHEL). The issue is much more likely to be found at the physical or data-link layer (Layer 1 or 2) using standard diagnostic tools rather than the OS brand.
Unattempted
Correct:
B. Checking the switch port counters for ‘FCS Errors‘ or ‘Runts‘ and comparing them with the NIC‘s internal error counters via ‘ethtool -S‘ or ‘mlnx_perf‘. In an AI cluster, high-speed interconnects (like InfiniBand or 100/200/400GbE) are sensitive to configuration mismatches (e.g., MTU size, Auto-Negotiation, or FEC settings). Frame Check Sequence (FCS) errors usually indicate signal integrity issues or hardware faults, while Runts or specific discard counters can indicate MTU mismatches. By using ethtool -S or NVIDIAÂ’s mlnx_perf utility, an administrator can see real-time hardware counters on the NIC. Comparing these to the switch‘s logs allows the admin to see if the errors are being generated at the source (NIC) or if the switch is dropping the traffic due to a port configuration error.
Incorrect:
A. Using the ‘nvidia-smi -q‘ command to check if the GPU‘s encoders are causing interference with the network card‘s electrical signals. nvidia-smi -q provides detailed hardware and software information about the GPU (clocks, power, memory). It does not provide network telemetry. While electrical interference (EMI) is a theoretical possibility in high-density racks, there is no “interference check“ feature in nvidia-smi to diagnose NIC packet loss.
C. Running a ‘Hello World‘ program in Python to see if the operating system‘s kernel is capable of processing basic arithmetic during packet loss. This test is irrelevant to network troubleshooting. Basic CPU arithmetic and kernel processing of local code are unrelated to the network interface card‘s ability to transmit or receive frames over a high-speed fabric.
D. Installing a second operating system on the node to see if the network card works better under a different brand of Linux. While driver compatibility is important, “distro hopping“ is not an efficient troubleshooting step in a production AI Factory environment. Most NVIDIA-certified systems use specific versions of Ubuntu or Red Hat Enterprise Linux (RHEL). The issue is much more likely to be found at the physical or data-link layer (Layer 1 or 2) using standard diagnostic tools rather than the OS brand.
Question 15 of 60
15. Question
When performing a NeMo burn-in test on a large-scale cluster intended for Large Language Model training, what is the engineer specifically trying to validate regarding the overall system health?
Correct
Correct:
B. The ability of the cluster to maintain sustained throughput during training. A NeMo burn-in test is more than just a hardware check; it is a system-level validation. For Large Language Model (LLM) training, the cluster must maintain high Model FLOPS Utilization (MFU) and sustained data throughput across the fabric. This test ensures that the GPUs, the NVLink/NVSwitch fabric, and the InfiniBand/Ethernet networking can handle the constant, high-bandwidth communication (collective operations) without overheating, dropping packets, or causing the system to throttle over an extended period.
Incorrect:
A. The resolution of the monitor connected to the head node VGA port. AI cluster head nodes are managed remotely via SSH, Base Command Manager (BCM), or out-of-band management (BMC). The local VGA resolution is irrelevant to the performance, stability, or health of the AI infrastructure.
C. The speed at which the BIOS can perform a Power-On Self-Test. The Power-On Self-Test (POST) is a preliminary check performed at boot time. While a fast POST is convenient, it does not validate how the system behaves under the intense thermal and electrical loads generated by LLM training workloads, which is the primary purpose of a NeMo burn-in.
D. The compatibility of the cluster with legacy 32-bit applications. Modern AI Factories are built on 64-bit architectures (x86_64 or ARM64) and specialized software stacks (CUDA, NCCL). Legacy 32-bit compatibility is not a design requirement or a validation goal for a high-performance AI cluster intended for modern model training.
Incorrect
Correct:
B. The ability of the cluster to maintain sustained throughput during training. A NeMo burn-in test is more than just a hardware check; it is a system-level validation. For Large Language Model (LLM) training, the cluster must maintain high Model FLOPS Utilization (MFU) and sustained data throughput across the fabric. This test ensures that the GPUs, the NVLink/NVSwitch fabric, and the InfiniBand/Ethernet networking can handle the constant, high-bandwidth communication (collective operations) without overheating, dropping packets, or causing the system to throttle over an extended period.
Incorrect:
A. The resolution of the monitor connected to the head node VGA port. AI cluster head nodes are managed remotely via SSH, Base Command Manager (BCM), or out-of-band management (BMC). The local VGA resolution is irrelevant to the performance, stability, or health of the AI infrastructure.
C. The speed at which the BIOS can perform a Power-On Self-Test. The Power-On Self-Test (POST) is a preliminary check performed at boot time. While a fast POST is convenient, it does not validate how the system behaves under the intense thermal and electrical loads generated by LLM training workloads, which is the primary purpose of a NeMo burn-in.
D. The compatibility of the cluster with legacy 32-bit applications. Modern AI Factories are built on 64-bit architectures (x86_64 or ARM64) and specialized software stacks (CUDA, NCCL). Legacy 32-bit compatibility is not a design requirement or a validation goal for a high-performance AI cluster intended for modern model training.
Unattempted
Correct:
B. The ability of the cluster to maintain sustained throughput during training. A NeMo burn-in test is more than just a hardware check; it is a system-level validation. For Large Language Model (LLM) training, the cluster must maintain high Model FLOPS Utilization (MFU) and sustained data throughput across the fabric. This test ensures that the GPUs, the NVLink/NVSwitch fabric, and the InfiniBand/Ethernet networking can handle the constant, high-bandwidth communication (collective operations) without overheating, dropping packets, or causing the system to throttle over an extended period.
Incorrect:
A. The resolution of the monitor connected to the head node VGA port. AI cluster head nodes are managed remotely via SSH, Base Command Manager (BCM), or out-of-band management (BMC). The local VGA resolution is irrelevant to the performance, stability, or health of the AI infrastructure.
C. The speed at which the BIOS can perform a Power-On Self-Test. The Power-On Self-Test (POST) is a preliminary check performed at boot time. While a fast POST is convenient, it does not validate how the system behaves under the intense thermal and electrical loads generated by LLM training workloads, which is the primary purpose of a NeMo burn-in.
D. The compatibility of the cluster with legacy 32-bit applications. Modern AI Factories are built on 64-bit architectures (x86_64 or ARM64) and specialized software stacks (CUDA, NCCL). Legacy 32-bit compatibility is not a design requirement or a validation goal for a high-performance AI cluster intended for modern model training.
Question 16 of 60
16. Question
When installing Base Command Manager (BCM) as the control plane for an AI cluster, the administrator must configure High Availability (HA) for the head node. What is the primary reason for establishing a secondary head node in a BCM environment, and how does the system typically handle a failure of the primary node?
Correct
Correct: A. The secondary node provides redundancy for the cluster management database and services, using a heartbeat mechanism to trigger an automatic failover. The NCP-AII blueprint specifies that an HA configuration involves two head nodes: a Primary and a Secondary. BCM uses a heartbeat mechanism to constantly monitor the health of the primary node. If the primary node‘s management daemon (cmdaemon) or the physical hardware fails, the secondary node detects the loss of the heartbeat and automatically takes over the cluster‘s Virtual IP (VIP) and management services. This ensures that the MariaDB/MySQL management database remains synchronized and that compute nodes can continue to communicate with a controller without manual intervention.
Incorrect: B. The secondary node acts as a backup storage server that only turns on when the primary node runs out of disk space for user home directories. In an NVIDIA-Certified system, user data and “home“ directories are typically stored on high-performance third-party storage (like DDN, NetApp, or VAST) or a dedicated storage tier, not on the head node‘s local disks. The head node is a control plane device, and the HA secondary node is kept in a “hot-standby“ or active state, not a “storage-triggered“ power-on state.
C. HA is used to allow the administrator to run two different versions of the operating system simultaneously for testing purposes without affecting users. High Availability requires configuration symmetry. For a failover to be successful, both the primary and secondary head nodes must run identical versions of the BCM software and the underlying operating system. Running mismatched versions would lead to database corruption or service incompatibility during a failover event, which is the opposite of the “Reliability“ goal taught in the NCP-AII course.
D. The secondary node is used to double the compute power of the cluster by sharing the scheduling load with the primary node during peak hours. This confuses High Availability with “Load Balancing.“ While some BCM components can be distributed, the primary purpose of the HA secondary node is Redundancy, not performance scaling. The secondary node does not actively schedule jobs alongside the primary; it waits to take over the primary‘s duties only in the event of a failure to ensure cluster persistence.
Incorrect
Correct: A. The secondary node provides redundancy for the cluster management database and services, using a heartbeat mechanism to trigger an automatic failover. The NCP-AII blueprint specifies that an HA configuration involves two head nodes: a Primary and a Secondary. BCM uses a heartbeat mechanism to constantly monitor the health of the primary node. If the primary node‘s management daemon (cmdaemon) or the physical hardware fails, the secondary node detects the loss of the heartbeat and automatically takes over the cluster‘s Virtual IP (VIP) and management services. This ensures that the MariaDB/MySQL management database remains synchronized and that compute nodes can continue to communicate with a controller without manual intervention.
Incorrect: B. The secondary node acts as a backup storage server that only turns on when the primary node runs out of disk space for user home directories. In an NVIDIA-Certified system, user data and “home“ directories are typically stored on high-performance third-party storage (like DDN, NetApp, or VAST) or a dedicated storage tier, not on the head node‘s local disks. The head node is a control plane device, and the HA secondary node is kept in a “hot-standby“ or active state, not a “storage-triggered“ power-on state.
C. HA is used to allow the administrator to run two different versions of the operating system simultaneously for testing purposes without affecting users. High Availability requires configuration symmetry. For a failover to be successful, both the primary and secondary head nodes must run identical versions of the BCM software and the underlying operating system. Running mismatched versions would lead to database corruption or service incompatibility during a failover event, which is the opposite of the “Reliability“ goal taught in the NCP-AII course.
D. The secondary node is used to double the compute power of the cluster by sharing the scheduling load with the primary node during peak hours. This confuses High Availability with “Load Balancing.“ While some BCM components can be distributed, the primary purpose of the HA secondary node is Redundancy, not performance scaling. The secondary node does not actively schedule jobs alongside the primary; it waits to take over the primary‘s duties only in the event of a failure to ensure cluster persistence.
Unattempted
Correct: A. The secondary node provides redundancy for the cluster management database and services, using a heartbeat mechanism to trigger an automatic failover. The NCP-AII blueprint specifies that an HA configuration involves two head nodes: a Primary and a Secondary. BCM uses a heartbeat mechanism to constantly monitor the health of the primary node. If the primary node‘s management daemon (cmdaemon) or the physical hardware fails, the secondary node detects the loss of the heartbeat and automatically takes over the cluster‘s Virtual IP (VIP) and management services. This ensures that the MariaDB/MySQL management database remains synchronized and that compute nodes can continue to communicate with a controller without manual intervention.
Incorrect: B. The secondary node acts as a backup storage server that only turns on when the primary node runs out of disk space for user home directories. In an NVIDIA-Certified system, user data and “home“ directories are typically stored on high-performance third-party storage (like DDN, NetApp, or VAST) or a dedicated storage tier, not on the head node‘s local disks. The head node is a control plane device, and the HA secondary node is kept in a “hot-standby“ or active state, not a “storage-triggered“ power-on state.
C. HA is used to allow the administrator to run two different versions of the operating system simultaneously for testing purposes without affecting users. High Availability requires configuration symmetry. For a failover to be successful, both the primary and secondary head nodes must run identical versions of the BCM software and the underlying operating system. Running mismatched versions would lead to database corruption or service incompatibility during a failover event, which is the opposite of the “Reliability“ goal taught in the NCP-AII course.
D. The secondary node is used to double the compute power of the cluster by sharing the scheduling load with the primary node during peak hours. This confuses High Availability with “Load Balancing.“ While some BCM components can be distributed, the primary purpose of the HA secondary node is Redundancy, not performance scaling. The secondary node does not actively schedule jobs alongside the primary; it waits to take over the primary‘s duties only in the event of a failure to ensure cluster persistence.
Question 17 of 60
17. Question
As part of the multifaceted node assessment, an engineer runs a NeMo burn-in test. What is the primary purpose of running this specific workload as a verification step for an NVIDIA-Certified AI Infrastructure?
Correct
Correct:
A. To simulate a real-world, large-scale Large Language Model (LLM) training scenario to verify that the compute, networking, and storage stacks work together under heavy load. NVIDIA-Certified AI Infrastructure must be validated as a cohesive unit. While individual benchmarks (like memory tests or network pings) check components in isolation, a NeMo burn-in simulates the actual “pressure“ of an LLM workload. It forces the GPUs to run at peak power, the NVLink and InfiniBand fabrics to handle complex collective communications (like All-Reduce), and the storage system to manage massive data checkpoints. This ensures that the entire stack is stable and integrated correctly before production use.
Incorrect:
B. To test the user interface of the NeMo framework and ensure the font colors are accessible for all researchers in the organization. The NeMo framework is primarily used via Python APIs, command-line interfaces, and Jupyter notebooks. UI accessibility (such as font colors) is a front-end design concern and has no bearing on the infrastructure verification or hardware stability of an AI Factory.
C. To verify that the server can play high-definition video files using the GPU video encoders for a corporate presentation. While NVIDIA GPUs do contain hardware video encoders/decoders (NVENC/NVDEC), these are not the focus of an AI Infrastructure certification. A NeMo burn-in specifically targets Tensor Core compute and high-speed interconnect performance, which is vastly different from simple video playback.
D. To generate a small amount of heat to verify that the server internal light sensors are working correctly in the dark data center. Burn-in tests generate a massive amount of heat to test thermal management and cooling systems, not light sensors. Furthermore, enterprise servers rely on thermal and electrical sensors for health monitoring; “light sensors“ are not a standard component for verifying AI workload stability.
Incorrect
Correct:
A. To simulate a real-world, large-scale Large Language Model (LLM) training scenario to verify that the compute, networking, and storage stacks work together under heavy load. NVIDIA-Certified AI Infrastructure must be validated as a cohesive unit. While individual benchmarks (like memory tests or network pings) check components in isolation, a NeMo burn-in simulates the actual “pressure“ of an LLM workload. It forces the GPUs to run at peak power, the NVLink and InfiniBand fabrics to handle complex collective communications (like All-Reduce), and the storage system to manage massive data checkpoints. This ensures that the entire stack is stable and integrated correctly before production use.
Incorrect:
B. To test the user interface of the NeMo framework and ensure the font colors are accessible for all researchers in the organization. The NeMo framework is primarily used via Python APIs, command-line interfaces, and Jupyter notebooks. UI accessibility (such as font colors) is a front-end design concern and has no bearing on the infrastructure verification or hardware stability of an AI Factory.
C. To verify that the server can play high-definition video files using the GPU video encoders for a corporate presentation. While NVIDIA GPUs do contain hardware video encoders/decoders (NVENC/NVDEC), these are not the focus of an AI Infrastructure certification. A NeMo burn-in specifically targets Tensor Core compute and high-speed interconnect performance, which is vastly different from simple video playback.
D. To generate a small amount of heat to verify that the server internal light sensors are working correctly in the dark data center. Burn-in tests generate a massive amount of heat to test thermal management and cooling systems, not light sensors. Furthermore, enterprise servers rely on thermal and electrical sensors for health monitoring; “light sensors“ are not a standard component for verifying AI workload stability.
Unattempted
Correct:
A. To simulate a real-world, large-scale Large Language Model (LLM) training scenario to verify that the compute, networking, and storage stacks work together under heavy load. NVIDIA-Certified AI Infrastructure must be validated as a cohesive unit. While individual benchmarks (like memory tests or network pings) check components in isolation, a NeMo burn-in simulates the actual “pressure“ of an LLM workload. It forces the GPUs to run at peak power, the NVLink and InfiniBand fabrics to handle complex collective communications (like All-Reduce), and the storage system to manage massive data checkpoints. This ensures that the entire stack is stable and integrated correctly before production use.
Incorrect:
B. To test the user interface of the NeMo framework and ensure the font colors are accessible for all researchers in the organization. The NeMo framework is primarily used via Python APIs, command-line interfaces, and Jupyter notebooks. UI accessibility (such as font colors) is a front-end design concern and has no bearing on the infrastructure verification or hardware stability of an AI Factory.
C. To verify that the server can play high-definition video files using the GPU video encoders for a corporate presentation. While NVIDIA GPUs do contain hardware video encoders/decoders (NVENC/NVDEC), these are not the focus of an AI Infrastructure certification. A NeMo burn-in specifically targets Tensor Core compute and high-speed interconnect performance, which is vastly different from simple video playback.
D. To generate a small amount of heat to verify that the server internal light sensors are working correctly in the dark data center. Burn-in tests generate a massive amount of heat to test thermal management and cooling systems, not light sensors. Furthermore, enterprise servers rely on thermal and electrical sensors for health monitoring; “light sensors“ are not a standard component for verifying AI workload stability.
Question 18 of 60
18. Question
During the installation of DOCA drivers on a BlueField-3 DPU, the administrator encounters a version mismatch between the host‘s OFED (OpenFabrics Enterprise Distribution) and the DPU‘s internal software. Which procedure should be followed to ensure the control plane is correctly configured for both the host and the DPU?
Correct
Correct:
C. Use the DOCA Host package to synchronize the drivers on the host and then use the BFB (BlueField Binary) image to update the DPU OS to a matching version. NVIDIA strictly requires that the DOCA-Host software (which includes the RShim driver and the bfb-install utility) be updated on the host server before updating the DPU itself. The BFB (BlueField Binary) is a comprehensive image that contains the Ubuntu OS for the DPU‘s ARM cores, the DPU‘s firmware, and the specialized DOCA drivers. By aligning the Host package version with the BFB version, you ensure that the control plane communication (via the RShim interface) is stable and that the version-sensitive OFED components are compatible across the PCIe bus.
Incorrect:
A. Disable the DOCA drivers on the host and use the standard inbox Linux drivers to avoid conflicts with the DPU‘s specialized hardware accelerators. Inbox Linux drivers do not support the advanced hardware offloads (like storage encryption or line-rate packet processing) required for an AI Factory. DOCA-specific drivers are necessary to unlock the BlueField-3Â’s acceleration engines. Disabling them would relegate the DPU to a standard network card (NIC) mode, wasting the platform‘s capabilities.
B. Uninstall all NVIDIA drivers from the host and allow the DPU to push its own drivers to the host kernel during the next PXE boot cycle. The DPU does not “push“ drivers to the host kernel. While the DPU can act as a PXE boot server for compute nodes, the relationship between a host and its local DPU requires the host to have the DOCA-Host software pre-installed to manage and communicate with the card.
D. Flash the DPU firmware using a standard USB drive connected to the server‘s front panel, then reboot the server into BIOS recovery mode. BlueField DPUs are updated via the host OS using the bfb-install tool over the internal RShim (PCIe-to-USB) interface or via the DPU‘s dedicated management port. They are not updated via front-panel USB ports on the server chassis. BIOS recovery mode is for the server‘s motherboard and does not address driver or firmware mismatches on an add-in DPU.
Incorrect
Correct:
C. Use the DOCA Host package to synchronize the drivers on the host and then use the BFB (BlueField Binary) image to update the DPU OS to a matching version. NVIDIA strictly requires that the DOCA-Host software (which includes the RShim driver and the bfb-install utility) be updated on the host server before updating the DPU itself. The BFB (BlueField Binary) is a comprehensive image that contains the Ubuntu OS for the DPU‘s ARM cores, the DPU‘s firmware, and the specialized DOCA drivers. By aligning the Host package version with the BFB version, you ensure that the control plane communication (via the RShim interface) is stable and that the version-sensitive OFED components are compatible across the PCIe bus.
Incorrect:
A. Disable the DOCA drivers on the host and use the standard inbox Linux drivers to avoid conflicts with the DPU‘s specialized hardware accelerators. Inbox Linux drivers do not support the advanced hardware offloads (like storage encryption or line-rate packet processing) required for an AI Factory. DOCA-specific drivers are necessary to unlock the BlueField-3Â’s acceleration engines. Disabling them would relegate the DPU to a standard network card (NIC) mode, wasting the platform‘s capabilities.
B. Uninstall all NVIDIA drivers from the host and allow the DPU to push its own drivers to the host kernel during the next PXE boot cycle. The DPU does not “push“ drivers to the host kernel. While the DPU can act as a PXE boot server for compute nodes, the relationship between a host and its local DPU requires the host to have the DOCA-Host software pre-installed to manage and communicate with the card.
D. Flash the DPU firmware using a standard USB drive connected to the server‘s front panel, then reboot the server into BIOS recovery mode. BlueField DPUs are updated via the host OS using the bfb-install tool over the internal RShim (PCIe-to-USB) interface or via the DPU‘s dedicated management port. They are not updated via front-panel USB ports on the server chassis. BIOS recovery mode is for the server‘s motherboard and does not address driver or firmware mismatches on an add-in DPU.
Unattempted
Correct:
C. Use the DOCA Host package to synchronize the drivers on the host and then use the BFB (BlueField Binary) image to update the DPU OS to a matching version. NVIDIA strictly requires that the DOCA-Host software (which includes the RShim driver and the bfb-install utility) be updated on the host server before updating the DPU itself. The BFB (BlueField Binary) is a comprehensive image that contains the Ubuntu OS for the DPU‘s ARM cores, the DPU‘s firmware, and the specialized DOCA drivers. By aligning the Host package version with the BFB version, you ensure that the control plane communication (via the RShim interface) is stable and that the version-sensitive OFED components are compatible across the PCIe bus.
Incorrect:
A. Disable the DOCA drivers on the host and use the standard inbox Linux drivers to avoid conflicts with the DPU‘s specialized hardware accelerators. Inbox Linux drivers do not support the advanced hardware offloads (like storage encryption or line-rate packet processing) required for an AI Factory. DOCA-specific drivers are necessary to unlock the BlueField-3Â’s acceleration engines. Disabling them would relegate the DPU to a standard network card (NIC) mode, wasting the platform‘s capabilities.
B. Uninstall all NVIDIA drivers from the host and allow the DPU to push its own drivers to the host kernel during the next PXE boot cycle. The DPU does not “push“ drivers to the host kernel. While the DPU can act as a PXE boot server for compute nodes, the relationship between a host and its local DPU requires the host to have the DOCA-Host software pre-installed to manage and communicate with the card.
D. Flash the DPU firmware using a standard USB drive connected to the server‘s front panel, then reboot the server into BIOS recovery mode. BlueField DPUs are updated via the host OS using the bfb-install tool over the internal RShim (PCIe-to-USB) interface or via the DPU‘s dedicated management port. They are not updated via front-panel USB ports on the server chassis. BIOS recovery mode is for the server‘s motherboard and does not address driver or firmware mismatches on an add-in DPU.
Question 19 of 60
19. Question
A researcher needs to partition an NVIDIA A100 GPU using Multi-Instance GPU (MIG) to support multiple users with guaranteed Quality of Service (QoS). The researcher requires that each user has their own dedicated compute and memory resources. What is a key architectural characteristic of MIG that distinguishes it from traditional temporal GPU sharing?
Correct
Correct: • A. MIG provides hardware-level isolation by partitioning the GPU into separate instances, each with its own dedicated GPCs, memory controllers, and cache. Unlike traditional virtualization or time-slicing, Multi-Instance GPU (MIG) performs a spatial partitioning of the hardware. On an A100 or H100, the GPU is physically divided into up to seven instances. Each instance is assigned its own dedicated GPU Processing Clusters (GPCs), specific slices of the L2 cache, and its own memory controllers/bandwidth. This architectural isolation ensures that a heavy workload in one instance cannot “starve“ another instance of memory bandwidth or cause latency spikes, providing a guaranteed Quality of Service (QoS) and fault isolation.
Incorrect: • B. MIG is only compatible with Windows-based AI workstations and cannot be used in Linux-based data centers or AI factories due to driver limitations. This is incorrect. MIG is a data center-grade feature primarily designed for Linux-based AI Factories, cloud service providers, and high-performance computing (HPC) environments. It is fully supported on Linux and integrates deeply with enterprise tools like Docker, Kubernetes, and Slurm.
• C. MIG allows the GPU to exceed its physical memory limit by automatically swapping data to the host system RAM when multiple instances are active. MIG does not enable memory oversubscription or “swapping“ to host RAM. In fact, one of its core strengths is the strict memory boundary. If a MIG instance is allocated 10GB of HBM, it has access to exactly that amount. This prevents one user from over-allocating memory and crashing the entire physical GPU, a common issue in non-MIG environments.
• D. MIG uses software-based time-slicing to rotate tasks on the GPU cores, allowing for thousands of concurrent users on a single A100 module. This describes temporal sharing (standard multitasking or NVIDIA vGPU time-slicing), not MIG. Temporal sharing rotates different tasks on the same hardware over time, which can lead to jitter and performance interference. MIG uses spatial sharing, where tasks run simultaneously on physically separate parts of the silicon. Furthermore, MIG is limited to a maximum of seven instances per A100, not thousands.
Incorrect
Correct: • A. MIG provides hardware-level isolation by partitioning the GPU into separate instances, each with its own dedicated GPCs, memory controllers, and cache. Unlike traditional virtualization or time-slicing, Multi-Instance GPU (MIG) performs a spatial partitioning of the hardware. On an A100 or H100, the GPU is physically divided into up to seven instances. Each instance is assigned its own dedicated GPU Processing Clusters (GPCs), specific slices of the L2 cache, and its own memory controllers/bandwidth. This architectural isolation ensures that a heavy workload in one instance cannot “starve“ another instance of memory bandwidth or cause latency spikes, providing a guaranteed Quality of Service (QoS) and fault isolation.
Incorrect: • B. MIG is only compatible with Windows-based AI workstations and cannot be used in Linux-based data centers or AI factories due to driver limitations. This is incorrect. MIG is a data center-grade feature primarily designed for Linux-based AI Factories, cloud service providers, and high-performance computing (HPC) environments. It is fully supported on Linux and integrates deeply with enterprise tools like Docker, Kubernetes, and Slurm.
• C. MIG allows the GPU to exceed its physical memory limit by automatically swapping data to the host system RAM when multiple instances are active. MIG does not enable memory oversubscription or “swapping“ to host RAM. In fact, one of its core strengths is the strict memory boundary. If a MIG instance is allocated 10GB of HBM, it has access to exactly that amount. This prevents one user from over-allocating memory and crashing the entire physical GPU, a common issue in non-MIG environments.
• D. MIG uses software-based time-slicing to rotate tasks on the GPU cores, allowing for thousands of concurrent users on a single A100 module. This describes temporal sharing (standard multitasking or NVIDIA vGPU time-slicing), not MIG. Temporal sharing rotates different tasks on the same hardware over time, which can lead to jitter and performance interference. MIG uses spatial sharing, where tasks run simultaneously on physically separate parts of the silicon. Furthermore, MIG is limited to a maximum of seven instances per A100, not thousands.
Unattempted
Correct: • A. MIG provides hardware-level isolation by partitioning the GPU into separate instances, each with its own dedicated GPCs, memory controllers, and cache. Unlike traditional virtualization or time-slicing, Multi-Instance GPU (MIG) performs a spatial partitioning of the hardware. On an A100 or H100, the GPU is physically divided into up to seven instances. Each instance is assigned its own dedicated GPU Processing Clusters (GPCs), specific slices of the L2 cache, and its own memory controllers/bandwidth. This architectural isolation ensures that a heavy workload in one instance cannot “starve“ another instance of memory bandwidth or cause latency spikes, providing a guaranteed Quality of Service (QoS) and fault isolation.
Incorrect: • B. MIG is only compatible with Windows-based AI workstations and cannot be used in Linux-based data centers or AI factories due to driver limitations. This is incorrect. MIG is a data center-grade feature primarily designed for Linux-based AI Factories, cloud service providers, and high-performance computing (HPC) environments. It is fully supported on Linux and integrates deeply with enterprise tools like Docker, Kubernetes, and Slurm.
• C. MIG allows the GPU to exceed its physical memory limit by automatically swapping data to the host system RAM when multiple instances are active. MIG does not enable memory oversubscription or “swapping“ to host RAM. In fact, one of its core strengths is the strict memory boundary. If a MIG instance is allocated 10GB of HBM, it has access to exactly that amount. This prevents one user from over-allocating memory and crashing the entire physical GPU, a common issue in non-MIG environments.
• D. MIG uses software-based time-slicing to rotate tasks on the GPU cores, allowing for thousands of concurrent users on a single A100 module. This describes temporal sharing (standard multitasking or NVIDIA vGPU time-slicing), not MIG. Temporal sharing rotates different tasks on the same hardware over time, which can lead to jitter and performance interference. MIG uses spatial sharing, where tasks run simultaneously on physically separate parts of the silicon. Furthermore, MIG is limited to a maximum of seven instances per A100, not thousands.
Question 20 of 60
20. Question
A cluster administrator is investigating a performance drop in an AI training job. Upon checking the system, they find that the GPUs are running at lower than expected clock speeds despite being under heavy load. Which of the following is a common cause of this behavior that an administrator should investigate first?
Correct
Correct:
C. The GPU is being throttled due to reaching its thermal limit or its power limit, which can be verified using the command nvidia-smi -q -d PERFORMANCE. In an AI Factory, GPUs like the H100 operate at very high power densities. If the data center‘s cooling is insufficient or the power delivery to the server is capped, the GPU firmware will automatically lower clock speeds to protect the hardware from damage. This is known as throttling. The nvidia-smi -q -d PERFORMANCE command is the specific tool used to query “Clocks Throttle Reasons,“ which will explicitly state if the slowdown is due to “HW Thermal Slowdown,“ “Power Brake,“ or “Sync Boost.“
Incorrect:
A. The Linux operating system has decided to use the GPU to mine cryptocurrency in the background, leaving no cycles for the AI training job. Standard enterprise Linux distributions (like Ubuntu or RHEL) used in AI clusters do not include background cryptocurrency miners. While a security breach (malware) could theoretically cause this, it is not a “common cause“ or a hardware-level behavior an administrator should investigate first when diagnosing clock speed issues in a controlled AI infrastructure.
B. The user has accidentally set the GPU into a 2D mode which is only meant for displaying basic desktop graphics and cannot be changed without a hardware jumper. Modern data center GPUs (the HGX/SXM or PCIe “headless“ models) do not have a “2D mode“ meant for desktops, nor do they utilize hardware jumpers for power state management. Power states (P-states) are managed dynamically by the driver and firmware based on computational demand.
D. The InfiniBand switch has detected that the GPU is working too hard and has sent a PAUSE frame to the GPU PCIe controller to slow it down. While InfiniBand and Ethernet fabrics use congestion control mechanisms (like Priority Flow Control/PFC or ECN), these operate at the network layer to manage packet flow. A switch cannot directly control the internal core clock speeds of a GPU or send “PAUSE frames“ to a PCIe controller to slow down the silicon‘s arithmetic logic units.
Incorrect
Correct:
C. The GPU is being throttled due to reaching its thermal limit or its power limit, which can be verified using the command nvidia-smi -q -d PERFORMANCE. In an AI Factory, GPUs like the H100 operate at very high power densities. If the data center‘s cooling is insufficient or the power delivery to the server is capped, the GPU firmware will automatically lower clock speeds to protect the hardware from damage. This is known as throttling. The nvidia-smi -q -d PERFORMANCE command is the specific tool used to query “Clocks Throttle Reasons,“ which will explicitly state if the slowdown is due to “HW Thermal Slowdown,“ “Power Brake,“ or “Sync Boost.“
Incorrect:
A. The Linux operating system has decided to use the GPU to mine cryptocurrency in the background, leaving no cycles for the AI training job. Standard enterprise Linux distributions (like Ubuntu or RHEL) used in AI clusters do not include background cryptocurrency miners. While a security breach (malware) could theoretically cause this, it is not a “common cause“ or a hardware-level behavior an administrator should investigate first when diagnosing clock speed issues in a controlled AI infrastructure.
B. The user has accidentally set the GPU into a 2D mode which is only meant for displaying basic desktop graphics and cannot be changed without a hardware jumper. Modern data center GPUs (the HGX/SXM or PCIe “headless“ models) do not have a “2D mode“ meant for desktops, nor do they utilize hardware jumpers for power state management. Power states (P-states) are managed dynamically by the driver and firmware based on computational demand.
D. The InfiniBand switch has detected that the GPU is working too hard and has sent a PAUSE frame to the GPU PCIe controller to slow it down. While InfiniBand and Ethernet fabrics use congestion control mechanisms (like Priority Flow Control/PFC or ECN), these operate at the network layer to manage packet flow. A switch cannot directly control the internal core clock speeds of a GPU or send “PAUSE frames“ to a PCIe controller to slow down the silicon‘s arithmetic logic units.
Unattempted
Correct:
C. The GPU is being throttled due to reaching its thermal limit or its power limit, which can be verified using the command nvidia-smi -q -d PERFORMANCE. In an AI Factory, GPUs like the H100 operate at very high power densities. If the data center‘s cooling is insufficient or the power delivery to the server is capped, the GPU firmware will automatically lower clock speeds to protect the hardware from damage. This is known as throttling. The nvidia-smi -q -d PERFORMANCE command is the specific tool used to query “Clocks Throttle Reasons,“ which will explicitly state if the slowdown is due to “HW Thermal Slowdown,“ “Power Brake,“ or “Sync Boost.“
Incorrect:
A. The Linux operating system has decided to use the GPU to mine cryptocurrency in the background, leaving no cycles for the AI training job. Standard enterprise Linux distributions (like Ubuntu or RHEL) used in AI clusters do not include background cryptocurrency miners. While a security breach (malware) could theoretically cause this, it is not a “common cause“ or a hardware-level behavior an administrator should investigate first when diagnosing clock speed issues in a controlled AI infrastructure.
B. The user has accidentally set the GPU into a 2D mode which is only meant for displaying basic desktop graphics and cannot be changed without a hardware jumper. Modern data center GPUs (the HGX/SXM or PCIe “headless“ models) do not have a “2D mode“ meant for desktops, nor do they utilize hardware jumpers for power state management. Power states (P-states) are managed dynamically by the driver and firmware based on computational demand.
D. The InfiniBand switch has detected that the GPU is working too hard and has sent a PAUSE frame to the GPU PCIe controller to slow it down. While InfiniBand and Ethernet fabrics use congestion control mechanisms (like Priority Flow Control/PFC or ECN), these operate at the network layer to manage packet flow. A switch cannot directly control the internal core clock speeds of a GPU or send “PAUSE frames“ to a PCIe controller to slow down the silicon‘s arithmetic logic units.
Question 21 of 60
21. Question
To facilitate seamless workload orchestration in an AI factory, an administrator is configuring a Slurm cluster with Enroot and Pyxis. What is the specific purpose of the Pyxis plugin in this NVIDIA-based AI infrastructure software stack for the researchers?
Correct
Correct:
D. Pyxis is a Slurm plugin that allows users to run unprivileged containers using the Enroot runtime via standard Slurm job scripts. In an NVIDIA AI Factory, Enroot is the runtime that handles the isolation and creation of containers (turning Docker images into simple, unprivileged sandboxes). However, Enroot itself does not natively talk to Slurm. Pyxis is a SPANK (Slurm Plug-in Architecture for Node and task [K]ontrol) plugin that adds container-specific arguments (like –container-image and –container-mounts) to the srun and sbatch commands. This allows researchers to submit containerized AI workloads as if they were native binaries, with Pyxis automatically handling the image pulling and Enroot environment setup.
Incorrect:
A. Pyxis is a kernel module that enables hardware-level encryption for the InfiniBand fabric between nodes to secure data in transit. Encryption for data in transit over InfiniBand or Ethernet is typically handled at the hardware level by BlueField DPUs (using DOCA) or via IPsec/TLS offloads. Pyxis is a user-space scheduler plugin and has no role in managing network fabric encryption or kernel-level security modules.
B. Pyxis manages the power distribution to the GPUs and shuts them down when no Slurm jobs are in the pending queue to save energy. Power management and telemetry (like idling GPUs) are functions of the NVIDIA Data Center GPU Manager (DCGM) and the cluster management software, such as NVIDIA Base Command Manager (BCM). Pyxis is strictly focused on the container lifecycle within the scheduler.
C. Pyxis is a storage driver that connects the Slurm head node to the NGC model registry for automatic data downloading of AI models. While Pyxis can pull container images from the NVIDIA GPU Cloud (NGC) to set up a job, it is not a “storage driver.“ It does not manage persistent datasets or provide a filesystem for model weights. Storage connectivity is handled by specialized AI storage solutions (like Lustre or Weka) using drivers like GPUDirect Storage.
Incorrect
Correct:
D. Pyxis is a Slurm plugin that allows users to run unprivileged containers using the Enroot runtime via standard Slurm job scripts. In an NVIDIA AI Factory, Enroot is the runtime that handles the isolation and creation of containers (turning Docker images into simple, unprivileged sandboxes). However, Enroot itself does not natively talk to Slurm. Pyxis is a SPANK (Slurm Plug-in Architecture for Node and task [K]ontrol) plugin that adds container-specific arguments (like –container-image and –container-mounts) to the srun and sbatch commands. This allows researchers to submit containerized AI workloads as if they were native binaries, with Pyxis automatically handling the image pulling and Enroot environment setup.
Incorrect:
A. Pyxis is a kernel module that enables hardware-level encryption for the InfiniBand fabric between nodes to secure data in transit. Encryption for data in transit over InfiniBand or Ethernet is typically handled at the hardware level by BlueField DPUs (using DOCA) or via IPsec/TLS offloads. Pyxis is a user-space scheduler plugin and has no role in managing network fabric encryption or kernel-level security modules.
B. Pyxis manages the power distribution to the GPUs and shuts them down when no Slurm jobs are in the pending queue to save energy. Power management and telemetry (like idling GPUs) are functions of the NVIDIA Data Center GPU Manager (DCGM) and the cluster management software, such as NVIDIA Base Command Manager (BCM). Pyxis is strictly focused on the container lifecycle within the scheduler.
C. Pyxis is a storage driver that connects the Slurm head node to the NGC model registry for automatic data downloading of AI models. While Pyxis can pull container images from the NVIDIA GPU Cloud (NGC) to set up a job, it is not a “storage driver.“ It does not manage persistent datasets or provide a filesystem for model weights. Storage connectivity is handled by specialized AI storage solutions (like Lustre or Weka) using drivers like GPUDirect Storage.
Unattempted
Correct:
D. Pyxis is a Slurm plugin that allows users to run unprivileged containers using the Enroot runtime via standard Slurm job scripts. In an NVIDIA AI Factory, Enroot is the runtime that handles the isolation and creation of containers (turning Docker images into simple, unprivileged sandboxes). However, Enroot itself does not natively talk to Slurm. Pyxis is a SPANK (Slurm Plug-in Architecture for Node and task [K]ontrol) plugin that adds container-specific arguments (like –container-image and –container-mounts) to the srun and sbatch commands. This allows researchers to submit containerized AI workloads as if they were native binaries, with Pyxis automatically handling the image pulling and Enroot environment setup.
Incorrect:
A. Pyxis is a kernel module that enables hardware-level encryption for the InfiniBand fabric between nodes to secure data in transit. Encryption for data in transit over InfiniBand or Ethernet is typically handled at the hardware level by BlueField DPUs (using DOCA) or via IPsec/TLS offloads. Pyxis is a user-space scheduler plugin and has no role in managing network fabric encryption or kernel-level security modules.
B. Pyxis manages the power distribution to the GPUs and shuts them down when no Slurm jobs are in the pending queue to save energy. Power management and telemetry (like idling GPUs) are functions of the NVIDIA Data Center GPU Manager (DCGM) and the cluster management software, such as NVIDIA Base Command Manager (BCM). Pyxis is strictly focused on the container lifecycle within the scheduler.
C. Pyxis is a storage driver that connects the Slurm head node to the NGC model registry for automatic data downloading of AI models. While Pyxis can pull container images from the NVIDIA GPU Cloud (NGC) to set up a job, it is not a “storage driver.“ It does not manage persistent datasets or provide a filesystem for model weights. Storage connectivity is handled by specialized AI storage solutions (like Lustre or Weka) using drivers like GPUDirect Storage.
Question 22 of 60
22. Question
An administrator is managing a cluster of NVIDIA Grace Hopper Superchips. They need to verify the physical interconnect between the Grace CPU and the Hopper GPU. Which technology represents the high-speed, coherent interface that must be validated to ensure the CPU and GPU share a unified memory space effectively?
Correct
Correct:
B. NVLink-C2C (Chip-to-Chip) The NVIDIA Grace Hopper Superchip architecture relies on NVLink-C2C to provide a unified, coherent memory space. This interface allows the Grace CPU to access the Hopper GPU‘s HBM3 memory and the Hopper GPU to access the CPUÂ’s LPDDR5X memory with high bandwidth and low latency. This “memory-coherent“ link is what distinguishes the Superchip from traditional discrete CPU-GPU setups connected via PCIe, as it enables the CPU and GPU to work on the same data structures without the overhead of explicit data copies.
Incorrect:
A. USB 4.0 Type-C USB 4.0 is a peripheral connectivity standard used for external devices like monitors, storage drives, and docking stations. It does not provide the bandwidth, low latency, or cache coherency required for internal processor-to-accelerator communication in an AI supercomputer.
C. SATA Express Interconnect SATA Express is a legacy storage interface designed to connect SSDs to a motherboard. It has been largely superseded by NVMe and does not have the architectural capability to facilitate memory coherency or high-speed data exchange between a CPU and a GPU.
D. Standard Ethernet 10GbE 10GbE is a networking standard used for node-to-node communication or management traffic. While 10GbE (or much faster speeds like 400GbE) is used for the cluster fabric, it is an external network interface and cannot act as the internal, on-chip coherent interconnect between a CPU and a GPU within a single Superchip module.
Incorrect
Correct:
B. NVLink-C2C (Chip-to-Chip) The NVIDIA Grace Hopper Superchip architecture relies on NVLink-C2C to provide a unified, coherent memory space. This interface allows the Grace CPU to access the Hopper GPU‘s HBM3 memory and the Hopper GPU to access the CPUÂ’s LPDDR5X memory with high bandwidth and low latency. This “memory-coherent“ link is what distinguishes the Superchip from traditional discrete CPU-GPU setups connected via PCIe, as it enables the CPU and GPU to work on the same data structures without the overhead of explicit data copies.
Incorrect:
A. USB 4.0 Type-C USB 4.0 is a peripheral connectivity standard used for external devices like monitors, storage drives, and docking stations. It does not provide the bandwidth, low latency, or cache coherency required for internal processor-to-accelerator communication in an AI supercomputer.
C. SATA Express Interconnect SATA Express is a legacy storage interface designed to connect SSDs to a motherboard. It has been largely superseded by NVMe and does not have the architectural capability to facilitate memory coherency or high-speed data exchange between a CPU and a GPU.
D. Standard Ethernet 10GbE 10GbE is a networking standard used for node-to-node communication or management traffic. While 10GbE (or much faster speeds like 400GbE) is used for the cluster fabric, it is an external network interface and cannot act as the internal, on-chip coherent interconnect between a CPU and a GPU within a single Superchip module.
Unattempted
Correct:
B. NVLink-C2C (Chip-to-Chip) The NVIDIA Grace Hopper Superchip architecture relies on NVLink-C2C to provide a unified, coherent memory space. This interface allows the Grace CPU to access the Hopper GPU‘s HBM3 memory and the Hopper GPU to access the CPUÂ’s LPDDR5X memory with high bandwidth and low latency. This “memory-coherent“ link is what distinguishes the Superchip from traditional discrete CPU-GPU setups connected via PCIe, as it enables the CPU and GPU to work on the same data structures without the overhead of explicit data copies.
Incorrect:
A. USB 4.0 Type-C USB 4.0 is a peripheral connectivity standard used for external devices like monitors, storage drives, and docking stations. It does not provide the bandwidth, low latency, or cache coherency required for internal processor-to-accelerator communication in an AI supercomputer.
C. SATA Express Interconnect SATA Express is a legacy storage interface designed to connect SSDs to a motherboard. It has been largely superseded by NVMe and does not have the architectural capability to facilitate memory coherency or high-speed data exchange between a CPU and a GPU.
D. Standard Ethernet 10GbE 10GbE is a networking standard used for node-to-node communication or management traffic. While 10GbE (or much faster speeds like 400GbE) is used for the cluster fabric, it is an external network interface and cannot act as the internal, on-chip coherent interconnect between a CPU and a GPU within a single Superchip module.
Question 23 of 60
23. Question
In the context of Control Plane Installation, an administrator is configuring Categories within Base Command Manager. What is the significance of using Categories when managing a large cluster with a mix of different hardware types, such as some nodes with A100 GPUs and others with H100 GPUs?
Correct
Correct:
A. Categories allow the administrator to define distinct software images and configuration parameters for groups of nodes, ensuring correct drivers are applied. In a heterogeneous cluster containing different GPU architectures (e.g., Ampere-based A100s and Hopper-based H100s), you cannot use a single, generic software image if you want optimal performance. Categories act as a template or “blueprint.“ By creating a category for A100 nodes and another for H100 nodes, an administrator can assign specific software images (containing the correct driver versions, such as DGX OS 5 vs. DGX OS 6), kernel parameters, and post-install scripts to each group. This ensures that when a node is provisioned or updated, it automatically receives the configuration tuned for its specific hardware.
Incorrect:
B. Each node must belong to every Category simultaneously to ensure that it has access to all available software licenses in the cluster. In BCM, a node typically belongs to one primary category. Belonging to “every“ category would create massive configuration conflicts, as each category might specify different (and incompatible) software images or kernel settings. Licensing in BCM is managed at the cluster or GPU level, not through category membership.
C. Categories are a legacy feature only used for managing storage arrays and are not relevant for GPU or compute node configuration. Categories are a core, modern feature of BCM (and its predecessor, Bright Cluster Manager). They are fundamental to the orchestration of compute nodes and GPUs. While storage nodes can be organized into categories, the feature is most critical for ensuring the compute plane remains consistent and manageable at scale.
D. Categories are used to sort the physical location of the nodes in the data center map but have no impact on the software or drivers installed on the nodes. While you might choose to name categories based on physical locations (e.g., “Rack-01“), their primary function is logical configuration. Unlike simple metadata tags, changing a node‘s category in BCM can trigger a re-provisioning of the operating system, change the installed driver stack, or modify the scheduler roles (e.g., Slurm compute vs. login).
Incorrect
Correct:
A. Categories allow the administrator to define distinct software images and configuration parameters for groups of nodes, ensuring correct drivers are applied. In a heterogeneous cluster containing different GPU architectures (e.g., Ampere-based A100s and Hopper-based H100s), you cannot use a single, generic software image if you want optimal performance. Categories act as a template or “blueprint.“ By creating a category for A100 nodes and another for H100 nodes, an administrator can assign specific software images (containing the correct driver versions, such as DGX OS 5 vs. DGX OS 6), kernel parameters, and post-install scripts to each group. This ensures that when a node is provisioned or updated, it automatically receives the configuration tuned for its specific hardware.
Incorrect:
B. Each node must belong to every Category simultaneously to ensure that it has access to all available software licenses in the cluster. In BCM, a node typically belongs to one primary category. Belonging to “every“ category would create massive configuration conflicts, as each category might specify different (and incompatible) software images or kernel settings. Licensing in BCM is managed at the cluster or GPU level, not through category membership.
C. Categories are a legacy feature only used for managing storage arrays and are not relevant for GPU or compute node configuration. Categories are a core, modern feature of BCM (and its predecessor, Bright Cluster Manager). They are fundamental to the orchestration of compute nodes and GPUs. While storage nodes can be organized into categories, the feature is most critical for ensuring the compute plane remains consistent and manageable at scale.
D. Categories are used to sort the physical location of the nodes in the data center map but have no impact on the software or drivers installed on the nodes. While you might choose to name categories based on physical locations (e.g., “Rack-01“), their primary function is logical configuration. Unlike simple metadata tags, changing a node‘s category in BCM can trigger a re-provisioning of the operating system, change the installed driver stack, or modify the scheduler roles (e.g., Slurm compute vs. login).
Unattempted
Correct:
A. Categories allow the administrator to define distinct software images and configuration parameters for groups of nodes, ensuring correct drivers are applied. In a heterogeneous cluster containing different GPU architectures (e.g., Ampere-based A100s and Hopper-based H100s), you cannot use a single, generic software image if you want optimal performance. Categories act as a template or “blueprint.“ By creating a category for A100 nodes and another for H100 nodes, an administrator can assign specific software images (containing the correct driver versions, such as DGX OS 5 vs. DGX OS 6), kernel parameters, and post-install scripts to each group. This ensures that when a node is provisioned or updated, it automatically receives the configuration tuned for its specific hardware.
Incorrect:
B. Each node must belong to every Category simultaneously to ensure that it has access to all available software licenses in the cluster. In BCM, a node typically belongs to one primary category. Belonging to “every“ category would create massive configuration conflicts, as each category might specify different (and incompatible) software images or kernel settings. Licensing in BCM is managed at the cluster or GPU level, not through category membership.
C. Categories are a legacy feature only used for managing storage arrays and are not relevant for GPU or compute node configuration. Categories are a core, modern feature of BCM (and its predecessor, Bright Cluster Manager). They are fundamental to the orchestration of compute nodes and GPUs. While storage nodes can be organized into categories, the feature is most critical for ensuring the compute plane remains consistent and manageable at scale.
D. Categories are used to sort the physical location of the nodes in the data center map but have no impact on the software or drivers installed on the nodes. While you might choose to name categories based on physical locations (e.g., “Rack-01“), their primary function is logical configuration. Unlike simple metadata tags, changing a node‘s category in BCM can trigger a re-provisioning of the operating system, change the installed driver stack, or modify the scheduler roles (e.g., Slurm compute vs. login).
Question 24 of 60
24. Question
After the physical installation and software configuration, an engineer runs the High-Performance Linpack (HPL) benchmark on a single node. What is the primary objective of running HPL during the ‘Cluster Test and Verification‘ phase for an NVIDIA AI infrastructure?
Correct
Correct:
D. To stress the GPUs and CPU to verify thermal stability and peak floating-point performance. In the “Cluster Test and Verification“ phase of an AI Factory deployment, HPL is used as a foundational stress test. It solves a dense system of linear equations, which is extremely computationally intensive. For a single node, this validates that the GPU Tensor Cores and the host CPUs can operate at their maximum theoretical floating-point performance (Rpeak) without hitting thermal limits that would cause clock throttling. It is a critical “gate“ to pass before moving to multi-node tests, ensuring that the cooling system and power delivery for each individual server are fully functional under the highest possible load.
Incorrect:
A. To test the latency of the OOB management network. The Out-of-Band (OOB) management network (used for BMC, IPMI, and Redfish) is a low-bandwidth administrative network. Testing its latency is typically done with simple ping tests or IPMI responsiveness checks. HPL is a heavy computational benchmark and has no relationship with the performance of the management network.
B. To measure the maximum theoretical bandwidth of the storage array. While storage is a vital component of an AI cluster, HPL is a compute-bound benchmark that primarily stresses the processor and memory. To measure storage bandwidth, an engineer would use tools like fio, ior, or specialized storage benchmarks. HPL‘s data resides in RAM/HBM during the test and does not reflect the performance of the external storage array.
C. To verify the installation of the NGC CLI. The NGC CLI is a management tool used for downloading containers and models. Its verification is a simple software check (running ngc –version). Using a massive mathematical benchmark like HPL to verify a command-line utility is unnecessary and technically unrelated.
Incorrect
Correct:
D. To stress the GPUs and CPU to verify thermal stability and peak floating-point performance. In the “Cluster Test and Verification“ phase of an AI Factory deployment, HPL is used as a foundational stress test. It solves a dense system of linear equations, which is extremely computationally intensive. For a single node, this validates that the GPU Tensor Cores and the host CPUs can operate at their maximum theoretical floating-point performance (Rpeak) without hitting thermal limits that would cause clock throttling. It is a critical “gate“ to pass before moving to multi-node tests, ensuring that the cooling system and power delivery for each individual server are fully functional under the highest possible load.
Incorrect:
A. To test the latency of the OOB management network. The Out-of-Band (OOB) management network (used for BMC, IPMI, and Redfish) is a low-bandwidth administrative network. Testing its latency is typically done with simple ping tests or IPMI responsiveness checks. HPL is a heavy computational benchmark and has no relationship with the performance of the management network.
B. To measure the maximum theoretical bandwidth of the storage array. While storage is a vital component of an AI cluster, HPL is a compute-bound benchmark that primarily stresses the processor and memory. To measure storage bandwidth, an engineer would use tools like fio, ior, or specialized storage benchmarks. HPL‘s data resides in RAM/HBM during the test and does not reflect the performance of the external storage array.
C. To verify the installation of the NGC CLI. The NGC CLI is a management tool used for downloading containers and models. Its verification is a simple software check (running ngc –version). Using a massive mathematical benchmark like HPL to verify a command-line utility is unnecessary and technically unrelated.
Unattempted
Correct:
D. To stress the GPUs and CPU to verify thermal stability and peak floating-point performance. In the “Cluster Test and Verification“ phase of an AI Factory deployment, HPL is used as a foundational stress test. It solves a dense system of linear equations, which is extremely computationally intensive. For a single node, this validates that the GPU Tensor Cores and the host CPUs can operate at their maximum theoretical floating-point performance (Rpeak) without hitting thermal limits that would cause clock throttling. It is a critical “gate“ to pass before moving to multi-node tests, ensuring that the cooling system and power delivery for each individual server are fully functional under the highest possible load.
Incorrect:
A. To test the latency of the OOB management network. The Out-of-Band (OOB) management network (used for BMC, IPMI, and Redfish) is a low-bandwidth administrative network. Testing its latency is typically done with simple ping tests or IPMI responsiveness checks. HPL is a heavy computational benchmark and has no relationship with the performance of the management network.
B. To measure the maximum theoretical bandwidth of the storage array. While storage is a vital component of an AI cluster, HPL is a compute-bound benchmark that primarily stresses the processor and memory. To measure storage bandwidth, an engineer would use tools like fio, ior, or specialized storage benchmarks. HPL‘s data resides in RAM/HBM during the test and does not reflect the performance of the external storage array.
C. To verify the installation of the NGC CLI. The NGC CLI is a management tool used for downloading containers and models. Its verification is a simple software check (running ngc –version). Using a massive mathematical benchmark like HPL to verify a command-line utility is unnecessary and technically unrelated.
Question 25 of 60
25. Question
During a performance audit of an AI factory, it is discovered that the InfiniBand fabric is experiencing high levels of congestion discard packets. Which optimization strategy should the network administrator apply at the switch level to resolve this and improve performance?
Correct
Correct:
A. Enable Adaptive Routing and configure Congestion Control (CC) parameters on the InfiniBand switches and DPUs. In a high-scale AI factory, traffic patterns are often bursty (e.g., during “All-Reduce“ operations), leading to “elephant flows“ that can overwhelm a specific path. Adaptive Routing (AR) allows the switch ASIC to dynamically select the least-loaded path for packets, spreading the load across the entire fabric. Furthermore, enabling Congestion Control allows the fabric to signal source nodes (HCAs/DPUs) to throttle their transmission rates before buffers overflow, preventing “congestion spread“ and eliminating the discards that occur when buffers are exhausted.
Incorrect:
B. Reduce the MTU size to 1500 bytes to ensure the packets are small enough to pass through the switch buffers without queuing. AI workloads rely on RDMA (Remote Direct Memory Access) and high-throughput data movement. Reducing the MTU to 1500 (a standard Ethernet size) would significantly increase the overhead for every byte of data transmitted, leading to a massive decrease in effective bandwidth and higher CPU utilization. High-performance AI fabrics typically use a 4096-byte MTU for InfiniBand to maximize efficiency.
C. Disable the Subnet Manager on all switches to prevent it from recalculating routes during heavy traffic loads. The Subnet Manager (SM) is the “brain“ of the InfiniBand fabric. Disabling it would prevent the fabric from functioning at all, as it is responsible for discovering the topology, assigning LIDs (Local Identifiers), and calculating the initial routing tables. While an SM can be stressed, the solution is to optimize its configuration or use UFM (Unified Fabric Manager), not to disable it.
D. Physically disconnect half of the compute nodes to reduce the total amount of traffic entering the fabric spine. This is a “reductio ad absurdum“ option. While it would technically reduce congestion, it destroys the utility of the AI cluster. The goal of an AI infrastructure engineer is to maximize resource utilization, not to solve a networking problem by removing the expensive compute resources the network was built to support.
Incorrect
Correct:
A. Enable Adaptive Routing and configure Congestion Control (CC) parameters on the InfiniBand switches and DPUs. In a high-scale AI factory, traffic patterns are often bursty (e.g., during “All-Reduce“ operations), leading to “elephant flows“ that can overwhelm a specific path. Adaptive Routing (AR) allows the switch ASIC to dynamically select the least-loaded path for packets, spreading the load across the entire fabric. Furthermore, enabling Congestion Control allows the fabric to signal source nodes (HCAs/DPUs) to throttle their transmission rates before buffers overflow, preventing “congestion spread“ and eliminating the discards that occur when buffers are exhausted.
Incorrect:
B. Reduce the MTU size to 1500 bytes to ensure the packets are small enough to pass through the switch buffers without queuing. AI workloads rely on RDMA (Remote Direct Memory Access) and high-throughput data movement. Reducing the MTU to 1500 (a standard Ethernet size) would significantly increase the overhead for every byte of data transmitted, leading to a massive decrease in effective bandwidth and higher CPU utilization. High-performance AI fabrics typically use a 4096-byte MTU for InfiniBand to maximize efficiency.
C. Disable the Subnet Manager on all switches to prevent it from recalculating routes during heavy traffic loads. The Subnet Manager (SM) is the “brain“ of the InfiniBand fabric. Disabling it would prevent the fabric from functioning at all, as it is responsible for discovering the topology, assigning LIDs (Local Identifiers), and calculating the initial routing tables. While an SM can be stressed, the solution is to optimize its configuration or use UFM (Unified Fabric Manager), not to disable it.
D. Physically disconnect half of the compute nodes to reduce the total amount of traffic entering the fabric spine. This is a “reductio ad absurdum“ option. While it would technically reduce congestion, it destroys the utility of the AI cluster. The goal of an AI infrastructure engineer is to maximize resource utilization, not to solve a networking problem by removing the expensive compute resources the network was built to support.
Unattempted
Correct:
A. Enable Adaptive Routing and configure Congestion Control (CC) parameters on the InfiniBand switches and DPUs. In a high-scale AI factory, traffic patterns are often bursty (e.g., during “All-Reduce“ operations), leading to “elephant flows“ that can overwhelm a specific path. Adaptive Routing (AR) allows the switch ASIC to dynamically select the least-loaded path for packets, spreading the load across the entire fabric. Furthermore, enabling Congestion Control allows the fabric to signal source nodes (HCAs/DPUs) to throttle their transmission rates before buffers overflow, preventing “congestion spread“ and eliminating the discards that occur when buffers are exhausted.
Incorrect:
B. Reduce the MTU size to 1500 bytes to ensure the packets are small enough to pass through the switch buffers without queuing. AI workloads rely on RDMA (Remote Direct Memory Access) and high-throughput data movement. Reducing the MTU to 1500 (a standard Ethernet size) would significantly increase the overhead for every byte of data transmitted, leading to a massive decrease in effective bandwidth and higher CPU utilization. High-performance AI fabrics typically use a 4096-byte MTU for InfiniBand to maximize efficiency.
C. Disable the Subnet Manager on all switches to prevent it from recalculating routes during heavy traffic loads. The Subnet Manager (SM) is the “brain“ of the InfiniBand fabric. Disabling it would prevent the fabric from functioning at all, as it is responsible for discovering the topology, assigning LIDs (Local Identifiers), and calculating the initial routing tables. While an SM can be stressed, the solution is to optimize its configuration or use UFM (Unified Fabric Manager), not to disable it.
D. Physically disconnect half of the compute nodes to reduce the total amount of traffic entering the fabric spine. This is a “reductio ad absurdum“ option. While it would technically reduce congestion, it destroys the utility of the AI cluster. The goal of an AI infrastructure engineer is to maximize resource utilization, not to solve a networking problem by removing the expensive compute resources the network was built to support.
Question 26 of 60
26. Question
In a large AI cluster, a storage optimization task involves reducing the latency for small-file metadata operations which are slowing down the initial phase of a training job. Which of the following strategies would provide the most significant performance improvement for this specific bottleneck?
Correct
Correct: B. Implement a distributed metadata cache or use an All-Flash storage tier specifically for the metadata and small-file components of the dataset. AI training jobs often start by scanning millions of small files (like images or text snippets), creating a “metadata storm“ that can overwhelm traditional spinning-disk (HDD) or even poorly configured SSD arrays. In an NVIDIA-Certified AI Factory, best practices involve using a high-performance All-Flash tier (NVMe) or a distributed metadata service (like those found in Lustre, Weka, or NetApp ONTAP). This ensures that “Small-File I/O“ and metadata lookups—which are latency-sensitive—are handled by the fastest storage medium, preventing the GPUs from idling while waiting for the filesystem to list files.
Incorrect: A. Enable software-based compression on the storage volume to reduce the total amount of data that needs to be read from the disks. Compression is a throughput optimization, not a latency optimization. While it can reduce the footprint of large files, the computational overhead of decompressing millions of small files can actually increase latency and CPU load, potentially worsening the metadata bottleneck rather than solving it.
C. Upgrade the InfiniBand switches from HDR (200Gb/s) to NDR (400Gb/s) to increase the total bandwidth available for data transfers. Upgrading bandwidth is like widening a highway to solve a stoplight problem. High-latency metadata operations are “small-packet“ exchanges where the speed of light and the storage controller‘s response time matter more than the total pipe size. Until the storage backend can serve metadata faster, the extra bandwidth of NDR InfiniBand will remain underutilized.
D. Increase the number of GPUs in each node to allow the training job to process more data in parallel, thereby hiding the storage latency. Adding more GPUs increases the demand on the storage system. If the metadata layer is already a bottleneck, adding more parallel workers will likely lead to “lock contention“ or even higher latencies as more processes fight to access the same directory structures, resulting in lower overall GPU utilization.
Incorrect
Correct: B. Implement a distributed metadata cache or use an All-Flash storage tier specifically for the metadata and small-file components of the dataset. AI training jobs often start by scanning millions of small files (like images or text snippets), creating a “metadata storm“ that can overwhelm traditional spinning-disk (HDD) or even poorly configured SSD arrays. In an NVIDIA-Certified AI Factory, best practices involve using a high-performance All-Flash tier (NVMe) or a distributed metadata service (like those found in Lustre, Weka, or NetApp ONTAP). This ensures that “Small-File I/O“ and metadata lookups—which are latency-sensitive—are handled by the fastest storage medium, preventing the GPUs from idling while waiting for the filesystem to list files.
Incorrect: A. Enable software-based compression on the storage volume to reduce the total amount of data that needs to be read from the disks. Compression is a throughput optimization, not a latency optimization. While it can reduce the footprint of large files, the computational overhead of decompressing millions of small files can actually increase latency and CPU load, potentially worsening the metadata bottleneck rather than solving it.
C. Upgrade the InfiniBand switches from HDR (200Gb/s) to NDR (400Gb/s) to increase the total bandwidth available for data transfers. Upgrading bandwidth is like widening a highway to solve a stoplight problem. High-latency metadata operations are “small-packet“ exchanges where the speed of light and the storage controller‘s response time matter more than the total pipe size. Until the storage backend can serve metadata faster, the extra bandwidth of NDR InfiniBand will remain underutilized.
D. Increase the number of GPUs in each node to allow the training job to process more data in parallel, thereby hiding the storage latency. Adding more GPUs increases the demand on the storage system. If the metadata layer is already a bottleneck, adding more parallel workers will likely lead to “lock contention“ or even higher latencies as more processes fight to access the same directory structures, resulting in lower overall GPU utilization.
Unattempted
Correct: B. Implement a distributed metadata cache or use an All-Flash storage tier specifically for the metadata and small-file components of the dataset. AI training jobs often start by scanning millions of small files (like images or text snippets), creating a “metadata storm“ that can overwhelm traditional spinning-disk (HDD) or even poorly configured SSD arrays. In an NVIDIA-Certified AI Factory, best practices involve using a high-performance All-Flash tier (NVMe) or a distributed metadata service (like those found in Lustre, Weka, or NetApp ONTAP). This ensures that “Small-File I/O“ and metadata lookups—which are latency-sensitive—are handled by the fastest storage medium, preventing the GPUs from idling while waiting for the filesystem to list files.
Incorrect: A. Enable software-based compression on the storage volume to reduce the total amount of data that needs to be read from the disks. Compression is a throughput optimization, not a latency optimization. While it can reduce the footprint of large files, the computational overhead of decompressing millions of small files can actually increase latency and CPU load, potentially worsening the metadata bottleneck rather than solving it.
C. Upgrade the InfiniBand switches from HDR (200Gb/s) to NDR (400Gb/s) to increase the total bandwidth available for data transfers. Upgrading bandwidth is like widening a highway to solve a stoplight problem. High-latency metadata operations are “small-packet“ exchanges where the speed of light and the storage controller‘s response time matter more than the total pipe size. Until the storage backend can serve metadata faster, the extra bandwidth of NDR InfiniBand will remain underutilized.
D. Increase the number of GPUs in each node to allow the training job to process more data in parallel, thereby hiding the storage latency. Adding more GPUs increases the demand on the storage system. If the metadata layer is already a bottleneck, adding more parallel workers will likely lead to “lock contention“ or even higher latencies as more processes fight to access the same directory structures, resulting in lower overall GPU utilization.
Question 27 of 60
27. Question
During the physical validation phase of an AI factory deployment involving multiple NVIDIA DGX nodes, an administrator observes that several links are failing to negotiate at the expected 400Gbps speed despite using Direct Attach Copper (DAC) cables. The design utilizes a Fat-Tree topology. Which physical layer check should be prioritized to validate that the cable types and transceivers are sufficient for the required East-West traffic bandwidth?
Correct
Correct: C. Checking the cable length against DAC maximum reach specifications. In 400Gbps (NDR) environments, Direct Attach Copper (DAC) cables have very strict physical length limitations due to signal attenuation. For NDR (400Gbps), passive DAC cables are typically limited to 1.5 to 2 meters. If an administrator attempts to use a longer DAC cable to reach a top-of-rack switch in a tall rack or a neighboring rack in a Fat-Tree topology, the link will either fail to come up or down-train to a lower speed (like 200Gbps or 100Gbps) to maintain stability. Verifying that the physical cable length matches the manufacturerÂ’s specification for 400Gbps transmission is the first step in physical layer (L1) troubleshooting.
Incorrect: A. Verifying the OSFP to QSFP adapter compatibility. While modern DGX systems (like the DGX H100) use OSFP ports, and some switches might use QSFP, adapters are typically used to bridge different form factors. However, if the link is failing to negotiate speed while using a DAC, the primary bottleneck is the physical medium‘s ability to carry the high-frequency signal over a specific distance, not the mechanical adapter itself (provided the adapter is rated for the correct generation).
B. Confirming the TPM is enabled in the UEFI settings. The Trusted Platform Module (TPM) is a security component used for cryptographic keys, measured boot, and system integrity. It has no functional relationship with the physical link negotiation or bandwidth of the InfiniBand/Ethernet network interfaces.
D. Validating the BMC firmware version on the storage array. The Baseboard Management Controller (BMC) manages the health and power of the storage server. While the storage array‘s performance is important for the AI cluster, its BMC firmware version will not cause a physical link negotiation failure on the network interfaces of the DGX compute nodes.
Incorrect
Correct: C. Checking the cable length against DAC maximum reach specifications. In 400Gbps (NDR) environments, Direct Attach Copper (DAC) cables have very strict physical length limitations due to signal attenuation. For NDR (400Gbps), passive DAC cables are typically limited to 1.5 to 2 meters. If an administrator attempts to use a longer DAC cable to reach a top-of-rack switch in a tall rack or a neighboring rack in a Fat-Tree topology, the link will either fail to come up or down-train to a lower speed (like 200Gbps or 100Gbps) to maintain stability. Verifying that the physical cable length matches the manufacturerÂ’s specification for 400Gbps transmission is the first step in physical layer (L1) troubleshooting.
Incorrect: A. Verifying the OSFP to QSFP adapter compatibility. While modern DGX systems (like the DGX H100) use OSFP ports, and some switches might use QSFP, adapters are typically used to bridge different form factors. However, if the link is failing to negotiate speed while using a DAC, the primary bottleneck is the physical medium‘s ability to carry the high-frequency signal over a specific distance, not the mechanical adapter itself (provided the adapter is rated for the correct generation).
B. Confirming the TPM is enabled in the UEFI settings. The Trusted Platform Module (TPM) is a security component used for cryptographic keys, measured boot, and system integrity. It has no functional relationship with the physical link negotiation or bandwidth of the InfiniBand/Ethernet network interfaces.
D. Validating the BMC firmware version on the storage array. The Baseboard Management Controller (BMC) manages the health and power of the storage server. While the storage array‘s performance is important for the AI cluster, its BMC firmware version will not cause a physical link negotiation failure on the network interfaces of the DGX compute nodes.
Unattempted
Correct: C. Checking the cable length against DAC maximum reach specifications. In 400Gbps (NDR) environments, Direct Attach Copper (DAC) cables have very strict physical length limitations due to signal attenuation. For NDR (400Gbps), passive DAC cables are typically limited to 1.5 to 2 meters. If an administrator attempts to use a longer DAC cable to reach a top-of-rack switch in a tall rack or a neighboring rack in a Fat-Tree topology, the link will either fail to come up or down-train to a lower speed (like 200Gbps or 100Gbps) to maintain stability. Verifying that the physical cable length matches the manufacturerÂ’s specification for 400Gbps transmission is the first step in physical layer (L1) troubleshooting.
Incorrect: A. Verifying the OSFP to QSFP adapter compatibility. While modern DGX systems (like the DGX H100) use OSFP ports, and some switches might use QSFP, adapters are typically used to bridge different form factors. However, if the link is failing to negotiate speed while using a DAC, the primary bottleneck is the physical medium‘s ability to carry the high-frequency signal over a specific distance, not the mechanical adapter itself (provided the adapter is rated for the correct generation).
B. Confirming the TPM is enabled in the UEFI settings. The Trusted Platform Module (TPM) is a security component used for cryptographic keys, measured boot, and system integrity. It has no functional relationship with the physical link negotiation or bandwidth of the InfiniBand/Ethernet network interfaces.
D. Validating the BMC firmware version on the storage array. The Baseboard Management Controller (BMC) manages the health and power of the storage server. While the storage array‘s performance is important for the AI cluster, its BMC firmware version will not cause a physical link negotiation failure on the network interfaces of the DGX compute nodes.
Question 28 of 60
28. Question
A network card in an AI server is identified as faulty after showing intermittent link drops. The administrator needs to replace the card. Which step is critical to ensure that the new card is recognized and functions with the same performance characteristics as the rest of the cluster?
Correct
Correct: B. The administrator must verify and update the firmware of the new NIC to match the specific version used across the cluster as defined in the BCM category. In high-scale AI environments, subtle differences in firmware can lead to significant performance variations, specifically in how the InfiniBand or RoCE stacks handle congestion control and RDMA verbs. NVIDIA Base Command Manager (BCM) uses Categories to enforce a “Golden Image“ or specific version baseline. When a hardware component like a NIC is replaced, it must be flashed to the cluster-wide standard to ensure it integrates correctly with the network fabric and behaves identically to its peers during collective communication tasks (e.g., All-Reduce).
Incorrect: A. No steps are needed; all network cards from the same manufacturer have identical performance and firmware regardless of when they were produced. This is a common misconception. Hardware revisions and manufacturing dates often ship with different firmware versions. In an AI cluster, even a minor version mismatch can cause “jitter“ or synchronization delays in distributed training jobs, leading to a performance drop for the entire workload.
C. The administrator should swap the card while the server is running a training job to test if the hot-swap software can detect it automatically. High-performance network cards (ConnectX-6/7 or BlueField DPUs) are typically not treated as hot-swappable components in the same way a SATA drive might be. Attempting to swap a NIC during an active job would crash the training process and could potentially cause a kernel panic or electrical damage to the server‘s PCIe bus.
D. The administrator must change the MAC address of the new card to match the old card exactly using a Sharpie on the physical PCB. MAC addresses are burned into the hardware at the factory and are globally unique. While software-based MAC spoofing exists, it is never a standard requirement for hardware replacement in an AI cluster. Physical labeling with a marker has no effect on the digital identity or functional logic of the network card.
Incorrect
Correct: B. The administrator must verify and update the firmware of the new NIC to match the specific version used across the cluster as defined in the BCM category. In high-scale AI environments, subtle differences in firmware can lead to significant performance variations, specifically in how the InfiniBand or RoCE stacks handle congestion control and RDMA verbs. NVIDIA Base Command Manager (BCM) uses Categories to enforce a “Golden Image“ or specific version baseline. When a hardware component like a NIC is replaced, it must be flashed to the cluster-wide standard to ensure it integrates correctly with the network fabric and behaves identically to its peers during collective communication tasks (e.g., All-Reduce).
Incorrect: A. No steps are needed; all network cards from the same manufacturer have identical performance and firmware regardless of when they were produced. This is a common misconception. Hardware revisions and manufacturing dates often ship with different firmware versions. In an AI cluster, even a minor version mismatch can cause “jitter“ or synchronization delays in distributed training jobs, leading to a performance drop for the entire workload.
C. The administrator should swap the card while the server is running a training job to test if the hot-swap software can detect it automatically. High-performance network cards (ConnectX-6/7 or BlueField DPUs) are typically not treated as hot-swappable components in the same way a SATA drive might be. Attempting to swap a NIC during an active job would crash the training process and could potentially cause a kernel panic or electrical damage to the server‘s PCIe bus.
D. The administrator must change the MAC address of the new card to match the old card exactly using a Sharpie on the physical PCB. MAC addresses are burned into the hardware at the factory and are globally unique. While software-based MAC spoofing exists, it is never a standard requirement for hardware replacement in an AI cluster. Physical labeling with a marker has no effect on the digital identity or functional logic of the network card.
Unattempted
Correct: B. The administrator must verify and update the firmware of the new NIC to match the specific version used across the cluster as defined in the BCM category. In high-scale AI environments, subtle differences in firmware can lead to significant performance variations, specifically in how the InfiniBand or RoCE stacks handle congestion control and RDMA verbs. NVIDIA Base Command Manager (BCM) uses Categories to enforce a “Golden Image“ or specific version baseline. When a hardware component like a NIC is replaced, it must be flashed to the cluster-wide standard to ensure it integrates correctly with the network fabric and behaves identically to its peers during collective communication tasks (e.g., All-Reduce).
Incorrect: A. No steps are needed; all network cards from the same manufacturer have identical performance and firmware regardless of when they were produced. This is a common misconception. Hardware revisions and manufacturing dates often ship with different firmware versions. In an AI cluster, even a minor version mismatch can cause “jitter“ or synchronization delays in distributed training jobs, leading to a performance drop for the entire workload.
C. The administrator should swap the card while the server is running a training job to test if the hot-swap software can detect it automatically. High-performance network cards (ConnectX-6/7 or BlueField DPUs) are typically not treated as hot-swappable components in the same way a SATA drive might be. Attempting to swap a NIC during an active job would crash the training process and could potentially cause a kernel panic or electrical damage to the server‘s PCIe bus.
D. The administrator must change the MAC address of the new card to match the old card exactly using a Sharpie on the physical PCB. MAC addresses are burned into the hardware at the factory and are globally unique. While software-based MAC spoofing exists, it is never a standard requirement for hardware replacement in an AI cluster. Physical labeling with a marker has no effect on the digital identity or functional logic of the network card.
Question 29 of 60
29. Question
During the configuration of a High Availability (HA) control plane in Base Command Manager, the administrator must verify that the failover mechanism is working correctly. Which component is responsible for maintaining the cluster database and ensuring that the standby head node can take over if the primary head node fails?
Correct
Correct: C. A shared heartbeat mechanism and synchronized database across head nodes. In an NVIDIA-certified AI cluster, the Base Command Manager control plane is often configured in an HA pair (Primary and Standby). To ensure a seamless failover, the system uses a heartbeat mechanism (often via a dedicated heartbeat network) to monitor the health of the active head node. Simultaneously, the cluster‘s configuration database, software images, and monitoring data are continuously synchronized between the nodes. If the heartbeat is lost, the standby node promotes itself to primary, takes over the virtual IP address, and uses the synchronized database to continue managing the cluster without data loss.
Incorrect: A. The NVIDIA Container Toolkit running on the compute nodes. The NVIDIA Container Toolkit (which includes nvidia-container-runtime) is responsible for allowing containers to access GPU hardware on individual compute nodes. It has no role in managing the cluster‘s control plane database or the failover logic of the management head nodes.
B. The LDAP service used for user authentication and authorization. While LDAP (or Active Directory) is often used within an AI Factory to manage user identities, it is an external or auxiliary service. LDAP does not manage the internal cluster configuration database of the Base Command Manager, nor does it control the failover state of the management hardware.
D. The Slurm database daemon (slurmdbd) running on the login node. slurmdbd is responsible for recording job accounting information for the Slurm scheduler. While it is an important part of the workload management stack, it is not the component that manages the overall cluster infrastructure database or the HA state of the management software itself. Furthermore, in a standard BCM deployment, the management database is handled by the BCM engine, not the scheduler‘s accounting daemon.
Incorrect
Correct: C. A shared heartbeat mechanism and synchronized database across head nodes. In an NVIDIA-certified AI cluster, the Base Command Manager control plane is often configured in an HA pair (Primary and Standby). To ensure a seamless failover, the system uses a heartbeat mechanism (often via a dedicated heartbeat network) to monitor the health of the active head node. Simultaneously, the cluster‘s configuration database, software images, and monitoring data are continuously synchronized between the nodes. If the heartbeat is lost, the standby node promotes itself to primary, takes over the virtual IP address, and uses the synchronized database to continue managing the cluster without data loss.
Incorrect: A. The NVIDIA Container Toolkit running on the compute nodes. The NVIDIA Container Toolkit (which includes nvidia-container-runtime) is responsible for allowing containers to access GPU hardware on individual compute nodes. It has no role in managing the cluster‘s control plane database or the failover logic of the management head nodes.
B. The LDAP service used for user authentication and authorization. While LDAP (or Active Directory) is often used within an AI Factory to manage user identities, it is an external or auxiliary service. LDAP does not manage the internal cluster configuration database of the Base Command Manager, nor does it control the failover state of the management hardware.
D. The Slurm database daemon (slurmdbd) running on the login node. slurmdbd is responsible for recording job accounting information for the Slurm scheduler. While it is an important part of the workload management stack, it is not the component that manages the overall cluster infrastructure database or the HA state of the management software itself. Furthermore, in a standard BCM deployment, the management database is handled by the BCM engine, not the scheduler‘s accounting daemon.
Unattempted
Correct: C. A shared heartbeat mechanism and synchronized database across head nodes. In an NVIDIA-certified AI cluster, the Base Command Manager control plane is often configured in an HA pair (Primary and Standby). To ensure a seamless failover, the system uses a heartbeat mechanism (often via a dedicated heartbeat network) to monitor the health of the active head node. Simultaneously, the cluster‘s configuration database, software images, and monitoring data are continuously synchronized between the nodes. If the heartbeat is lost, the standby node promotes itself to primary, takes over the virtual IP address, and uses the synchronized database to continue managing the cluster without data loss.
Incorrect: A. The NVIDIA Container Toolkit running on the compute nodes. The NVIDIA Container Toolkit (which includes nvidia-container-runtime) is responsible for allowing containers to access GPU hardware on individual compute nodes. It has no role in managing the cluster‘s control plane database or the failover logic of the management head nodes.
B. The LDAP service used for user authentication and authorization. While LDAP (or Active Directory) is often used within an AI Factory to manage user identities, it is an external or auxiliary service. LDAP does not manage the internal cluster configuration database of the Base Command Manager, nor does it control the failover state of the management hardware.
D. The Slurm database daemon (slurmdbd) running on the login node. slurmdbd is responsible for recording job accounting information for the Slurm scheduler. While it is an important part of the workload management stack, it is not the component that manages the overall cluster infrastructure database or the HA state of the management software itself. Furthermore, in a standard BCM deployment, the management database is handled by the BCM engine, not the scheduler‘s accounting daemon.
Question 30 of 60
30. Question
A system administrator is troubleshooting a performance issue on an AMD-based AI server. They suspect that the GPU-to-CPU affinity is not optimized. Which tool or command should be used to identify the NUMA (Non-Uniform Memory Access) topology and ensure that the GPUs are pinned to the correct CPU socket for maximum throughput?
Correct
Correct: A. The ‘lscpu‘ and ‘nvidia-smi topo -m‘ commands to view the processor affinity and the physical layout of the GPUs relative to the CPU cores. In AMD EPYC-based servers, which often feature multiple NUMA nodes per socket (NPS settings), ensuring that a GPU communicates with the “local“ CPU cores is vital for performance. The lscpu command provides a detailed breakdown of the CPU‘s NUMA architecture, showing which logical cores belong to which NUMA node. The nvidia-smi topo -m command displays the Topology Matrix, which explicitly lists the CPU Affinity (the specific range of CPU cores closest to each GPU) and the NUMA Affinity. By cross-referencing these, an administrator can ensure that data-loading processes or MPI ranks are pinned to the specific CPU cores that have a direct PCIe path to the GPU, minimizing high-latency cross-socket or cross-NUMA traffic.
Incorrect: B. The ‘fdisk -l‘ command to check if the GPU memory has been partitioned as a virtual swap disk for the AMD processor. The fdisk utility is used for managing disk partitions on block storage devices (HDD/SSD). GPU memory (VRAM/HBM) is managed by the NVIDIA driver and cannot be “partitioned“ into a system swap disk using standard Linux disk tools. Even if it were possible, using high-speed HBM as swap for a CPU would not be a standard optimization or troubleshooting step for affinity issues.
C. The ‘ping‘ command to measure the latency between the GPU and the local hard drive on the AMD motherboard. The ping command is a network utility used to test reachability and latency over an IP network. It cannot measure internal bus latency between a GPU and a local storage device. To measure internal data transfer performance, one would use the NVIDIA Bandwidth Test or storage benchmarks like fio.
D. The ‘apt-get upgrade‘ command to automatically download the latest AMD-to-NVIDIA optimization patch for the Linux kernel. apt-get upgrade is a general package management command that updates installed software to the latest versions available in the repositories. While it might update drivers, it is not a “topology identification tool.“ Optimization in an AI cluster requires manual configuration of affinity, bios settings, and environment variables, rather than relying on a generic “optimization patch.“
Incorrect
Correct: A. The ‘lscpu‘ and ‘nvidia-smi topo -m‘ commands to view the processor affinity and the physical layout of the GPUs relative to the CPU cores. In AMD EPYC-based servers, which often feature multiple NUMA nodes per socket (NPS settings), ensuring that a GPU communicates with the “local“ CPU cores is vital for performance. The lscpu command provides a detailed breakdown of the CPU‘s NUMA architecture, showing which logical cores belong to which NUMA node. The nvidia-smi topo -m command displays the Topology Matrix, which explicitly lists the CPU Affinity (the specific range of CPU cores closest to each GPU) and the NUMA Affinity. By cross-referencing these, an administrator can ensure that data-loading processes or MPI ranks are pinned to the specific CPU cores that have a direct PCIe path to the GPU, minimizing high-latency cross-socket or cross-NUMA traffic.
Incorrect: B. The ‘fdisk -l‘ command to check if the GPU memory has been partitioned as a virtual swap disk for the AMD processor. The fdisk utility is used for managing disk partitions on block storage devices (HDD/SSD). GPU memory (VRAM/HBM) is managed by the NVIDIA driver and cannot be “partitioned“ into a system swap disk using standard Linux disk tools. Even if it were possible, using high-speed HBM as swap for a CPU would not be a standard optimization or troubleshooting step for affinity issues.
C. The ‘ping‘ command to measure the latency between the GPU and the local hard drive on the AMD motherboard. The ping command is a network utility used to test reachability and latency over an IP network. It cannot measure internal bus latency between a GPU and a local storage device. To measure internal data transfer performance, one would use the NVIDIA Bandwidth Test or storage benchmarks like fio.
D. The ‘apt-get upgrade‘ command to automatically download the latest AMD-to-NVIDIA optimization patch for the Linux kernel. apt-get upgrade is a general package management command that updates installed software to the latest versions available in the repositories. While it might update drivers, it is not a “topology identification tool.“ Optimization in an AI cluster requires manual configuration of affinity, bios settings, and environment variables, rather than relying on a generic “optimization patch.“
Unattempted
Correct: A. The ‘lscpu‘ and ‘nvidia-smi topo -m‘ commands to view the processor affinity and the physical layout of the GPUs relative to the CPU cores. In AMD EPYC-based servers, which often feature multiple NUMA nodes per socket (NPS settings), ensuring that a GPU communicates with the “local“ CPU cores is vital for performance. The lscpu command provides a detailed breakdown of the CPU‘s NUMA architecture, showing which logical cores belong to which NUMA node. The nvidia-smi topo -m command displays the Topology Matrix, which explicitly lists the CPU Affinity (the specific range of CPU cores closest to each GPU) and the NUMA Affinity. By cross-referencing these, an administrator can ensure that data-loading processes or MPI ranks are pinned to the specific CPU cores that have a direct PCIe path to the GPU, minimizing high-latency cross-socket or cross-NUMA traffic.
Incorrect: B. The ‘fdisk -l‘ command to check if the GPU memory has been partitioned as a virtual swap disk for the AMD processor. The fdisk utility is used for managing disk partitions on block storage devices (HDD/SSD). GPU memory (VRAM/HBM) is managed by the NVIDIA driver and cannot be “partitioned“ into a system swap disk using standard Linux disk tools. Even if it were possible, using high-speed HBM as swap for a CPU would not be a standard optimization or troubleshooting step for affinity issues.
C. The ‘ping‘ command to measure the latency between the GPU and the local hard drive on the AMD motherboard. The ping command is a network utility used to test reachability and latency over an IP network. It cannot measure internal bus latency between a GPU and a local storage device. To measure internal data transfer performance, one would use the NVIDIA Bandwidth Test or storage benchmarks like fio.
D. The ‘apt-get upgrade‘ command to automatically download the latest AMD-to-NVIDIA optimization patch for the Linux kernel. apt-get upgrade is a general package management command that updates installed software to the latest versions available in the repositories. While it might update drivers, it is not a “topology identification tool.“ Optimization in an AI cluster requires manual configuration of affinity, bios settings, and environment variables, rather than relying on a generic “optimization patch.“
Question 31 of 60
31. Question
A cluster administrator is setting up a specialized software environment for an AI research team. They need to install Slurm along with Enroot and Pyxis. What is the primary reason for integrating Enroot and Pyxis with the Slurm workload manager in an NVIDIA-based AI infrastructure?
Correct
A. To automate the firmware updates for BlueField-3 DPUs
Incorrect: Firmware updates for NVIDIA BlueField-3 DPUs are typically managed using the NVIDIA BlueField-3 Software (via the DPU OS) or tools like mlxfwmanager and flint from the host. While some management software like Base Command Manager (BCM) might orchestrate these, Slurm/Enroot/Pyxis are focused on workload execution, not physical hardware maintenance.
B. To allow Slurm to run containerized workloads as if they were native processes
Correct: This is the primary function of the Pyxis and Enroot stack in an NVIDIA environment.
Enroot: A lightweight, unprivileged container runtime specifically optimized for HPC and AI. It converts Docker/OCI images into a simple filesystem (SquashFS) and runs them without the overhead of a persistent daemon (like Dockerd), making it faster and more stable for large-scale training.
Pyxis: A Slurm SPANK plugin that enables the srun and sbatch commands to accept container-specific flags (e.g., –container-image). This allows users to launch jobs directly into a containerized environment seamlessly, ensuring portability and reproducibility across the cluster without changing their Slurm workflow.
C. To provide hardware-level encryption for the NVLink Fabric
Incorrect: NVLink encryption (where available, such as in Confidential Computing modes on H100) is a feature of the GPU hardware and the NVIDIA driver. Slurm, Pyxis, and Enroot operate at the application orchestration layer and do not manage the physical signaling or encryption protocols of the NVLink fabric.
D. To synchronize the BIOS settings across all compute nodes
Incorrect: BIOS synchronization and “Golden Image“ configuration are tasks handled by infrastructure management tools like NVIDIA Base Command Manager or Redfish-based BMC scripts. Container runtimes and scheduler plugins are not involved in low-level motherboard or firmware settings.
Incorrect
A. To automate the firmware updates for BlueField-3 DPUs
Incorrect: Firmware updates for NVIDIA BlueField-3 DPUs are typically managed using the NVIDIA BlueField-3 Software (via the DPU OS) or tools like mlxfwmanager and flint from the host. While some management software like Base Command Manager (BCM) might orchestrate these, Slurm/Enroot/Pyxis are focused on workload execution, not physical hardware maintenance.
B. To allow Slurm to run containerized workloads as if they were native processes
Correct: This is the primary function of the Pyxis and Enroot stack in an NVIDIA environment.
Enroot: A lightweight, unprivileged container runtime specifically optimized for HPC and AI. It converts Docker/OCI images into a simple filesystem (SquashFS) and runs them without the overhead of a persistent daemon (like Dockerd), making it faster and more stable for large-scale training.
Pyxis: A Slurm SPANK plugin that enables the srun and sbatch commands to accept container-specific flags (e.g., –container-image). This allows users to launch jobs directly into a containerized environment seamlessly, ensuring portability and reproducibility across the cluster without changing their Slurm workflow.
C. To provide hardware-level encryption for the NVLink Fabric
Incorrect: NVLink encryption (where available, such as in Confidential Computing modes on H100) is a feature of the GPU hardware and the NVIDIA driver. Slurm, Pyxis, and Enroot operate at the application orchestration layer and do not manage the physical signaling or encryption protocols of the NVLink fabric.
D. To synchronize the BIOS settings across all compute nodes
Incorrect: BIOS synchronization and “Golden Image“ configuration are tasks handled by infrastructure management tools like NVIDIA Base Command Manager or Redfish-based BMC scripts. Container runtimes and scheduler plugins are not involved in low-level motherboard or firmware settings.
Unattempted
A. To automate the firmware updates for BlueField-3 DPUs
Incorrect: Firmware updates for NVIDIA BlueField-3 DPUs are typically managed using the NVIDIA BlueField-3 Software (via the DPU OS) or tools like mlxfwmanager and flint from the host. While some management software like Base Command Manager (BCM) might orchestrate these, Slurm/Enroot/Pyxis are focused on workload execution, not physical hardware maintenance.
B. To allow Slurm to run containerized workloads as if they were native processes
Correct: This is the primary function of the Pyxis and Enroot stack in an NVIDIA environment.
Enroot: A lightweight, unprivileged container runtime specifically optimized for HPC and AI. It converts Docker/OCI images into a simple filesystem (SquashFS) and runs them without the overhead of a persistent daemon (like Dockerd), making it faster and more stable for large-scale training.
Pyxis: A Slurm SPANK plugin that enables the srun and sbatch commands to accept container-specific flags (e.g., –container-image). This allows users to launch jobs directly into a containerized environment seamlessly, ensuring portability and reproducibility across the cluster without changing their Slurm workflow.
C. To provide hardware-level encryption for the NVLink Fabric
Incorrect: NVLink encryption (where available, such as in Confidential Computing modes on H100) is a feature of the GPU hardware and the NVIDIA driver. Slurm, Pyxis, and Enroot operate at the application orchestration layer and do not manage the physical signaling or encryption protocols of the NVLink fabric.
D. To synchronize the BIOS settings across all compute nodes
Incorrect: BIOS synchronization and “Golden Image“ configuration are tasks handled by infrastructure management tools like NVIDIA Base Command Manager or Redfish-based BMC scripts. Container runtimes and scheduler plugins are not involved in low-level motherboard or firmware settings.
Question 32 of 60
32. Question
During the initial system bring-up of an NVIDIA HGX H100 system, an administrator notices that the Baseboard Management Controller (BMC) reports a power capping event despite the rack PDUs operating within their limits. Considering the critical power requirements for AI factories, which action should be the priority to ensure the server meets the high-performance demands of AI workloads without hardware-induced throttling?
Correct
Correct: D. Verify the Power Supply Unit (PSU) redundancy policy in the BMC and ensure all power cables are seated and connected to independent circuits.
This option is correct because a power capping event reported by the Baseboard Management Controller (BMC), despite rack PDUs operating within limits, points to a power delivery issue within the server itself rather than a facility-level problem . The HGX H100 system has high power demands—a single DGX H100 server can draw approximately 10-11 kW under full load . Several power-related factors must be checked at the server level:
PSU Redundancy Policy: When a system operates in an N+N redundant configuration, power capping must be enabled to limit system power consumption and ensure safe operation . If the PSU redundancy policy is misconfigured or if one power feed is lost, the BMC may enforce power capping to protect the hardware, even if the remaining PSUs could theoretically handle the load .
Power Cable Integrity: Loose or improperly seated power cables can cause intermittent power delivery issues. During initial system bring-up, verifying all power connections are fully seated is critical . This includes checking that power cables are connected to independent circuits as required for redundancy.
Power Budget Configuration: The BMC manages power capping through Redfish APIs and IPMI commands, setting conservative policies when needed . If the system is operating with fewer than the required number of power supplies, the BMC may enforce power caps to prevent overloading the remaining units.
Ensuring proper PSU configuration and physical power connections addresses the most likely root cause of BMC-reported power capping when facility power appears adequate .
Incorrect:
A. Update the TPM firmware to version 2.0 to allow for higher power draw authorization from the motherboard components during peak loads.
This is incorrect. TPM (Trusted Platform Module) is a security component used for cryptographic operations, secure boot, and platform integrity verification . It has no role in power delivery authorization or power capping management. TPM firmware updates address security vulnerabilities and compatibility, not power draw capabilities. Power management is handled by the BMC, PSUs, and voltage regulators, not the TPM.
B. Reinstall the NVIDIA Container Toolkit to recalibrate the power sensing logic of the underlying operating system and driver stack.
This is incorrect. The NVIDIA Container Toolkit enables GPU acceleration in containers but does not perform power sensing or calibration . Power monitoring and capping are handled by the BMC hardware, GPU firmware (VBIOS), and NVIDIA drivers through tools like nvidia-smi . Reinstalling the Container Toolkit would not affect power sensing logic or resolve BMC-reported power capping events. The Container Toolkit is for container integration, not power management .
C. Decrease the GPU clock frequency via nvidia-smi to manually stay under the current power threshold until more power is available.
This is incorrect as a troubleshooting priority. While manually reducing GPU clocks can lower power consumption, it works around the symptom rather than addressing the root cause during system bring-up . The priority should be resolving the underlying power delivery issue—such as PSU redundancy misconfiguration or loose cables—to allow the system to operate at its designed performance levels . Thermal throttling or power capping should first be addressed by verifying environmental factors and power integrity before manually tuning power limits . This approach ensures the system meets high-performance demands without artificial throttling .
Incorrect
Correct: D. Verify the Power Supply Unit (PSU) redundancy policy in the BMC and ensure all power cables are seated and connected to independent circuits.
This option is correct because a power capping event reported by the Baseboard Management Controller (BMC), despite rack PDUs operating within limits, points to a power delivery issue within the server itself rather than a facility-level problem . The HGX H100 system has high power demands—a single DGX H100 server can draw approximately 10-11 kW under full load . Several power-related factors must be checked at the server level:
PSU Redundancy Policy: When a system operates in an N+N redundant configuration, power capping must be enabled to limit system power consumption and ensure safe operation . If the PSU redundancy policy is misconfigured or if one power feed is lost, the BMC may enforce power capping to protect the hardware, even if the remaining PSUs could theoretically handle the load .
Power Cable Integrity: Loose or improperly seated power cables can cause intermittent power delivery issues. During initial system bring-up, verifying all power connections are fully seated is critical . This includes checking that power cables are connected to independent circuits as required for redundancy.
Power Budget Configuration: The BMC manages power capping through Redfish APIs and IPMI commands, setting conservative policies when needed . If the system is operating with fewer than the required number of power supplies, the BMC may enforce power caps to prevent overloading the remaining units.
Ensuring proper PSU configuration and physical power connections addresses the most likely root cause of BMC-reported power capping when facility power appears adequate .
Incorrect:
A. Update the TPM firmware to version 2.0 to allow for higher power draw authorization from the motherboard components during peak loads.
This is incorrect. TPM (Trusted Platform Module) is a security component used for cryptographic operations, secure boot, and platform integrity verification . It has no role in power delivery authorization or power capping management. TPM firmware updates address security vulnerabilities and compatibility, not power draw capabilities. Power management is handled by the BMC, PSUs, and voltage regulators, not the TPM.
B. Reinstall the NVIDIA Container Toolkit to recalibrate the power sensing logic of the underlying operating system and driver stack.
This is incorrect. The NVIDIA Container Toolkit enables GPU acceleration in containers but does not perform power sensing or calibration . Power monitoring and capping are handled by the BMC hardware, GPU firmware (VBIOS), and NVIDIA drivers through tools like nvidia-smi . Reinstalling the Container Toolkit would not affect power sensing logic or resolve BMC-reported power capping events. The Container Toolkit is for container integration, not power management .
C. Decrease the GPU clock frequency via nvidia-smi to manually stay under the current power threshold until more power is available.
This is incorrect as a troubleshooting priority. While manually reducing GPU clocks can lower power consumption, it works around the symptom rather than addressing the root cause during system bring-up . The priority should be resolving the underlying power delivery issue—such as PSU redundancy misconfiguration or loose cables—to allow the system to operate at its designed performance levels . Thermal throttling or power capping should first be addressed by verifying environmental factors and power integrity before manually tuning power limits . This approach ensures the system meets high-performance demands without artificial throttling .
Unattempted
Correct: D. Verify the Power Supply Unit (PSU) redundancy policy in the BMC and ensure all power cables are seated and connected to independent circuits.
This option is correct because a power capping event reported by the Baseboard Management Controller (BMC), despite rack PDUs operating within limits, points to a power delivery issue within the server itself rather than a facility-level problem . The HGX H100 system has high power demands—a single DGX H100 server can draw approximately 10-11 kW under full load . Several power-related factors must be checked at the server level:
PSU Redundancy Policy: When a system operates in an N+N redundant configuration, power capping must be enabled to limit system power consumption and ensure safe operation . If the PSU redundancy policy is misconfigured or if one power feed is lost, the BMC may enforce power capping to protect the hardware, even if the remaining PSUs could theoretically handle the load .
Power Cable Integrity: Loose or improperly seated power cables can cause intermittent power delivery issues. During initial system bring-up, verifying all power connections are fully seated is critical . This includes checking that power cables are connected to independent circuits as required for redundancy.
Power Budget Configuration: The BMC manages power capping through Redfish APIs and IPMI commands, setting conservative policies when needed . If the system is operating with fewer than the required number of power supplies, the BMC may enforce power caps to prevent overloading the remaining units.
Ensuring proper PSU configuration and physical power connections addresses the most likely root cause of BMC-reported power capping when facility power appears adequate .
Incorrect:
A. Update the TPM firmware to version 2.0 to allow for higher power draw authorization from the motherboard components during peak loads.
This is incorrect. TPM (Trusted Platform Module) is a security component used for cryptographic operations, secure boot, and platform integrity verification . It has no role in power delivery authorization or power capping management. TPM firmware updates address security vulnerabilities and compatibility, not power draw capabilities. Power management is handled by the BMC, PSUs, and voltage regulators, not the TPM.
B. Reinstall the NVIDIA Container Toolkit to recalibrate the power sensing logic of the underlying operating system and driver stack.
This is incorrect. The NVIDIA Container Toolkit enables GPU acceleration in containers but does not perform power sensing or calibration . Power monitoring and capping are handled by the BMC hardware, GPU firmware (VBIOS), and NVIDIA drivers through tools like nvidia-smi . Reinstalling the Container Toolkit would not affect power sensing logic or resolve BMC-reported power capping events. The Container Toolkit is for container integration, not power management .
C. Decrease the GPU clock frequency via nvidia-smi to manually stay under the current power threshold until more power is available.
This is incorrect as a troubleshooting priority. While manually reducing GPU clocks can lower power consumption, it works around the symptom rather than addressing the root cause during system bring-up . The priority should be resolving the underlying power delivery issue—such as PSU redundancy misconfiguration or loose cables—to allow the system to operate at its designed performance levels . Thermal throttling or power capping should first be addressed by verifying environmental factors and power integrity before manually tuning power limits . This approach ensures the system meets high-performance demands without artificial throttling .
Question 33 of 60
33. Question
An administrator is performing a NeMo burn-in test on a newly configured cluster. During the test, several nodes reboot spontaneously. After checking logs, the administrator finds Power Supply Input Lost and Critical Over Temperature events. What is the primary purpose of the burn-in test in this context, and what does this failure indicate?
Correct
A. The test is meant to train a chatbot; the failure indicates complexity.
Incorrect: While NVIDIA NeMo is used for LLM development (including chatbots), the primary goal of a burn-in test in an infrastructure context is not to successfully train a model but to validate that the hardware can handle the sustained stress of doing so. A failure during this phase is a hardware or facility signal, not a model architecture issue.
B. The test checks the speed of the Linux boot process.
Incorrect: Boot speed is irrelevant to a NeMo burn-in. Burn-in tests are performed after the system has already booted and is running heavy compute operations. Spontaneous reboots during these tests are symptoms of hardware or environmental instability, not “slow“ software.
C. The test verifies the Slurm license.
Incorrect: Slurm is an open-source workload manager; it does not utilize a license that would force a physical system shutdown or cause “Critical Over Temperature“ events.
D. The test stresses physical infrastructure; the failure indicates power and cooling issues.
Correct: According to NCP-AII standards, burn-in tests (HPL, NCCL, and NeMo) are designed to simulate peak production loads.
Power Supply Input Lost: This event typically occurs when the GPUs draw more power than the Data Center‘s PDU (Power Distribution Unit) or the server‘s PSUs can provide, leading to a trip or voltage drop.
Critical Over Temperature: This indicates that the facility‘s cooling (CRAC units, airflow, or liquid cooling loops) is insufficient to dissipate the heat generated by the GPUs and CPUs running at 100% load.
Failure Indication: The failure of a burn-in test means the “AI Factory“ is not yet job-ready and requires facility-level remediation before production training can begin.
Incorrect
A. The test is meant to train a chatbot; the failure indicates complexity.
Incorrect: While NVIDIA NeMo is used for LLM development (including chatbots), the primary goal of a burn-in test in an infrastructure context is not to successfully train a model but to validate that the hardware can handle the sustained stress of doing so. A failure during this phase is a hardware or facility signal, not a model architecture issue.
B. The test checks the speed of the Linux boot process.
Incorrect: Boot speed is irrelevant to a NeMo burn-in. Burn-in tests are performed after the system has already booted and is running heavy compute operations. Spontaneous reboots during these tests are symptoms of hardware or environmental instability, not “slow“ software.
C. The test verifies the Slurm license.
Incorrect: Slurm is an open-source workload manager; it does not utilize a license that would force a physical system shutdown or cause “Critical Over Temperature“ events.
D. The test stresses physical infrastructure; the failure indicates power and cooling issues.
Correct: According to NCP-AII standards, burn-in tests (HPL, NCCL, and NeMo) are designed to simulate peak production loads.
Power Supply Input Lost: This event typically occurs when the GPUs draw more power than the Data Center‘s PDU (Power Distribution Unit) or the server‘s PSUs can provide, leading to a trip or voltage drop.
Critical Over Temperature: This indicates that the facility‘s cooling (CRAC units, airflow, or liquid cooling loops) is insufficient to dissipate the heat generated by the GPUs and CPUs running at 100% load.
Failure Indication: The failure of a burn-in test means the “AI Factory“ is not yet job-ready and requires facility-level remediation before production training can begin.
Unattempted
A. The test is meant to train a chatbot; the failure indicates complexity.
Incorrect: While NVIDIA NeMo is used for LLM development (including chatbots), the primary goal of a burn-in test in an infrastructure context is not to successfully train a model but to validate that the hardware can handle the sustained stress of doing so. A failure during this phase is a hardware or facility signal, not a model architecture issue.
B. The test checks the speed of the Linux boot process.
Incorrect: Boot speed is irrelevant to a NeMo burn-in. Burn-in tests are performed after the system has already booted and is running heavy compute operations. Spontaneous reboots during these tests are symptoms of hardware or environmental instability, not “slow“ software.
C. The test verifies the Slurm license.
Incorrect: Slurm is an open-source workload manager; it does not utilize a license that would force a physical system shutdown or cause “Critical Over Temperature“ events.
D. The test stresses physical infrastructure; the failure indicates power and cooling issues.
Correct: According to NCP-AII standards, burn-in tests (HPL, NCCL, and NeMo) are designed to simulate peak production loads.
Power Supply Input Lost: This event typically occurs when the GPUs draw more power than the Data Center‘s PDU (Power Distribution Unit) or the server‘s PSUs can provide, leading to a trip or voltage drop.
Critical Over Temperature: This indicates that the facility‘s cooling (CRAC units, airflow, or liquid cooling loops) is insufficient to dissipate the heat generated by the GPUs and CPUs running at 100% load.
Failure Indication: The failure of a burn-in test means the “AI Factory“ is not yet job-ready and requires facility-level remediation before production training can begin.
Question 34 of 60
34. Question
A cluster is experiencing intermittent network performance drops during large-scale NCCL all-reduce operations. The administrator suspects a faulty transceiver. Which tool or method should be used to verify the signal quality and identify if a specific cable or transceiver is failing within the AI fabric?
Correct
A. Using ‘mlnx_qos‘ to check for traffic priority
Incorrect: mlnx_qos is used to configure and view Quality of Service (QoS) parameters, such as Priority Flow Control (PFC) and Enhanced Transmission Selection (ETS). While misconfigured QoS can cause performance drops, it is a logical configuration check, not a tool for diagnosing physical signal quality or hardware degradation in a specific cable.
B. Changing the Slurm partition name to ‘debug‘
Incorrect: Slurm partitions are organizational labels used to group nodes for specific workloads. Renaming a partition to “debug“ has no functional effect on the hardware, network drivers, or physical layer diagnostics.
C. Reviewing the ‘ethtool -S‘ counters or using ‘cable-test‘ diagnostics on the switch
Correct: This method directly addresses the physical layer health as defined in the NCP-AII curriculum.
ethtool -S (Host Side): For Ethernet/RoCE fabrics, this command displays detailed hardware-level counters. An administrator looks for rx_crc_errors, rx_symbol_errors, or phy_symbol_errors, which are definitive signs of a failing transceiver or a dirty fiber connection.
cable-test (Switch Side): NVIDIA Spectrum and Quantum switches provide built-in diagnostics (like TDR – Time Domain Reflectometry) to verify cable integrity. On InfiniBand fabrics, tools like mlxlink or ibdiagnet are also used to check the Bit Error Rate (BER) and signal eye quality.
D. Reinstalling the NVIDIA Container Toolkit on the head node
Incorrect: The NVIDIA Container Toolkit (nvidia-container-runtime) manages GPU visibility and drivers within Docker/Enroot. It does not manage the network fabric. Reinstalling it on the head node would have no impact on intermittent hardware-level performance drops occurring on the compute nodes‘ high-speed interfaces.
Incorrect
A. Using ‘mlnx_qos‘ to check for traffic priority
Incorrect: mlnx_qos is used to configure and view Quality of Service (QoS) parameters, such as Priority Flow Control (PFC) and Enhanced Transmission Selection (ETS). While misconfigured QoS can cause performance drops, it is a logical configuration check, not a tool for diagnosing physical signal quality or hardware degradation in a specific cable.
B. Changing the Slurm partition name to ‘debug‘
Incorrect: Slurm partitions are organizational labels used to group nodes for specific workloads. Renaming a partition to “debug“ has no functional effect on the hardware, network drivers, or physical layer diagnostics.
C. Reviewing the ‘ethtool -S‘ counters or using ‘cable-test‘ diagnostics on the switch
Correct: This method directly addresses the physical layer health as defined in the NCP-AII curriculum.
ethtool -S (Host Side): For Ethernet/RoCE fabrics, this command displays detailed hardware-level counters. An administrator looks for rx_crc_errors, rx_symbol_errors, or phy_symbol_errors, which are definitive signs of a failing transceiver or a dirty fiber connection.
cable-test (Switch Side): NVIDIA Spectrum and Quantum switches provide built-in diagnostics (like TDR – Time Domain Reflectometry) to verify cable integrity. On InfiniBand fabrics, tools like mlxlink or ibdiagnet are also used to check the Bit Error Rate (BER) and signal eye quality.
D. Reinstalling the NVIDIA Container Toolkit on the head node
Incorrect: The NVIDIA Container Toolkit (nvidia-container-runtime) manages GPU visibility and drivers within Docker/Enroot. It does not manage the network fabric. Reinstalling it on the head node would have no impact on intermittent hardware-level performance drops occurring on the compute nodes‘ high-speed interfaces.
Unattempted
A. Using ‘mlnx_qos‘ to check for traffic priority
Incorrect: mlnx_qos is used to configure and view Quality of Service (QoS) parameters, such as Priority Flow Control (PFC) and Enhanced Transmission Selection (ETS). While misconfigured QoS can cause performance drops, it is a logical configuration check, not a tool for diagnosing physical signal quality or hardware degradation in a specific cable.
B. Changing the Slurm partition name to ‘debug‘
Incorrect: Slurm partitions are organizational labels used to group nodes for specific workloads. Renaming a partition to “debug“ has no functional effect on the hardware, network drivers, or physical layer diagnostics.
C. Reviewing the ‘ethtool -S‘ counters or using ‘cable-test‘ diagnostics on the switch
Correct: This method directly addresses the physical layer health as defined in the NCP-AII curriculum.
ethtool -S (Host Side): For Ethernet/RoCE fabrics, this command displays detailed hardware-level counters. An administrator looks for rx_crc_errors, rx_symbol_errors, or phy_symbol_errors, which are definitive signs of a failing transceiver or a dirty fiber connection.
cable-test (Switch Side): NVIDIA Spectrum and Quantum switches provide built-in diagnostics (like TDR – Time Domain Reflectometry) to verify cable integrity. On InfiniBand fabrics, tools like mlxlink or ibdiagnet are also used to check the Bit Error Rate (BER) and signal eye quality.
D. Reinstalling the NVIDIA Container Toolkit on the head node
Incorrect: The NVIDIA Container Toolkit (nvidia-container-runtime) manages GPU visibility and drivers within Docker/Enroot. It does not manage the network fabric. Reinstalling it on the head node would have no impact on intermittent hardware-level performance drops occurring on the compute nodes‘ high-speed interfaces.
Question 35 of 60
35. Question
An administrator needs to partition a physical NVIDIA H100 GPU into multiple instances to support diverse workloads ranging from small-scale inference to moderate development tasks. Which feature should be configured to ensure that each instance has dedicated hardware resources including compute units and memory, providing strict isolation and predictable performance for multiple users on the same GPU?
Correct
A. CUDA Multi-Process Service (MPS)
Incorrect: MPS is a software-based solution designed to increase throughput by allowing multiple processes (typically ranks of a single MPI job) to share the GPU‘s compute resources concurrently. While it allows for fine-grained compute limits, it does not provide hardware-level isolation. Processes share the same memory address space and hardware error containment, meaning a crash in one process can impact others. It is not suitable for diverse, independent workloads where strict isolation is required.
B. Multi-Instance GPU (MIG)
Correct: MIG is the definitive solution for hardware-level partitioning on NVIDIA H100 and A100 GPUs.
Strict Isolation: It physically partitions the GPU into up to seven independent GPU Instances. Each instance has its own dedicated path to specific hardware resources, including Streaming Multiprocessors (SMs), L2 cache, and High Bandwidth Memory (HBM).
Predictable Performance: Because resources are not shared or time-sliced, one user‘s workload cannot “starve“ another‘s memory bandwidth or compute cycles, ensuring guaranteed QoS.
Fault Isolation: A failure or “GPU reset“ in one MIG instance does not affect the others, making it ideal for multi-tenant AI factories.
C. NVIDIA vGPU (Virtual GPU)
Incorrect: While vGPU is used for virtualization, the standard “time-sliced“ vGPU (C-Series) uses temporal partitioning, where VMs take turns using the whole GPU. This can introduce latency and does not provide the “dedicated hardware resources“ for parallel execution described in the question. Note: While NVIDIA does support “MIG-backed vGPU,“ the foundational technology that provides the dedicated compute and memory hardware is MIG itself.
D. Docker containers with the NVIDIA Container Toolkit
Incorrect: The NVIDIA Container Toolkit (and the –gpus flag in Docker) provides logical isolation. It limits which GPUs a container can see, but it does not partition a single GPU‘s internal hardware. Without MIG or MPS, multiple containers pointed at the same physical GPU will compete for the same SMs and memory bandwidth, leading to jitter and unpredictable performance.
Incorrect
A. CUDA Multi-Process Service (MPS)
Incorrect: MPS is a software-based solution designed to increase throughput by allowing multiple processes (typically ranks of a single MPI job) to share the GPU‘s compute resources concurrently. While it allows for fine-grained compute limits, it does not provide hardware-level isolation. Processes share the same memory address space and hardware error containment, meaning a crash in one process can impact others. It is not suitable for diverse, independent workloads where strict isolation is required.
B. Multi-Instance GPU (MIG)
Correct: MIG is the definitive solution for hardware-level partitioning on NVIDIA H100 and A100 GPUs.
Strict Isolation: It physically partitions the GPU into up to seven independent GPU Instances. Each instance has its own dedicated path to specific hardware resources, including Streaming Multiprocessors (SMs), L2 cache, and High Bandwidth Memory (HBM).
Predictable Performance: Because resources are not shared or time-sliced, one user‘s workload cannot “starve“ another‘s memory bandwidth or compute cycles, ensuring guaranteed QoS.
Fault Isolation: A failure or “GPU reset“ in one MIG instance does not affect the others, making it ideal for multi-tenant AI factories.
C. NVIDIA vGPU (Virtual GPU)
Incorrect: While vGPU is used for virtualization, the standard “time-sliced“ vGPU (C-Series) uses temporal partitioning, where VMs take turns using the whole GPU. This can introduce latency and does not provide the “dedicated hardware resources“ for parallel execution described in the question. Note: While NVIDIA does support “MIG-backed vGPU,“ the foundational technology that provides the dedicated compute and memory hardware is MIG itself.
D. Docker containers with the NVIDIA Container Toolkit
Incorrect: The NVIDIA Container Toolkit (and the –gpus flag in Docker) provides logical isolation. It limits which GPUs a container can see, but it does not partition a single GPU‘s internal hardware. Without MIG or MPS, multiple containers pointed at the same physical GPU will compete for the same SMs and memory bandwidth, leading to jitter and unpredictable performance.
Unattempted
A. CUDA Multi-Process Service (MPS)
Incorrect: MPS is a software-based solution designed to increase throughput by allowing multiple processes (typically ranks of a single MPI job) to share the GPU‘s compute resources concurrently. While it allows for fine-grained compute limits, it does not provide hardware-level isolation. Processes share the same memory address space and hardware error containment, meaning a crash in one process can impact others. It is not suitable for diverse, independent workloads where strict isolation is required.
B. Multi-Instance GPU (MIG)
Correct: MIG is the definitive solution for hardware-level partitioning on NVIDIA H100 and A100 GPUs.
Strict Isolation: It physically partitions the GPU into up to seven independent GPU Instances. Each instance has its own dedicated path to specific hardware resources, including Streaming Multiprocessors (SMs), L2 cache, and High Bandwidth Memory (HBM).
Predictable Performance: Because resources are not shared or time-sliced, one user‘s workload cannot “starve“ another‘s memory bandwidth or compute cycles, ensuring guaranteed QoS.
Fault Isolation: A failure or “GPU reset“ in one MIG instance does not affect the others, making it ideal for multi-tenant AI factories.
C. NVIDIA vGPU (Virtual GPU)
Incorrect: While vGPU is used for virtualization, the standard “time-sliced“ vGPU (C-Series) uses temporal partitioning, where VMs take turns using the whole GPU. This can introduce latency and does not provide the “dedicated hardware resources“ for parallel execution described in the question. Note: While NVIDIA does support “MIG-backed vGPU,“ the foundational technology that provides the dedicated compute and memory hardware is MIG itself.
D. Docker containers with the NVIDIA Container Toolkit
Incorrect: The NVIDIA Container Toolkit (and the –gpus flag in Docker) provides logical isolation. It limits which GPUs a container can see, but it does not partition a single GPU‘s internal hardware. Without MIG or MPS, multiple containers pointed at the same physical GPU will compete for the same SMs and memory bandwidth, leading to jitter and unpredictable performance.
Question 36 of 60
36. Question
During the Cluster Test and Verification phase, an administrator runs ClusterKit to perform a multifaceted node assessment. The report indicates a Mismatched Firmware warning for the BlueField-3 DPUs and the InfiniBand transceivers across the cluster. Why is it critical to ensure that all transceivers and DPUs have consistent firmware versions before moving to production?
Correct
A. Different firmware versions cause light to oscillate at different frequencies…
Incorrect: This is a humorous fabrication. Fiber optic communication uses specific, standardized wavelengths (e.g., 850nm for multi-mode). Firmware does not change the physical properties of light or cause “color bleeding“ that affects digital model weights.
B. Consistent firmware is only required for aesthetic reasons…
Incorrect: Firmware on DPUs and transceivers controls critical low-level functions such as link negotiation, error correction (Forward Error Correction or FEC), and power management. Mismatches frequently lead to performance degradation or total link failure, which are significant functional issues rather than aesthetic ones.
C. Mismatched firmware can lead to intermittent link flaps and sub-optimal signal quality…
Correct: This follows the NCP-AII Cluster Test and Verification standards.
Link Flaps: Inconsistent firmware between a DPU and a transceiver can cause the link to repeatedly go up and down (flap) because the handshake protocols for 400G signaling are not perfectly aligned.
Signal Quality: Firmware defines the “tuning“ for PAM4 signaling. Mismatches can increase the Bit Error Rate (BER), forcing the hardware to perform retries that latency-sensitive AI workloads cannot tolerate.
LLDP Issues: Link Layer Discovery Protocol (LLDP) and other fabric management protocols rely on firmware compatibility to correctly identify topology. Failures here can prevent the Subnet Manager (SM) from optimizing the fabric path.
D. The NVIDIA license manager will refuse to activate the GPUs…
Incorrect: While NVIDIA does have software licensing (such as for AI Enterprise or vGPU), there is no hardware-level “lockout“ that prevents GPU activation based on the firmware version of a network transceiver. These are separate hardware subsystems.
Incorrect
A. Different firmware versions cause light to oscillate at different frequencies…
Incorrect: This is a humorous fabrication. Fiber optic communication uses specific, standardized wavelengths (e.g., 850nm for multi-mode). Firmware does not change the physical properties of light or cause “color bleeding“ that affects digital model weights.
B. Consistent firmware is only required for aesthetic reasons…
Incorrect: Firmware on DPUs and transceivers controls critical low-level functions such as link negotiation, error correction (Forward Error Correction or FEC), and power management. Mismatches frequently lead to performance degradation or total link failure, which are significant functional issues rather than aesthetic ones.
C. Mismatched firmware can lead to intermittent link flaps and sub-optimal signal quality…
Correct: This follows the NCP-AII Cluster Test and Verification standards.
Link Flaps: Inconsistent firmware between a DPU and a transceiver can cause the link to repeatedly go up and down (flap) because the handshake protocols for 400G signaling are not perfectly aligned.
Signal Quality: Firmware defines the “tuning“ for PAM4 signaling. Mismatches can increase the Bit Error Rate (BER), forcing the hardware to perform retries that latency-sensitive AI workloads cannot tolerate.
LLDP Issues: Link Layer Discovery Protocol (LLDP) and other fabric management protocols rely on firmware compatibility to correctly identify topology. Failures here can prevent the Subnet Manager (SM) from optimizing the fabric path.
D. The NVIDIA license manager will refuse to activate the GPUs…
Incorrect: While NVIDIA does have software licensing (such as for AI Enterprise or vGPU), there is no hardware-level “lockout“ that prevents GPU activation based on the firmware version of a network transceiver. These are separate hardware subsystems.
Unattempted
A. Different firmware versions cause light to oscillate at different frequencies…
Incorrect: This is a humorous fabrication. Fiber optic communication uses specific, standardized wavelengths (e.g., 850nm for multi-mode). Firmware does not change the physical properties of light or cause “color bleeding“ that affects digital model weights.
B. Consistent firmware is only required for aesthetic reasons…
Incorrect: Firmware on DPUs and transceivers controls critical low-level functions such as link negotiation, error correction (Forward Error Correction or FEC), and power management. Mismatches frequently lead to performance degradation or total link failure, which are significant functional issues rather than aesthetic ones.
C. Mismatched firmware can lead to intermittent link flaps and sub-optimal signal quality…
Correct: This follows the NCP-AII Cluster Test and Verification standards.
Link Flaps: Inconsistent firmware between a DPU and a transceiver can cause the link to repeatedly go up and down (flap) because the handshake protocols for 400G signaling are not perfectly aligned.
Signal Quality: Firmware defines the “tuning“ for PAM4 signaling. Mismatches can increase the Bit Error Rate (BER), forcing the hardware to perform retries that latency-sensitive AI workloads cannot tolerate.
LLDP Issues: Link Layer Discovery Protocol (LLDP) and other fabric management protocols rely on firmware compatibility to correctly identify topology. Failures here can prevent the Subnet Manager (SM) from optimizing the fabric path.
D. The NVIDIA license manager will refuse to activate the GPUs…
Incorrect: While NVIDIA does have software licensing (such as for AI Enterprise or vGPU), there is no hardware-level “lockout“ that prevents GPU activation based on the firmware version of a network transceiver. These are separate hardware subsystems.
Question 37 of 60
37. Question
An AI cluster is experiencing intermittent GPU dropouts during heavy training jobs. The system logs indicate a ‘GPU Fallen Off the Bus‘ error. Which of the following troubleshooting steps should be taken first to identify if the issue is hardware-related or thermal-related?
Correct
A. Upgrade the Slurm scheduler to the latest experimental beta version
Incorrect: Slurm is a workload manager and does not have direct control over the physical PCIe bus or GPU thermal management. Using “experimental beta“ software in a production AI cluster is against NVIDIA best practices, as it introduces instability without addressing the root hardware/thermal cause.
B. Replace the high-speed InfiniBand switches with standard 1GbE switches
Incorrect: This would be a massive downgrade in performance. While InfiniBand is the “data plane,“ it is unrelated to a local “Fallen Off Bus“ error, which occurs on the internal PCIe/NVLink bus of the server. 1GbE switches cannot handle AI training traffic and would not help diagnose a GPU hardware fault.
C. Check the DCGM-exporter metrics for high temperature and power spikes
Correct: According to the NCP-AII curriculum, NVIDIA DCGM (Data Center GPU Manager) is the primary tool for health monitoring.
Thermal-related: If the metrics show temperatures exceeding the thermal slowdown threshold (typically 80°C – 85°C), the GPU may have shut down to prevent damage.
Hardware-related: DCGM can track XID errors. If the error occurred without a temperature or power spike, it points toward a physical seating issue, a faulty PCIe bridge, or a defective GPU.
Power Spikes: Monitoring power draw helps identify if the PSU is failing to meet the peak demand of an H100 or A100 during training.
D. Delete all user data from the shared storage
Incorrect: Storage capacity issues might cause a job to fail with a “Disk Quota Exceeded“ or “No space left on device“ error, but they will never cause a GPU to physically fall off the PCIe bus. This action is destructive and irrelevant to hardware troubleshooting.
Incorrect
A. Upgrade the Slurm scheduler to the latest experimental beta version
Incorrect: Slurm is a workload manager and does not have direct control over the physical PCIe bus or GPU thermal management. Using “experimental beta“ software in a production AI cluster is against NVIDIA best practices, as it introduces instability without addressing the root hardware/thermal cause.
B. Replace the high-speed InfiniBand switches with standard 1GbE switches
Incorrect: This would be a massive downgrade in performance. While InfiniBand is the “data plane,“ it is unrelated to a local “Fallen Off Bus“ error, which occurs on the internal PCIe/NVLink bus of the server. 1GbE switches cannot handle AI training traffic and would not help diagnose a GPU hardware fault.
C. Check the DCGM-exporter metrics for high temperature and power spikes
Correct: According to the NCP-AII curriculum, NVIDIA DCGM (Data Center GPU Manager) is the primary tool for health monitoring.
Thermal-related: If the metrics show temperatures exceeding the thermal slowdown threshold (typically 80°C – 85°C), the GPU may have shut down to prevent damage.
Hardware-related: DCGM can track XID errors. If the error occurred without a temperature or power spike, it points toward a physical seating issue, a faulty PCIe bridge, or a defective GPU.
Power Spikes: Monitoring power draw helps identify if the PSU is failing to meet the peak demand of an H100 or A100 during training.
D. Delete all user data from the shared storage
Incorrect: Storage capacity issues might cause a job to fail with a “Disk Quota Exceeded“ or “No space left on device“ error, but they will never cause a GPU to physically fall off the PCIe bus. This action is destructive and irrelevant to hardware troubleshooting.
Unattempted
A. Upgrade the Slurm scheduler to the latest experimental beta version
Incorrect: Slurm is a workload manager and does not have direct control over the physical PCIe bus or GPU thermal management. Using “experimental beta“ software in a production AI cluster is against NVIDIA best practices, as it introduces instability without addressing the root hardware/thermal cause.
B. Replace the high-speed InfiniBand switches with standard 1GbE switches
Incorrect: This would be a massive downgrade in performance. While InfiniBand is the “data plane,“ it is unrelated to a local “Fallen Off Bus“ error, which occurs on the internal PCIe/NVLink bus of the server. 1GbE switches cannot handle AI training traffic and would not help diagnose a GPU hardware fault.
C. Check the DCGM-exporter metrics for high temperature and power spikes
Correct: According to the NCP-AII curriculum, NVIDIA DCGM (Data Center GPU Manager) is the primary tool for health monitoring.
Thermal-related: If the metrics show temperatures exceeding the thermal slowdown threshold (typically 80°C – 85°C), the GPU may have shut down to prevent damage.
Hardware-related: DCGM can track XID errors. If the error occurred without a temperature or power spike, it points toward a physical seating issue, a faulty PCIe bridge, or a defective GPU.
Power Spikes: Monitoring power draw helps identify if the PSU is failing to meet the peak demand of an H100 or A100 during training.
D. Delete all user data from the shared storage
Incorrect: Storage capacity issues might cause a job to fail with a “Disk Quota Exceeded“ or “No space left on device“ error, but they will never cause a GPU to physically fall off the PCIe bus. This action is destructive and irrelevant to hardware troubleshooting.
Question 38 of 60
38. Question
A storage optimization task is underway for an AI cluster where data loading is identified as the primary bottleneck. The administrator decides to implement NVIDIA GPUDirect Storage (GDS). What is the primary requirement for the network cards (NICs) to support this feature effectively?
Correct
A. The NICs must have integrated RGB lighting
Incorrect: RGB lighting is purely aesthetic and has no impact on the functional performance, data transfer logic, or telemetry of an AI cluster. Professional data center hardware (like ConnectX-7 or BlueField-3) prioritizes thermal efficiency over lighting.
B. The NICs must support RDMA and be positioned on the same PCIe root complex
Correct: This is a fundamental hardware requirement for GDS according to the NCP-AII curriculum.
RDMA (Remote Direct Memory Access): GDS relies on RDMA (InfiniBand or RoCE) to move data directly from the network card into the GPU‘s memory. This is what allows the “zero-copy“ transfer that avoids the CPU.
PCIe Root Complex/Switch: For the most efficient direct memory access (DMA), the NIC and the GPU should ideally be connected to the same PCIe Switch or Root Complex. If the data must cross between different CPU sockets (via UPI or QPI), the latency increases and the bandwidth can be halved, defeating the purpose of GDS.
C. The NICs must use the TCP/IP stack exclusively
Incorrect: The standard TCP/IP stack involves significant CPU overhead and data copying within the kernel. GDS specifically aims to move away from this model. While GDS can operate in “compatibility mode“ using standard I/O, its primary performance benefits are only realized when using RDMA-enabled protocols.
D. The NICs must be connected directly to the BMC management port
Correct: The BMC (Baseboard Management Controller) is used for Out-of-Band (OOB) management (power control, hardware health). Its bandwidth is typically limited to 1GbE, which is thousands of times slower than what is needed for AI training. GDS data traffic happens on the high-speed data plane, not the management plane.
Incorrect
A. The NICs must have integrated RGB lighting
Incorrect: RGB lighting is purely aesthetic and has no impact on the functional performance, data transfer logic, or telemetry of an AI cluster. Professional data center hardware (like ConnectX-7 or BlueField-3) prioritizes thermal efficiency over lighting.
B. The NICs must support RDMA and be positioned on the same PCIe root complex
Correct: This is a fundamental hardware requirement for GDS according to the NCP-AII curriculum.
RDMA (Remote Direct Memory Access): GDS relies on RDMA (InfiniBand or RoCE) to move data directly from the network card into the GPU‘s memory. This is what allows the “zero-copy“ transfer that avoids the CPU.
PCIe Root Complex/Switch: For the most efficient direct memory access (DMA), the NIC and the GPU should ideally be connected to the same PCIe Switch or Root Complex. If the data must cross between different CPU sockets (via UPI or QPI), the latency increases and the bandwidth can be halved, defeating the purpose of GDS.
C. The NICs must use the TCP/IP stack exclusively
Incorrect: The standard TCP/IP stack involves significant CPU overhead and data copying within the kernel. GDS specifically aims to move away from this model. While GDS can operate in “compatibility mode“ using standard I/O, its primary performance benefits are only realized when using RDMA-enabled protocols.
D. The NICs must be connected directly to the BMC management port
Correct: The BMC (Baseboard Management Controller) is used for Out-of-Band (OOB) management (power control, hardware health). Its bandwidth is typically limited to 1GbE, which is thousands of times slower than what is needed for AI training. GDS data traffic happens on the high-speed data plane, not the management plane.
Unattempted
A. The NICs must have integrated RGB lighting
Incorrect: RGB lighting is purely aesthetic and has no impact on the functional performance, data transfer logic, or telemetry of an AI cluster. Professional data center hardware (like ConnectX-7 or BlueField-3) prioritizes thermal efficiency over lighting.
B. The NICs must support RDMA and be positioned on the same PCIe root complex
Correct: This is a fundamental hardware requirement for GDS according to the NCP-AII curriculum.
RDMA (Remote Direct Memory Access): GDS relies on RDMA (InfiniBand or RoCE) to move data directly from the network card into the GPU‘s memory. This is what allows the “zero-copy“ transfer that avoids the CPU.
PCIe Root Complex/Switch: For the most efficient direct memory access (DMA), the NIC and the GPU should ideally be connected to the same PCIe Switch or Root Complex. If the data must cross between different CPU sockets (via UPI or QPI), the latency increases and the bandwidth can be halved, defeating the purpose of GDS.
C. The NICs must use the TCP/IP stack exclusively
Incorrect: The standard TCP/IP stack involves significant CPU overhead and data copying within the kernel. GDS specifically aims to move away from this model. While GDS can operate in “compatibility mode“ using standard I/O, its primary performance benefits are only realized when using RDMA-enabled protocols.
D. The NICs must be connected directly to the BMC management port
Correct: The BMC (Baseboard Management Controller) is used for Out-of-Band (OOB) management (power control, hardware health). Its bandwidth is typically limited to 1GbE, which is thousands of times slower than what is needed for AI training. GDS data traffic happens on the high-speed data plane, not the management plane.
Question 39 of 60
39. Question
A user wants to run a multi-node AI training job using the srun command with the Pyxis and Enroot stack. The command fails with an error indicating that the container image cannot be found. Which component of the control plane is responsible for pulling the image and converting it into a runtime format?
Correct
A. The Slurm database stores all container images in an internal SQL table
Incorrect: The Slurm database (slurmdbd) is used for accounting, job history, and user/limit associations. It does not store binary blobs like container images or squashfs files, as this would cause significant performance degradation and storage bloat within the database engine.
B. The DOCA driver uses DPU hardware acceleration to decrypt the image
Incorrect: While NVIDIA BlueField DPUs and DOCA can be used to accelerate networking and security, they are not responsible for the standard container pull-and-unpack cycle in a Pyxis/Enroot setup. Container image management in this stack is handled at the OS/runtime level, not by network driver decryption.
C. The NGC CLI must be manually run on every node
Incorrect: The NGC CLI is a powerful tool for manually downloading assets, but one of the primary benefits of the Pyxis/Enroot integration is automation. When a user submits a job with –container-image, the system is designed to handle the retrieval and conversion automatically on the allocated nodes, rather than requiring the administrator to pre-stage images manually across the entire cluster.
D. The Enroot runtime, triggered by the Pyxis plugin
Correct: This describes the standard control plane workflow for containerized jobs in an NVIDIA AI infrastructure:
Pyxis: This is a Slurm SPANK plugin. When srun –container-image=… is executed, Pyxis intercepts the command and sends the request to the compute nodes.
Enroot: This is the container runtime itself. On each local compute node, Enroot is triggered to:
Pull: Download the layers from a registry (like NVIDIA NGC or Docker Hub).
Unpack/Convert: Convert the traditional Docker/OCI layers into a SquashFS file or an unpacked directory tree.
Execute: Start the unprivileged container for the job.
Incorrect
A. The Slurm database stores all container images in an internal SQL table
Incorrect: The Slurm database (slurmdbd) is used for accounting, job history, and user/limit associations. It does not store binary blobs like container images or squashfs files, as this would cause significant performance degradation and storage bloat within the database engine.
B. The DOCA driver uses DPU hardware acceleration to decrypt the image
Incorrect: While NVIDIA BlueField DPUs and DOCA can be used to accelerate networking and security, they are not responsible for the standard container pull-and-unpack cycle in a Pyxis/Enroot setup. Container image management in this stack is handled at the OS/runtime level, not by network driver decryption.
C. The NGC CLI must be manually run on every node
Incorrect: The NGC CLI is a powerful tool for manually downloading assets, but one of the primary benefits of the Pyxis/Enroot integration is automation. When a user submits a job with –container-image, the system is designed to handle the retrieval and conversion automatically on the allocated nodes, rather than requiring the administrator to pre-stage images manually across the entire cluster.
D. The Enroot runtime, triggered by the Pyxis plugin
Correct: This describes the standard control plane workflow for containerized jobs in an NVIDIA AI infrastructure:
Pyxis: This is a Slurm SPANK plugin. When srun –container-image=… is executed, Pyxis intercepts the command and sends the request to the compute nodes.
Enroot: This is the container runtime itself. On each local compute node, Enroot is triggered to:
Pull: Download the layers from a registry (like NVIDIA NGC or Docker Hub).
Unpack/Convert: Convert the traditional Docker/OCI layers into a SquashFS file or an unpacked directory tree.
Execute: Start the unprivileged container for the job.
Unattempted
A. The Slurm database stores all container images in an internal SQL table
Incorrect: The Slurm database (slurmdbd) is used for accounting, job history, and user/limit associations. It does not store binary blobs like container images or squashfs files, as this would cause significant performance degradation and storage bloat within the database engine.
B. The DOCA driver uses DPU hardware acceleration to decrypt the image
Incorrect: While NVIDIA BlueField DPUs and DOCA can be used to accelerate networking and security, they are not responsible for the standard container pull-and-unpack cycle in a Pyxis/Enroot setup. Container image management in this stack is handled at the OS/runtime level, not by network driver decryption.
C. The NGC CLI must be manually run on every node
Incorrect: The NGC CLI is a powerful tool for manually downloading assets, but one of the primary benefits of the Pyxis/Enroot integration is automation. When a user submits a job with –container-image, the system is designed to handle the retrieval and conversion automatically on the allocated nodes, rather than requiring the administrator to pre-stage images manually across the entire cluster.
D. The Enroot runtime, triggered by the Pyxis plugin
Correct: This describes the standard control plane workflow for containerized jobs in an NVIDIA AI infrastructure:
Pyxis: This is a Slurm SPANK plugin. When srun –container-image=… is executed, Pyxis intercepts the command and sends the request to the compute nodes.
Enroot: This is the container runtime itself. On each local compute node, Enroot is triggered to:
Pull: Download the layers from a registry (like NVIDIA NGC or Docker Hub).
Unpack/Convert: Convert the traditional Docker/OCI layers into a SquashFS file or an unpacked directory tree.
Execute: Start the unprivileged container for the job.
Question 40 of 60
40. Question
A storage optimization task is underway for an AI cluster where data loading is identified as the bottleneck. The administrator decides to implement NVIDIA GPUDirect Storage (GDS). What is the primary requirement for the network cards (NICs) to support this feature effectively?
Correct
A. The NICs must be connected directly to the BMC‘s management port.
Incorrect: The Baseboard Management Controller (BMC) is used for out-of-band management (powering on/off, monitoring health, remote console). It typically operates at 1GbE speeds, which is far too slow for AI data loading. GDS operates on the high-speed data fabric (InfiniBand or 400GbE), not the management network.
B. The NICs must be configured to use the TCP/IP stack exclusively.
Incorrect: The standard TCP/IP stack is highly CPU-intensive and involves multiple data copies in system memory. GDS specifically avoids this by using RDMA. While GDS can run in a “compatibility mode“ over TCP, the primary performance benefits (low latency and high throughput) are lost.
C. The NICs must have integrated RGB lighting.
Incorrect: RGB lighting is a consumer aesthetic feature and has no functional role in data center performance, GDS logic, or infrastructure certification.
D. The NICs must support RDMA and be positioned on the same PCIe root complex.
Correct: This is a fundamental architectural requirement for GDS in the NCP-AII curriculum.
RDMA (Remote Direct Memory Access): GDS uses RDMA (InfiniBand or RoCE) to move data directly from the network card to the GPU memory without involving the host CPU.
PCIe Root Complex/Switch: For the data transfer to be “direct,“ the NIC and the GPU should ideally reside on the same PCIe Switch or Root Complex. If data has to cross between different CPU sockets (via UPI/QPI links), it introduces latency and reduces bandwidth, which significantly diminishes the efficiency of GDS.
Incorrect
A. The NICs must be connected directly to the BMC‘s management port.
Incorrect: The Baseboard Management Controller (BMC) is used for out-of-band management (powering on/off, monitoring health, remote console). It typically operates at 1GbE speeds, which is far too slow for AI data loading. GDS operates on the high-speed data fabric (InfiniBand or 400GbE), not the management network.
B. The NICs must be configured to use the TCP/IP stack exclusively.
Incorrect: The standard TCP/IP stack is highly CPU-intensive and involves multiple data copies in system memory. GDS specifically avoids this by using RDMA. While GDS can run in a “compatibility mode“ over TCP, the primary performance benefits (low latency and high throughput) are lost.
C. The NICs must have integrated RGB lighting.
Incorrect: RGB lighting is a consumer aesthetic feature and has no functional role in data center performance, GDS logic, or infrastructure certification.
D. The NICs must support RDMA and be positioned on the same PCIe root complex.
Correct: This is a fundamental architectural requirement for GDS in the NCP-AII curriculum.
RDMA (Remote Direct Memory Access): GDS uses RDMA (InfiniBand or RoCE) to move data directly from the network card to the GPU memory without involving the host CPU.
PCIe Root Complex/Switch: For the data transfer to be “direct,“ the NIC and the GPU should ideally reside on the same PCIe Switch or Root Complex. If data has to cross between different CPU sockets (via UPI/QPI links), it introduces latency and reduces bandwidth, which significantly diminishes the efficiency of GDS.
Unattempted
A. The NICs must be connected directly to the BMC‘s management port.
Incorrect: The Baseboard Management Controller (BMC) is used for out-of-band management (powering on/off, monitoring health, remote console). It typically operates at 1GbE speeds, which is far too slow for AI data loading. GDS operates on the high-speed data fabric (InfiniBand or 400GbE), not the management network.
B. The NICs must be configured to use the TCP/IP stack exclusively.
Incorrect: The standard TCP/IP stack is highly CPU-intensive and involves multiple data copies in system memory. GDS specifically avoids this by using RDMA. While GDS can run in a “compatibility mode“ over TCP, the primary performance benefits (low latency and high throughput) are lost.
C. The NICs must have integrated RGB lighting.
Incorrect: RGB lighting is a consumer aesthetic feature and has no functional role in data center performance, GDS logic, or infrastructure certification.
D. The NICs must support RDMA and be positioned on the same PCIe root complex.
Correct: This is a fundamental architectural requirement for GDS in the NCP-AII curriculum.
RDMA (Remote Direct Memory Access): GDS uses RDMA (InfiniBand or RoCE) to move data directly from the network card to the GPU memory without involving the host CPU.
PCIe Root Complex/Switch: For the data transfer to be “direct,“ the NIC and the GPU should ideally reside on the same PCIe Switch or Root Complex. If data has to cross between different CPU sockets (via UPI/QPI links), it introduces latency and reduces bandwidth, which significantly diminishes the efficiency of GDS.
Question 41 of 60
41. Question
An infrastructure engineer needs to optimize an NVIDIA H100 GPU for a multi-tenant environment where several small AI inference jobs must run simultaneously with strict hardware isolation. Which technology should be configured to ensure that each job has its own dedicated high-bandwidth memory (HBM) and compute resources at the hardware level, preventing interference between tenants?
Correct
A. Multi-Instance GPU (MIG) • Correct: According to the latest NCP-AII standards, MIG is the only technology that provides spatial (hardware) partitioning for GPUs like the H100 and A100. ? Hardware Isolation: It carves the GPU into up to seven independent instances. Each instance has its own dedicated compute slices (Streaming Multiprocessors), L2 cache, and High-Bandwidth Memory (HBM). ? Predictable Performance: Because resources are not shared, one tenant‘s workload cannot “interfere“ with another‘s (no “noisy neighbor“ effect). ? Fault Isolation: If a process in one MIG instance crashes or causes a GPU error, the other instances remain operational, which is critical for multi-tenant production environments.
B. CUDA Multi-Process Service (MPS) • Incorrect: MPS is a software-based solution that allows multiple processes to share a single GPU context concurrently. While it improves utilization and allows for some resource limits, it does not provide hardware isolation. Processes still share the same underlying memory and cache, and a fatal error in one process can crash the entire GPU, making it unsuitable for strict multi-tenancy.
C. NVIDIA Virtual GPU (vGPU) software • Incorrect: While vGPU is used for virtualization, standard “time-sliced“ vGPU (like the C-series) rotates access to the GPU cores over time (temporal sharing). While it provides memory isolation via the hypervisor, it does not provide the parallel, dedicated hardware compute resources described. Note: You can run vGPU on top of MIG (MIG-backed vGPU), but the underlying partitioning technology is still MIG.
D. Time-Slicing Scheduler • Incorrect: Time-slicing is the default behavior when multiple processes use a single GPU. The GPU driver switches between tasks sequentially. This results in variable latency and no dedicated hardware resources, as each task “owns“ the entire GPU for a brief window before being swapped out.
Incorrect
A. Multi-Instance GPU (MIG) • Correct: According to the latest NCP-AII standards, MIG is the only technology that provides spatial (hardware) partitioning for GPUs like the H100 and A100. ? Hardware Isolation: It carves the GPU into up to seven independent instances. Each instance has its own dedicated compute slices (Streaming Multiprocessors), L2 cache, and High-Bandwidth Memory (HBM). ? Predictable Performance: Because resources are not shared, one tenant‘s workload cannot “interfere“ with another‘s (no “noisy neighbor“ effect). ? Fault Isolation: If a process in one MIG instance crashes or causes a GPU error, the other instances remain operational, which is critical for multi-tenant production environments.
B. CUDA Multi-Process Service (MPS) • Incorrect: MPS is a software-based solution that allows multiple processes to share a single GPU context concurrently. While it improves utilization and allows for some resource limits, it does not provide hardware isolation. Processes still share the same underlying memory and cache, and a fatal error in one process can crash the entire GPU, making it unsuitable for strict multi-tenancy.
C. NVIDIA Virtual GPU (vGPU) software • Incorrect: While vGPU is used for virtualization, standard “time-sliced“ vGPU (like the C-series) rotates access to the GPU cores over time (temporal sharing). While it provides memory isolation via the hypervisor, it does not provide the parallel, dedicated hardware compute resources described. Note: You can run vGPU on top of MIG (MIG-backed vGPU), but the underlying partitioning technology is still MIG.
D. Time-Slicing Scheduler • Incorrect: Time-slicing is the default behavior when multiple processes use a single GPU. The GPU driver switches between tasks sequentially. This results in variable latency and no dedicated hardware resources, as each task “owns“ the entire GPU for a brief window before being swapped out.
Unattempted
A. Multi-Instance GPU (MIG) • Correct: According to the latest NCP-AII standards, MIG is the only technology that provides spatial (hardware) partitioning for GPUs like the H100 and A100. ? Hardware Isolation: It carves the GPU into up to seven independent instances. Each instance has its own dedicated compute slices (Streaming Multiprocessors), L2 cache, and High-Bandwidth Memory (HBM). ? Predictable Performance: Because resources are not shared, one tenant‘s workload cannot “interfere“ with another‘s (no “noisy neighbor“ effect). ? Fault Isolation: If a process in one MIG instance crashes or causes a GPU error, the other instances remain operational, which is critical for multi-tenant production environments.
B. CUDA Multi-Process Service (MPS) • Incorrect: MPS is a software-based solution that allows multiple processes to share a single GPU context concurrently. While it improves utilization and allows for some resource limits, it does not provide hardware isolation. Processes still share the same underlying memory and cache, and a fatal error in one process can crash the entire GPU, making it unsuitable for strict multi-tenancy.
C. NVIDIA Virtual GPU (vGPU) software • Incorrect: While vGPU is used for virtualization, standard “time-sliced“ vGPU (like the C-series) rotates access to the GPU cores over time (temporal sharing). While it provides memory isolation via the hypervisor, it does not provide the parallel, dedicated hardware compute resources described. Note: You can run vGPU on top of MIG (MIG-backed vGPU), but the underlying partitioning technology is still MIG.
D. Time-Slicing Scheduler • Incorrect: Time-slicing is the default behavior when multiple processes use a single GPU. The GPU driver switches between tasks sequentially. This results in variable latency and no dedicated hardware resources, as each task “owns“ the entire GPU for a brief window before being swapped out.
Question 42 of 60
42. Question
To verify the health of the inter-node East-West fabric, the administrator runs an NCCL (NVIDIA Collective Communications Library) ‘all_reduce‘ test across 16 nodes. The results show a significant bandwidth bottleneck. Which specific check should the administrator perform on the InfiniBand switches and BlueField-3 DPUs to troubleshoot this network performance issue?
Correct
A. Check if the ‘gcc‘ compiler is installed on the switch
Incorrect: InfiniBand and Ethernet switches are high-performance networking appliances that run specialized network operating systems (like NVIDIA Onyx or Cumulus Linux). They do not recompile NCCL kernels; NCCL kernels are compiled on the compute nodes (the servers) and run on the GPUs. The switch‘s role is purely data-plane packet forwarding.
B. Verify that the GPUs are in MIG mode
Incorrect: Multi-Instance GPU (MIG) is used to partition a single GPU for multi-tenancy. While NCCL can run on MIG instances, enabling MIG is not a requirement for the network fabric to work effectively. In fact, for large-scale training jobs that require maximum bandwidth, administrators typically use the full GPU (Non-MIG) to ensure all available NVLink and network bandwidth is dedicated to a single task.
C. Confirm that the Adaptive Routing (AR) and Congestion Control (CC) settings are correctly configured
Correct: This is a core troubleshooting step in the NCP-AII curriculum for resolving performance bottlenecks in the high-speed fabric.
Adaptive Routing (AR): In a large-scale AI factory with a Fat-Tree topology, AR allows the InfiniBand/Spectrum-X switches to dynamically route packets across multiple available paths to avoid localized congestion. If AR is disabled or misconfigured, traffic may “hotspot“ on a single link, causing the bottleneck observed.
Congestion Control (CC): Technologies like DCQCN (for RoCE) or hardware congestion control (for InfiniBand) prevent a “slow receiver“ from backing up traffic and affecting the entire fabric (Head-of-Line blocking). Consistent settings across the BlueField-3 DPUs and the Quantum/Spectrum switches are mandatory to ensure the fabric remains lossless and high-throughput.
D. Ensure that the network cables are painted with a non-conductive coating
Incorrect: This is a purely fictional concept. High-speed networking uses photons in fiber optic cables or electrical signals in copper DACs. Neither “static electricity on photons“ nor “non-conductive paint“ are real physical factors in data center networking performance.
Incorrect
A. Check if the ‘gcc‘ compiler is installed on the switch
Incorrect: InfiniBand and Ethernet switches are high-performance networking appliances that run specialized network operating systems (like NVIDIA Onyx or Cumulus Linux). They do not recompile NCCL kernels; NCCL kernels are compiled on the compute nodes (the servers) and run on the GPUs. The switch‘s role is purely data-plane packet forwarding.
B. Verify that the GPUs are in MIG mode
Incorrect: Multi-Instance GPU (MIG) is used to partition a single GPU for multi-tenancy. While NCCL can run on MIG instances, enabling MIG is not a requirement for the network fabric to work effectively. In fact, for large-scale training jobs that require maximum bandwidth, administrators typically use the full GPU (Non-MIG) to ensure all available NVLink and network bandwidth is dedicated to a single task.
C. Confirm that the Adaptive Routing (AR) and Congestion Control (CC) settings are correctly configured
Correct: This is a core troubleshooting step in the NCP-AII curriculum for resolving performance bottlenecks in the high-speed fabric.
Adaptive Routing (AR): In a large-scale AI factory with a Fat-Tree topology, AR allows the InfiniBand/Spectrum-X switches to dynamically route packets across multiple available paths to avoid localized congestion. If AR is disabled or misconfigured, traffic may “hotspot“ on a single link, causing the bottleneck observed.
Congestion Control (CC): Technologies like DCQCN (for RoCE) or hardware congestion control (for InfiniBand) prevent a “slow receiver“ from backing up traffic and affecting the entire fabric (Head-of-Line blocking). Consistent settings across the BlueField-3 DPUs and the Quantum/Spectrum switches are mandatory to ensure the fabric remains lossless and high-throughput.
D. Ensure that the network cables are painted with a non-conductive coating
Incorrect: This is a purely fictional concept. High-speed networking uses photons in fiber optic cables or electrical signals in copper DACs. Neither “static electricity on photons“ nor “non-conductive paint“ are real physical factors in data center networking performance.
Unattempted
A. Check if the ‘gcc‘ compiler is installed on the switch
Incorrect: InfiniBand and Ethernet switches are high-performance networking appliances that run specialized network operating systems (like NVIDIA Onyx or Cumulus Linux). They do not recompile NCCL kernels; NCCL kernels are compiled on the compute nodes (the servers) and run on the GPUs. The switch‘s role is purely data-plane packet forwarding.
B. Verify that the GPUs are in MIG mode
Incorrect: Multi-Instance GPU (MIG) is used to partition a single GPU for multi-tenancy. While NCCL can run on MIG instances, enabling MIG is not a requirement for the network fabric to work effectively. In fact, for large-scale training jobs that require maximum bandwidth, administrators typically use the full GPU (Non-MIG) to ensure all available NVLink and network bandwidth is dedicated to a single task.
C. Confirm that the Adaptive Routing (AR) and Congestion Control (CC) settings are correctly configured
Correct: This is a core troubleshooting step in the NCP-AII curriculum for resolving performance bottlenecks in the high-speed fabric.
Adaptive Routing (AR): In a large-scale AI factory with a Fat-Tree topology, AR allows the InfiniBand/Spectrum-X switches to dynamically route packets across multiple available paths to avoid localized congestion. If AR is disabled or misconfigured, traffic may “hotspot“ on a single link, causing the bottleneck observed.
Congestion Control (CC): Technologies like DCQCN (for RoCE) or hardware congestion control (for InfiniBand) prevent a “slow receiver“ from backing up traffic and affecting the entire fabric (Head-of-Line blocking). Consistent settings across the BlueField-3 DPUs and the Quantum/Spectrum switches are mandatory to ensure the fabric remains lossless and high-throughput.
D. Ensure that the network cables are painted with a non-conductive coating
Incorrect: This is a purely fictional concept. High-speed networking uses photons in fiber optic cables or electrical signals in copper DACs. Neither “static electricity on photons“ nor “non-conductive paint“ are real physical factors in data center networking performance.
Question 43 of 60
43. Question
An AI cluster is experiencing intermittent GPU dropouts during heavy training jobs. The system logs indicate a GPU Fallen Off the Bus error. Which of the following troubleshooting steps should be taken first to identify if the issue is hardware-related or thermal-related?
Correct
Correct: A. Check the DCGM-exporter metrics for high temperature and power spikes
This option is correct because DCGM (Data Center GPU Manager) is the primary tool for collecting comprehensive GPU telemetry, and checking its metrics for temperature and power anomalies is the logical first step to differentiate between hardware and thermal causes of a “GPU Fallen Off the Bus“ error .
The “GPU Fallen Off the Bus“ error in system logs (dmesg) indicates that the GPU has become unresponsive or disconnected from the PCIe bus . This can be caused by either hardware issues (PCIe slot contact problems, power supply instability) or thermal issues (overheating triggering protective shutdown) . Before physically inspecting hardware, checking DCGM metrics allows the administrator to gather diagnostic data non-invasively.
DCGM-exporter provides critical metrics that help identify the root cause :
Temperature metrics (DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_MEMORY_TEMP): If temperatures were approaching or exceeding thermal limits (typically 85°C+ for GPUs) before the dropout, this points to thermal throttling or thermal shutdown
Power metrics (DCGM_FI_DEV_POWER_USAGE): Power spikes exceeding the GPU‘s TDP or power limit could indicate power delivery issues that might cause the GPU to fall off the bus
Clock throttle reasons (DCGM_FI_DEV_CLOCK_THROTTLE_REASONS): This bitmask shows if thermal or power constraints were causing throttling before failure
The DCGM-exporter is designed specifically for this type of monitoring—it exposes real-time metrics that can be scraped by Prometheus and visualized in Grafana, enabling administrators to spot trends and anomalies before complete failure occurs . For intermittent issues like GPU dropouts during heavy training jobs, historical DCGM data is invaluable for correlating failures with temperature or power events .
This approach aligns with the NCP-AII certification‘s focus on troubleshooting hardware faults and using NVIDIA tools for infrastructure monitoring .
Incorrect:
B. Upgrade the Slurm scheduler to the latest experimental beta version
This is incorrect because Slurm is a workload manager for job scheduling, not a tool for diagnosing GPU hardware issues. Upgrading to an experimental beta version would introduce instability and is completely unrelated to troubleshooting a “GPU Fallen Off the Bus“ error. This action does not help differentiate between hardware and thermal causes and violates best practices of using stable software in production environments.
C. Delete all user data from the shared storage to free up disk space
This is incorrect and destructive. Storage space has no direct relationship to GPU PCIe bus connectivity issues. Deleting user data would not help diagnose whether a GPU dropout is caused by hardware or thermal problems, and it would cause significant data loss. Storage troubleshooting is separate from GPU hardware diagnostics.
D. Replace the high-speed InfiniBand switches with standard 1GbE switches
This is incorrect because network switches (InfiniBand or Ethernet) are unrelated to GPU PCIe connectivity issues. Replacing them would disrupt the entire cluster fabric without addressing the GPU dropout problem. The “GPU Fallen Off the Bus“ error is specific to the GPU‘s connection to the host system via PCIe, not network communication between nodes.
Incorrect
Correct: A. Check the DCGM-exporter metrics for high temperature and power spikes
This option is correct because DCGM (Data Center GPU Manager) is the primary tool for collecting comprehensive GPU telemetry, and checking its metrics for temperature and power anomalies is the logical first step to differentiate between hardware and thermal causes of a “GPU Fallen Off the Bus“ error .
The “GPU Fallen Off the Bus“ error in system logs (dmesg) indicates that the GPU has become unresponsive or disconnected from the PCIe bus . This can be caused by either hardware issues (PCIe slot contact problems, power supply instability) or thermal issues (overheating triggering protective shutdown) . Before physically inspecting hardware, checking DCGM metrics allows the administrator to gather diagnostic data non-invasively.
DCGM-exporter provides critical metrics that help identify the root cause :
Temperature metrics (DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_MEMORY_TEMP): If temperatures were approaching or exceeding thermal limits (typically 85°C+ for GPUs) before the dropout, this points to thermal throttling or thermal shutdown
Power metrics (DCGM_FI_DEV_POWER_USAGE): Power spikes exceeding the GPU‘s TDP or power limit could indicate power delivery issues that might cause the GPU to fall off the bus
Clock throttle reasons (DCGM_FI_DEV_CLOCK_THROTTLE_REASONS): This bitmask shows if thermal or power constraints were causing throttling before failure
The DCGM-exporter is designed specifically for this type of monitoring—it exposes real-time metrics that can be scraped by Prometheus and visualized in Grafana, enabling administrators to spot trends and anomalies before complete failure occurs . For intermittent issues like GPU dropouts during heavy training jobs, historical DCGM data is invaluable for correlating failures with temperature or power events .
This approach aligns with the NCP-AII certification‘s focus on troubleshooting hardware faults and using NVIDIA tools for infrastructure monitoring .
Incorrect:
B. Upgrade the Slurm scheduler to the latest experimental beta version
This is incorrect because Slurm is a workload manager for job scheduling, not a tool for diagnosing GPU hardware issues. Upgrading to an experimental beta version would introduce instability and is completely unrelated to troubleshooting a “GPU Fallen Off the Bus“ error. This action does not help differentiate between hardware and thermal causes and violates best practices of using stable software in production environments.
C. Delete all user data from the shared storage to free up disk space
This is incorrect and destructive. Storage space has no direct relationship to GPU PCIe bus connectivity issues. Deleting user data would not help diagnose whether a GPU dropout is caused by hardware or thermal problems, and it would cause significant data loss. Storage troubleshooting is separate from GPU hardware diagnostics.
D. Replace the high-speed InfiniBand switches with standard 1GbE switches
This is incorrect because network switches (InfiniBand or Ethernet) are unrelated to GPU PCIe connectivity issues. Replacing them would disrupt the entire cluster fabric without addressing the GPU dropout problem. The “GPU Fallen Off the Bus“ error is specific to the GPU‘s connection to the host system via PCIe, not network communication between nodes.
Unattempted
Correct: A. Check the DCGM-exporter metrics for high temperature and power spikes
This option is correct because DCGM (Data Center GPU Manager) is the primary tool for collecting comprehensive GPU telemetry, and checking its metrics for temperature and power anomalies is the logical first step to differentiate between hardware and thermal causes of a “GPU Fallen Off the Bus“ error .
The “GPU Fallen Off the Bus“ error in system logs (dmesg) indicates that the GPU has become unresponsive or disconnected from the PCIe bus . This can be caused by either hardware issues (PCIe slot contact problems, power supply instability) or thermal issues (overheating triggering protective shutdown) . Before physically inspecting hardware, checking DCGM metrics allows the administrator to gather diagnostic data non-invasively.
DCGM-exporter provides critical metrics that help identify the root cause :
Temperature metrics (DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_MEMORY_TEMP): If temperatures were approaching or exceeding thermal limits (typically 85°C+ for GPUs) before the dropout, this points to thermal throttling or thermal shutdown
Power metrics (DCGM_FI_DEV_POWER_USAGE): Power spikes exceeding the GPU‘s TDP or power limit could indicate power delivery issues that might cause the GPU to fall off the bus
Clock throttle reasons (DCGM_FI_DEV_CLOCK_THROTTLE_REASONS): This bitmask shows if thermal or power constraints were causing throttling before failure
The DCGM-exporter is designed specifically for this type of monitoring—it exposes real-time metrics that can be scraped by Prometheus and visualized in Grafana, enabling administrators to spot trends and anomalies before complete failure occurs . For intermittent issues like GPU dropouts during heavy training jobs, historical DCGM data is invaluable for correlating failures with temperature or power events .
This approach aligns with the NCP-AII certification‘s focus on troubleshooting hardware faults and using NVIDIA tools for infrastructure monitoring .
Incorrect:
B. Upgrade the Slurm scheduler to the latest experimental beta version
This is incorrect because Slurm is a workload manager for job scheduling, not a tool for diagnosing GPU hardware issues. Upgrading to an experimental beta version would introduce instability and is completely unrelated to troubleshooting a “GPU Fallen Off the Bus“ error. This action does not help differentiate between hardware and thermal causes and violates best practices of using stable software in production environments.
C. Delete all user data from the shared storage to free up disk space
This is incorrect and destructive. Storage space has no direct relationship to GPU PCIe bus connectivity issues. Deleting user data would not help diagnose whether a GPU dropout is caused by hardware or thermal problems, and it would cause significant data loss. Storage troubleshooting is separate from GPU hardware diagnostics.
D. Replace the high-speed InfiniBand switches with standard 1GbE switches
This is incorrect because network switches (InfiniBand or Ethernet) are unrelated to GPU PCIe connectivity issues. Replacing them would disrupt the entire cluster fabric without addressing the GPU dropout problem. The “GPU Fallen Off the Bus“ error is specific to the GPU‘s connection to the host system via PCIe, not network communication between nodes.
Question 44 of 60
44. Question
During the NCCL tests, an engineer notices that the East-West fabric bandwidth is significantly lower than expected for a 400Gb/s InfiniBand network. Which tool or diagnostic should be run to specifically isolate whether the issue is with the NVLink Switch inside the node or the external fabric cabling?
Correct
A. Checking the server BIOS to see if hyper-threading is enabled
Incorrect: Hyper-threading (Intel) or SMT (AMD) primarily affects CPU core utilization for general-purpose compute. While it can impact system performance, it has no direct relationship with the physical bandwidth of the NVLink Switch or the InfiniBand fabric. It cannot isolate hardware-level interconnect bottlenecks.
B. Running the nccl-tests with the NVLINK_DISABLE=1 variable
Correct: This is a standard diagnostic method taught in the NCP-AII curriculum for isolating communication paths.
The Logic: By default, NCCL uses the fastest available path (NVLink for intra-node, InfiniBand for inter-node). By setting the environment variable NVLINK_DISABLE=1, you force NCCL to bypass the internal NVLink mesh and use the next available path (typically PCIe or the network fabric) even for local communication.
The Result: If performance is still poor after disabling NVLink, the issue is likely rooted in the external fabric cabling or the network configuration. If performance improves relative to the expected non-NVLink baseline, or if the original “slowness“ disappears from the network-only data, it confirms that the internal NVLink Switch or its driver state was the bottleneck.
C. Executing a burn-in test on the local NVMe storage drives
Incorrect: NVMe burn-in tests stress the storage subsystem‘s IOPS and throughput. While storage is a critical part of an AI cluster, it is logically separate from the high-speed GPU-to-GPU fabric. A storage test will not provide diagnostic data regarding NCCL bandwidth or NVLink integrity.
D. Using the ibstat command to check the port state on the BMC
Incorrect: ibstat is a powerful tool for checking InfiniBand port status, but it is run from the host OS, not the BMC. Furthermore, while it can show if a link is “Active“ or “Down,“ it cannot differentiate between a bottleneck caused by internal NVLink issues versus external cabling quality during an active collective operation.
Incorrect
A. Checking the server BIOS to see if hyper-threading is enabled
Incorrect: Hyper-threading (Intel) or SMT (AMD) primarily affects CPU core utilization for general-purpose compute. While it can impact system performance, it has no direct relationship with the physical bandwidth of the NVLink Switch or the InfiniBand fabric. It cannot isolate hardware-level interconnect bottlenecks.
B. Running the nccl-tests with the NVLINK_DISABLE=1 variable
Correct: This is a standard diagnostic method taught in the NCP-AII curriculum for isolating communication paths.
The Logic: By default, NCCL uses the fastest available path (NVLink for intra-node, InfiniBand for inter-node). By setting the environment variable NVLINK_DISABLE=1, you force NCCL to bypass the internal NVLink mesh and use the next available path (typically PCIe or the network fabric) even for local communication.
The Result: If performance is still poor after disabling NVLink, the issue is likely rooted in the external fabric cabling or the network configuration. If performance improves relative to the expected non-NVLink baseline, or if the original “slowness“ disappears from the network-only data, it confirms that the internal NVLink Switch or its driver state was the bottleneck.
C. Executing a burn-in test on the local NVMe storage drives
Incorrect: NVMe burn-in tests stress the storage subsystem‘s IOPS and throughput. While storage is a critical part of an AI cluster, it is logically separate from the high-speed GPU-to-GPU fabric. A storage test will not provide diagnostic data regarding NCCL bandwidth or NVLink integrity.
D. Using the ibstat command to check the port state on the BMC
Incorrect: ibstat is a powerful tool for checking InfiniBand port status, but it is run from the host OS, not the BMC. Furthermore, while it can show if a link is “Active“ or “Down,“ it cannot differentiate between a bottleneck caused by internal NVLink issues versus external cabling quality during an active collective operation.
Unattempted
A. Checking the server BIOS to see if hyper-threading is enabled
Incorrect: Hyper-threading (Intel) or SMT (AMD) primarily affects CPU core utilization for general-purpose compute. While it can impact system performance, it has no direct relationship with the physical bandwidth of the NVLink Switch or the InfiniBand fabric. It cannot isolate hardware-level interconnect bottlenecks.
B. Running the nccl-tests with the NVLINK_DISABLE=1 variable
Correct: This is a standard diagnostic method taught in the NCP-AII curriculum for isolating communication paths.
The Logic: By default, NCCL uses the fastest available path (NVLink for intra-node, InfiniBand for inter-node). By setting the environment variable NVLINK_DISABLE=1, you force NCCL to bypass the internal NVLink mesh and use the next available path (typically PCIe or the network fabric) even for local communication.
The Result: If performance is still poor after disabling NVLink, the issue is likely rooted in the external fabric cabling or the network configuration. If performance improves relative to the expected non-NVLink baseline, or if the original “slowness“ disappears from the network-only data, it confirms that the internal NVLink Switch or its driver state was the bottleneck.
C. Executing a burn-in test on the local NVMe storage drives
Incorrect: NVMe burn-in tests stress the storage subsystem‘s IOPS and throughput. While storage is a critical part of an AI cluster, it is logically separate from the high-speed GPU-to-GPU fabric. A storage test will not provide diagnostic data regarding NCCL bandwidth or NVLink integrity.
D. Using the ibstat command to check the port state on the BMC
Incorrect: ibstat is a powerful tool for checking InfiniBand port status, but it is run from the host OS, not the BMC. Furthermore, while it can show if a link is “Active“ or “Down,“ it cannot differentiate between a bottleneck caused by internal NVLink issues versus external cabling quality during an active collective operation.
Question 45 of 60
45. Question
An administrator needs to install the NVIDIA Enroot and Pyxis tools on a Slurm cluster to support containerized AI workloads. What is the primary reason for using these specific tools instead of a standard Docker daemon for running distributed AI training jobs?
Correct
Correct: C. Enroot and Pyxis allow for unprivileged container execution, which improves security and integrates more natively with the Slurm resource manager.
This option is correct because it accurately describes the fundamental reasons why Enroot and Pyxis are the preferred container solution for Slurm-managed AI clusters, as documented in NVIDIA‘s reference architectures for systems like the DGX SuperPOD .
Unprivileged Execution and Security: Standard Docker environments require elevated privileges (root access) and run a persistent daemon, which presents security concerns on shared HPC resources . In contrast, Enroot is designed to run containers entirely in user space with no privileged daemons or root access required . This unprivileged model is critical for multi-tenant environments where users must not have the ability to escalate privileges or interfere with other users‘ workloads . Containers on NVIDIA DGX systems run as the submitting user, with file permissions matching that user rather than root .
Native Slurm Integration via Pyxis: Pyxis is a SPANK plugin for Slurm that extends Slurm‘s functionality to launch jobs directly into containers using familiar Slurm commands like srun and sbatch . This integration allows containerized jobs to be treated as native Slurm jobs, supporting MPI parallelism, multi-node execution, and GPU allocation through standard Slurm directives (e.g., –gpus-per-task, –ntasks-per-node) .
Simplified GPU Access: Unlike Docker which requires additional runtime flags (–runtime=nvidia or –gpus all) for GPU passthrough, Enroot provides built-in GPU support using libnvidia-container, automatically configuring containers to leverage underlying NVIDIA GPU hardware . This seamless integration with NVIDIA hardware is essential for AI workloads.
Lightweight and HPC-Optimized: Enroot is specifically optimized for HPC environments, using squashfs images that can be shared across nodes and started quickly without the overhead of a container daemon .
Incorrect:
A. Enroot is a specialized operating system that must be installed on the GPU itself, replacing the need for a host Linux distribution.
This is incorrect. Enroot is a container runtime, not an operating system. It runs on the host Linux distribution and creates isolated user-space environments for applications. GPUs do not run operating systems directly—they are hardware devices managed by the host through drivers. Enroot does not “replace“ the host OS; it runs as a userspace tool alongside it .
B. Docker is not capable of accessing more than one GPU at a time, whereas Enroot can aggregate all GPUs in a cluster into a single virtual device.
This is incorrect on multiple counts. Docker is fully capable of accessing multiple GPUs simultaneously using the –gpus option or NVIDIA Container Toolkit. Enroot does not “aggregate“ GPUs into a single virtual device; it preserves direct access to individual GPUs as allocated by Slurm . Both tools can access multiple GPUs, but Enroot‘s advantage is its unprivileged architecture and HPC integration, not any special GPU aggregation capability.
D. Standard Docker is too fast for AI workloads, causing the GPUs to overheat, while Enroot adds necessary latency to protect the hardware.
This is completely incorrect and has no basis in reality. Container runtimes do not affect GPU processing speed or thermal characteristics—GPU workloads determine utilization and heat generation. Enroot‘s startup time may actually be slightly longer than Docker due to image unpacking , but this is infrastructure overhead, not intentional “latency to protect hardware.“ Thermal management is handled by GPU firmware, power capping, and cooling systems, not container runtimes.
Incorrect
Correct: C. Enroot and Pyxis allow for unprivileged container execution, which improves security and integrates more natively with the Slurm resource manager.
This option is correct because it accurately describes the fundamental reasons why Enroot and Pyxis are the preferred container solution for Slurm-managed AI clusters, as documented in NVIDIA‘s reference architectures for systems like the DGX SuperPOD .
Unprivileged Execution and Security: Standard Docker environments require elevated privileges (root access) and run a persistent daemon, which presents security concerns on shared HPC resources . In contrast, Enroot is designed to run containers entirely in user space with no privileged daemons or root access required . This unprivileged model is critical for multi-tenant environments where users must not have the ability to escalate privileges or interfere with other users‘ workloads . Containers on NVIDIA DGX systems run as the submitting user, with file permissions matching that user rather than root .
Native Slurm Integration via Pyxis: Pyxis is a SPANK plugin for Slurm that extends Slurm‘s functionality to launch jobs directly into containers using familiar Slurm commands like srun and sbatch . This integration allows containerized jobs to be treated as native Slurm jobs, supporting MPI parallelism, multi-node execution, and GPU allocation through standard Slurm directives (e.g., –gpus-per-task, –ntasks-per-node) .
Simplified GPU Access: Unlike Docker which requires additional runtime flags (–runtime=nvidia or –gpus all) for GPU passthrough, Enroot provides built-in GPU support using libnvidia-container, automatically configuring containers to leverage underlying NVIDIA GPU hardware . This seamless integration with NVIDIA hardware is essential for AI workloads.
Lightweight and HPC-Optimized: Enroot is specifically optimized for HPC environments, using squashfs images that can be shared across nodes and started quickly without the overhead of a container daemon .
Incorrect:
A. Enroot is a specialized operating system that must be installed on the GPU itself, replacing the need for a host Linux distribution.
This is incorrect. Enroot is a container runtime, not an operating system. It runs on the host Linux distribution and creates isolated user-space environments for applications. GPUs do not run operating systems directly—they are hardware devices managed by the host through drivers. Enroot does not “replace“ the host OS; it runs as a userspace tool alongside it .
B. Docker is not capable of accessing more than one GPU at a time, whereas Enroot can aggregate all GPUs in a cluster into a single virtual device.
This is incorrect on multiple counts. Docker is fully capable of accessing multiple GPUs simultaneously using the –gpus option or NVIDIA Container Toolkit. Enroot does not “aggregate“ GPUs into a single virtual device; it preserves direct access to individual GPUs as allocated by Slurm . Both tools can access multiple GPUs, but Enroot‘s advantage is its unprivileged architecture and HPC integration, not any special GPU aggregation capability.
D. Standard Docker is too fast for AI workloads, causing the GPUs to overheat, while Enroot adds necessary latency to protect the hardware.
This is completely incorrect and has no basis in reality. Container runtimes do not affect GPU processing speed or thermal characteristics—GPU workloads determine utilization and heat generation. Enroot‘s startup time may actually be slightly longer than Docker due to image unpacking , but this is infrastructure overhead, not intentional “latency to protect hardware.“ Thermal management is handled by GPU firmware, power capping, and cooling systems, not container runtimes.
Unattempted
Correct: C. Enroot and Pyxis allow for unprivileged container execution, which improves security and integrates more natively with the Slurm resource manager.
This option is correct because it accurately describes the fundamental reasons why Enroot and Pyxis are the preferred container solution for Slurm-managed AI clusters, as documented in NVIDIA‘s reference architectures for systems like the DGX SuperPOD .
Unprivileged Execution and Security: Standard Docker environments require elevated privileges (root access) and run a persistent daemon, which presents security concerns on shared HPC resources . In contrast, Enroot is designed to run containers entirely in user space with no privileged daemons or root access required . This unprivileged model is critical for multi-tenant environments where users must not have the ability to escalate privileges or interfere with other users‘ workloads . Containers on NVIDIA DGX systems run as the submitting user, with file permissions matching that user rather than root .
Native Slurm Integration via Pyxis: Pyxis is a SPANK plugin for Slurm that extends Slurm‘s functionality to launch jobs directly into containers using familiar Slurm commands like srun and sbatch . This integration allows containerized jobs to be treated as native Slurm jobs, supporting MPI parallelism, multi-node execution, and GPU allocation through standard Slurm directives (e.g., –gpus-per-task, –ntasks-per-node) .
Simplified GPU Access: Unlike Docker which requires additional runtime flags (–runtime=nvidia or –gpus all) for GPU passthrough, Enroot provides built-in GPU support using libnvidia-container, automatically configuring containers to leverage underlying NVIDIA GPU hardware . This seamless integration with NVIDIA hardware is essential for AI workloads.
Lightweight and HPC-Optimized: Enroot is specifically optimized for HPC environments, using squashfs images that can be shared across nodes and started quickly without the overhead of a container daemon .
Incorrect:
A. Enroot is a specialized operating system that must be installed on the GPU itself, replacing the need for a host Linux distribution.
This is incorrect. Enroot is a container runtime, not an operating system. It runs on the host Linux distribution and creates isolated user-space environments for applications. GPUs do not run operating systems directly—they are hardware devices managed by the host through drivers. Enroot does not “replace“ the host OS; it runs as a userspace tool alongside it .
B. Docker is not capable of accessing more than one GPU at a time, whereas Enroot can aggregate all GPUs in a cluster into a single virtual device.
This is incorrect on multiple counts. Docker is fully capable of accessing multiple GPUs simultaneously using the –gpus option or NVIDIA Container Toolkit. Enroot does not “aggregate“ GPUs into a single virtual device; it preserves direct access to individual GPUs as allocated by Slurm . Both tools can access multiple GPUs, but Enroot‘s advantage is its unprivileged architecture and HPC integration, not any special GPU aggregation capability.
D. Standard Docker is too fast for AI workloads, causing the GPUs to overheat, while Enroot adds necessary latency to protect the hardware.
This is completely incorrect and has no basis in reality. Container runtimes do not affect GPU processing speed or thermal characteristics—GPU workloads determine utilization and heat generation. Enroot‘s startup time may actually be slightly longer than Docker due to image unpacking , but this is infrastructure overhead, not intentional “latency to protect hardware.“ Thermal management is handled by GPU firmware, power capping, and cooling systems, not container runtimes.
Question 46 of 60
46. Question
A cloud service provider is using NVIDIA BlueField-3 DPUs to manage the physical layer of their AI infrastructure. They want to offload networking and security tasks from the host CPU to the DPU. Which configuration is required to allow the BlueField-3 to operate in a mode where it can manage its own embedded ARM operating system and hardware accelerators?
Correct
A. The DOCA drivers must be installed on the BMC
Incorrect: While the host BMC (Baseboard Management Controller) and the DPU‘s internal BMC can communicate for power control and monitoring, DOCA (Data Center-on-a-Chip Architecture) is a software framework that runs on the host CPU (to provide drivers) and the DPU ARM cores (to run applications). It does not “bridge“ the management network through the server‘s BMC in the manner described.
B. The DPU must be configured in DPU-Mode
Correct: This is the standard architectural requirement for an AI Factory.
DPU-Mode (or ECPF Mode): In this mode, the BlueField-3 operates as a fully independent “computer-in-front-of-a-computer.“ The internal ARM cores are active and boot their own Linux-based operating system (typically Ubuntu).
Ownership: The DPU owns the network resources. All traffic to/from the host must pass through the DPUÂ’s internal virtual switch (eSwitch) managed by the ARM cores. This allows for the offloading of networking (OVS/OVN), security (firewalls/encryption), and storage (NVMe-oF) tasks from the host CPU.
Configuration: This mode is enabled via mlxconfig by setting INTERNAL_CPU_OFFLOAD_ENGINE=0 (which, counter-intuitively in firmware logic, enables the DPU-mode embedded functionality).
C. The administrator must enable MIG on the BlueField-3 DPU
Incorrect: MIG (Multi-Instance GPU) is a technology exclusive to NVIDIA GPUs (like the A100 and H100) to partition compute and memory. It is not applicable to DPUs. While the DPU can create multiple virtual network interfaces (vNICs) using SR-IOV or Scalable Functions (SFs), this is not called MIG.
D. The DPU must be set to NIC-Only mode
Incorrect: In NIC-Only mode, the BlueField-3 behaves exactly like a standard ConnectX network adapter. The internal ARM cores are deactivated (or non-functional), and the host CPU is responsible for all network processing. This mode is the opposite of what is required to offload tasks to the DPU‘s internal OS.
Incorrect
A. The DOCA drivers must be installed on the BMC
Incorrect: While the host BMC (Baseboard Management Controller) and the DPU‘s internal BMC can communicate for power control and monitoring, DOCA (Data Center-on-a-Chip Architecture) is a software framework that runs on the host CPU (to provide drivers) and the DPU ARM cores (to run applications). It does not “bridge“ the management network through the server‘s BMC in the manner described.
B. The DPU must be configured in DPU-Mode
Correct: This is the standard architectural requirement for an AI Factory.
DPU-Mode (or ECPF Mode): In this mode, the BlueField-3 operates as a fully independent “computer-in-front-of-a-computer.“ The internal ARM cores are active and boot their own Linux-based operating system (typically Ubuntu).
Ownership: The DPU owns the network resources. All traffic to/from the host must pass through the DPUÂ’s internal virtual switch (eSwitch) managed by the ARM cores. This allows for the offloading of networking (OVS/OVN), security (firewalls/encryption), and storage (NVMe-oF) tasks from the host CPU.
Configuration: This mode is enabled via mlxconfig by setting INTERNAL_CPU_OFFLOAD_ENGINE=0 (which, counter-intuitively in firmware logic, enables the DPU-mode embedded functionality).
C. The administrator must enable MIG on the BlueField-3 DPU
Incorrect: MIG (Multi-Instance GPU) is a technology exclusive to NVIDIA GPUs (like the A100 and H100) to partition compute and memory. It is not applicable to DPUs. While the DPU can create multiple virtual network interfaces (vNICs) using SR-IOV or Scalable Functions (SFs), this is not called MIG.
D. The DPU must be set to NIC-Only mode
Incorrect: In NIC-Only mode, the BlueField-3 behaves exactly like a standard ConnectX network adapter. The internal ARM cores are deactivated (or non-functional), and the host CPU is responsible for all network processing. This mode is the opposite of what is required to offload tasks to the DPU‘s internal OS.
Unattempted
A. The DOCA drivers must be installed on the BMC
Incorrect: While the host BMC (Baseboard Management Controller) and the DPU‘s internal BMC can communicate for power control and monitoring, DOCA (Data Center-on-a-Chip Architecture) is a software framework that runs on the host CPU (to provide drivers) and the DPU ARM cores (to run applications). It does not “bridge“ the management network through the server‘s BMC in the manner described.
B. The DPU must be configured in DPU-Mode
Correct: This is the standard architectural requirement for an AI Factory.
DPU-Mode (or ECPF Mode): In this mode, the BlueField-3 operates as a fully independent “computer-in-front-of-a-computer.“ The internal ARM cores are active and boot their own Linux-based operating system (typically Ubuntu).
Ownership: The DPU owns the network resources. All traffic to/from the host must pass through the DPUÂ’s internal virtual switch (eSwitch) managed by the ARM cores. This allows for the offloading of networking (OVS/OVN), security (firewalls/encryption), and storage (NVMe-oF) tasks from the host CPU.
Configuration: This mode is enabled via mlxconfig by setting INTERNAL_CPU_OFFLOAD_ENGINE=0 (which, counter-intuitively in firmware logic, enables the DPU-mode embedded functionality).
C. The administrator must enable MIG on the BlueField-3 DPU
Incorrect: MIG (Multi-Instance GPU) is a technology exclusive to NVIDIA GPUs (like the A100 and H100) to partition compute and memory. It is not applicable to DPUs. While the DPU can create multiple virtual network interfaces (vNICs) using SR-IOV or Scalable Functions (SFs), this is not called MIG.
D. The DPU must be set to NIC-Only mode
Incorrect: In NIC-Only mode, the BlueField-3 behaves exactly like a standard ConnectX network adapter. The internal ARM cores are deactivated (or non-functional), and the host CPU is responsible for all network processing. This mode is the opposite of what is required to offload tasks to the DPU‘s internal OS.
Question 47 of 60
47. Question
When configuring Multi-Instance GPU (MIG) for a diverse set of AI workloads, an administrator needs to understand the limitations of the partitioning profiles. If an H100 GPU is partitioned into several 1g.10gb instances, which of the following statements accurately describes the resource allocation for those specific instances?
Correct
A. MIG instances are virtualized via a hypervisor.
Incorrect: MIG is a hardware-level partitioning technology, not a software virtualization feature. While MIG can be used within virtual machines (via NVIDIA vGPU), the partitioning itself happens at the GPU hardware level. It does not incur a “20% overhead“ due to hypervisor context switching or memory transaction translation.
B. Each instance has access to the full 80GB of GPU memory.
Incorrect: This describes the opposite of MIG‘s architecture. In MIG mode, the memory is physically partitioned. If you are using a 1g.10gb profile, that instance is strictly limited to its assigned 10GB slice. It cannot “burst“ into the rest of the 80GB, ensuring that one tenant‘s memory usage never impacts another‘s available capacity.
C. Each instance is allocated a fixed slice of the GPU‘s hardware.
Correct: According to the NCP-AII curriculum, MIG provides Spatial Partitioning.
Fixed Resources: For an H100 80GB, a 1g.10gb profile allocates 1/7th of the GPU‘s compute resources (Streaming Multiprocessors) and 1/8th of the total memory (approximately 10GB of HBM3).
Isolation: Each slice includes its own dedicated hardware paths through the memory controllers, crossbar ports, and L2 cache banks.
Predictable Performance: Because the compute and memory channels are physically separated, the instance behaves like a standalone, smaller GPU with deterministic latency and throughput.
D. The instances share a single pool of memory.
Incorrect: This statement describes CUDA Multi-Process Service (MPS) or standard time-slicing, where multiple processes share the same memory address space and hardware units. MIG specifically exists to move away from this “shared pool“ model to provide fault isolation and guaranteed resource availability.
Incorrect
A. MIG instances are virtualized via a hypervisor.
Incorrect: MIG is a hardware-level partitioning technology, not a software virtualization feature. While MIG can be used within virtual machines (via NVIDIA vGPU), the partitioning itself happens at the GPU hardware level. It does not incur a “20% overhead“ due to hypervisor context switching or memory transaction translation.
B. Each instance has access to the full 80GB of GPU memory.
Incorrect: This describes the opposite of MIG‘s architecture. In MIG mode, the memory is physically partitioned. If you are using a 1g.10gb profile, that instance is strictly limited to its assigned 10GB slice. It cannot “burst“ into the rest of the 80GB, ensuring that one tenant‘s memory usage never impacts another‘s available capacity.
C. Each instance is allocated a fixed slice of the GPU‘s hardware.
Correct: According to the NCP-AII curriculum, MIG provides Spatial Partitioning.
Fixed Resources: For an H100 80GB, a 1g.10gb profile allocates 1/7th of the GPU‘s compute resources (Streaming Multiprocessors) and 1/8th of the total memory (approximately 10GB of HBM3).
Isolation: Each slice includes its own dedicated hardware paths through the memory controllers, crossbar ports, and L2 cache banks.
Predictable Performance: Because the compute and memory channels are physically separated, the instance behaves like a standalone, smaller GPU with deterministic latency and throughput.
D. The instances share a single pool of memory.
Incorrect: This statement describes CUDA Multi-Process Service (MPS) or standard time-slicing, where multiple processes share the same memory address space and hardware units. MIG specifically exists to move away from this “shared pool“ model to provide fault isolation and guaranteed resource availability.
Unattempted
A. MIG instances are virtualized via a hypervisor.
Incorrect: MIG is a hardware-level partitioning technology, not a software virtualization feature. While MIG can be used within virtual machines (via NVIDIA vGPU), the partitioning itself happens at the GPU hardware level. It does not incur a “20% overhead“ due to hypervisor context switching or memory transaction translation.
B. Each instance has access to the full 80GB of GPU memory.
Incorrect: This describes the opposite of MIG‘s architecture. In MIG mode, the memory is physically partitioned. If you are using a 1g.10gb profile, that instance is strictly limited to its assigned 10GB slice. It cannot “burst“ into the rest of the 80GB, ensuring that one tenant‘s memory usage never impacts another‘s available capacity.
C. Each instance is allocated a fixed slice of the GPU‘s hardware.
Correct: According to the NCP-AII curriculum, MIG provides Spatial Partitioning.
Fixed Resources: For an H100 80GB, a 1g.10gb profile allocates 1/7th of the GPU‘s compute resources (Streaming Multiprocessors) and 1/8th of the total memory (approximately 10GB of HBM3).
Isolation: Each slice includes its own dedicated hardware paths through the memory controllers, crossbar ports, and L2 cache banks.
Predictable Performance: Because the compute and memory channels are physically separated, the instance behaves like a standalone, smaller GPU with deterministic latency and throughput.
D. The instances share a single pool of memory.
Incorrect: This statement describes CUDA Multi-Process Service (MPS) or standard time-slicing, where multiple processes share the same memory address space and hardware units. MIG specifically exists to move away from this “shared pool“ model to provide fault isolation and guaranteed resource availability.
Question 48 of 60
48. Question
An administrator is using the NVIDIA ClusterKit to perform a multifaceted node assessment on a newly deployed AI factory. The tool reports a failure in the ‘NCCL All-Reduce‘ test on one specific node within the 32-node cluster. What should be the next logical step in the verification process to isolate the fault between the physical layer and the software stack?
Correct
Correct: A. Run a single-node NCCL test on the failing node to determine if the issue is with the internal NVLink fabric or the external InfiniBand E/W fabric.
This option is correct because it represents a logical, step-by-step approach to fault isolation when an NCCL all_reduce test fails on a specific node during a cluster-wide assessment with NVIDIA ClusterKit. ClusterKit is explicitly designed to conduct GPU communication tests, including NCCL bandwidth and latency evaluations, as part of its node assessment capabilities . When a failure is reported on one node, the administrator must systematically isolate the fault.
Running a single-node NCCL test is the standard methodology for distinguishing between intra-node and inter-node communication issues. As documented in GPU benchmarking guides, NCCL tests can be configured to run on a single node to specifically benchmark NVLink performance between GPUs within that node . If the single-node test succeeds with expected bandwidth, the problem is isolated to the external fabric (InfiniBand East-West network). If the single-node test also fails, the issue lies within the node itself—either with NVLink connections, GPU-to-GPU communication paths, or local PCIe topology .
This diagnostic approach aligns with NCCL‘s architecture: NVLink handles intra-node GPU communication, while InfiniBand (or RoCE) handles inter-node communication . By running the test with -N1 -n8 –gpus-per-task=1 parameters (single node, all GPUs), the administrator specifically exercises NVLink without involving the external fabric. Comparing these results against the multi-node test that failed helps pinpoint whether the problem is physical-layer (cables/optics/transceivers) or software-stack (drivers/firmware) related.
Incorrect:
B. Immediately replace the motherboard of the failing node and the two adjacent nodes to ensure no electromagnetic interference is occurring.
This is incorrect because it represents premature and excessive hardware replacement without proper diagnosis. Replacing hardware without first isolating the fault through systematic testing violates fundamental troubleshooting principles. There is no indication that electromagnetic interference (EMI) is the cause, and replacing adjacent nodes unnecessarily disrupts the cluster and risks introducing new issues. The correct approach is to use diagnostic tools like NCCL tests to gather data before considering hardware replacement .
C. Update the NGC CLI on all nodes and re-pull the NeMo burn-in container image, as the failure is likely due to a corrupted container layer.
This is incorrect because a single-node NCCL test failure reported by ClusterKit is not indicative of container corruption. ClusterKit performs bare-metal assessments of node hardware and communication fabrics . While container issues can cause application-level failures, the NCCL all_reduce test operates at a lower level to validate GPU-to-GPU communication paths. Updating NGC CLI and re-pulling containers across all nodes is a broad, unfocused action that does not help isolate the fault between physical and software layers. The proper next step is targeted testing on the failing node .
D. Configure a new Slurm partition that excludes the failing node and rerun the cluster-wide NCCL test to see if the bandwidth increases by a factor of ten.
This is incorrect because excluding the node bypasses the problem rather than diagnosing it. While excluding a failed node may be a temporary workaround to allow other jobs to run, it does not isolate the fault or help understand whether the issue is physical (cables/optics) or software-related. The question specifically asks for the next logical step in the verification process to isolate the fault, and excluding the node provides no diagnostic information. Furthermore, expecting bandwidth to increase by “a factor of ten“ has no technical basis—removing one node from a 32-node cluster would not produce such a dramatic improvement. The correct approach is to run targeted tests on the failing node .
Incorrect
Correct: A. Run a single-node NCCL test on the failing node to determine if the issue is with the internal NVLink fabric or the external InfiniBand E/W fabric.
This option is correct because it represents a logical, step-by-step approach to fault isolation when an NCCL all_reduce test fails on a specific node during a cluster-wide assessment with NVIDIA ClusterKit. ClusterKit is explicitly designed to conduct GPU communication tests, including NCCL bandwidth and latency evaluations, as part of its node assessment capabilities . When a failure is reported on one node, the administrator must systematically isolate the fault.
Running a single-node NCCL test is the standard methodology for distinguishing between intra-node and inter-node communication issues. As documented in GPU benchmarking guides, NCCL tests can be configured to run on a single node to specifically benchmark NVLink performance between GPUs within that node . If the single-node test succeeds with expected bandwidth, the problem is isolated to the external fabric (InfiniBand East-West network). If the single-node test also fails, the issue lies within the node itself—either with NVLink connections, GPU-to-GPU communication paths, or local PCIe topology .
This diagnostic approach aligns with NCCL‘s architecture: NVLink handles intra-node GPU communication, while InfiniBand (or RoCE) handles inter-node communication . By running the test with -N1 -n8 –gpus-per-task=1 parameters (single node, all GPUs), the administrator specifically exercises NVLink without involving the external fabric. Comparing these results against the multi-node test that failed helps pinpoint whether the problem is physical-layer (cables/optics/transceivers) or software-stack (drivers/firmware) related.
Incorrect:
B. Immediately replace the motherboard of the failing node and the two adjacent nodes to ensure no electromagnetic interference is occurring.
This is incorrect because it represents premature and excessive hardware replacement without proper diagnosis. Replacing hardware without first isolating the fault through systematic testing violates fundamental troubleshooting principles. There is no indication that electromagnetic interference (EMI) is the cause, and replacing adjacent nodes unnecessarily disrupts the cluster and risks introducing new issues. The correct approach is to use diagnostic tools like NCCL tests to gather data before considering hardware replacement .
C. Update the NGC CLI on all nodes and re-pull the NeMo burn-in container image, as the failure is likely due to a corrupted container layer.
This is incorrect because a single-node NCCL test failure reported by ClusterKit is not indicative of container corruption. ClusterKit performs bare-metal assessments of node hardware and communication fabrics . While container issues can cause application-level failures, the NCCL all_reduce test operates at a lower level to validate GPU-to-GPU communication paths. Updating NGC CLI and re-pulling containers across all nodes is a broad, unfocused action that does not help isolate the fault between physical and software layers. The proper next step is targeted testing on the failing node .
D. Configure a new Slurm partition that excludes the failing node and rerun the cluster-wide NCCL test to see if the bandwidth increases by a factor of ten.
This is incorrect because excluding the node bypasses the problem rather than diagnosing it. While excluding a failed node may be a temporary workaround to allow other jobs to run, it does not isolate the fault or help understand whether the issue is physical (cables/optics) or software-related. The question specifically asks for the next logical step in the verification process to isolate the fault, and excluding the node provides no diagnostic information. Furthermore, expecting bandwidth to increase by “a factor of ten“ has no technical basis—removing one node from a 32-node cluster would not produce such a dramatic improvement. The correct approach is to run targeted tests on the failing node .
Unattempted
Correct: A. Run a single-node NCCL test on the failing node to determine if the issue is with the internal NVLink fabric or the external InfiniBand E/W fabric.
This option is correct because it represents a logical, step-by-step approach to fault isolation when an NCCL all_reduce test fails on a specific node during a cluster-wide assessment with NVIDIA ClusterKit. ClusterKit is explicitly designed to conduct GPU communication tests, including NCCL bandwidth and latency evaluations, as part of its node assessment capabilities . When a failure is reported on one node, the administrator must systematically isolate the fault.
Running a single-node NCCL test is the standard methodology for distinguishing between intra-node and inter-node communication issues. As documented in GPU benchmarking guides, NCCL tests can be configured to run on a single node to specifically benchmark NVLink performance between GPUs within that node . If the single-node test succeeds with expected bandwidth, the problem is isolated to the external fabric (InfiniBand East-West network). If the single-node test also fails, the issue lies within the node itself—either with NVLink connections, GPU-to-GPU communication paths, or local PCIe topology .
This diagnostic approach aligns with NCCL‘s architecture: NVLink handles intra-node GPU communication, while InfiniBand (or RoCE) handles inter-node communication . By running the test with -N1 -n8 –gpus-per-task=1 parameters (single node, all GPUs), the administrator specifically exercises NVLink without involving the external fabric. Comparing these results against the multi-node test that failed helps pinpoint whether the problem is physical-layer (cables/optics/transceivers) or software-stack (drivers/firmware) related.
Incorrect:
B. Immediately replace the motherboard of the failing node and the two adjacent nodes to ensure no electromagnetic interference is occurring.
This is incorrect because it represents premature and excessive hardware replacement without proper diagnosis. Replacing hardware without first isolating the fault through systematic testing violates fundamental troubleshooting principles. There is no indication that electromagnetic interference (EMI) is the cause, and replacing adjacent nodes unnecessarily disrupts the cluster and risks introducing new issues. The correct approach is to use diagnostic tools like NCCL tests to gather data before considering hardware replacement .
C. Update the NGC CLI on all nodes and re-pull the NeMo burn-in container image, as the failure is likely due to a corrupted container layer.
This is incorrect because a single-node NCCL test failure reported by ClusterKit is not indicative of container corruption. ClusterKit performs bare-metal assessments of node hardware and communication fabrics . While container issues can cause application-level failures, the NCCL all_reduce test operates at a lower level to validate GPU-to-GPU communication paths. Updating NGC CLI and re-pulling containers across all nodes is a broad, unfocused action that does not help isolate the fault between physical and software layers. The proper next step is targeted testing on the failing node .
D. Configure a new Slurm partition that excludes the failing node and rerun the cluster-wide NCCL test to see if the bandwidth increases by a factor of ten.
This is incorrect because excluding the node bypasses the problem rather than diagnosing it. While excluding a failed node may be a temporary workaround to allow other jobs to run, it does not isolate the fault or help understand whether the issue is physical (cables/optics) or software-related. The question specifically asks for the next logical step in the verification process to isolate the fault, and excluding the node provides no diagnostic information. Furthermore, expecting bandwidth to increase by “a factor of ten“ has no technical basis—removing one node from a 32-node cluster would not produce such a dramatic improvement. The correct approach is to run targeted tests on the failing node .
Question 49 of 60
49. Question
When configuring a BlueField-3 Data Processing Unit (DPU) to act as a secure network platform for an AI cluster, the administrator needs to isolate the management plane from the data plane. Which architectural feature of the BlueField platform allows for the offloading of security policies and telemetry without consuming the host CPU cycles of the NVIDIA HGX server?
Correct
A. The direct connection to the system‘s local SATA storage drives
Incorrect: While some BlueField-3 models feature M.2 or U.2 connectors for direct-attached NVMe storage to enable storage offloading (BlueField SNAP), they do not use legacy SATA connections as a primary architectural feature for isolating the management plane. Furthermore, storage connectivity alone does not provide the compute power needed to offload security policies.
B. The integrated ARM cores and programmable hardware accelerators
Correct: This is the foundational architecture of the BlueField-3 DPU.
Independent Management Plane: The DPU features up to 16 ARM Cortex-A78 cores that run their own separate Linux operating system (the “DPU OS“). This allows the DPU to act as an independent compute node within the server, effectively isolating the management and control plane from the host‘s x86 environment.
Security Offloading: Programmable hardware accelerators (such as the Regular Expression/RegEx engine, Public Key Accelerator, and IPsec/TLS inline encryption) handle security policies and telemetry at line rate.
Zero Host Impact: Because these tasks run on the DPUÂ’s ARM cores and dedicated silicon, they consume zero host CPU cycles, ensuring the HGX server‘s CPUs and GPUs remain fully dedicated to AI training and inference.
C. The secondary Ethernet port used for legacy BMC management
Incorrect: While BlueField-3 DPUs include a 1GbE Out-of-Band (OOB) management port, this port is used for accessing the DPU‘s own management interface (BMC). It is a physical interface, not the architectural feature that enables the complex offloading of security policies and data-plane telemetry.
D. The standard PCIe Gen5 bus connection to the system motherboard
Incorrect: The PCIe Gen5 bus is the physical interconnect that allows the DPU to communicate with the host. While the high bandwidth is necessary, the PCIe bus itself is a transport mechanism, not a “programmable feature“ that performs offloading. In fact, a standard NIC also uses PCIe but lacks the ARM cores and accelerators required to run independent security services.
Incorrect
A. The direct connection to the system‘s local SATA storage drives
Incorrect: While some BlueField-3 models feature M.2 or U.2 connectors for direct-attached NVMe storage to enable storage offloading (BlueField SNAP), they do not use legacy SATA connections as a primary architectural feature for isolating the management plane. Furthermore, storage connectivity alone does not provide the compute power needed to offload security policies.
B. The integrated ARM cores and programmable hardware accelerators
Correct: This is the foundational architecture of the BlueField-3 DPU.
Independent Management Plane: The DPU features up to 16 ARM Cortex-A78 cores that run their own separate Linux operating system (the “DPU OS“). This allows the DPU to act as an independent compute node within the server, effectively isolating the management and control plane from the host‘s x86 environment.
Security Offloading: Programmable hardware accelerators (such as the Regular Expression/RegEx engine, Public Key Accelerator, and IPsec/TLS inline encryption) handle security policies and telemetry at line rate.
Zero Host Impact: Because these tasks run on the DPUÂ’s ARM cores and dedicated silicon, they consume zero host CPU cycles, ensuring the HGX server‘s CPUs and GPUs remain fully dedicated to AI training and inference.
C. The secondary Ethernet port used for legacy BMC management
Incorrect: While BlueField-3 DPUs include a 1GbE Out-of-Band (OOB) management port, this port is used for accessing the DPU‘s own management interface (BMC). It is a physical interface, not the architectural feature that enables the complex offloading of security policies and data-plane telemetry.
D. The standard PCIe Gen5 bus connection to the system motherboard
Incorrect: The PCIe Gen5 bus is the physical interconnect that allows the DPU to communicate with the host. While the high bandwidth is necessary, the PCIe bus itself is a transport mechanism, not a “programmable feature“ that performs offloading. In fact, a standard NIC also uses PCIe but lacks the ARM cores and accelerators required to run independent security services.
Unattempted
A. The direct connection to the system‘s local SATA storage drives
Incorrect: While some BlueField-3 models feature M.2 or U.2 connectors for direct-attached NVMe storage to enable storage offloading (BlueField SNAP), they do not use legacy SATA connections as a primary architectural feature for isolating the management plane. Furthermore, storage connectivity alone does not provide the compute power needed to offload security policies.
B. The integrated ARM cores and programmable hardware accelerators
Correct: This is the foundational architecture of the BlueField-3 DPU.
Independent Management Plane: The DPU features up to 16 ARM Cortex-A78 cores that run their own separate Linux operating system (the “DPU OS“). This allows the DPU to act as an independent compute node within the server, effectively isolating the management and control plane from the host‘s x86 environment.
Security Offloading: Programmable hardware accelerators (such as the Regular Expression/RegEx engine, Public Key Accelerator, and IPsec/TLS inline encryption) handle security policies and telemetry at line rate.
Zero Host Impact: Because these tasks run on the DPUÂ’s ARM cores and dedicated silicon, they consume zero host CPU cycles, ensuring the HGX server‘s CPUs and GPUs remain fully dedicated to AI training and inference.
C. The secondary Ethernet port used for legacy BMC management
Incorrect: While BlueField-3 DPUs include a 1GbE Out-of-Band (OOB) management port, this port is used for accessing the DPU‘s own management interface (BMC). It is a physical interface, not the architectural feature that enables the complex offloading of security policies and data-plane telemetry.
D. The standard PCIe Gen5 bus connection to the system motherboard
Incorrect: The PCIe Gen5 bus is the physical interconnect that allows the DPU to communicate with the host. While the high bandwidth is necessary, the PCIe bus itself is a transport mechanism, not a “programmable feature“ that performs offloading. In fact, a standard NIC also uses PCIe but lacks the ARM cores and accelerators required to run independent security services.
Question 50 of 60
50. Question
To enable seamless GPU-accelerated container execution, the administrator must install the NVIDIA Container Toolkit. Which of the following is a primary function of the NVIDIA Container Toolkit when integrated with Docker on an AI compute node in the cluster?
Correct
A. It automatically recompiles the PyTorch source code…
Incorrect: The NVIDIA Container Toolkit does not perform code compilation or optimization of deep learning frameworks like PyTorch or TensorFlow. Optimization is typically handled by the frameworks themselves (using libraries like cuDNN or TensorRT) or by using pre-optimized containers from the NVIDIA NGC (Private Registry).
B. It provides a container runtime library that allows Docker to interface with GPU drivers…
Correct: This is the core architectural purpose of the toolkit as defined in the NCP-AII curriculum.
The Problem: Standard containers are isolated from the host hardware and do not “see“ the GPUs.
The Solution: The toolkit includes the nvidia-container-runtime, which acts as a wrapper around the standard runc. When a container is started with the –gpus flag, the toolkit injects the necessary NVIDIA user-mode drivers, libraries, and device files (/dev/nvidia*) into the container‘s namespace. This allows the application inside the container to communicate directly with the hostÂ’s NVIDIA kernel driver.
C. It manages the power distribution to the GPUs by throttling clock speeds…
Incorrect: Power management, clock throttling, and thermal regulation are handled by the NVIDIA Driver and the GPU firmware, often monitored by the NVIDIA System Manager (NVSM) or DCGM. The Container Toolkit operates at the software orchestration layer and does not have the authority or functionality to manage physical power distribution.
D. It acts as a virtual hypervisor that allows BlueField-3 to run multiple instances of Ubuntu…
Incorrect: This is a misunderstanding of both the toolkit and the DPU. The BlueField-3 DPU runs its own internal OS (typically Ubuntu) on its ARM cores, but it is not a “Docker container“ in this context. The Container Toolkit is for enabling GPU access, whereas the DPU is managed through the DOCA framework.
Incorrect
A. It automatically recompiles the PyTorch source code…
Incorrect: The NVIDIA Container Toolkit does not perform code compilation or optimization of deep learning frameworks like PyTorch or TensorFlow. Optimization is typically handled by the frameworks themselves (using libraries like cuDNN or TensorRT) or by using pre-optimized containers from the NVIDIA NGC (Private Registry).
B. It provides a container runtime library that allows Docker to interface with GPU drivers…
Correct: This is the core architectural purpose of the toolkit as defined in the NCP-AII curriculum.
The Problem: Standard containers are isolated from the host hardware and do not “see“ the GPUs.
The Solution: The toolkit includes the nvidia-container-runtime, which acts as a wrapper around the standard runc. When a container is started with the –gpus flag, the toolkit injects the necessary NVIDIA user-mode drivers, libraries, and device files (/dev/nvidia*) into the container‘s namespace. This allows the application inside the container to communicate directly with the hostÂ’s NVIDIA kernel driver.
C. It manages the power distribution to the GPUs by throttling clock speeds…
Incorrect: Power management, clock throttling, and thermal regulation are handled by the NVIDIA Driver and the GPU firmware, often monitored by the NVIDIA System Manager (NVSM) or DCGM. The Container Toolkit operates at the software orchestration layer and does not have the authority or functionality to manage physical power distribution.
D. It acts as a virtual hypervisor that allows BlueField-3 to run multiple instances of Ubuntu…
Incorrect: This is a misunderstanding of both the toolkit and the DPU. The BlueField-3 DPU runs its own internal OS (typically Ubuntu) on its ARM cores, but it is not a “Docker container“ in this context. The Container Toolkit is for enabling GPU access, whereas the DPU is managed through the DOCA framework.
Unattempted
A. It automatically recompiles the PyTorch source code…
Incorrect: The NVIDIA Container Toolkit does not perform code compilation or optimization of deep learning frameworks like PyTorch or TensorFlow. Optimization is typically handled by the frameworks themselves (using libraries like cuDNN or TensorRT) or by using pre-optimized containers from the NVIDIA NGC (Private Registry).
B. It provides a container runtime library that allows Docker to interface with GPU drivers…
Correct: This is the core architectural purpose of the toolkit as defined in the NCP-AII curriculum.
The Problem: Standard containers are isolated from the host hardware and do not “see“ the GPUs.
The Solution: The toolkit includes the nvidia-container-runtime, which acts as a wrapper around the standard runc. When a container is started with the –gpus flag, the toolkit injects the necessary NVIDIA user-mode drivers, libraries, and device files (/dev/nvidia*) into the container‘s namespace. This allows the application inside the container to communicate directly with the hostÂ’s NVIDIA kernel driver.
C. It manages the power distribution to the GPUs by throttling clock speeds…
Incorrect: Power management, clock throttling, and thermal regulation are handled by the NVIDIA Driver and the GPU firmware, often monitored by the NVIDIA System Manager (NVSM) or DCGM. The Container Toolkit operates at the software orchestration layer and does not have the authority or functionality to manage physical power distribution.
D. It acts as a virtual hypervisor that allows BlueField-3 to run multiple instances of Ubuntu…
Incorrect: This is a misunderstanding of both the toolkit and the DPU. The BlueField-3 DPU runs its own internal OS (typically Ubuntu) on its ARM cores, but it is not a “Docker container“ in this context. The Container Toolkit is for enabling GPU access, whereas the DPU is managed through the DOCA framework.
Question 51 of 60
51. Question
When configuring the cluster interfaces in Base Command Manager, the administrator must define the category and network settings for the compute nodes. Why is it important to correctly configure the category in BCM, and how does it affect the installation of software like Slurm, Enroot, and Pyxis?
Correct
A. The category determines the maximum CPU frequency…
Incorrect: While BCM can manage some hardware settings via the BIOS/BMC, the “Category“ is not primarily an overclocking tool. CPU frequency is typically managed by the OS kernel‘s scaling governor or BIOS power profiles, not by the BCM category template itself.
B. The category defines the physical color of the server chassis…
Incorrect: This is irrelevant to cluster management. Physical identification of servers is handled through LED UID (Unit Identification) buttons or rack location data stored in the BMC/CMDB, not by a logical software “Category“ in the management interface.
C. The category is a billing tag…
Incorrect: While categories can be used for organizational grouping, they are functional, not just descriptive. In a professional NVIDIA-certified environment, the Category is the “brain“ of the node‘s software personality. Dismissing it as a “billing tag“ ignores its role in automated deployment.
D. The category acts as a template that defines software packages, kernel parameters, and configuration files.
Correct: This is the core architectural principle of BCM taught in the NCP-AII curriculum.
Software Profile Integration: A Category links a group of nodes to a specific Software Image. For an AI cluster, this image contains the NVIDIA drivers, CUDA, and the Slurm/Enroot/Pyxis stack.
Automation: When an administrator adds a node to the “Compute“ category, BCM automatically pushes the required packages (like slurm-node, enroot, and the pyxis plugin) and configures the slurm.conf file based on the categoryÂ’s template.
Consistency: It ensures that every node in the group has identical kernel parameters (e.g., IOMMU settings for InfiniBand) and identical software versions, which is critical for preventing “jitter“ during distributed training.
Incorrect
A. The category determines the maximum CPU frequency…
Incorrect: While BCM can manage some hardware settings via the BIOS/BMC, the “Category“ is not primarily an overclocking tool. CPU frequency is typically managed by the OS kernel‘s scaling governor or BIOS power profiles, not by the BCM category template itself.
B. The category defines the physical color of the server chassis…
Incorrect: This is irrelevant to cluster management. Physical identification of servers is handled through LED UID (Unit Identification) buttons or rack location data stored in the BMC/CMDB, not by a logical software “Category“ in the management interface.
C. The category is a billing tag…
Incorrect: While categories can be used for organizational grouping, they are functional, not just descriptive. In a professional NVIDIA-certified environment, the Category is the “brain“ of the node‘s software personality. Dismissing it as a “billing tag“ ignores its role in automated deployment.
D. The category acts as a template that defines software packages, kernel parameters, and configuration files.
Correct: This is the core architectural principle of BCM taught in the NCP-AII curriculum.
Software Profile Integration: A Category links a group of nodes to a specific Software Image. For an AI cluster, this image contains the NVIDIA drivers, CUDA, and the Slurm/Enroot/Pyxis stack.
Automation: When an administrator adds a node to the “Compute“ category, BCM automatically pushes the required packages (like slurm-node, enroot, and the pyxis plugin) and configures the slurm.conf file based on the categoryÂ’s template.
Consistency: It ensures that every node in the group has identical kernel parameters (e.g., IOMMU settings for InfiniBand) and identical software versions, which is critical for preventing “jitter“ during distributed training.
Unattempted
A. The category determines the maximum CPU frequency…
Incorrect: While BCM can manage some hardware settings via the BIOS/BMC, the “Category“ is not primarily an overclocking tool. CPU frequency is typically managed by the OS kernel‘s scaling governor or BIOS power profiles, not by the BCM category template itself.
B. The category defines the physical color of the server chassis…
Incorrect: This is irrelevant to cluster management. Physical identification of servers is handled through LED UID (Unit Identification) buttons or rack location data stored in the BMC/CMDB, not by a logical software “Category“ in the management interface.
C. The category is a billing tag…
Incorrect: While categories can be used for organizational grouping, they are functional, not just descriptive. In a professional NVIDIA-certified environment, the Category is the “brain“ of the node‘s software personality. Dismissing it as a “billing tag“ ignores its role in automated deployment.
D. The category acts as a template that defines software packages, kernel parameters, and configuration files.
Correct: This is the core architectural principle of BCM taught in the NCP-AII curriculum.
Software Profile Integration: A Category links a group of nodes to a specific Software Image. For an AI cluster, this image contains the NVIDIA drivers, CUDA, and the Slurm/Enroot/Pyxis stack.
Automation: When an administrator adds a node to the “Compute“ category, BCM automatically pushes the required packages (like slurm-node, enroot, and the pyxis plugin) and configures the slurm.conf file based on the categoryÂ’s template.
Consistency: It ensures that every node in the group has identical kernel parameters (e.g., IOMMU settings for InfiniBand) and identical software versions, which is critical for preventing “jitter“ during distributed training.
Question 52 of 60
52. Question
An administrator is configuring a third-party storage solution to be used as the primary data lake for an AI cluster. Which initial parameter configuration is most important to ensure that the storage system can keep up with the high-throughput requirements of multiple GPU nodes during a large-scale training job using the NVIDIA collective communications library?
Correct
A. Restrict the number of concurrent connections…
Incorrect: Restricting connections to one per node would severely bottleneck performance. AI nodes (especially HGX systems with 8 GPUs) often require multiple parallel I/O streams to saturate the available network bandwidth. Limiting concurrency prevents the cluster from reaching its aggregate throughput potential.
B. Configure the storage export with appropriate MTU sizes and ensure network isolation.
Correct: This follows NVIDIA‘s Best Practices for storage in AI clusters.
MTU Sizes: For high-throughput storage networks (like RoCE or InfiniBand), using Jumbo Frames (MTU 9000) or consistent fabric-wide MTUs (like 4096 in InfiniBand) reduces header overhead and CPU interrupts, allowing for more efficient data transfer.
Isolation: Isolating the Data Plane (storage and GPU-to-GPU traffic) from the Management/OOB Plane ensures that administrative tasks or telemetry data do not cause congestion or packet loss on the high-speed fabric, which is critical for maintaining NCCL stability.
C. Enable heavy data compression and deduplication…
Incorrect: While these features save space, they are extremely CPU-intensive on the storage controller. For AI workloads, the overhead of decompressing data in real-time can introduce significant latency and jitter, leading to “GPU starvation“ where the GPUs sit idle waiting for the next batch of data.
D. Set the storage system to use a single parity RAID 5 configuration.
Incorrect: RAID 5 is generally discouraged for primary AI data lakes because of the “write penalty“ associated with calculating parity. During large-scale checkpointing, RAID 5 can become a bottleneck. Furthermore, the rebuild times for large drives in RAID 5 are slow, increasing the risk of data loss compared to RAID 6 or Erasure Coding.
Incorrect
A. Restrict the number of concurrent connections…
Incorrect: Restricting connections to one per node would severely bottleneck performance. AI nodes (especially HGX systems with 8 GPUs) often require multiple parallel I/O streams to saturate the available network bandwidth. Limiting concurrency prevents the cluster from reaching its aggregate throughput potential.
B. Configure the storage export with appropriate MTU sizes and ensure network isolation.
Correct: This follows NVIDIA‘s Best Practices for storage in AI clusters.
MTU Sizes: For high-throughput storage networks (like RoCE or InfiniBand), using Jumbo Frames (MTU 9000) or consistent fabric-wide MTUs (like 4096 in InfiniBand) reduces header overhead and CPU interrupts, allowing for more efficient data transfer.
Isolation: Isolating the Data Plane (storage and GPU-to-GPU traffic) from the Management/OOB Plane ensures that administrative tasks or telemetry data do not cause congestion or packet loss on the high-speed fabric, which is critical for maintaining NCCL stability.
C. Enable heavy data compression and deduplication…
Incorrect: While these features save space, they are extremely CPU-intensive on the storage controller. For AI workloads, the overhead of decompressing data in real-time can introduce significant latency and jitter, leading to “GPU starvation“ where the GPUs sit idle waiting for the next batch of data.
D. Set the storage system to use a single parity RAID 5 configuration.
Incorrect: RAID 5 is generally discouraged for primary AI data lakes because of the “write penalty“ associated with calculating parity. During large-scale checkpointing, RAID 5 can become a bottleneck. Furthermore, the rebuild times for large drives in RAID 5 are slow, increasing the risk of data loss compared to RAID 6 or Erasure Coding.
Unattempted
A. Restrict the number of concurrent connections…
Incorrect: Restricting connections to one per node would severely bottleneck performance. AI nodes (especially HGX systems with 8 GPUs) often require multiple parallel I/O streams to saturate the available network bandwidth. Limiting concurrency prevents the cluster from reaching its aggregate throughput potential.
B. Configure the storage export with appropriate MTU sizes and ensure network isolation.
Correct: This follows NVIDIA‘s Best Practices for storage in AI clusters.
MTU Sizes: For high-throughput storage networks (like RoCE or InfiniBand), using Jumbo Frames (MTU 9000) or consistent fabric-wide MTUs (like 4096 in InfiniBand) reduces header overhead and CPU interrupts, allowing for more efficient data transfer.
Isolation: Isolating the Data Plane (storage and GPU-to-GPU traffic) from the Management/OOB Plane ensures that administrative tasks or telemetry data do not cause congestion or packet loss on the high-speed fabric, which is critical for maintaining NCCL stability.
C. Enable heavy data compression and deduplication…
Incorrect: While these features save space, they are extremely CPU-intensive on the storage controller. For AI workloads, the overhead of decompressing data in real-time can introduce significant latency and jitter, leading to “GPU starvation“ where the GPUs sit idle waiting for the next batch of data.
D. Set the storage system to use a single parity RAID 5 configuration.
Incorrect: RAID 5 is generally discouraged for primary AI data lakes because of the “write penalty“ associated with calculating parity. During large-scale checkpointing, RAID 5 can become a bottleneck. Furthermore, the rebuild times for large drives in RAID 5 are slow, increasing the risk of data loss compared to RAID 6 or Erasure Coding.
Question 53 of 60
53. Question
When designing the network topology for a large-scale AI factory, an architect must determine the correct cabling for the InfiniBand East-West compute fabric. The distance between the Leaf switches and the Spine switches is approximately 45 meters. Which combination of transceiver types and cabling should be selected to ensure reliable 400Gb/s connectivity while maintaining signal integrity according to NVIDIA best practices?
Correct
Correct: D. Active Optical Cables (AOC) or Multimode Fiber with 400G-SR4 transceivers, which are specifically designed for high-speed data center reaches between 30 and 100 meters.
This is the correct choice. The distance in question is 45 meters, which falls squarely within the optimal ranges defined by NVIDIA for both Active Optical Cables (AOCs) and Multi-mode optics with SR4 transceivers.
NVIDIA Best Practices: NVIDIA‘s cable management guidelines explicitly state that AOCs are used from 3 meters to about 100 meters . Additionally, multi-mode optics like SR4 (Short Range 4 Channels) are designed to be used up to 100 meters .
Technical Alignment: For 400G InfiniBand (NDR) fabrics, NVIDIA‘s structured cabling requirements confirm that Multi-Mode (MM) cabling is approved for distances up to 50 meters, which comfortably covers the 45-meter requirement in the scenario . This combination ensures signal integrity without the cost and complexity of long-range single-mode optics.
Incorrect: A. Standard Category 6 Ethernet cables using RJ45 connectors to leverage existing enterprise-grade patch panels for the high-bandwidth compute fabric.
This option is incorrect and does not align with NVIDIA‘s design for high-performance AI factories.
Incompatible Technology: Category 6 Ethernet cables and RJ45 connectors are not designed for the high-bandwidth, low-latency requirements of an InfiniBand East-West compute fabric, especially at 400Gb/s. InfiniBand networks for AI use different physical layers (like optical or direct-attach copper) and transceiver form factors (like QSFP or OSFP), not standard copper Ethernet cabling.
NVIDIA Best Practices: NVIDIA documentation focuses on DACs, AOCs, and optical transceivers for these high-speed interconnects . Using enterprise-grade Ethernet cabling would not maintain signal integrity or support the InfiniBand protocol at the required speed.
B. Passive Copper Direct Attach Cables (DAC) due to their low power consumption and cost-effectiveness for distances up to 50 meters.
While the premise about cost and power is correct, the distance limitation makes this option invalid for this specific scenario.
Distance Limitations: Although the option states DACs work up to 50 meters, NVIDIA‘s guidance clarifies that at very high speeds (like 400Gb/s), the effective distance for passive DACs is significantly shorter due to signal attenuation. The documentation notes that after 2-5 meters (rate dependent), the signal attenuation becomes significant . For 25G-NRZ and 50G PAM4-based cables, distances are limited to 5 and 3 meters respectively .
Application Mismatch: At 45 meters, a passive DAC would fail to maintain signal integrity, making it an unreliable choice for the compute fabric, despite its theoretical maximum .
C. Single-mode Fiber with 400G-LR4 transceivers, as they are the only optical technology capable of supporting InfiniBand HDR or NDR protocols over 10 meters.
This option is incorrect due to both technological inaccuracy and unnecessary over-engineering.
Factually Incorrect: The statement that single-mode fiber is the only technology capable of supporting HDR or NDR over 10 meters is false. As confirmed by the correct answer, multi-mode fiber and AOCs are the standard and recommended solutions for this distance range .
Not Cost-Effective: 400G-LR4 transceivers are designed for long-reach applications (up to 10 km) . Using them for a 45-meter link within a data center would be a significant and unnecessary cost, as they are far more expensive than the SR4 or AOC alternatives recommended for short-reach connectivity.
Incorrect
Correct: D. Active Optical Cables (AOC) or Multimode Fiber with 400G-SR4 transceivers, which are specifically designed for high-speed data center reaches between 30 and 100 meters.
This is the correct choice. The distance in question is 45 meters, which falls squarely within the optimal ranges defined by NVIDIA for both Active Optical Cables (AOCs) and Multi-mode optics with SR4 transceivers.
NVIDIA Best Practices: NVIDIA‘s cable management guidelines explicitly state that AOCs are used from 3 meters to about 100 meters . Additionally, multi-mode optics like SR4 (Short Range 4 Channels) are designed to be used up to 100 meters .
Technical Alignment: For 400G InfiniBand (NDR) fabrics, NVIDIA‘s structured cabling requirements confirm that Multi-Mode (MM) cabling is approved for distances up to 50 meters, which comfortably covers the 45-meter requirement in the scenario . This combination ensures signal integrity without the cost and complexity of long-range single-mode optics.
Incorrect: A. Standard Category 6 Ethernet cables using RJ45 connectors to leverage existing enterprise-grade patch panels for the high-bandwidth compute fabric.
This option is incorrect and does not align with NVIDIA‘s design for high-performance AI factories.
Incompatible Technology: Category 6 Ethernet cables and RJ45 connectors are not designed for the high-bandwidth, low-latency requirements of an InfiniBand East-West compute fabric, especially at 400Gb/s. InfiniBand networks for AI use different physical layers (like optical or direct-attach copper) and transceiver form factors (like QSFP or OSFP), not standard copper Ethernet cabling.
NVIDIA Best Practices: NVIDIA documentation focuses on DACs, AOCs, and optical transceivers for these high-speed interconnects . Using enterprise-grade Ethernet cabling would not maintain signal integrity or support the InfiniBand protocol at the required speed.
B. Passive Copper Direct Attach Cables (DAC) due to their low power consumption and cost-effectiveness for distances up to 50 meters.
While the premise about cost and power is correct, the distance limitation makes this option invalid for this specific scenario.
Distance Limitations: Although the option states DACs work up to 50 meters, NVIDIA‘s guidance clarifies that at very high speeds (like 400Gb/s), the effective distance for passive DACs is significantly shorter due to signal attenuation. The documentation notes that after 2-5 meters (rate dependent), the signal attenuation becomes significant . For 25G-NRZ and 50G PAM4-based cables, distances are limited to 5 and 3 meters respectively .
Application Mismatch: At 45 meters, a passive DAC would fail to maintain signal integrity, making it an unreliable choice for the compute fabric, despite its theoretical maximum .
C. Single-mode Fiber with 400G-LR4 transceivers, as they are the only optical technology capable of supporting InfiniBand HDR or NDR protocols over 10 meters.
This option is incorrect due to both technological inaccuracy and unnecessary over-engineering.
Factually Incorrect: The statement that single-mode fiber is the only technology capable of supporting HDR or NDR over 10 meters is false. As confirmed by the correct answer, multi-mode fiber and AOCs are the standard and recommended solutions for this distance range .
Not Cost-Effective: 400G-LR4 transceivers are designed for long-reach applications (up to 10 km) . Using them for a 45-meter link within a data center would be a significant and unnecessary cost, as they are far more expensive than the SR4 or AOC alternatives recommended for short-reach connectivity.
Unattempted
Correct: D. Active Optical Cables (AOC) or Multimode Fiber with 400G-SR4 transceivers, which are specifically designed for high-speed data center reaches between 30 and 100 meters.
This is the correct choice. The distance in question is 45 meters, which falls squarely within the optimal ranges defined by NVIDIA for both Active Optical Cables (AOCs) and Multi-mode optics with SR4 transceivers.
NVIDIA Best Practices: NVIDIA‘s cable management guidelines explicitly state that AOCs are used from 3 meters to about 100 meters . Additionally, multi-mode optics like SR4 (Short Range 4 Channels) are designed to be used up to 100 meters .
Technical Alignment: For 400G InfiniBand (NDR) fabrics, NVIDIA‘s structured cabling requirements confirm that Multi-Mode (MM) cabling is approved for distances up to 50 meters, which comfortably covers the 45-meter requirement in the scenario . This combination ensures signal integrity without the cost and complexity of long-range single-mode optics.
Incorrect: A. Standard Category 6 Ethernet cables using RJ45 connectors to leverage existing enterprise-grade patch panels for the high-bandwidth compute fabric.
This option is incorrect and does not align with NVIDIA‘s design for high-performance AI factories.
Incompatible Technology: Category 6 Ethernet cables and RJ45 connectors are not designed for the high-bandwidth, low-latency requirements of an InfiniBand East-West compute fabric, especially at 400Gb/s. InfiniBand networks for AI use different physical layers (like optical or direct-attach copper) and transceiver form factors (like QSFP or OSFP), not standard copper Ethernet cabling.
NVIDIA Best Practices: NVIDIA documentation focuses on DACs, AOCs, and optical transceivers for these high-speed interconnects . Using enterprise-grade Ethernet cabling would not maintain signal integrity or support the InfiniBand protocol at the required speed.
B. Passive Copper Direct Attach Cables (DAC) due to their low power consumption and cost-effectiveness for distances up to 50 meters.
While the premise about cost and power is correct, the distance limitation makes this option invalid for this specific scenario.
Distance Limitations: Although the option states DACs work up to 50 meters, NVIDIA‘s guidance clarifies that at very high speeds (like 400Gb/s), the effective distance for passive DACs is significantly shorter due to signal attenuation. The documentation notes that after 2-5 meters (rate dependent), the signal attenuation becomes significant . For 25G-NRZ and 50G PAM4-based cables, distances are limited to 5 and 3 meters respectively .
Application Mismatch: At 45 meters, a passive DAC would fail to maintain signal integrity, making it an unreliable choice for the compute fabric, despite its theoretical maximum .
C. Single-mode Fiber with 400G-LR4 transceivers, as they are the only optical technology capable of supporting InfiniBand HDR or NDR protocols over 10 meters.
This option is incorrect due to both technological inaccuracy and unnecessary over-engineering.
Factually Incorrect: The statement that single-mode fiber is the only technology capable of supporting HDR or NDR over 10 meters is false. As confirmed by the correct answer, multi-mode fiber and AOCs are the standard and recommended solutions for this distance range .
Not Cost-Effective: 400G-LR4 transceivers are designed for long-reach applications (up to 10 km) . Using them for a 45-meter link within a data center would be a significant and unnecessary cost, as they are far more expensive than the SR4 or AOC alternatives recommended for short-reach connectivity.
Question 54 of 60
54. Question
During the setup of a large-scale AI factory, an administrator needs to validate the network topology for the compute fabric. The design calls for a rail-optimized InfiniBand topology to maximize collective communication performance. Which hardware validation step is most important to confirm that the physical cabling matches the intended logical topology and provides the necessary bandwidth for GPUDirect RDMA operations?
Correct
Correct: C. Verify that all transceivers are correctly seated and use the ibnetdiscover tool to map the physical connections against the fabric design manifest.
This is the correct choice for validating a rail-optimized InfiniBand topology for GPUDirect RDMA operations.
Tool Functionality: The ibnetdiscover utility is specifically designed to scan and discover InfiniBand topology, outputting a complete map of physical connections, node types (switches, hosts), and port information . It allows the administrator to verify that the actual physical cabling matches the intended logical design documented in the fabric manifest.
NVIDIA Best Practices: The NCP-AII certification exam blueprint explicitly includes “Confirm cabling is correct“ and “Validate cables by verifying signal quality“ under the “Cluster Test and Verification“ domain, which comprises 33% of the exam . This validation step is critical for rail-optimized topologies where each GPU communicates with a dedicated network adapter, requiring precise physical connectivity to maximize GPUDirect RDMA performance .
GPUDirect RDMA Relevance: For GPUDirect RDMA operations, which allow direct data transfer between GPUs across nodes without CPU involvement , correct physical cabling is essential. Mismatched cables would disrupt the dedicated communication paths (rails) between GPUs, severely impacting collective communication performance.
Incorrect: A. Configure the TPM modules on the compute nodes to encrypt the InfiniBand headers before they reach the leaf switches in the rack.
This option is incorrect for hardware topology validation.
TPM Purpose Mismatch: TPM (Trusted Platform Module) configuration falls under “System and Server Bring-up“ in the certification blueprint, specifically for security and encryption purposes, not for physical topology validation .
Wrong Validation Focus: Encrypting InfiniBand headers does not help confirm physical cabling correctness or verify that the rail-optimized topology is properly implemented. This action addresses security, not connectivity verification.
Irrelevant to GPUDirect RDMA: TPM configuration has no direct impact on verifying the physical connections required for GPUDirect RDMA operations.
B. Execute the nvidia-smi command with the -mig 1 flag on the head node to check if the network cards are visible to the Slurm scheduler.
This option is incorrect due to command misuse and purpose mismatch.
Incorrect Command Usage: The -mig flag in nvidia-smi is used for managing Multi-Instance GPU (MIG) partitioning, not for checking network card visibility to schedulers . This flag would not provide information about InfiniBand adapters.
Wrong Validation Layer: While checking network card visibility to Slurm is a valid validation step, it belongs to the “Control Plane Installation and Configuration“ domain . This verifies software integration, not physical cabling correctness against a design manifest.
GPUDirect RDMA Context: Even if network cards are visible to Slurm, this does not confirm that the physical cabling matches the rail-optimized topology required for optimal GPUDirect RDMA performance.
D. Install the NVIDIA DOCA drivers on the BMC to allow for the automated mapping of the NVLink Switch cable signal quality across the fabric.
This option is incorrect due to component and interface mismatch.
Wrong Management Interface: BMC (Baseboard Management Controller) is used for out-of-band management and server health monitoring, not for running DOCA drivers or mapping fabric topology .
NVLink vs. InfiniBand Confusion: NVLink Switch refers to high-speed GPU-to-GPU connections within a server, not the InfiniBand fabric between servers . The question specifically addresses the InfiniBand East-West compute fabric for inter-node communication.
Incorrect Tool for the Task: While DOCA is an NVIDIA SDK for BlueField DPUs, installing it on BMC would not help map physical InfiniBand connections against a design manifest. The correct tool for topology discovery is ibnetdiscover, not DOCA on BMC.
Incorrect
Correct: C. Verify that all transceivers are correctly seated and use the ibnetdiscover tool to map the physical connections against the fabric design manifest.
This is the correct choice for validating a rail-optimized InfiniBand topology for GPUDirect RDMA operations.
Tool Functionality: The ibnetdiscover utility is specifically designed to scan and discover InfiniBand topology, outputting a complete map of physical connections, node types (switches, hosts), and port information . It allows the administrator to verify that the actual physical cabling matches the intended logical design documented in the fabric manifest.
NVIDIA Best Practices: The NCP-AII certification exam blueprint explicitly includes “Confirm cabling is correct“ and “Validate cables by verifying signal quality“ under the “Cluster Test and Verification“ domain, which comprises 33% of the exam . This validation step is critical for rail-optimized topologies where each GPU communicates with a dedicated network adapter, requiring precise physical connectivity to maximize GPUDirect RDMA performance .
GPUDirect RDMA Relevance: For GPUDirect RDMA operations, which allow direct data transfer between GPUs across nodes without CPU involvement , correct physical cabling is essential. Mismatched cables would disrupt the dedicated communication paths (rails) between GPUs, severely impacting collective communication performance.
Incorrect: A. Configure the TPM modules on the compute nodes to encrypt the InfiniBand headers before they reach the leaf switches in the rack.
This option is incorrect for hardware topology validation.
TPM Purpose Mismatch: TPM (Trusted Platform Module) configuration falls under “System and Server Bring-up“ in the certification blueprint, specifically for security and encryption purposes, not for physical topology validation .
Wrong Validation Focus: Encrypting InfiniBand headers does not help confirm physical cabling correctness or verify that the rail-optimized topology is properly implemented. This action addresses security, not connectivity verification.
Irrelevant to GPUDirect RDMA: TPM configuration has no direct impact on verifying the physical connections required for GPUDirect RDMA operations.
B. Execute the nvidia-smi command with the -mig 1 flag on the head node to check if the network cards are visible to the Slurm scheduler.
This option is incorrect due to command misuse and purpose mismatch.
Incorrect Command Usage: The -mig flag in nvidia-smi is used for managing Multi-Instance GPU (MIG) partitioning, not for checking network card visibility to schedulers . This flag would not provide information about InfiniBand adapters.
Wrong Validation Layer: While checking network card visibility to Slurm is a valid validation step, it belongs to the “Control Plane Installation and Configuration“ domain . This verifies software integration, not physical cabling correctness against a design manifest.
GPUDirect RDMA Context: Even if network cards are visible to Slurm, this does not confirm that the physical cabling matches the rail-optimized topology required for optimal GPUDirect RDMA performance.
D. Install the NVIDIA DOCA drivers on the BMC to allow for the automated mapping of the NVLink Switch cable signal quality across the fabric.
This option is incorrect due to component and interface mismatch.
Wrong Management Interface: BMC (Baseboard Management Controller) is used for out-of-band management and server health monitoring, not for running DOCA drivers or mapping fabric topology .
NVLink vs. InfiniBand Confusion: NVLink Switch refers to high-speed GPU-to-GPU connections within a server, not the InfiniBand fabric between servers . The question specifically addresses the InfiniBand East-West compute fabric for inter-node communication.
Incorrect Tool for the Task: While DOCA is an NVIDIA SDK for BlueField DPUs, installing it on BMC would not help map physical InfiniBand connections against a design manifest. The correct tool for topology discovery is ibnetdiscover, not DOCA on BMC.
Unattempted
Correct: C. Verify that all transceivers are correctly seated and use the ibnetdiscover tool to map the physical connections against the fabric design manifest.
This is the correct choice for validating a rail-optimized InfiniBand topology for GPUDirect RDMA operations.
Tool Functionality: The ibnetdiscover utility is specifically designed to scan and discover InfiniBand topology, outputting a complete map of physical connections, node types (switches, hosts), and port information . It allows the administrator to verify that the actual physical cabling matches the intended logical design documented in the fabric manifest.
NVIDIA Best Practices: The NCP-AII certification exam blueprint explicitly includes “Confirm cabling is correct“ and “Validate cables by verifying signal quality“ under the “Cluster Test and Verification“ domain, which comprises 33% of the exam . This validation step is critical for rail-optimized topologies where each GPU communicates with a dedicated network adapter, requiring precise physical connectivity to maximize GPUDirect RDMA performance .
GPUDirect RDMA Relevance: For GPUDirect RDMA operations, which allow direct data transfer between GPUs across nodes without CPU involvement , correct physical cabling is essential. Mismatched cables would disrupt the dedicated communication paths (rails) between GPUs, severely impacting collective communication performance.
Incorrect: A. Configure the TPM modules on the compute nodes to encrypt the InfiniBand headers before they reach the leaf switches in the rack.
This option is incorrect for hardware topology validation.
TPM Purpose Mismatch: TPM (Trusted Platform Module) configuration falls under “System and Server Bring-up“ in the certification blueprint, specifically for security and encryption purposes, not for physical topology validation .
Wrong Validation Focus: Encrypting InfiniBand headers does not help confirm physical cabling correctness or verify that the rail-optimized topology is properly implemented. This action addresses security, not connectivity verification.
Irrelevant to GPUDirect RDMA: TPM configuration has no direct impact on verifying the physical connections required for GPUDirect RDMA operations.
B. Execute the nvidia-smi command with the -mig 1 flag on the head node to check if the network cards are visible to the Slurm scheduler.
This option is incorrect due to command misuse and purpose mismatch.
Incorrect Command Usage: The -mig flag in nvidia-smi is used for managing Multi-Instance GPU (MIG) partitioning, not for checking network card visibility to schedulers . This flag would not provide information about InfiniBand adapters.
Wrong Validation Layer: While checking network card visibility to Slurm is a valid validation step, it belongs to the “Control Plane Installation and Configuration“ domain . This verifies software integration, not physical cabling correctness against a design manifest.
GPUDirect RDMA Context: Even if network cards are visible to Slurm, this does not confirm that the physical cabling matches the rail-optimized topology required for optimal GPUDirect RDMA performance.
D. Install the NVIDIA DOCA drivers on the BMC to allow for the automated mapping of the NVLink Switch cable signal quality across the fabric.
This option is incorrect due to component and interface mismatch.
Wrong Management Interface: BMC (Baseboard Management Controller) is used for out-of-band management and server health monitoring, not for running DOCA drivers or mapping fabric topology .
NVLink vs. InfiniBand Confusion: NVLink Switch refers to high-speed GPU-to-GPU connections within a server, not the InfiniBand fabric between servers . The question specifically addresses the InfiniBand East-West compute fabric for inter-node communication.
Incorrect Tool for the Task: While DOCA is an NVIDIA SDK for BlueField DPUs, installing it on BMC would not help map physical InfiniBand connections against a design manifest. The correct tool for topology discovery is ibnetdiscover, not DOCA on BMC.
Question 55 of 60
55. Question
To enable seamless containerized AI workloads, an engineer is installing the NVIDIA Container Toolkit on a fresh Ubuntu installation. After adding the package repositories and installing the nvidia-container-toolkit package, what is the mandatory next step to ensure that the Docker runtime can actually utilize the NVIDIA GPUs?
Correct
Correct: A. The engineer must run the command sudo nvidia-ctk runtime configure –runtime=docker and then restart the Docker service using sudo systemctl restart docker.
This is the correct mandatory next step after installing the nvidia-container-toolkit package.
Tool Functionality: The nvidia-ctk runtime configure command is the NVIDIA Container Toolkit‘s utility for automatically configuring container engines like Docker. When run with the –runtime=docker flag, it properly registers the NVIDIA runtime with Docker by updating the daemon configuration file (/etc/docker/daemon.json) . This registration is what enables Docker to recognize and utilize NVIDIA GPUs inside containers.
NVIDIA Documentation Confirmation: Multiple official NVIDIA sources confirm this exact sequence. After installing the NVIDIA Container Toolkit, the next step is to configure the runtime using nvidia-ctk and then restart the Docker service to apply the configuration . The restart command sudo systemctl restart docker ensures the new runtime configuration is loaded by the Docker daemon.
NCP-AII Exam Alignment: The certification exam blueprint explicitly lists “Install the NVIDIA container toolkit“ and “Demonstrate how to use NVIDIA GPUs with Docker“ under the Control Plane Installation and Configuration domain, which comprises 19% of the exam . This question directly addresses that knowledge area, and the correct procedure matches NVIDIA‘s official documentation.
Without This Step: Simply installing the toolkit does not automatically configure Docker to use it. The runtime configuration step is required to establish the integration between the container toolkit and the Docker engine .
Incorrect: B. The engineer must manually edit the /etc/fstab file to mount the GPU device nodes into the /var/lib/docker directory before starting any containers.
This option is incorrect and does not reflect NVIDIA‘s documented procedure.
Wrong Configuration File: The /etc/fstab file is used for filesystem mount points at system boot (e.g., disk partitions, network storage), not for configuring container runtime GPU access. GPU device nodes are managed by the NVIDIA driver and container runtime, not through fstab entries.
Incorrect Mount Location: The /var/lib/docker directory is Docker‘s storage area for images, containers, and volumes. Manually mounting GPU devices there would not enable GPU access inside containers. The NVIDIA container runtime handles GPU device injection into containers dynamically.
No NVIDIA Documentation Support: There is no reference in NVIDIA documentation to editing fstab as part of the container toolkit installation or configuration process. This approach is not recognized in any official NVIDIA guides .
C. The engineer must install the nvidia-docker3 meta-package, which automatically replaces the standard Docker binary with a GPU-aware version developed by NVIDIA.
This option is incorrect and relies on outdated or inaccurate information.
Outdated Approach: The older nvidia-docker packages (nvidia-docker, nvidia-docker2) have been superseded by the NVIDIA Container Toolkit. The current architecture uses the nvidia-container-toolkit package, which integrates with Docker through runtime configuration rather than replacing the Docker binary .
No Binary Replacement: The NVIDIA Container Toolkit does not replace the standard Docker binary. It adds NVIDIA‘s runtime as an additional runtime option that Docker can use, leaving the original Docker installation intact. The toolkit registers a runtime, not a replacement binary .
Documentation Confirmation: Current NVIDIA documentation focuses on the nvidia-container-toolkit package and the nvidia-ctk runtime configure command, not on nvidia-docker3 meta-packages .
D. The engineer must compile the NVIDIA driver from source code to ensure that the kernel-level hooks for the container toolkit are correctly registered with the CPU MMU.
This option is incorrect and represents a significant misunderstanding of NVIDIA driver installation.
Unnecessary Compilation: NVIDIA drivers are distributed as pre-compiled binaries through package repositories. Compiling from source is neither required nor recommended for standard NVIDIA driver installation on Ubuntu systems .
Kernel Hooks and MMU Confusion: The reference to “kernel-level hooks registered with the CPU MMU“ demonstrates a misunderstanding of how GPU access works. The NVIDIA driver loads kernel modules that provide device access, but this is separate from and prerequisite to the container toolkit installation, not a post-installation step for container configuration .
Wrong Sequence: Driver installation must occur before installing the NVIDIA Container Toolkit, not after. The question specifies the engineer has already added repositories and installed the nvidia-container-toolkit package, implying drivers should already be present . Compiling drivers at this stage would be both incorrect sequencing and unnecessary complexity.
No Documentation Support: NVIDIA‘s official installation guides never require source compilation for standard Ubuntu installations. The recommended approach uses package managers to install pre-built drivers .
Incorrect
Correct: A. The engineer must run the command sudo nvidia-ctk runtime configure –runtime=docker and then restart the Docker service using sudo systemctl restart docker.
This is the correct mandatory next step after installing the nvidia-container-toolkit package.
Tool Functionality: The nvidia-ctk runtime configure command is the NVIDIA Container Toolkit‘s utility for automatically configuring container engines like Docker. When run with the –runtime=docker flag, it properly registers the NVIDIA runtime with Docker by updating the daemon configuration file (/etc/docker/daemon.json) . This registration is what enables Docker to recognize and utilize NVIDIA GPUs inside containers.
NVIDIA Documentation Confirmation: Multiple official NVIDIA sources confirm this exact sequence. After installing the NVIDIA Container Toolkit, the next step is to configure the runtime using nvidia-ctk and then restart the Docker service to apply the configuration . The restart command sudo systemctl restart docker ensures the new runtime configuration is loaded by the Docker daemon.
NCP-AII Exam Alignment: The certification exam blueprint explicitly lists “Install the NVIDIA container toolkit“ and “Demonstrate how to use NVIDIA GPUs with Docker“ under the Control Plane Installation and Configuration domain, which comprises 19% of the exam . This question directly addresses that knowledge area, and the correct procedure matches NVIDIA‘s official documentation.
Without This Step: Simply installing the toolkit does not automatically configure Docker to use it. The runtime configuration step is required to establish the integration between the container toolkit and the Docker engine .
Incorrect: B. The engineer must manually edit the /etc/fstab file to mount the GPU device nodes into the /var/lib/docker directory before starting any containers.
This option is incorrect and does not reflect NVIDIA‘s documented procedure.
Wrong Configuration File: The /etc/fstab file is used for filesystem mount points at system boot (e.g., disk partitions, network storage), not for configuring container runtime GPU access. GPU device nodes are managed by the NVIDIA driver and container runtime, not through fstab entries.
Incorrect Mount Location: The /var/lib/docker directory is Docker‘s storage area for images, containers, and volumes. Manually mounting GPU devices there would not enable GPU access inside containers. The NVIDIA container runtime handles GPU device injection into containers dynamically.
No NVIDIA Documentation Support: There is no reference in NVIDIA documentation to editing fstab as part of the container toolkit installation or configuration process. This approach is not recognized in any official NVIDIA guides .
C. The engineer must install the nvidia-docker3 meta-package, which automatically replaces the standard Docker binary with a GPU-aware version developed by NVIDIA.
This option is incorrect and relies on outdated or inaccurate information.
Outdated Approach: The older nvidia-docker packages (nvidia-docker, nvidia-docker2) have been superseded by the NVIDIA Container Toolkit. The current architecture uses the nvidia-container-toolkit package, which integrates with Docker through runtime configuration rather than replacing the Docker binary .
No Binary Replacement: The NVIDIA Container Toolkit does not replace the standard Docker binary. It adds NVIDIA‘s runtime as an additional runtime option that Docker can use, leaving the original Docker installation intact. The toolkit registers a runtime, not a replacement binary .
Documentation Confirmation: Current NVIDIA documentation focuses on the nvidia-container-toolkit package and the nvidia-ctk runtime configure command, not on nvidia-docker3 meta-packages .
D. The engineer must compile the NVIDIA driver from source code to ensure that the kernel-level hooks for the container toolkit are correctly registered with the CPU MMU.
This option is incorrect and represents a significant misunderstanding of NVIDIA driver installation.
Unnecessary Compilation: NVIDIA drivers are distributed as pre-compiled binaries through package repositories. Compiling from source is neither required nor recommended for standard NVIDIA driver installation on Ubuntu systems .
Kernel Hooks and MMU Confusion: The reference to “kernel-level hooks registered with the CPU MMU“ demonstrates a misunderstanding of how GPU access works. The NVIDIA driver loads kernel modules that provide device access, but this is separate from and prerequisite to the container toolkit installation, not a post-installation step for container configuration .
Wrong Sequence: Driver installation must occur before installing the NVIDIA Container Toolkit, not after. The question specifies the engineer has already added repositories and installed the nvidia-container-toolkit package, implying drivers should already be present . Compiling drivers at this stage would be both incorrect sequencing and unnecessary complexity.
No Documentation Support: NVIDIA‘s official installation guides never require source compilation for standard Ubuntu installations. The recommended approach uses package managers to install pre-built drivers .
Unattempted
Correct: A. The engineer must run the command sudo nvidia-ctk runtime configure –runtime=docker and then restart the Docker service using sudo systemctl restart docker.
This is the correct mandatory next step after installing the nvidia-container-toolkit package.
Tool Functionality: The nvidia-ctk runtime configure command is the NVIDIA Container Toolkit‘s utility for automatically configuring container engines like Docker. When run with the –runtime=docker flag, it properly registers the NVIDIA runtime with Docker by updating the daemon configuration file (/etc/docker/daemon.json) . This registration is what enables Docker to recognize and utilize NVIDIA GPUs inside containers.
NVIDIA Documentation Confirmation: Multiple official NVIDIA sources confirm this exact sequence. After installing the NVIDIA Container Toolkit, the next step is to configure the runtime using nvidia-ctk and then restart the Docker service to apply the configuration . The restart command sudo systemctl restart docker ensures the new runtime configuration is loaded by the Docker daemon.
NCP-AII Exam Alignment: The certification exam blueprint explicitly lists “Install the NVIDIA container toolkit“ and “Demonstrate how to use NVIDIA GPUs with Docker“ under the Control Plane Installation and Configuration domain, which comprises 19% of the exam . This question directly addresses that knowledge area, and the correct procedure matches NVIDIA‘s official documentation.
Without This Step: Simply installing the toolkit does not automatically configure Docker to use it. The runtime configuration step is required to establish the integration between the container toolkit and the Docker engine .
Incorrect: B. The engineer must manually edit the /etc/fstab file to mount the GPU device nodes into the /var/lib/docker directory before starting any containers.
This option is incorrect and does not reflect NVIDIA‘s documented procedure.
Wrong Configuration File: The /etc/fstab file is used for filesystem mount points at system boot (e.g., disk partitions, network storage), not for configuring container runtime GPU access. GPU device nodes are managed by the NVIDIA driver and container runtime, not through fstab entries.
Incorrect Mount Location: The /var/lib/docker directory is Docker‘s storage area for images, containers, and volumes. Manually mounting GPU devices there would not enable GPU access inside containers. The NVIDIA container runtime handles GPU device injection into containers dynamically.
No NVIDIA Documentation Support: There is no reference in NVIDIA documentation to editing fstab as part of the container toolkit installation or configuration process. This approach is not recognized in any official NVIDIA guides .
C. The engineer must install the nvidia-docker3 meta-package, which automatically replaces the standard Docker binary with a GPU-aware version developed by NVIDIA.
This option is incorrect and relies on outdated or inaccurate information.
Outdated Approach: The older nvidia-docker packages (nvidia-docker, nvidia-docker2) have been superseded by the NVIDIA Container Toolkit. The current architecture uses the nvidia-container-toolkit package, which integrates with Docker through runtime configuration rather than replacing the Docker binary .
No Binary Replacement: The NVIDIA Container Toolkit does not replace the standard Docker binary. It adds NVIDIA‘s runtime as an additional runtime option that Docker can use, leaving the original Docker installation intact. The toolkit registers a runtime, not a replacement binary .
Documentation Confirmation: Current NVIDIA documentation focuses on the nvidia-container-toolkit package and the nvidia-ctk runtime configure command, not on nvidia-docker3 meta-packages .
D. The engineer must compile the NVIDIA driver from source code to ensure that the kernel-level hooks for the container toolkit are correctly registered with the CPU MMU.
This option is incorrect and represents a significant misunderstanding of NVIDIA driver installation.
Unnecessary Compilation: NVIDIA drivers are distributed as pre-compiled binaries through package repositories. Compiling from source is neither required nor recommended for standard NVIDIA driver installation on Ubuntu systems .
Kernel Hooks and MMU Confusion: The reference to “kernel-level hooks registered with the CPU MMU“ demonstrates a misunderstanding of how GPU access works. The NVIDIA driver loads kernel modules that provide device access, but this is separate from and prerequisite to the container toolkit installation, not a post-installation step for container configuration .
Wrong Sequence: Driver installation must occur before installing the NVIDIA Container Toolkit, not after. The question specifies the engineer has already added repositories and installed the nvidia-container-toolkit package, implying drivers should already be present . Compiling drivers at this stage would be both incorrect sequencing and unnecessary complexity.
No Documentation Support: NVIDIA‘s official installation guides never require source compilation for standard Ubuntu installations. The recommended approach uses package managers to install pre-built drivers .
Question 56 of 60
56. Question
A network administrator is configuring an NVIDIA BlueField-3 Data Processing Unit (DPU) to act as a secure network platform for an AI cluster. The goal is to offload networking tasks from the host CPU. Which specialized software environment must be utilized on the DPU to manage the physical network resources and implement customized acceleration logic through the DOCA framework?
Correct
Correct: D. The DPU Operating System (typically Ubuntu or CentOS) The BlueField-3 DPU contains its own dedicated ARM CPU cores, memory, and network controllers. To function as a “secure network platform,“ it requires a full-fledged execution environment independent of the host x86 server. This DPU OS is where the DOCA (Data-Center-on-a-Chip Architecture) runtime and SDK are installed. It is the only environment listed that allows a developer or admin to load drivers, run DOCA services (like Firefly for timing or Telemetry for monitoring), and manage the DPUÂ’s hardware accelerators directly.
Incorrect: A. The NVIDIA SMI Management Interface The NVIDIA System Management Interface (nvidia-smi) is a command-line utility used primarily for monitoring and managing NVIDIA GPUs. While it is a critical tool in an AI cluster for checking GPU temperature and memory usage, it does not have the capability to manage the DPU‘s ARM cores or implement networking logic through the DOCA framework.
B. The CUDA Toolkit Development Environment The CUDA Toolkit is the parallel computing platform for GPU-accelerated applications. While the NCP-AII exam covers how DPUs and GPUs interact (such as through GPUDirect RDMA), the DPU‘s specialized networking tasks are handled by DOCA, not CUDA. Programming a DPU requires the DOCA SDK, which is distinct from the CUDA environment used for AI model training or inference.
C. The BIOS/UEFI setup menu of the host server The host‘s BIOS or UEFI is used for the initial hardware handshake. In a BlueField deployment, you might use the BIOS to enable SR-IOV or set the PCIe slot to the correct mode, but the BIOS is static firmware. It cannot “implement customized acceleration logic“ or run the high-level DOCA services required to offload networking tasks once the system is up and running.
Incorrect
Correct: D. The DPU Operating System (typically Ubuntu or CentOS) The BlueField-3 DPU contains its own dedicated ARM CPU cores, memory, and network controllers. To function as a “secure network platform,“ it requires a full-fledged execution environment independent of the host x86 server. This DPU OS is where the DOCA (Data-Center-on-a-Chip Architecture) runtime and SDK are installed. It is the only environment listed that allows a developer or admin to load drivers, run DOCA services (like Firefly for timing or Telemetry for monitoring), and manage the DPUÂ’s hardware accelerators directly.
Incorrect: A. The NVIDIA SMI Management Interface The NVIDIA System Management Interface (nvidia-smi) is a command-line utility used primarily for monitoring and managing NVIDIA GPUs. While it is a critical tool in an AI cluster for checking GPU temperature and memory usage, it does not have the capability to manage the DPU‘s ARM cores or implement networking logic through the DOCA framework.
B. The CUDA Toolkit Development Environment The CUDA Toolkit is the parallel computing platform for GPU-accelerated applications. While the NCP-AII exam covers how DPUs and GPUs interact (such as through GPUDirect RDMA), the DPU‘s specialized networking tasks are handled by DOCA, not CUDA. Programming a DPU requires the DOCA SDK, which is distinct from the CUDA environment used for AI model training or inference.
C. The BIOS/UEFI setup menu of the host server The host‘s BIOS or UEFI is used for the initial hardware handshake. In a BlueField deployment, you might use the BIOS to enable SR-IOV or set the PCIe slot to the correct mode, but the BIOS is static firmware. It cannot “implement customized acceleration logic“ or run the high-level DOCA services required to offload networking tasks once the system is up and running.
Unattempted
Correct: D. The DPU Operating System (typically Ubuntu or CentOS) The BlueField-3 DPU contains its own dedicated ARM CPU cores, memory, and network controllers. To function as a “secure network platform,“ it requires a full-fledged execution environment independent of the host x86 server. This DPU OS is where the DOCA (Data-Center-on-a-Chip Architecture) runtime and SDK are installed. It is the only environment listed that allows a developer or admin to load drivers, run DOCA services (like Firefly for timing or Telemetry for monitoring), and manage the DPUÂ’s hardware accelerators directly.
Incorrect: A. The NVIDIA SMI Management Interface The NVIDIA System Management Interface (nvidia-smi) is a command-line utility used primarily for monitoring and managing NVIDIA GPUs. While it is a critical tool in an AI cluster for checking GPU temperature and memory usage, it does not have the capability to manage the DPU‘s ARM cores or implement networking logic through the DOCA framework.
B. The CUDA Toolkit Development Environment The CUDA Toolkit is the parallel computing platform for GPU-accelerated applications. While the NCP-AII exam covers how DPUs and GPUs interact (such as through GPUDirect RDMA), the DPU‘s specialized networking tasks are handled by DOCA, not CUDA. Programming a DPU requires the DOCA SDK, which is distinct from the CUDA environment used for AI model training or inference.
C. The BIOS/UEFI setup menu of the host server The host‘s BIOS or UEFI is used for the initial hardware handshake. In a BlueField deployment, you might use the BIOS to enable SR-IOV or set the PCIe slot to the correct mode, but the BIOS is static firmware. It cannot “implement customized acceleration logic“ or run the high-level DOCA services required to offload networking tasks once the system is up and running.
Question 57 of 60
57. Question
To ensure the reliability of the AI infrastructure under sustained load, a ‘burn-in‘ test is performed. What is the primary purpose of executing a NeMo burn-in test specifically in the context of an NVIDIA AI factory deployment?
Correct
Correct: B. To simulate a real-world Large Language Model (LLM) training workload that stresses the GPUs, NVLink, and the network fabric to identify intermittent hardware failures. The NCP-AII curriculum defines the NeMo burn-in as a validation tool that utilizes the NVIDIA NeMo framework to run actual LLM training tasks. Unlike synthetic benchmarks that only stress individual components, this test exercises the entire “full-stack“ infrastructure. It pushes the GPUs to peak power, utilizes NVLink for intra-node communication, and saturates the InfiniBand or RoCE network fabric for inter-node communication. The primary goal is to surface “infant mortality“ in hardware—such as faulty cables, unstable GPUs, or overheating components—before the system is handed over for production.
Incorrect: A. To verify that the BlueField-3 DPUs can successfully intercept and inspect all encrypted traffic coming from the NVIDIA GPU Cloud (NGC). While BlueField-3 DPUs are part of the AI infrastructure, a NeMo burn-in is focused on compute and fabric stability, not Deep Packet Inspection (DPI) of traffic from NGC. NGC is a registry for containers and models; traffic from it is typically restricted to the initial setup phase, not a sustained “burn-in“ load.
C. To erase all existing data on the third-party storage system to prepare it for the installation of the Base Command Manager software suite. A “burn-in“ test is a stress test, not a data sanitization or “wipe“ utility. While storage performance is vital in an AI Factory, the NeMo burn-in specifically targets the compute and fabric layers. Data erasure is handled by storage-specific management tools or OS-level commands during the provisioning phase.
D. To calibrate the TPM modules on the compute nodes by generating a series of cryptographic keys at maximum GPU clock speeds. Trusted Platform Modules (TPM) are used for security and attestation, not performance calibration. High GPU clock speeds have no functional relationship with the generation of TPM cryptographic keys. A burn-in test aims to identify hardware instability, not to perform security module calibration.
Incorrect
Correct: B. To simulate a real-world Large Language Model (LLM) training workload that stresses the GPUs, NVLink, and the network fabric to identify intermittent hardware failures. The NCP-AII curriculum defines the NeMo burn-in as a validation tool that utilizes the NVIDIA NeMo framework to run actual LLM training tasks. Unlike synthetic benchmarks that only stress individual components, this test exercises the entire “full-stack“ infrastructure. It pushes the GPUs to peak power, utilizes NVLink for intra-node communication, and saturates the InfiniBand or RoCE network fabric for inter-node communication. The primary goal is to surface “infant mortality“ in hardware—such as faulty cables, unstable GPUs, or overheating components—before the system is handed over for production.
Incorrect: A. To verify that the BlueField-3 DPUs can successfully intercept and inspect all encrypted traffic coming from the NVIDIA GPU Cloud (NGC). While BlueField-3 DPUs are part of the AI infrastructure, a NeMo burn-in is focused on compute and fabric stability, not Deep Packet Inspection (DPI) of traffic from NGC. NGC is a registry for containers and models; traffic from it is typically restricted to the initial setup phase, not a sustained “burn-in“ load.
C. To erase all existing data on the third-party storage system to prepare it for the installation of the Base Command Manager software suite. A “burn-in“ test is a stress test, not a data sanitization or “wipe“ utility. While storage performance is vital in an AI Factory, the NeMo burn-in specifically targets the compute and fabric layers. Data erasure is handled by storage-specific management tools or OS-level commands during the provisioning phase.
D. To calibrate the TPM modules on the compute nodes by generating a series of cryptographic keys at maximum GPU clock speeds. Trusted Platform Modules (TPM) are used for security and attestation, not performance calibration. High GPU clock speeds have no functional relationship with the generation of TPM cryptographic keys. A burn-in test aims to identify hardware instability, not to perform security module calibration.
Unattempted
Correct: B. To simulate a real-world Large Language Model (LLM) training workload that stresses the GPUs, NVLink, and the network fabric to identify intermittent hardware failures. The NCP-AII curriculum defines the NeMo burn-in as a validation tool that utilizes the NVIDIA NeMo framework to run actual LLM training tasks. Unlike synthetic benchmarks that only stress individual components, this test exercises the entire “full-stack“ infrastructure. It pushes the GPUs to peak power, utilizes NVLink for intra-node communication, and saturates the InfiniBand or RoCE network fabric for inter-node communication. The primary goal is to surface “infant mortality“ in hardware—such as faulty cables, unstable GPUs, or overheating components—before the system is handed over for production.
Incorrect: A. To verify that the BlueField-3 DPUs can successfully intercept and inspect all encrypted traffic coming from the NVIDIA GPU Cloud (NGC). While BlueField-3 DPUs are part of the AI infrastructure, a NeMo burn-in is focused on compute and fabric stability, not Deep Packet Inspection (DPI) of traffic from NGC. NGC is a registry for containers and models; traffic from it is typically restricted to the initial setup phase, not a sustained “burn-in“ load.
C. To erase all existing data on the third-party storage system to prepare it for the installation of the Base Command Manager software suite. A “burn-in“ test is a stress test, not a data sanitization or “wipe“ utility. While storage performance is vital in an AI Factory, the NeMo burn-in specifically targets the compute and fabric layers. Data erasure is handled by storage-specific management tools or OS-level commands during the provisioning phase.
D. To calibrate the TPM modules on the compute nodes by generating a series of cryptographic keys at maximum GPU clock speeds. Trusted Platform Modules (TPM) are used for security and attestation, not performance calibration. High GPU clock speeds have no functional relationship with the generation of TPM cryptographic keys. A burn-in test aims to identify hardware instability, not to perform security module calibration.
Question 58 of 60
58. Question
A research team needs to run multiple small inference jobs on a single NVIDIA H100 GPU to maximize resource utilization. The administrator decides to use Multi-Instance GPU (MIG) technology. After enabling MIG mode via NVIDIA SMI, the administrator must create GPU instances and compute instances. What is a critical limitation of MIG that the administrator must consider during the physical layer management of these resources?
Correct
Correct: A. MIG instances cannot share the same hardware encoders/decoders, and once a GPU is partitioned, the aggregate memory bandwidth is divided among the slices. The NCP-AII framework defines MIG as a mechanism that provides strict hardware-level partitioning. When an H100 is sliced into MIG instances, the physical resources—including the memory controllers and the available memory bandwidth—are statically partitioned. This means a single 1g.10gb instance only has access to a fraction of the total H100 memory bandwidth. Furthermore, specialized hardware engines like NVDEC (video decoders), NVJPG (JPEG decoders), and NVENC (video encoders) are assigned to specific slices. If a slice does not have an assigned engine, it cannot “borrow“ those hardware features from another slice, even if they are idle.
Incorrect: B. Enabling MIG mode requires the physical removal of the NVLink bridge because the bridges are not compatible with partitioned memory addresses. This is factually incorrect. In an AI Factory deployment, NVLink remains physically connected and functional. While MIG does have specific interactions with NVLink (it typically disables Peer-to-Peer communication between instances within the same physical GPU to ensure isolation), it never requires the physical removal of hardware bridges.
C. A single H100 GPU can only be partitioned into two instances if the host is running a Windows-based operating system instead of a Linux-based one. The NCP-AII certification focuses heavily on Linux-based environments (Ubuntu, CentOS/RHEL), which are the standard for AI Infrastructure. An H100 can be partitioned into up to seven (7) MIG instances. Windows is generally more restrictive for data center GPU features and is not the preferred environment for high-scale MIG deployments in an AI Factory.
D. MIG partitions are only logical and do not provide hardware-level isolation for cache or memory, meaning one tenant can impact the performance of another. This describes “Temporal Partitioning“ (standard GPU sharing or time-slicing), not MIG. A key selling point of MIG emphasized in the NCP-AII exam is Hardware Isolation. MIG provides dedicated paths to the memory controllers and partitioned L2 cache. This ensures “Quality of Service“ (QoS) where one tenant‘s workload cannot “noise“ or impact the deterministic performance of another tenant‘s instance.
Incorrect
Correct: A. MIG instances cannot share the same hardware encoders/decoders, and once a GPU is partitioned, the aggregate memory bandwidth is divided among the slices. The NCP-AII framework defines MIG as a mechanism that provides strict hardware-level partitioning. When an H100 is sliced into MIG instances, the physical resources—including the memory controllers and the available memory bandwidth—are statically partitioned. This means a single 1g.10gb instance only has access to a fraction of the total H100 memory bandwidth. Furthermore, specialized hardware engines like NVDEC (video decoders), NVJPG (JPEG decoders), and NVENC (video encoders) are assigned to specific slices. If a slice does not have an assigned engine, it cannot “borrow“ those hardware features from another slice, even if they are idle.
Incorrect: B. Enabling MIG mode requires the physical removal of the NVLink bridge because the bridges are not compatible with partitioned memory addresses. This is factually incorrect. In an AI Factory deployment, NVLink remains physically connected and functional. While MIG does have specific interactions with NVLink (it typically disables Peer-to-Peer communication between instances within the same physical GPU to ensure isolation), it never requires the physical removal of hardware bridges.
C. A single H100 GPU can only be partitioned into two instances if the host is running a Windows-based operating system instead of a Linux-based one. The NCP-AII certification focuses heavily on Linux-based environments (Ubuntu, CentOS/RHEL), which are the standard for AI Infrastructure. An H100 can be partitioned into up to seven (7) MIG instances. Windows is generally more restrictive for data center GPU features and is not the preferred environment for high-scale MIG deployments in an AI Factory.
D. MIG partitions are only logical and do not provide hardware-level isolation for cache or memory, meaning one tenant can impact the performance of another. This describes “Temporal Partitioning“ (standard GPU sharing or time-slicing), not MIG. A key selling point of MIG emphasized in the NCP-AII exam is Hardware Isolation. MIG provides dedicated paths to the memory controllers and partitioned L2 cache. This ensures “Quality of Service“ (QoS) where one tenant‘s workload cannot “noise“ or impact the deterministic performance of another tenant‘s instance.
Unattempted
Correct: A. MIG instances cannot share the same hardware encoders/decoders, and once a GPU is partitioned, the aggregate memory bandwidth is divided among the slices. The NCP-AII framework defines MIG as a mechanism that provides strict hardware-level partitioning. When an H100 is sliced into MIG instances, the physical resources—including the memory controllers and the available memory bandwidth—are statically partitioned. This means a single 1g.10gb instance only has access to a fraction of the total H100 memory bandwidth. Furthermore, specialized hardware engines like NVDEC (video decoders), NVJPG (JPEG decoders), and NVENC (video encoders) are assigned to specific slices. If a slice does not have an assigned engine, it cannot “borrow“ those hardware features from another slice, even if they are idle.
Incorrect: B. Enabling MIG mode requires the physical removal of the NVLink bridge because the bridges are not compatible with partitioned memory addresses. This is factually incorrect. In an AI Factory deployment, NVLink remains physically connected and functional. While MIG does have specific interactions with NVLink (it typically disables Peer-to-Peer communication between instances within the same physical GPU to ensure isolation), it never requires the physical removal of hardware bridges.
C. A single H100 GPU can only be partitioned into two instances if the host is running a Windows-based operating system instead of a Linux-based one. The NCP-AII certification focuses heavily on Linux-based environments (Ubuntu, CentOS/RHEL), which are the standard for AI Infrastructure. An H100 can be partitioned into up to seven (7) MIG instances. Windows is generally more restrictive for data center GPU features and is not the preferred environment for high-scale MIG deployments in an AI Factory.
D. MIG partitions are only logical and do not provide hardware-level isolation for cache or memory, meaning one tenant can impact the performance of another. This describes “Temporal Partitioning“ (standard GPU sharing or time-slicing), not MIG. A key selling point of MIG emphasized in the NCP-AII exam is Hardware Isolation. MIG provides dedicated paths to the memory controllers and partitioned L2 cache. This ensures “Quality of Service“ (QoS) where one tenant‘s workload cannot “noise“ or impact the deterministic performance of another tenant‘s instance.
Question 59 of 60
59. Question
During the verification phase, an administrator runs the NVIDIA Collective Communications Library (NCCL) tests and observes that the East-West fabric bandwidth is significantly lower than the theoretical maximum. Which of the following is a likely cause that should be investigated in the physical or switch layer?
Correct
Correct: D. Improperly seated transceivers or contaminated fiber optic cables causing high bit-error rates, or a mismatch in the firmware versions between the InfiniBand switches. The NCP-AII curriculum emphasizes that physical layer integrity is the most frequent cause of bandwidth degradation in massive AI clusters. High Bit-Error Rates (BER) caused by “dirty“ fiber or poorly seated transceivers trigger retransmissions, which severely penalize the high-throughput, low-latency requirements of NCCL. Furthermore, consistent performance across the fabric requires synchronized firmware across the NVIDIA Quantum-2 (InfiniBand) or Spectrum-4 (Ethernet) switches to ensure features like Adaptive Routing and Congestion Control function correctly.
Incorrect: A. The Slurm scheduler is not properly configured to use the Enroot runtime, causing the NCCL containers to default to CPU-only communication paths. While Slurm and Enroot are standard in the NVIDIA AI stack for container orchestration, a configuration error here would typically lead to a job failure or a “GPU not found“ error rather than “significantly lower bandwidth.“ If NCCL defaulted to CPU-only communication, the performance drop would be so extreme (orders of magnitude) that it would be characterized as a functional failure rather than a bandwidth optimization issue.
B. The MIG profiles on the H100 GPUs are set to a high-priority mode, which automatically reserves 50% of the network bandwidth for BMC management traffic. This is factually incorrect. Multi-Instance GPU (MIG) profiles partition internal GPU resources (Compute and Memory), not the external network fabric bandwidth. Furthermore, BMC (Baseboard Management Controller) traffic is “North-South“ management traffic and runs on a separate, much slower OOB (Out-of-Band) network, never consuming 50% of the high-speed data fabric.
C. The NVIDIA Container Toolkit has not been registered with the NGC CLI, resulting in the bandwidth being throttled by the NVIDIA licensing server. The NVIDIA Container Toolkit is an open-source utility that allows containers to access GPU hardware; it does not require “registration“ with NGC CLI to function. Most importantly, NVIDIA does not “throttle“ hardware network bandwidth via a licensing server. In the NCP-AII framework, hardware performance is determined by physical links and driver configurations, not cloud-based license checks.
Incorrect
Correct: D. Improperly seated transceivers or contaminated fiber optic cables causing high bit-error rates, or a mismatch in the firmware versions between the InfiniBand switches. The NCP-AII curriculum emphasizes that physical layer integrity is the most frequent cause of bandwidth degradation in massive AI clusters. High Bit-Error Rates (BER) caused by “dirty“ fiber or poorly seated transceivers trigger retransmissions, which severely penalize the high-throughput, low-latency requirements of NCCL. Furthermore, consistent performance across the fabric requires synchronized firmware across the NVIDIA Quantum-2 (InfiniBand) or Spectrum-4 (Ethernet) switches to ensure features like Adaptive Routing and Congestion Control function correctly.
Incorrect: A. The Slurm scheduler is not properly configured to use the Enroot runtime, causing the NCCL containers to default to CPU-only communication paths. While Slurm and Enroot are standard in the NVIDIA AI stack for container orchestration, a configuration error here would typically lead to a job failure or a “GPU not found“ error rather than “significantly lower bandwidth.“ If NCCL defaulted to CPU-only communication, the performance drop would be so extreme (orders of magnitude) that it would be characterized as a functional failure rather than a bandwidth optimization issue.
B. The MIG profiles on the H100 GPUs are set to a high-priority mode, which automatically reserves 50% of the network bandwidth for BMC management traffic. This is factually incorrect. Multi-Instance GPU (MIG) profiles partition internal GPU resources (Compute and Memory), not the external network fabric bandwidth. Furthermore, BMC (Baseboard Management Controller) traffic is “North-South“ management traffic and runs on a separate, much slower OOB (Out-of-Band) network, never consuming 50% of the high-speed data fabric.
C. The NVIDIA Container Toolkit has not been registered with the NGC CLI, resulting in the bandwidth being throttled by the NVIDIA licensing server. The NVIDIA Container Toolkit is an open-source utility that allows containers to access GPU hardware; it does not require “registration“ with NGC CLI to function. Most importantly, NVIDIA does not “throttle“ hardware network bandwidth via a licensing server. In the NCP-AII framework, hardware performance is determined by physical links and driver configurations, not cloud-based license checks.
Unattempted
Correct: D. Improperly seated transceivers or contaminated fiber optic cables causing high bit-error rates, or a mismatch in the firmware versions between the InfiniBand switches. The NCP-AII curriculum emphasizes that physical layer integrity is the most frequent cause of bandwidth degradation in massive AI clusters. High Bit-Error Rates (BER) caused by “dirty“ fiber or poorly seated transceivers trigger retransmissions, which severely penalize the high-throughput, low-latency requirements of NCCL. Furthermore, consistent performance across the fabric requires synchronized firmware across the NVIDIA Quantum-2 (InfiniBand) or Spectrum-4 (Ethernet) switches to ensure features like Adaptive Routing and Congestion Control function correctly.
Incorrect: A. The Slurm scheduler is not properly configured to use the Enroot runtime, causing the NCCL containers to default to CPU-only communication paths. While Slurm and Enroot are standard in the NVIDIA AI stack for container orchestration, a configuration error here would typically lead to a job failure or a “GPU not found“ error rather than “significantly lower bandwidth.“ If NCCL defaulted to CPU-only communication, the performance drop would be so extreme (orders of magnitude) that it would be characterized as a functional failure rather than a bandwidth optimization issue.
B. The MIG profiles on the H100 GPUs are set to a high-priority mode, which automatically reserves 50% of the network bandwidth for BMC management traffic. This is factually incorrect. Multi-Instance GPU (MIG) profiles partition internal GPU resources (Compute and Memory), not the external network fabric bandwidth. Furthermore, BMC (Baseboard Management Controller) traffic is “North-South“ management traffic and runs on a separate, much slower OOB (Out-of-Band) network, never consuming 50% of the high-speed data fabric.
C. The NVIDIA Container Toolkit has not been registered with the NGC CLI, resulting in the bandwidth being throttled by the NVIDIA licensing server. The NVIDIA Container Toolkit is an open-source utility that allows containers to access GPU hardware; it does not require “registration“ with NGC CLI to function. Most importantly, NVIDIA does not “throttle“ hardware network bandwidth via a licensing server. In the NCP-AII framework, hardware performance is determined by physical links and driver configurations, not cloud-based license checks.
Question 60 of 60
60. Question
During the server bring-up phase, the technician notices that one of the GPUs in an 8-GPU HGX baseboard is not being recognized by the operating system, although the other seven are functional. What is the most logical first step in the troubleshooting and fault detection process for an NVIDIA-Certified System in this scenario?
Correct
Correct: D. Check the BMC logs for any hardware alerts, verify the GPU power cables are seated correctly, and use nvidia-smi to check for any excluded or partially initialized devices. The NCP-AII framework prioritizes non-invasive diagnostics as the first step. The Baseboard Management Controller (BMC) is the “source of truth“ for hardware health in an NVIDIA-Certified System; it can report power delivery failures, thermal issues, or PCIe training errors specific to a single GPU. Verifying physical power connections is a fundamental check in the “Physical Layer“ module of the certification. Finally, nvidia-smi is the primary tool to determine if a GPU is truly “gone“ or simply in a “Pending“ or “Error“ state (such as a Drain state or a Thermal slowdown), which provides critical clues before escalating to hardware replacement.
Incorrect: A. Immediately replace the entire HGX baseboard as a single GPU failure indicates a total system board defect that cannot be repaired on-site. This contradicts the “Field Replaceable Unit“ (FRU) and troubleshooting logic taught in the NCP-AII curriculum. While an HGX baseboard is complex, replacing the entire assembly is the final step after exhausting all diagnostic possibilities. Jumping to replacement without checking logs is inefficient and does not align with the professional standard of identifying the root cause (e.g., it could be a simple seating issue or a power cable).
B. Update the Linux kernel to the latest experimental version to see if new driver support fixes the visibility issue for the single missing GPU. The NCP-AII certification emphasizes stability and the use of NVIDIA-Certified software stacks. Using “experimental“ kernels is explicitly discouraged in production AI Factory deployments. Furthermore, if seven out of eight identical GPUs are recognized, the issue is hardware or low-level firmware related, not a lack of kernel-level driver support.
C. Swap the InfiniBand cables between the missing GPU port and a working GPU port to see if the identity of the missing GPU follows the cable. This is a “Red Herring“ in the context of GPU visibility. InfiniBand cables are part of the Network Fabric used for inter-node communication; they do not control whether the local Operating System “sees“ the GPU on the PCIe bus. Swapping these cables would troubleshoot fabric connectivity, but it would have zero impact on a GPU that is missing from the local lspci or nvidia-smi output.
Incorrect
Correct: D. Check the BMC logs for any hardware alerts, verify the GPU power cables are seated correctly, and use nvidia-smi to check for any excluded or partially initialized devices. The NCP-AII framework prioritizes non-invasive diagnostics as the first step. The Baseboard Management Controller (BMC) is the “source of truth“ for hardware health in an NVIDIA-Certified System; it can report power delivery failures, thermal issues, or PCIe training errors specific to a single GPU. Verifying physical power connections is a fundamental check in the “Physical Layer“ module of the certification. Finally, nvidia-smi is the primary tool to determine if a GPU is truly “gone“ or simply in a “Pending“ or “Error“ state (such as a Drain state or a Thermal slowdown), which provides critical clues before escalating to hardware replacement.
Incorrect: A. Immediately replace the entire HGX baseboard as a single GPU failure indicates a total system board defect that cannot be repaired on-site. This contradicts the “Field Replaceable Unit“ (FRU) and troubleshooting logic taught in the NCP-AII curriculum. While an HGX baseboard is complex, replacing the entire assembly is the final step after exhausting all diagnostic possibilities. Jumping to replacement without checking logs is inefficient and does not align with the professional standard of identifying the root cause (e.g., it could be a simple seating issue or a power cable).
B. Update the Linux kernel to the latest experimental version to see if new driver support fixes the visibility issue for the single missing GPU. The NCP-AII certification emphasizes stability and the use of NVIDIA-Certified software stacks. Using “experimental“ kernels is explicitly discouraged in production AI Factory deployments. Furthermore, if seven out of eight identical GPUs are recognized, the issue is hardware or low-level firmware related, not a lack of kernel-level driver support.
C. Swap the InfiniBand cables between the missing GPU port and a working GPU port to see if the identity of the missing GPU follows the cable. This is a “Red Herring“ in the context of GPU visibility. InfiniBand cables are part of the Network Fabric used for inter-node communication; they do not control whether the local Operating System “sees“ the GPU on the PCIe bus. Swapping these cables would troubleshoot fabric connectivity, but it would have zero impact on a GPU that is missing from the local lspci or nvidia-smi output.
Unattempted
Correct: D. Check the BMC logs for any hardware alerts, verify the GPU power cables are seated correctly, and use nvidia-smi to check for any excluded or partially initialized devices. The NCP-AII framework prioritizes non-invasive diagnostics as the first step. The Baseboard Management Controller (BMC) is the “source of truth“ for hardware health in an NVIDIA-Certified System; it can report power delivery failures, thermal issues, or PCIe training errors specific to a single GPU. Verifying physical power connections is a fundamental check in the “Physical Layer“ module of the certification. Finally, nvidia-smi is the primary tool to determine if a GPU is truly “gone“ or simply in a “Pending“ or “Error“ state (such as a Drain state or a Thermal slowdown), which provides critical clues before escalating to hardware replacement.
Incorrect: A. Immediately replace the entire HGX baseboard as a single GPU failure indicates a total system board defect that cannot be repaired on-site. This contradicts the “Field Replaceable Unit“ (FRU) and troubleshooting logic taught in the NCP-AII curriculum. While an HGX baseboard is complex, replacing the entire assembly is the final step after exhausting all diagnostic possibilities. Jumping to replacement without checking logs is inefficient and does not align with the professional standard of identifying the root cause (e.g., it could be a simple seating issue or a power cable).
B. Update the Linux kernel to the latest experimental version to see if new driver support fixes the visibility issue for the single missing GPU. The NCP-AII certification emphasizes stability and the use of NVIDIA-Certified software stacks. Using “experimental“ kernels is explicitly discouraged in production AI Factory deployments. Furthermore, if seven out of eight identical GPUs are recognized, the issue is hardware or low-level firmware related, not a lack of kernel-level driver support.
C. Swap the InfiniBand cables between the missing GPU port and a working GPU port to see if the identity of the missing GPU follows the cable. This is a “Red Herring“ in the context of GPU visibility. InfiniBand cables are part of the Network Fabric used for inter-node communication; they do not control whether the local Operating System “sees“ the GPU on the PCIe bus. Swapping these cables would troubleshoot fabric connectivity, but it would have zero impact on a GPU that is missing from the local lspci or nvidia-smi output.
X
SkillCertPro Wishes you all the best for your exam.