You have already completed the Test before. Hence you can not start it again.
Test is loading...
You must sign in or sign up to start the Test.
You have to finish following quiz, to start this Test:
Your results are here!! for" NVIDIA NCP-AII Practice Test 2 "
0 of 60 questions answered correctly
Your time:
Time has elapsed
Your Final Score is : 0
You have attempted : 0
Number of Correct Questions : 0 and scored 0
Number of Incorrect Questions : 0 and Negative marks 0
Average score
Your score
NVIDIA NCP-AII
You have attempted: 0
Number of Correct Questions: 0 and scored 0
Number of Incorrect Questions: 0 and Negative marks 0
You can review your answers by clicking on “View Answers” option. Important Note : Open Reference Documentation Links in New Tab (Right Click and Open in New Tab).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Answered
Review
Question 1 of 60
1. Question
To facilitate seamless workload orchestration, an administrator is configuring a Slurm cluster with Enroot and Pyxis. What is the specific purpose of the Pyxis plugin in this NVIDIA-based AI infrastructure stack?
Correct
Correct: A Pyxis is a Slurm plugin that allows users to run unprivileged containers using the Enroot runtime via standard Slurm job scripts. • The Technical Reason: While Enroot handles the heavy lifting of turning container images into sandboxed filesystems, it lacks native integration with a cluster scheduler. Pyxis is a Slurm SPANK (Slurm Plug-in Architecture for Node and job Kontrol) plugin that bridges this gap. ? Native CLI Integration: Pyxis adds specific flags to the srun and sbatch commands, such as –container-image, –container-mounts, and –container-workdir. ? Automation: When a user submits a job with a container flag, Pyxis automatically calls Enroot on the allocated compute nodes to pull (if necessary), create, and start the container environment before the job script executes. ? Unprivileged Security: It maintains the security model of the NVIDIA AI stack by ensuring containers run under the user‘s own credentials without requiring a root daemon. • The NCP-AII Context: The certification validates that you can deploy this “Golden Trio“ (Slurm + Pyxis + Enroot). This setup is the standard for NVIDIA DGX BasePOD and SuperPOD architectures because it allows researchers to use Docker-like workflows with the performance and scale of HPC scheduling.
Incorrect: B. Kernel module for InfiniBand encryption Pyxis is a user-space plugin for Slurm, not a kernel module. High-speed networking encryption (if utilized) is typically handled by the BlueField-3 DPU hardware or specialized drivers within the NVIDIA DOCA or MLNX_OFED stack, not by a scheduler plugin.
C. Power distribution management for GPUs Power management is handled by the Baseboard Management Controller (BMC) and the NVIDIA Management Library (NVML). While Slurm can track GPU power usage for accounting purposes (via the acct_gather energy plugin), Pyxis has no role in the electrical control or power-cycling of GPU hardware.
D. Storage driver for automatic model downloading Pyxis is not a storage driver. While it facilitates the mounting of filesystems into containers, it does not manage storage protocols or provide automatic connectivity to the NGC registry for data. The NGC CLI is the primary tool used for downloading models and datasets, typically executed as a task within a job or manually by the researcher.
Incorrect
Correct: A Pyxis is a Slurm plugin that allows users to run unprivileged containers using the Enroot runtime via standard Slurm job scripts. • The Technical Reason: While Enroot handles the heavy lifting of turning container images into sandboxed filesystems, it lacks native integration with a cluster scheduler. Pyxis is a Slurm SPANK (Slurm Plug-in Architecture for Node and job Kontrol) plugin that bridges this gap. ? Native CLI Integration: Pyxis adds specific flags to the srun and sbatch commands, such as –container-image, –container-mounts, and –container-workdir. ? Automation: When a user submits a job with a container flag, Pyxis automatically calls Enroot on the allocated compute nodes to pull (if necessary), create, and start the container environment before the job script executes. ? Unprivileged Security: It maintains the security model of the NVIDIA AI stack by ensuring containers run under the user‘s own credentials without requiring a root daemon. • The NCP-AII Context: The certification validates that you can deploy this “Golden Trio“ (Slurm + Pyxis + Enroot). This setup is the standard for NVIDIA DGX BasePOD and SuperPOD architectures because it allows researchers to use Docker-like workflows with the performance and scale of HPC scheduling.
Incorrect: B. Kernel module for InfiniBand encryption Pyxis is a user-space plugin for Slurm, not a kernel module. High-speed networking encryption (if utilized) is typically handled by the BlueField-3 DPU hardware or specialized drivers within the NVIDIA DOCA or MLNX_OFED stack, not by a scheduler plugin.
C. Power distribution management for GPUs Power management is handled by the Baseboard Management Controller (BMC) and the NVIDIA Management Library (NVML). While Slurm can track GPU power usage for accounting purposes (via the acct_gather energy plugin), Pyxis has no role in the electrical control or power-cycling of GPU hardware.
D. Storage driver for automatic model downloading Pyxis is not a storage driver. While it facilitates the mounting of filesystems into containers, it does not manage storage protocols or provide automatic connectivity to the NGC registry for data. The NGC CLI is the primary tool used for downloading models and datasets, typically executed as a task within a job or manually by the researcher.
Unattempted
Correct: A Pyxis is a Slurm plugin that allows users to run unprivileged containers using the Enroot runtime via standard Slurm job scripts. • The Technical Reason: While Enroot handles the heavy lifting of turning container images into sandboxed filesystems, it lacks native integration with a cluster scheduler. Pyxis is a Slurm SPANK (Slurm Plug-in Architecture for Node and job Kontrol) plugin that bridges this gap. ? Native CLI Integration: Pyxis adds specific flags to the srun and sbatch commands, such as –container-image, –container-mounts, and –container-workdir. ? Automation: When a user submits a job with a container flag, Pyxis automatically calls Enroot on the allocated compute nodes to pull (if necessary), create, and start the container environment before the job script executes. ? Unprivileged Security: It maintains the security model of the NVIDIA AI stack by ensuring containers run under the user‘s own credentials without requiring a root daemon. • The NCP-AII Context: The certification validates that you can deploy this “Golden Trio“ (Slurm + Pyxis + Enroot). This setup is the standard for NVIDIA DGX BasePOD and SuperPOD architectures because it allows researchers to use Docker-like workflows with the performance and scale of HPC scheduling.
Incorrect: B. Kernel module for InfiniBand encryption Pyxis is a user-space plugin for Slurm, not a kernel module. High-speed networking encryption (if utilized) is typically handled by the BlueField-3 DPU hardware or specialized drivers within the NVIDIA DOCA or MLNX_OFED stack, not by a scheduler plugin.
C. Power distribution management for GPUs Power management is handled by the Baseboard Management Controller (BMC) and the NVIDIA Management Library (NVML). While Slurm can track GPU power usage for accounting purposes (via the acct_gather energy plugin), Pyxis has no role in the electrical control or power-cycling of GPU hardware.
D. Storage driver for automatic model downloading Pyxis is not a storage driver. While it facilitates the mounting of filesystems into containers, it does not manage storage protocols or provide automatic connectivity to the NGC registry for data. The NGC CLI is the primary tool used for downloading models and datasets, typically executed as a task within a job or manually by the researcher.
Question 2 of 60
2. Question
A system administrator is deploying a cluster of NVIDIA HGX H100 systems and needs to perform the initial firmware baseline. During the Baseboard Management Controller (BMC) configuration, the administrator notices that the Out-of-Band (OOB) management network is reachable but the Trusted Platform Module (TPM) 2.0 is reporting a physical presence failure. Which sequence of events is most critical to validate the hardware operation before proceeding to the operating system installation and workload validation?
Correct
Correct: B Update the BMC and BIOS firmware to the latest versions using the Redfish API, then reset the TPM state in the BIOS security menu and verify cooling parameters. • The Technical Reason: * Firmware Hierarchy: NVIDIA HGX systems rely on a strictly validated recipe of firmware. The BMC and BIOS must be aligned to ensure the complex communication between the H100 GPUs and the host CPU is stable. Using the Redfish API is the professional standard for automating these updates across a server farm. ? TPM Physical Presence: A “physical presence failure“ in TPM 2.0 often occurs when the security state is out of sync with the current BIOS version or after a hardware change. Resetting the TPM state (or clearing it) in a secure BIOS environment is the required step to re-establish the hardware Root of Trust. ? Thermal Readiness: Since HGX H100 systems have massive power draws (up to 700W+ per GPU), validating cooling parameters before OS installation is critical to prevent thermal throttling or emergency shutdowns during the first boot. • The NCP-AII Context: The exam blueprint requires candidates to “Perform initial configuration of BMC, OOB, and TPM“ and “Perform firmware upgrades (including on HGX™).“ Option B follows the logical sequence of: Standardize Management Plane ? Resolve Hardware Security State ? Validate Environmental Safety.
Incorrect: A. Disconnect power for ten minutes and re-seat the baseboard While clearing CMOS is a general troubleshooting step, it is not a “baseline“ procedure for a professional deployment. Re-seating an HGX baseboard is a high-risk physical operation that should only be performed if there is a confirmed mechanical failure. It does not address firmware versioning or TPM logic errors.
C. Ignore the TPM error and install the Container Toolkit This is a critical failure in the context of the NCP-AII exam. The certification emphasizes Validation before Operation. Proceeding with a TPM failure compromises the security integrity of the cluster (Measured Boot/Confidential Computing) and ignores a hardware health flag that could indicate deeper motherboard issues.
D. Factory reset the BlueField-3 and re-cable the fabric This targets the wrong layer. A TPM error on the host motherboard is an internal server management issue. The BlueField-3 DPU and the InfiniBand fabric are part of the networking plane. Re-cabling the fabric will not resolve a local BIOS/TPM presence error on the server‘s management controller.
Incorrect
Correct: B Update the BMC and BIOS firmware to the latest versions using the Redfish API, then reset the TPM state in the BIOS security menu and verify cooling parameters. • The Technical Reason: * Firmware Hierarchy: NVIDIA HGX systems rely on a strictly validated recipe of firmware. The BMC and BIOS must be aligned to ensure the complex communication between the H100 GPUs and the host CPU is stable. Using the Redfish API is the professional standard for automating these updates across a server farm. ? TPM Physical Presence: A “physical presence failure“ in TPM 2.0 often occurs when the security state is out of sync with the current BIOS version or after a hardware change. Resetting the TPM state (or clearing it) in a secure BIOS environment is the required step to re-establish the hardware Root of Trust. ? Thermal Readiness: Since HGX H100 systems have massive power draws (up to 700W+ per GPU), validating cooling parameters before OS installation is critical to prevent thermal throttling or emergency shutdowns during the first boot. • The NCP-AII Context: The exam blueprint requires candidates to “Perform initial configuration of BMC, OOB, and TPM“ and “Perform firmware upgrades (including on HGX™).“ Option B follows the logical sequence of: Standardize Management Plane ? Resolve Hardware Security State ? Validate Environmental Safety.
Incorrect: A. Disconnect power for ten minutes and re-seat the baseboard While clearing CMOS is a general troubleshooting step, it is not a “baseline“ procedure for a professional deployment. Re-seating an HGX baseboard is a high-risk physical operation that should only be performed if there is a confirmed mechanical failure. It does not address firmware versioning or TPM logic errors.
C. Ignore the TPM error and install the Container Toolkit This is a critical failure in the context of the NCP-AII exam. The certification emphasizes Validation before Operation. Proceeding with a TPM failure compromises the security integrity of the cluster (Measured Boot/Confidential Computing) and ignores a hardware health flag that could indicate deeper motherboard issues.
D. Factory reset the BlueField-3 and re-cable the fabric This targets the wrong layer. A TPM error on the host motherboard is an internal server management issue. The BlueField-3 DPU and the InfiniBand fabric are part of the networking plane. Re-cabling the fabric will not resolve a local BIOS/TPM presence error on the server‘s management controller.
Unattempted
Correct: B Update the BMC and BIOS firmware to the latest versions using the Redfish API, then reset the TPM state in the BIOS security menu and verify cooling parameters. • The Technical Reason: * Firmware Hierarchy: NVIDIA HGX systems rely on a strictly validated recipe of firmware. The BMC and BIOS must be aligned to ensure the complex communication between the H100 GPUs and the host CPU is stable. Using the Redfish API is the professional standard for automating these updates across a server farm. ? TPM Physical Presence: A “physical presence failure“ in TPM 2.0 often occurs when the security state is out of sync with the current BIOS version or after a hardware change. Resetting the TPM state (or clearing it) in a secure BIOS environment is the required step to re-establish the hardware Root of Trust. ? Thermal Readiness: Since HGX H100 systems have massive power draws (up to 700W+ per GPU), validating cooling parameters before OS installation is critical to prevent thermal throttling or emergency shutdowns during the first boot. • The NCP-AII Context: The exam blueprint requires candidates to “Perform initial configuration of BMC, OOB, and TPM“ and “Perform firmware upgrades (including on HGX™).“ Option B follows the logical sequence of: Standardize Management Plane ? Resolve Hardware Security State ? Validate Environmental Safety.
Incorrect: A. Disconnect power for ten minutes and re-seat the baseboard While clearing CMOS is a general troubleshooting step, it is not a “baseline“ procedure for a professional deployment. Re-seating an HGX baseboard is a high-risk physical operation that should only be performed if there is a confirmed mechanical failure. It does not address firmware versioning or TPM logic errors.
C. Ignore the TPM error and install the Container Toolkit This is a critical failure in the context of the NCP-AII exam. The certification emphasizes Validation before Operation. Proceeding with a TPM failure compromises the security integrity of the cluster (Measured Boot/Confidential Computing) and ignores a hardware health flag that could indicate deeper motherboard issues.
D. Factory reset the BlueField-3 and re-cable the fabric This targets the wrong layer. A TPM error on the host motherboard is an internal server management issue. The BlueField-3 DPU and the InfiniBand fabric are part of the networking plane. Re-cabling the fabric will not resolve a local BIOS/TPM presence error on the server‘s management controller.
Question 3 of 60
3. Question
An administrator is troubleshooting a Multi-Instance GPU (MIG) configuration on an NVIDIA A100. They notice that they cannot create a new instance even though there is available GPU memory. What is the most likely architectural reason for this limitation according to the physical layer management of MIG?
Correct
Correct: D The available memory is fragmented, and the GPU requires a contiguous block of memory and compute ‘slices‘ to form a valid instance profile.
The Technical Reason: MIG technology on Ampere (A100) and Hopper (H100) architectures partitions the GPU into fixed physical “slices“ of memory and compute.
Contiguous Allocation: To create a GPU Instance (GI), the hardware must be able to allocate a contiguous set of memory slices and a specific set of compute slices that align with the predefined profiles (e.g., 3g.20gb).
Fragmentation: If you have created and deleted various instances, the “free“ slices may be scattered (e.g., slice 1 is free, slice 2 is busy, slice 3 is free). Even if the total free memory equals 20GB, if those 20GB are not physically adjacent in the hardware‘s partitioning map, a new 20GB profile cannot be formed.
The NCP-AII Context: The exam validates your ability to troubleshoot why a configuration that “looks right“ on paper fails in practice. Fragmentation is a common real-world issue in multi-tenant environments where workloads are dynamic.
Incorrect Options: A. GPU is in ‘Persistence Mode‘ Persistence Mode is a software setting that keeps the NVIDIA driver loaded even when no applications are using the GPU. It does not block the creation of MIG instances. In fact, Persistence Mode is often recommended in production to reduce the latency of initialization for new CUDA contexts.
B. Exceeding maximum CUDA cores allowed by Linux The Linux operating system does not have an arbitrary “cap“ on CUDA cores. The number of CUDA cores available to an instance is determined by the hardware profile selected (e.g., a 1g instance gets 1/7th of the available SMs). If a profile is valid, Linux will recognize the available cores provided by the driver.
C. TPM 2.0 module has locked the GPU memory While TPM 2.0 and NVIDIA Confidential Computing (on Hopper H100) are used for security and attestation, they do not dynamically “lock“ memory to prevent the creation of new MIG partitions. MIG‘s internal hardware isolation (Memory Protection Units) handles instance security, not the system‘s TPM module.
Incorrect
Correct: D The available memory is fragmented, and the GPU requires a contiguous block of memory and compute ‘slices‘ to form a valid instance profile.
The Technical Reason: MIG technology on Ampere (A100) and Hopper (H100) architectures partitions the GPU into fixed physical “slices“ of memory and compute.
Contiguous Allocation: To create a GPU Instance (GI), the hardware must be able to allocate a contiguous set of memory slices and a specific set of compute slices that align with the predefined profiles (e.g., 3g.20gb).
Fragmentation: If you have created and deleted various instances, the “free“ slices may be scattered (e.g., slice 1 is free, slice 2 is busy, slice 3 is free). Even if the total free memory equals 20GB, if those 20GB are not physically adjacent in the hardware‘s partitioning map, a new 20GB profile cannot be formed.
The NCP-AII Context: The exam validates your ability to troubleshoot why a configuration that “looks right“ on paper fails in practice. Fragmentation is a common real-world issue in multi-tenant environments where workloads are dynamic.
Incorrect Options: A. GPU is in ‘Persistence Mode‘ Persistence Mode is a software setting that keeps the NVIDIA driver loaded even when no applications are using the GPU. It does not block the creation of MIG instances. In fact, Persistence Mode is often recommended in production to reduce the latency of initialization for new CUDA contexts.
B. Exceeding maximum CUDA cores allowed by Linux The Linux operating system does not have an arbitrary “cap“ on CUDA cores. The number of CUDA cores available to an instance is determined by the hardware profile selected (e.g., a 1g instance gets 1/7th of the available SMs). If a profile is valid, Linux will recognize the available cores provided by the driver.
C. TPM 2.0 module has locked the GPU memory While TPM 2.0 and NVIDIA Confidential Computing (on Hopper H100) are used for security and attestation, they do not dynamically “lock“ memory to prevent the creation of new MIG partitions. MIG‘s internal hardware isolation (Memory Protection Units) handles instance security, not the system‘s TPM module.
Unattempted
Correct: D The available memory is fragmented, and the GPU requires a contiguous block of memory and compute ‘slices‘ to form a valid instance profile.
The Technical Reason: MIG technology on Ampere (A100) and Hopper (H100) architectures partitions the GPU into fixed physical “slices“ of memory and compute.
Contiguous Allocation: To create a GPU Instance (GI), the hardware must be able to allocate a contiguous set of memory slices and a specific set of compute slices that align with the predefined profiles (e.g., 3g.20gb).
Fragmentation: If you have created and deleted various instances, the “free“ slices may be scattered (e.g., slice 1 is free, slice 2 is busy, slice 3 is free). Even if the total free memory equals 20GB, if those 20GB are not physically adjacent in the hardware‘s partitioning map, a new 20GB profile cannot be formed.
The NCP-AII Context: The exam validates your ability to troubleshoot why a configuration that “looks right“ on paper fails in practice. Fragmentation is a common real-world issue in multi-tenant environments where workloads are dynamic.
Incorrect Options: A. GPU is in ‘Persistence Mode‘ Persistence Mode is a software setting that keeps the NVIDIA driver loaded even when no applications are using the GPU. It does not block the creation of MIG instances. In fact, Persistence Mode is often recommended in production to reduce the latency of initialization for new CUDA contexts.
B. Exceeding maximum CUDA cores allowed by Linux The Linux operating system does not have an arbitrary “cap“ on CUDA cores. The number of CUDA cores available to an instance is determined by the hardware profile selected (e.g., a 1g instance gets 1/7th of the available SMs). If a profile is valid, Linux will recognize the available cores provided by the driver.
C. TPM 2.0 module has locked the GPU memory While TPM 2.0 and NVIDIA Confidential Computing (on Hopper H100) are used for security and attestation, they do not dynamically “lock“ memory to prevent the creation of new MIG partitions. MIG‘s internal hardware isolation (Memory Protection Units) handles instance security, not the system‘s TPM module.
Question 4 of 60
4. Question
During the configuration of a BlueField-3 network platform, an administrator encounters an issue where the DPU is not visible to the host system. After checking the physical seating of the card, what is the next logical step in the physical layer management to ensure the DPU is initialized correctly?
Correct
Correct: D Check the PCIe bifurcation settings in the system BIOS to ensure that the slot is configured to support the lanes required by the DPU hardware. • The Technical Reason: The NVIDIA BlueField-3 DPU is a high-performance “system-on-a-chip“ that often exposes multiple PCIe endpoints (e.g., for the network controller, ARM cores, and storage offloads). ? Bifurcation: Many servers default to a single x16 lane configuration for a physical slot. However, high-density devices like the BlueField-3 or multi-GPU cards often require the motherboard to “bifurcate“ or split those lanes (e.g., into x8/x8 or x4/x4/x4/x4). ? Detection Failure: If the BIOS is not set to the correct bifurcation mode, the hardware handshake fails at the POST (Power-On Self-Test) level, and the device will not appear in lspci or the OS, regardless of drivers or seating. • The NCP-AII Context: The exam expects you to follow a logical troubleshooting hierarchy: Physical seating ? BIOS/PCIe configuration ? Firmware ? Driver. Bifurcation is a common “day zero“ misconfiguration for advanced AI infrastructure.
Incorrect: A. Decrease the server power supply wattage Decreasing power supply wattage is counterproductive and dangerous. High-performance components like the BlueField-3 (which can draw up to 150W+) require more stable power, not less. Reducing wattage would likely cause system instability or prevent the DPU from powering on entirely, rather than resolving “electrical noise.“
B. Replace InfiniBand cables with standard Ethernet While BlueField-3 supports both InfiniBand and Ethernet (VPI), the type of network cable plugged into the external ports has no impact on whether the host server can see the DPU card over the internal PCIe bus. Link-level cables are for external networking; PCIe detection is an internal server motherboard task.
C. Reinstall the host operating system Reinstalling the OS is an extreme “last resort“ and would not fix the problem if the hardware is invisible at the BIOS level. If lspci (a low-level bus scanning tool) cannot see the device, the issue is hardware or firmware related. Reinstalling software or drivers cannot resolve a failure in the initial PCIe link training.
Incorrect
Correct: D Check the PCIe bifurcation settings in the system BIOS to ensure that the slot is configured to support the lanes required by the DPU hardware. • The Technical Reason: The NVIDIA BlueField-3 DPU is a high-performance “system-on-a-chip“ that often exposes multiple PCIe endpoints (e.g., for the network controller, ARM cores, and storage offloads). ? Bifurcation: Many servers default to a single x16 lane configuration for a physical slot. However, high-density devices like the BlueField-3 or multi-GPU cards often require the motherboard to “bifurcate“ or split those lanes (e.g., into x8/x8 or x4/x4/x4/x4). ? Detection Failure: If the BIOS is not set to the correct bifurcation mode, the hardware handshake fails at the POST (Power-On Self-Test) level, and the device will not appear in lspci or the OS, regardless of drivers or seating. • The NCP-AII Context: The exam expects you to follow a logical troubleshooting hierarchy: Physical seating ? BIOS/PCIe configuration ? Firmware ? Driver. Bifurcation is a common “day zero“ misconfiguration for advanced AI infrastructure.
Incorrect: A. Decrease the server power supply wattage Decreasing power supply wattage is counterproductive and dangerous. High-performance components like the BlueField-3 (which can draw up to 150W+) require more stable power, not less. Reducing wattage would likely cause system instability or prevent the DPU from powering on entirely, rather than resolving “electrical noise.“
B. Replace InfiniBand cables with standard Ethernet While BlueField-3 supports both InfiniBand and Ethernet (VPI), the type of network cable plugged into the external ports has no impact on whether the host server can see the DPU card over the internal PCIe bus. Link-level cables are for external networking; PCIe detection is an internal server motherboard task.
C. Reinstall the host operating system Reinstalling the OS is an extreme “last resort“ and would not fix the problem if the hardware is invisible at the BIOS level. If lspci (a low-level bus scanning tool) cannot see the device, the issue is hardware or firmware related. Reinstalling software or drivers cannot resolve a failure in the initial PCIe link training.
Unattempted
Correct: D Check the PCIe bifurcation settings in the system BIOS to ensure that the slot is configured to support the lanes required by the DPU hardware. • The Technical Reason: The NVIDIA BlueField-3 DPU is a high-performance “system-on-a-chip“ that often exposes multiple PCIe endpoints (e.g., for the network controller, ARM cores, and storage offloads). ? Bifurcation: Many servers default to a single x16 lane configuration for a physical slot. However, high-density devices like the BlueField-3 or multi-GPU cards often require the motherboard to “bifurcate“ or split those lanes (e.g., into x8/x8 or x4/x4/x4/x4). ? Detection Failure: If the BIOS is not set to the correct bifurcation mode, the hardware handshake fails at the POST (Power-On Self-Test) level, and the device will not appear in lspci or the OS, regardless of drivers or seating. • The NCP-AII Context: The exam expects you to follow a logical troubleshooting hierarchy: Physical seating ? BIOS/PCIe configuration ? Firmware ? Driver. Bifurcation is a common “day zero“ misconfiguration for advanced AI infrastructure.
Incorrect: A. Decrease the server power supply wattage Decreasing power supply wattage is counterproductive and dangerous. High-performance components like the BlueField-3 (which can draw up to 150W+) require more stable power, not less. Reducing wattage would likely cause system instability or prevent the DPU from powering on entirely, rather than resolving “electrical noise.“
B. Replace InfiniBand cables with standard Ethernet While BlueField-3 supports both InfiniBand and Ethernet (VPI), the type of network cable plugged into the external ports has no impact on whether the host server can see the DPU card over the internal PCIe bus. Link-level cables are for external networking; PCIe detection is an internal server motherboard task.
C. Reinstall the host operating system Reinstalling the OS is an extreme “last resort“ and would not fix the problem if the hardware is invisible at the BIOS level. If lspci (a low-level bus scanning tool) cannot see the device, the issue is hardware or firmware related. Reinstalling software or drivers cannot resolve a failure in the initial PCIe link training.
Question 5 of 60
5. Question
When configuring a Multi-Instance GPU environment, why is it necessary to understand the difference between GPU Instances and Compute Instances? Which statement correctly describes the hierarchy and relationship between these two MIG components?
Correct
Correct: D A GPU Instance defines the memory and cache allocation, while a Compute Instance is created within it to define the actual compute resources available. • The Technical Reason: The MIG architecture follows a strict parent-child hierarchy: ? GPU Instance (GI): This is the primary partition. When you create a GI, you are carving out a dedicated portion of the hardware‘s physical resources, including memory capacity (VRAM), memory bandwidth, and L2 cache banks. This provides the highest level of isolation, ensuring “noisy neighbors“ cannot impact memory performance. ? Compute Instance (CI): This is a subdivision inside a GPU Instance. It defines the number of Streaming Multiprocessors (SMs)—the actual “brains“ that do the math—allocated to a workload. While multiple CIs can exist within one GI, they all share the memory and cache that was allocated to the parent GI. • The NCP-AII Context: The exam tests your knowledge of this “Spatial Partitioning.“ You must understand that a user cannot have a Compute Instance without first having a parent GPU Instance, as the GI provides the necessary memory foundation for the compute units to function.
Incorrect: A. Terms are used interchangeably In the nvidia-smi tool, these are distinct objects with their own IDs. Using them interchangeably would lead to configuration errors. A GPU Instance ID refers to the memory-backed partition, while a Compute Instance ID refers to the execution context within that partition.
B. Compute Instances are created first This is the reverse of the actual hardware workflow. You cannot allocate SMs (Compute) without first defining which block of high-bandwidth memory (GPU Instance) they will be attached to. The GI provides the “container“ of memory/cache that the CIs utilize.
C. GPU Instances for rendering / Compute Instances for training Both components are used together for both training and inference workloads. MIG is primarily a feature of data center GPUs (A100, H100) which are often utilized for compute-heavy tasks. The distinction is based on resource type (Memory vs. Compute), not the specific AI application type.
Incorrect
Correct: D A GPU Instance defines the memory and cache allocation, while a Compute Instance is created within it to define the actual compute resources available. • The Technical Reason: The MIG architecture follows a strict parent-child hierarchy: ? GPU Instance (GI): This is the primary partition. When you create a GI, you are carving out a dedicated portion of the hardware‘s physical resources, including memory capacity (VRAM), memory bandwidth, and L2 cache banks. This provides the highest level of isolation, ensuring “noisy neighbors“ cannot impact memory performance. ? Compute Instance (CI): This is a subdivision inside a GPU Instance. It defines the number of Streaming Multiprocessors (SMs)—the actual “brains“ that do the math—allocated to a workload. While multiple CIs can exist within one GI, they all share the memory and cache that was allocated to the parent GI. • The NCP-AII Context: The exam tests your knowledge of this “Spatial Partitioning.“ You must understand that a user cannot have a Compute Instance without first having a parent GPU Instance, as the GI provides the necessary memory foundation for the compute units to function.
Incorrect: A. Terms are used interchangeably In the nvidia-smi tool, these are distinct objects with their own IDs. Using them interchangeably would lead to configuration errors. A GPU Instance ID refers to the memory-backed partition, while a Compute Instance ID refers to the execution context within that partition.
B. Compute Instances are created first This is the reverse of the actual hardware workflow. You cannot allocate SMs (Compute) without first defining which block of high-bandwidth memory (GPU Instance) they will be attached to. The GI provides the “container“ of memory/cache that the CIs utilize.
C. GPU Instances for rendering / Compute Instances for training Both components are used together for both training and inference workloads. MIG is primarily a feature of data center GPUs (A100, H100) which are often utilized for compute-heavy tasks. The distinction is based on resource type (Memory vs. Compute), not the specific AI application type.
Unattempted
Correct: D A GPU Instance defines the memory and cache allocation, while a Compute Instance is created within it to define the actual compute resources available. • The Technical Reason: The MIG architecture follows a strict parent-child hierarchy: ? GPU Instance (GI): This is the primary partition. When you create a GI, you are carving out a dedicated portion of the hardware‘s physical resources, including memory capacity (VRAM), memory bandwidth, and L2 cache banks. This provides the highest level of isolation, ensuring “noisy neighbors“ cannot impact memory performance. ? Compute Instance (CI): This is a subdivision inside a GPU Instance. It defines the number of Streaming Multiprocessors (SMs)—the actual “brains“ that do the math—allocated to a workload. While multiple CIs can exist within one GI, they all share the memory and cache that was allocated to the parent GI. • The NCP-AII Context: The exam tests your knowledge of this “Spatial Partitioning.“ You must understand that a user cannot have a Compute Instance without first having a parent GPU Instance, as the GI provides the necessary memory foundation for the compute units to function.
Incorrect: A. Terms are used interchangeably In the nvidia-smi tool, these are distinct objects with their own IDs. Using them interchangeably would lead to configuration errors. A GPU Instance ID refers to the memory-backed partition, while a Compute Instance ID refers to the execution context within that partition.
B. Compute Instances are created first This is the reverse of the actual hardware workflow. You cannot allocate SMs (Compute) without first defining which block of high-bandwidth memory (GPU Instance) they will be attached to. The GI provides the “container“ of memory/cache that the CIs utilize.
C. GPU Instances for rendering / Compute Instances for training Both components are used together for both training and inference workloads. MIG is primarily a feature of data center GPUs (A100, H100) which are often utilized for compute-heavy tasks. The distinction is based on resource type (Memory vs. Compute), not the specific AI application type.
Question 6 of 60
6. Question
A DevOps engineer is configuring a cluster with Slurm as the job scheduler. To support containerized workloads, the engineer must install the Pyxis and Enroot plugins. What is the specific role of Enroot in this control plane configuration, and how does it differ from a traditional Docker-based workflow?
Correct
Correct: C Enroot is a tool that turns container images into unprivileged sandboxes, allowing Slurm to run them as regular processes without needing a root-level daemon.
The Technical Reason: Traditional container runtimes like Docker rely on a central, root-owned daemon, which poses security risks in multi-tenant HPC environments and adds overhead.
Unprivileged Execution: Enroot (an NVIDIA-developed tool) bypasses the daemon-based architecture. It imports traditional Docker/NGC images and converts them into a simple SquashFS filesystem.
Process-Based: When a job is submitted, Enroot runs the container as a standard Linux process under the user‘s own UID. This allows Slurm to manage the container‘s lifecycle exactly like any other binary task, utilizing standard Linux namespaces for isolation.
Pyxis Integration: Pyxis is a Slurm plugin (SPANK) that acts as the “glue.“ It allows users to use the srun –container-image flag, which internally triggers Enroot to handle the container setup on the compute nodes.
The NCP-AII Context: The certification validates your ability to “Deploy Slurm with Enroot/Pyxis and submit a test job.“ This configuration is the gold standard for NVIDIA DGX systems and SuperPODs because it ensures security, performance, and seamless GPU/InfiniBand pass-through.
Incorrect Options: A. Network virtualization layer for InfiniBand Enroot is a container runtime, not a networking layer. While it includes “hooks“ to ensure that the container can access the host‘s InfiniBand and GPU resources (via the NVIDIA Container Toolkit), it does not virtualize or bypass the network stack. Networking is still handled by the underlying host drivers (e.g., MOFED/DOCA).
B. Primary scheduler that replaces Slurm Enroot does not have any scheduling capabilities. It is a local utility installed on compute nodes to launch container filesystems. Slurm remains the primary authority for resource allocation, job queuing, and determining which GPUs are assigned to which task.
D. Graphical user interface for the NGC CLI Enroot is a Command Line Interface (CLI) and a set of system libraries; it does not provide a web-based or graphical user interface. The NGC CLI is used to pull images from the registry, while Enroot is used to execute those images. Management of the Slurm queue via a web browser would typically involve a portal like NVIDIA Base Command Manager or Open OnDemand, not Enroot.
Incorrect
Correct: C Enroot is a tool that turns container images into unprivileged sandboxes, allowing Slurm to run them as regular processes without needing a root-level daemon.
The Technical Reason: Traditional container runtimes like Docker rely on a central, root-owned daemon, which poses security risks in multi-tenant HPC environments and adds overhead.
Unprivileged Execution: Enroot (an NVIDIA-developed tool) bypasses the daemon-based architecture. It imports traditional Docker/NGC images and converts them into a simple SquashFS filesystem.
Process-Based: When a job is submitted, Enroot runs the container as a standard Linux process under the user‘s own UID. This allows Slurm to manage the container‘s lifecycle exactly like any other binary task, utilizing standard Linux namespaces for isolation.
Pyxis Integration: Pyxis is a Slurm plugin (SPANK) that acts as the “glue.“ It allows users to use the srun –container-image flag, which internally triggers Enroot to handle the container setup on the compute nodes.
The NCP-AII Context: The certification validates your ability to “Deploy Slurm with Enroot/Pyxis and submit a test job.“ This configuration is the gold standard for NVIDIA DGX systems and SuperPODs because it ensures security, performance, and seamless GPU/InfiniBand pass-through.
Incorrect Options: A. Network virtualization layer for InfiniBand Enroot is a container runtime, not a networking layer. While it includes “hooks“ to ensure that the container can access the host‘s InfiniBand and GPU resources (via the NVIDIA Container Toolkit), it does not virtualize or bypass the network stack. Networking is still handled by the underlying host drivers (e.g., MOFED/DOCA).
B. Primary scheduler that replaces Slurm Enroot does not have any scheduling capabilities. It is a local utility installed on compute nodes to launch container filesystems. Slurm remains the primary authority for resource allocation, job queuing, and determining which GPUs are assigned to which task.
D. Graphical user interface for the NGC CLI Enroot is a Command Line Interface (CLI) and a set of system libraries; it does not provide a web-based or graphical user interface. The NGC CLI is used to pull images from the registry, while Enroot is used to execute those images. Management of the Slurm queue via a web browser would typically involve a portal like NVIDIA Base Command Manager or Open OnDemand, not Enroot.
Unattempted
Correct: C Enroot is a tool that turns container images into unprivileged sandboxes, allowing Slurm to run them as regular processes without needing a root-level daemon.
The Technical Reason: Traditional container runtimes like Docker rely on a central, root-owned daemon, which poses security risks in multi-tenant HPC environments and adds overhead.
Unprivileged Execution: Enroot (an NVIDIA-developed tool) bypasses the daemon-based architecture. It imports traditional Docker/NGC images and converts them into a simple SquashFS filesystem.
Process-Based: When a job is submitted, Enroot runs the container as a standard Linux process under the user‘s own UID. This allows Slurm to manage the container‘s lifecycle exactly like any other binary task, utilizing standard Linux namespaces for isolation.
Pyxis Integration: Pyxis is a Slurm plugin (SPANK) that acts as the “glue.“ It allows users to use the srun –container-image flag, which internally triggers Enroot to handle the container setup on the compute nodes.
The NCP-AII Context: The certification validates your ability to “Deploy Slurm with Enroot/Pyxis and submit a test job.“ This configuration is the gold standard for NVIDIA DGX systems and SuperPODs because it ensures security, performance, and seamless GPU/InfiniBand pass-through.
Incorrect Options: A. Network virtualization layer for InfiniBand Enroot is a container runtime, not a networking layer. While it includes “hooks“ to ensure that the container can access the host‘s InfiniBand and GPU resources (via the NVIDIA Container Toolkit), it does not virtualize or bypass the network stack. Networking is still handled by the underlying host drivers (e.g., MOFED/DOCA).
B. Primary scheduler that replaces Slurm Enroot does not have any scheduling capabilities. It is a local utility installed on compute nodes to launch container filesystems. Slurm remains the primary authority for resource allocation, job queuing, and determining which GPUs are assigned to which task.
D. Graphical user interface for the NGC CLI Enroot is a Command Line Interface (CLI) and a set of system libraries; it does not provide a web-based or graphical user interface. The NGC CLI is used to pull images from the registry, while Enroot is used to execute those images. Management of the Slurm queue via a web browser would typically involve a portal like NVIDIA Base Command Manager or Open OnDemand, not Enroot.
Question 7 of 60
7. Question
To enable seamless GPU-accelerated container execution, the administrator must install the NVIDIA Container Toolkit. Which of the following is a primary function of the NVIDIA Container Toolkit when integrated with Docker on an AI compute node?
Correct
Correct : C The NVIDIA Container Toolkit‘s core purpose is to enable containers to access and utilize NVIDIA GPUs. It achieves this by providing a container runtime library that integrates with the Docker engine . This integration allows Docker to interface with the host‘s NVIDIA drivers. When a container is started with the –gpus flag, the toolkit‘s runtime hook (based on the Open Container Initiative specification) is triggered, which automatically injects the necessary GPU devices and CUDA libraries into the container . This process makes the GPU resources available to the applications running inside the container without any manual configuration .
Incorrect: A. It automatically recompiles the PyTorch source code within the container to optimize it for the specific CUDA core count of the host‘s H100 GPU. This is incorrect. The NVIDIA Container Toolkit does not modify or recompile application source code like PyTorch. Its role is at the system level, making the GPU hardware accessible to the container. The toolkit “provides a container runtime library“ and “automatically configure[s] containers to leverage NVIDIA GPUs“ , but it does not perform application-specific optimizations or recompilation.
B. It manages the power distribution to the GPUs by throttling the clock speeds when the Docker daemon detects a high number of context switches. This is incorrect. Power management and clock speed control are functions of the NVIDIA kernel-mode driver and the NVIDIA System Management Interface (nvidia-smi), not the Container Toolkit. The toolkit is concerned with “GPU device access configuration“ and “CUDA library management“ for containers , not the low-level hardware power states.
D. It acts as a virtual hypervisor that allows the BlueField-3 DPU to run multiple instances of the Ubuntu operating system within a single Docker container. This is incorrect. The NVIDIA Container Toolkit is not a hypervisor and does not run full operating system instances. It is specifically designed to run GPU-accelerated applications within containers . While the toolkit can be used in conjunction with a DPU (Data Processing Unit) in a broader infrastructure context, its described function of running OS instances is fundamentally wrong. A single Docker container runs a single instance of an application in an isolated user-space, not multiple OS instances .
Incorrect
Correct : C The NVIDIA Container Toolkit‘s core purpose is to enable containers to access and utilize NVIDIA GPUs. It achieves this by providing a container runtime library that integrates with the Docker engine . This integration allows Docker to interface with the host‘s NVIDIA drivers. When a container is started with the –gpus flag, the toolkit‘s runtime hook (based on the Open Container Initiative specification) is triggered, which automatically injects the necessary GPU devices and CUDA libraries into the container . This process makes the GPU resources available to the applications running inside the container without any manual configuration .
Incorrect: A. It automatically recompiles the PyTorch source code within the container to optimize it for the specific CUDA core count of the host‘s H100 GPU. This is incorrect. The NVIDIA Container Toolkit does not modify or recompile application source code like PyTorch. Its role is at the system level, making the GPU hardware accessible to the container. The toolkit “provides a container runtime library“ and “automatically configure[s] containers to leverage NVIDIA GPUs“ , but it does not perform application-specific optimizations or recompilation.
B. It manages the power distribution to the GPUs by throttling the clock speeds when the Docker daemon detects a high number of context switches. This is incorrect. Power management and clock speed control are functions of the NVIDIA kernel-mode driver and the NVIDIA System Management Interface (nvidia-smi), not the Container Toolkit. The toolkit is concerned with “GPU device access configuration“ and “CUDA library management“ for containers , not the low-level hardware power states.
D. It acts as a virtual hypervisor that allows the BlueField-3 DPU to run multiple instances of the Ubuntu operating system within a single Docker container. This is incorrect. The NVIDIA Container Toolkit is not a hypervisor and does not run full operating system instances. It is specifically designed to run GPU-accelerated applications within containers . While the toolkit can be used in conjunction with a DPU (Data Processing Unit) in a broader infrastructure context, its described function of running OS instances is fundamentally wrong. A single Docker container runs a single instance of an application in an isolated user-space, not multiple OS instances .
Unattempted
Correct : C The NVIDIA Container Toolkit‘s core purpose is to enable containers to access and utilize NVIDIA GPUs. It achieves this by providing a container runtime library that integrates with the Docker engine . This integration allows Docker to interface with the host‘s NVIDIA drivers. When a container is started with the –gpus flag, the toolkit‘s runtime hook (based on the Open Container Initiative specification) is triggered, which automatically injects the necessary GPU devices and CUDA libraries into the container . This process makes the GPU resources available to the applications running inside the container without any manual configuration .
Incorrect: A. It automatically recompiles the PyTorch source code within the container to optimize it for the specific CUDA core count of the host‘s H100 GPU. This is incorrect. The NVIDIA Container Toolkit does not modify or recompile application source code like PyTorch. Its role is at the system level, making the GPU hardware accessible to the container. The toolkit “provides a container runtime library“ and “automatically configure[s] containers to leverage NVIDIA GPUs“ , but it does not perform application-specific optimizations or recompilation.
B. It manages the power distribution to the GPUs by throttling the clock speeds when the Docker daemon detects a high number of context switches. This is incorrect. Power management and clock speed control are functions of the NVIDIA kernel-mode driver and the NVIDIA System Management Interface (nvidia-smi), not the Container Toolkit. The toolkit is concerned with “GPU device access configuration“ and “CUDA library management“ for containers , not the low-level hardware power states.
D. It acts as a virtual hypervisor that allows the BlueField-3 DPU to run multiple instances of the Ubuntu operating system within a single Docker container. This is incorrect. The NVIDIA Container Toolkit is not a hypervisor and does not run full operating system instances. It is specifically designed to run GPU-accelerated applications within containers . While the toolkit can be used in conjunction with a DPU (Data Processing Unit) in a broader infrastructure context, its described function of running OS instances is fundamentally wrong. A single Docker container runs a single instance of an application in an isolated user-space, not multiple OS instances .
Question 8 of 60
8. Question
After the physical installation and software configuration of a new 32-node H100 cluster, the team must perform a validation of the InfiniBand fabric East-West bandwidth. Which tool is specifically designed to run the NVIDIA Collective Communications Library (NCCL) tests across all nodes to ensure the fabric is providing the expected multi-rail throughput, and what is the key metric to look for?
Correct
Correct: B The nccl-tests suite (specifically p2p_single_dev or all_reduce_perf) should be run via srun to measure the BusBw (Bus Bandwidth), which should reflect the aggregate performance of all active links. • The Technical Reason: While individual ib_write_bw tests check physical link speeds, they don‘t reflect how AI frameworks actually use the network. ? NCCL-tests: This is the gold-standard suite for AI infrastructure. It measures how the Collective Communications Library (NCCL) performs over the fabric. ? BusBw (Bus Bandwidth): This is the key metric. Unlike AlgBw (Algorithm Bandwidth), which varies by the mathematical efficiency of the operation (like All-Reduce or Broadcast), BusBw is a normalized metric that reflects the physical throughput being pushed across the PCIe and InfiniBand buses. ? Multi-Rail Performance: For a 400Gb/s (NDR) node with multiple NICs, the BusBw should show a value that reflects the combined bandwidth of all rails (e.g., ~ 50GB/s per 400Gb/s rail). • The NCP-AII Context: The certification expects you to know that you must use a scheduler like Slurm with srun or mpirun to execute these tests across multiple nodes simultaneously to catch fabric-level bottlenecks that single-node tests would miss.
Incorrect: A. ClusterKit for fan speeds and AirFlow ClusterKit is indeed an NVIDIA-provided multipurpose node assessment tool mentioned in the NCP-AII blueprint. However, its primary role in bandwidth validation is to automate performance tests (like latency and bandwidth), not to monitor fan speeds. While thermal health is important, “AirFlow“ is an environmental metric, not a networking metric, and cannot indicate InfiniBand signal integrity.
C. iperf3 for 10Gb/s TCP throughput iperf3 is a standard networking tool, but it measures TCP/IP performance. Modern AI clusters use RDMA (Remote Direct Memory Access) over InfiniBand or RoCE to bypass the kernel. Testing via TCP/IP is irrelevant for AI training performance because it doesn‘t utilize the hardware‘s acceleration capabilities. Furthermore, 10Gb/s is significantly lower than the 400Gb/s expected in an H100 “AI Factory.“
D. stress-ng for CPU cache latency NCCL performance is specifically designed to bypass the host CPU as much as possible using GPUDirect RDMA. While CPU latency matters for some application logic, the massive data exchanges in AI training are limited by InfiniBand/NVLink bandwidth, not the L3 cache latency of the x86 processor.
Incorrect
Correct: B The nccl-tests suite (specifically p2p_single_dev or all_reduce_perf) should be run via srun to measure the BusBw (Bus Bandwidth), which should reflect the aggregate performance of all active links. • The Technical Reason: While individual ib_write_bw tests check physical link speeds, they don‘t reflect how AI frameworks actually use the network. ? NCCL-tests: This is the gold-standard suite for AI infrastructure. It measures how the Collective Communications Library (NCCL) performs over the fabric. ? BusBw (Bus Bandwidth): This is the key metric. Unlike AlgBw (Algorithm Bandwidth), which varies by the mathematical efficiency of the operation (like All-Reduce or Broadcast), BusBw is a normalized metric that reflects the physical throughput being pushed across the PCIe and InfiniBand buses. ? Multi-Rail Performance: For a 400Gb/s (NDR) node with multiple NICs, the BusBw should show a value that reflects the combined bandwidth of all rails (e.g., ~ 50GB/s per 400Gb/s rail). • The NCP-AII Context: The certification expects you to know that you must use a scheduler like Slurm with srun or mpirun to execute these tests across multiple nodes simultaneously to catch fabric-level bottlenecks that single-node tests would miss.
Incorrect: A. ClusterKit for fan speeds and AirFlow ClusterKit is indeed an NVIDIA-provided multipurpose node assessment tool mentioned in the NCP-AII blueprint. However, its primary role in bandwidth validation is to automate performance tests (like latency and bandwidth), not to monitor fan speeds. While thermal health is important, “AirFlow“ is an environmental metric, not a networking metric, and cannot indicate InfiniBand signal integrity.
C. iperf3 for 10Gb/s TCP throughput iperf3 is a standard networking tool, but it measures TCP/IP performance. Modern AI clusters use RDMA (Remote Direct Memory Access) over InfiniBand or RoCE to bypass the kernel. Testing via TCP/IP is irrelevant for AI training performance because it doesn‘t utilize the hardware‘s acceleration capabilities. Furthermore, 10Gb/s is significantly lower than the 400Gb/s expected in an H100 “AI Factory.“
D. stress-ng for CPU cache latency NCCL performance is specifically designed to bypass the host CPU as much as possible using GPUDirect RDMA. While CPU latency matters for some application logic, the massive data exchanges in AI training are limited by InfiniBand/NVLink bandwidth, not the L3 cache latency of the x86 processor.
Unattempted
Correct: B The nccl-tests suite (specifically p2p_single_dev or all_reduce_perf) should be run via srun to measure the BusBw (Bus Bandwidth), which should reflect the aggregate performance of all active links. • The Technical Reason: While individual ib_write_bw tests check physical link speeds, they don‘t reflect how AI frameworks actually use the network. ? NCCL-tests: This is the gold-standard suite for AI infrastructure. It measures how the Collective Communications Library (NCCL) performs over the fabric. ? BusBw (Bus Bandwidth): This is the key metric. Unlike AlgBw (Algorithm Bandwidth), which varies by the mathematical efficiency of the operation (like All-Reduce or Broadcast), BusBw is a normalized metric that reflects the physical throughput being pushed across the PCIe and InfiniBand buses. ? Multi-Rail Performance: For a 400Gb/s (NDR) node with multiple NICs, the BusBw should show a value that reflects the combined bandwidth of all rails (e.g., ~ 50GB/s per 400Gb/s rail). • The NCP-AII Context: The certification expects you to know that you must use a scheduler like Slurm with srun or mpirun to execute these tests across multiple nodes simultaneously to catch fabric-level bottlenecks that single-node tests would miss.
Incorrect: A. ClusterKit for fan speeds and AirFlow ClusterKit is indeed an NVIDIA-provided multipurpose node assessment tool mentioned in the NCP-AII blueprint. However, its primary role in bandwidth validation is to automate performance tests (like latency and bandwidth), not to monitor fan speeds. While thermal health is important, “AirFlow“ is an environmental metric, not a networking metric, and cannot indicate InfiniBand signal integrity.
C. iperf3 for 10Gb/s TCP throughput iperf3 is a standard networking tool, but it measures TCP/IP performance. Modern AI clusters use RDMA (Remote Direct Memory Access) over InfiniBand or RoCE to bypass the kernel. Testing via TCP/IP is irrelevant for AI training performance because it doesn‘t utilize the hardware‘s acceleration capabilities. Furthermore, 10Gb/s is significantly lower than the 400Gb/s expected in an H100 “AI Factory.“
D. stress-ng for CPU cache latency NCCL performance is specifically designed to bypass the host CPU as much as possible using GPUDirect RDMA. While CPU latency matters for some application logic, the massive data exchanges in AI training are limited by InfiniBand/NVLink bandwidth, not the L3 cache latency of the x86 processor.
Question 9 of 60
9. Question
An administrator is using the NVIDIA ClusterKit to perform a multifaceted node assessment on a newly deployed AI factory. The tool reports a failure in the NCCL All-Reduce test on one specific node within the 32-node cluster. What should be the next logical step in the verification process to isolate the fault between the physical layer and the software stack?
Correct
Correct: C Run a single-node NCCL test on the failing node to determine if the issue is with the internal NVLink fabric or the external InfiniBand E/W fabric.
The Technical Reason: NCCL (NVIDIA Collective Communications Library) uses two distinct physical fabrics: NVLink for intra-node (GPU-to-GPU within the same server) and InfiniBand/RoCE for inter-node (server-to-server) communication.
Isolating the Fabric: By running a single-node test (e.g., all_reduce_perf -b 8 -e 128M -f 2 -g 8), you force NCCL to use only the internal NVLink fabric.
The Diagnostic Path: * If the single-node test fails or shows low bandwidth, the fault lies with the NVLink bridges, NVSwitch, or GPU placement inside that specific chassis.
If the single-node test passes, the issue is likely with the external InfiniBand links (cabling, transceivers, or leaf switches) connecting that node to the rest of the cluster.
The NCP-AII Context: The certification emphasizes using ClusterKit to automate these specific granular tests. Identifying whether the bottleneck is “Local“ (NVLink) or “Global“ (InfiniBand) is a mandatory step before escalating to hardware replacement.
Incorrect Options: A. Update NGC CLI and re-pull the NeMo container While container corruption is possible, it is highly unlikely to cause a specific NCCL All-Reduce failure on just one node if the other 31 nodes are running the same image successfully. Furthermore, NeMo burn-in is a heavy application-level test; at this stage of troubleshooting, the administrator needs a low-level primitive test (like NCCL-tests) to isolate the hardware fabric failure, not a high-level training workload.
B. Exclude the node and check for a 10x bandwidth increase Excluding the failing node is a temporary workaround to get the cluster running, but it does not isolate the fault or help fix the node. Additionally, excluding one node in a 32-node cluster will not result in a “factor of ten“ bandwidth increase. Bandwidth is limited by the slowest link in a collective operation, but the theoretical increase from removing one node is marginal compared to the total aggregate bandwidth of the remaining 31 nodes.
D. Immediately replace motherboards and adjacent nodes Replacing motherboards is a drastic, expensive, and time-consuming measure that should only be taken after a definitive hardware failure is identified via diagnostics. There is no technical basis for replacing “adjacent nodes“ to solve an isolated NCCL error on one node; modern AI servers are shielded against electromagnetic interference (EMI) that would affect neighbors in this manner.
Incorrect
Correct: C Run a single-node NCCL test on the failing node to determine if the issue is with the internal NVLink fabric or the external InfiniBand E/W fabric.
The Technical Reason: NCCL (NVIDIA Collective Communications Library) uses two distinct physical fabrics: NVLink for intra-node (GPU-to-GPU within the same server) and InfiniBand/RoCE for inter-node (server-to-server) communication.
Isolating the Fabric: By running a single-node test (e.g., all_reduce_perf -b 8 -e 128M -f 2 -g 8), you force NCCL to use only the internal NVLink fabric.
The Diagnostic Path: * If the single-node test fails or shows low bandwidth, the fault lies with the NVLink bridges, NVSwitch, or GPU placement inside that specific chassis.
If the single-node test passes, the issue is likely with the external InfiniBand links (cabling, transceivers, or leaf switches) connecting that node to the rest of the cluster.
The NCP-AII Context: The certification emphasizes using ClusterKit to automate these specific granular tests. Identifying whether the bottleneck is “Local“ (NVLink) or “Global“ (InfiniBand) is a mandatory step before escalating to hardware replacement.
Incorrect Options: A. Update NGC CLI and re-pull the NeMo container While container corruption is possible, it is highly unlikely to cause a specific NCCL All-Reduce failure on just one node if the other 31 nodes are running the same image successfully. Furthermore, NeMo burn-in is a heavy application-level test; at this stage of troubleshooting, the administrator needs a low-level primitive test (like NCCL-tests) to isolate the hardware fabric failure, not a high-level training workload.
B. Exclude the node and check for a 10x bandwidth increase Excluding the failing node is a temporary workaround to get the cluster running, but it does not isolate the fault or help fix the node. Additionally, excluding one node in a 32-node cluster will not result in a “factor of ten“ bandwidth increase. Bandwidth is limited by the slowest link in a collective operation, but the theoretical increase from removing one node is marginal compared to the total aggregate bandwidth of the remaining 31 nodes.
D. Immediately replace motherboards and adjacent nodes Replacing motherboards is a drastic, expensive, and time-consuming measure that should only be taken after a definitive hardware failure is identified via diagnostics. There is no technical basis for replacing “adjacent nodes“ to solve an isolated NCCL error on one node; modern AI servers are shielded against electromagnetic interference (EMI) that would affect neighbors in this manner.
Unattempted
Correct: C Run a single-node NCCL test on the failing node to determine if the issue is with the internal NVLink fabric or the external InfiniBand E/W fabric.
The Technical Reason: NCCL (NVIDIA Collective Communications Library) uses two distinct physical fabrics: NVLink for intra-node (GPU-to-GPU within the same server) and InfiniBand/RoCE for inter-node (server-to-server) communication.
Isolating the Fabric: By running a single-node test (e.g., all_reduce_perf -b 8 -e 128M -f 2 -g 8), you force NCCL to use only the internal NVLink fabric.
The Diagnostic Path: * If the single-node test fails or shows low bandwidth, the fault lies with the NVLink bridges, NVSwitch, or GPU placement inside that specific chassis.
If the single-node test passes, the issue is likely with the external InfiniBand links (cabling, transceivers, or leaf switches) connecting that node to the rest of the cluster.
The NCP-AII Context: The certification emphasizes using ClusterKit to automate these specific granular tests. Identifying whether the bottleneck is “Local“ (NVLink) or “Global“ (InfiniBand) is a mandatory step before escalating to hardware replacement.
Incorrect Options: A. Update NGC CLI and re-pull the NeMo container While container corruption is possible, it is highly unlikely to cause a specific NCCL All-Reduce failure on just one node if the other 31 nodes are running the same image successfully. Furthermore, NeMo burn-in is a heavy application-level test; at this stage of troubleshooting, the administrator needs a low-level primitive test (like NCCL-tests) to isolate the hardware fabric failure, not a high-level training workload.
B. Exclude the node and check for a 10x bandwidth increase Excluding the failing node is a temporary workaround to get the cluster running, but it does not isolate the fault or help fix the node. Additionally, excluding one node in a 32-node cluster will not result in a “factor of ten“ bandwidth increase. Bandwidth is limited by the slowest link in a collective operation, but the theoretical increase from removing one node is marginal compared to the total aggregate bandwidth of the remaining 31 nodes.
D. Immediately replace motherboards and adjacent nodes Replacing motherboards is a drastic, expensive, and time-consuming measure that should only be taken after a definitive hardware failure is identified via diagnostics. There is no technical basis for replacing “adjacent nodes“ to solve an isolated NCCL error on one node; modern AI servers are shielded against electromagnetic interference (EMI) that would affect neighbors in this manner.
Question 10 of 60
10. Question
An administrator is preparing to deploy a cluster of NVIDIA HGX H100 systems within a high-density AI data center environment. During the initial bring-up and validation phase, which specific sequence of actions must be performed to ensure the hardware is operating within thermal and electrical safety margins before initiating full-scale workload testing?
Correct
Correct: D Validate power and cooling parameters via the BMC, perform firmware upgrades on the HGX baseboard, and use nvidia-smi to verify hardware health.
The Technical Reason: This sequence follows the NVIDIA Validated Recipe for data center deployment:
Step 1: BMC Validation: Before the system is even fully powered into an OS, the Baseboard Management Controller (BMC) is used to monitor “pre-flight“ vitals. This includes checking that the high-wattage power supplies (PSUs) are redundant and that the cooling fans are responding to PWM (Pulse Width Modulation) commands to prevent immediate thermal shutdown of 700W GPUs.
Step 2: Firmware Baseline: HGX systems require a tightly synchronized set of firmware across the GPUs, NVSwitches, and the baseboard (CPLD/FPGA). Mismatched firmware can lead to PCIe training failures or erratic power behavior.
Step 3: nvidia-smi Health Check: Once the OS is up, nvidia-smi is the primary tool to confirm the hardware‘s logical state—verifying that all 8 GPUs are visible, power limits are set correctly, and no XID errors are present.
The NCP-AII Context: The exam blueprint specifically highlights the sequence of “Perform initial configuration of BMC… Perform firmware upgrades (including on HGX™)… Validate power and cooling parameters.“ Option D is the only answer that aligns with this professional workflow.
Incorrect Options: A. Run NeMo burn-in test to identify cooling leaks NeMo burn-in is a high-level application test for training performance. While it places a heavy load on the system, it is performed in the Cluster Test and Verification phase (33%), long after the initial bring-up. Using an AI workload to find “cooling leaks“ is unsafe; hardware vitals must be confirmed at the sensor level (BMC) before such a heavy load is applied.
B. Flash UEFI to default and verify transceivers via BCM While BCM (Base Command Manager) can monitor transceivers, “flashing UEFI to default“ is not a standard bring-up task and may actually disable critical performance settings (like Above 4G Decoding or SR-IOV). The primary task during bring-up is upgrading to a validated firmware version, not simply resetting to factory defaults.
C. Install OS, enable MIG, and then check BMC logs This sequence is flawed because it places software configuration (MIG) before hardware validation. If the power delivery system has a fault, the system might crash during OS installation or driver loading. MIG (Multi-Instance GPU) configuration is a specialized optimization step, not a safety validation step.
Incorrect
Correct: D Validate power and cooling parameters via the BMC, perform firmware upgrades on the HGX baseboard, and use nvidia-smi to verify hardware health.
The Technical Reason: This sequence follows the NVIDIA Validated Recipe for data center deployment:
Step 1: BMC Validation: Before the system is even fully powered into an OS, the Baseboard Management Controller (BMC) is used to monitor “pre-flight“ vitals. This includes checking that the high-wattage power supplies (PSUs) are redundant and that the cooling fans are responding to PWM (Pulse Width Modulation) commands to prevent immediate thermal shutdown of 700W GPUs.
Step 2: Firmware Baseline: HGX systems require a tightly synchronized set of firmware across the GPUs, NVSwitches, and the baseboard (CPLD/FPGA). Mismatched firmware can lead to PCIe training failures or erratic power behavior.
Step 3: nvidia-smi Health Check: Once the OS is up, nvidia-smi is the primary tool to confirm the hardware‘s logical state—verifying that all 8 GPUs are visible, power limits are set correctly, and no XID errors are present.
The NCP-AII Context: The exam blueprint specifically highlights the sequence of “Perform initial configuration of BMC… Perform firmware upgrades (including on HGX™)… Validate power and cooling parameters.“ Option D is the only answer that aligns with this professional workflow.
Incorrect Options: A. Run NeMo burn-in test to identify cooling leaks NeMo burn-in is a high-level application test for training performance. While it places a heavy load on the system, it is performed in the Cluster Test and Verification phase (33%), long after the initial bring-up. Using an AI workload to find “cooling leaks“ is unsafe; hardware vitals must be confirmed at the sensor level (BMC) before such a heavy load is applied.
B. Flash UEFI to default and verify transceivers via BCM While BCM (Base Command Manager) can monitor transceivers, “flashing UEFI to default“ is not a standard bring-up task and may actually disable critical performance settings (like Above 4G Decoding or SR-IOV). The primary task during bring-up is upgrading to a validated firmware version, not simply resetting to factory defaults.
C. Install OS, enable MIG, and then check BMC logs This sequence is flawed because it places software configuration (MIG) before hardware validation. If the power delivery system has a fault, the system might crash during OS installation or driver loading. MIG (Multi-Instance GPU) configuration is a specialized optimization step, not a safety validation step.
Unattempted
Correct: D Validate power and cooling parameters via the BMC, perform firmware upgrades on the HGX baseboard, and use nvidia-smi to verify hardware health.
The Technical Reason: This sequence follows the NVIDIA Validated Recipe for data center deployment:
Step 1: BMC Validation: Before the system is even fully powered into an OS, the Baseboard Management Controller (BMC) is used to monitor “pre-flight“ vitals. This includes checking that the high-wattage power supplies (PSUs) are redundant and that the cooling fans are responding to PWM (Pulse Width Modulation) commands to prevent immediate thermal shutdown of 700W GPUs.
Step 2: Firmware Baseline: HGX systems require a tightly synchronized set of firmware across the GPUs, NVSwitches, and the baseboard (CPLD/FPGA). Mismatched firmware can lead to PCIe training failures or erratic power behavior.
Step 3: nvidia-smi Health Check: Once the OS is up, nvidia-smi is the primary tool to confirm the hardware‘s logical state—verifying that all 8 GPUs are visible, power limits are set correctly, and no XID errors are present.
The NCP-AII Context: The exam blueprint specifically highlights the sequence of “Perform initial configuration of BMC… Perform firmware upgrades (including on HGX™)… Validate power and cooling parameters.“ Option D is the only answer that aligns with this professional workflow.
Incorrect Options: A. Run NeMo burn-in test to identify cooling leaks NeMo burn-in is a high-level application test for training performance. While it places a heavy load on the system, it is performed in the Cluster Test and Verification phase (33%), long after the initial bring-up. Using an AI workload to find “cooling leaks“ is unsafe; hardware vitals must be confirmed at the sensor level (BMC) before such a heavy load is applied.
B. Flash UEFI to default and verify transceivers via BCM While BCM (Base Command Manager) can monitor transceivers, “flashing UEFI to default“ is not a standard bring-up task and may actually disable critical performance settings (like Above 4G Decoding or SR-IOV). The primary task during bring-up is upgrading to a validated firmware version, not simply resetting to factory defaults.
C. Install OS, enable MIG, and then check BMC logs This sequence is flawed because it places software configuration (MIG) before hardware validation. If the power delivery system has a fault, the system might crash during OS installation or driver loading. MIG (Multi-Instance GPU) configuration is a specialized optimization step, not a safety validation step.
Question 11 of 60
11. Question
A cluster administrator is performing a single-node stress test as part of the verification process. The goal is to ensure the node can handle peak loads without thermal throttling or power failure. Which tool is commonly used to execute a High-Performance Linpack (HPL) test, and what does a successful HPL result indicate about the node‘s health?
Correct
Correct: B An HPL-optimized container is used to stress the GPUs and CPUs, and a stable result indicates that the system‘s power and cooling are sufficient for sustained compute. • The Technical Reason: HPL solves a dense system of linear equations, which is extremely computationally intensive. ? Maximum Stress: It is designed to push the GPUs and CPUs to their theoretical peak floating-point performance R_peak. This generates maximum heat and draws the highest possible power from the system‘s power supply units (PSUs). ? Stability Validation: A “successful“ HPL run (achieving a high R_max without crashing or throttling) confirms that the thermal solution (fans, heatsinks, airflow) can dissipate the heat and the power delivery can sustain the load. If a node has a faulty fan or an underpowered PSU, HPL will trigger a thermal shutdown or a power trip. • The NCP-AII Context: The certification emphasizes using NVIDIA NGC™ containers for validation. You would typically pull the hpl container from NGC to ensure you are using a version optimized for the specific architecture (e.g., Hopper or Ampere).
Incorrect: A. NVIDIA Container Toolkit to configure a web server While the NVIDIA Container Toolkit is used to run the HPL container, HPL itself has nothing to do with web servers. Its purpose is raw mathematical computation. Configuring an inference web server (like Triton) is a separate deployment task that happens much later in the lifecycle.
C. Linux ‘stress‘ command to check filesystem sectors The standard Linux stress or stress-ng commands are generic CPU/Memory stressors. They do not utilize the GPU‘s Tensor Cores and therefore cannot validate the power and cooling requirements of an NVIDIA-certified AI server. Furthermore, checking for bad filesystem sectors is a storage integrity task handled by tools like fsck or vendor-specific SSD diagnostics, not a compute stress test.
D. ‘ping‘ utility for network latency The ping utility measures simple ICMP reachability and network latency. It does not stress the node‘s internal compute components (CPU/GPU) and provides no information about a node‘s thermal or power stability. Network validation for AI clusters typically uses NCCL-tests or ib_write_bw, not standard pings.
Incorrect
Correct: B An HPL-optimized container is used to stress the GPUs and CPUs, and a stable result indicates that the system‘s power and cooling are sufficient for sustained compute. • The Technical Reason: HPL solves a dense system of linear equations, which is extremely computationally intensive. ? Maximum Stress: It is designed to push the GPUs and CPUs to their theoretical peak floating-point performance R_peak. This generates maximum heat and draws the highest possible power from the system‘s power supply units (PSUs). ? Stability Validation: A “successful“ HPL run (achieving a high R_max without crashing or throttling) confirms that the thermal solution (fans, heatsinks, airflow) can dissipate the heat and the power delivery can sustain the load. If a node has a faulty fan or an underpowered PSU, HPL will trigger a thermal shutdown or a power trip. • The NCP-AII Context: The certification emphasizes using NVIDIA NGC™ containers for validation. You would typically pull the hpl container from NGC to ensure you are using a version optimized for the specific architecture (e.g., Hopper or Ampere).
Incorrect: A. NVIDIA Container Toolkit to configure a web server While the NVIDIA Container Toolkit is used to run the HPL container, HPL itself has nothing to do with web servers. Its purpose is raw mathematical computation. Configuring an inference web server (like Triton) is a separate deployment task that happens much later in the lifecycle.
C. Linux ‘stress‘ command to check filesystem sectors The standard Linux stress or stress-ng commands are generic CPU/Memory stressors. They do not utilize the GPU‘s Tensor Cores and therefore cannot validate the power and cooling requirements of an NVIDIA-certified AI server. Furthermore, checking for bad filesystem sectors is a storage integrity task handled by tools like fsck or vendor-specific SSD diagnostics, not a compute stress test.
D. ‘ping‘ utility for network latency The ping utility measures simple ICMP reachability and network latency. It does not stress the node‘s internal compute components (CPU/GPU) and provides no information about a node‘s thermal or power stability. Network validation for AI clusters typically uses NCCL-tests or ib_write_bw, not standard pings.
Unattempted
Correct: B An HPL-optimized container is used to stress the GPUs and CPUs, and a stable result indicates that the system‘s power and cooling are sufficient for sustained compute. • The Technical Reason: HPL solves a dense system of linear equations, which is extremely computationally intensive. ? Maximum Stress: It is designed to push the GPUs and CPUs to their theoretical peak floating-point performance R_peak. This generates maximum heat and draws the highest possible power from the system‘s power supply units (PSUs). ? Stability Validation: A “successful“ HPL run (achieving a high R_max without crashing or throttling) confirms that the thermal solution (fans, heatsinks, airflow) can dissipate the heat and the power delivery can sustain the load. If a node has a faulty fan or an underpowered PSU, HPL will trigger a thermal shutdown or a power trip. • The NCP-AII Context: The certification emphasizes using NVIDIA NGC™ containers for validation. You would typically pull the hpl container from NGC to ensure you are using a version optimized for the specific architecture (e.g., Hopper or Ampere).
Incorrect: A. NVIDIA Container Toolkit to configure a web server While the NVIDIA Container Toolkit is used to run the HPL container, HPL itself has nothing to do with web servers. Its purpose is raw mathematical computation. Configuring an inference web server (like Triton) is a separate deployment task that happens much later in the lifecycle.
C. Linux ‘stress‘ command to check filesystem sectors The standard Linux stress or stress-ng commands are generic CPU/Memory stressors. They do not utilize the GPU‘s Tensor Cores and therefore cannot validate the power and cooling requirements of an NVIDIA-certified AI server. Furthermore, checking for bad filesystem sectors is a storage integrity task handled by tools like fsck or vendor-specific SSD diagnostics, not a compute stress test.
D. ‘ping‘ utility for network latency The ping utility measures simple ICMP reachability and network latency. It does not stress the node‘s internal compute components (CPU/GPU) and provides no information about a node‘s thermal or power stability. Network validation for AI clusters typically uses NCCL-tests or ib_write_bw, not standard pings.
Question 12 of 60
12. Question
A technician is installing an NVIDIA HGX baseboard into a new server chassis. After the physical installation, the technician must perform a firmware upgrade on the HGX components and the integrated BlueField-3 DPUs. What is the recommended sequence of events to ensure that the firmware updates are applied correctly and that the hardware is validated for initial workload testing?
Correct
Correct: B Update the BMC and BIOS first, then the HGX baseboard firmware, followed by the BlueField-3 DPU firmware, and finally validate with a cold reboot.
The Technical Reason: This follows the “Outside-In“ management principle for NVIDIA-Certified Systems:
BMC & BIOS: The Baseboard Management Controller (BMC) and BIOS form the foundation of the server. They must be updated first to ensure the system can properly communicate with and power the high-performance accelerators via the PCIe and I2C buses.
HGX Baseboard: This includes the GPU VBIOS, NVSwitch firmware, and complex CPLDs. These updates require the host BIOS to be at a compatible version to manage the complex PCIe training.
BlueField-3 DPU: DPUs are treated as independent subsystems (servers-on-a-card). Updating them last ensures the host environment is stable enough to handle the DPU‘s initialization and the potential re-enumeration of the PCIe bus.
Cold Reboot: A cold reboot (AC power cycle) is often mandatory for the BlueField-3 and HGX components. It ensures that the persistent logic in the hardware‘s ERoT (Electronic Root of Trust) and FPGA components is fully reset and reloaded with the new image.
The NCP-AII Context: The certification validates your ability to follow the NVIDIA Validated Recipe. The exam specifically tests for the understanding that software (drivers/OS) should not be the primary focus until the hardware‘s firmware baseline is established and verified through a clean power cycle.
Incorrect Options: A. Use NGC CLI to download firmware into GPU memory The NGC CLI is used for pulling container images and datasets, not for system firmware updates. Furthermore, firmware is written to non-volatile flash memory (EEPROMs), not to “GPU memory“ (VRAM). Executing an HPL stress test before validating the firmware baseline is a violation of safety protocols, as unvalidated firmware can lead to thermal or electrical instability under load.
C. Flash BlueField firmware using NVIDIA SMI in a MIG-enabled state NVIDIA SMI (nvidia-smi) is a tool for managing GPUs; it is not the primary tool for flashing BlueField DPU firmware (which typically uses mlxfwmanager, mstflint, or the bf.cfg bundle). Additionally, MIG (Multi-Instance GPU) is a software partitioning feature that has no relevance to the physical firmware flashing process of a network adapter.
D. Bypass firmware validation and proceed to cable testing Bypassing firmware validation is the most common cause of deployment failure in AI infrastructure. Default factory settings are rarely aligned with the latest security patches or performance optimizations required for an H100 “AI Factory.“ Cable signal testing (ibdiagnet) is a later-stage validation step that requires the firmware to be correctly initialized for the transceivers to function at peak speed.
Incorrect
Correct: B Update the BMC and BIOS first, then the HGX baseboard firmware, followed by the BlueField-3 DPU firmware, and finally validate with a cold reboot.
The Technical Reason: This follows the “Outside-In“ management principle for NVIDIA-Certified Systems:
BMC & BIOS: The Baseboard Management Controller (BMC) and BIOS form the foundation of the server. They must be updated first to ensure the system can properly communicate with and power the high-performance accelerators via the PCIe and I2C buses.
HGX Baseboard: This includes the GPU VBIOS, NVSwitch firmware, and complex CPLDs. These updates require the host BIOS to be at a compatible version to manage the complex PCIe training.
BlueField-3 DPU: DPUs are treated as independent subsystems (servers-on-a-card). Updating them last ensures the host environment is stable enough to handle the DPU‘s initialization and the potential re-enumeration of the PCIe bus.
Cold Reboot: A cold reboot (AC power cycle) is often mandatory for the BlueField-3 and HGX components. It ensures that the persistent logic in the hardware‘s ERoT (Electronic Root of Trust) and FPGA components is fully reset and reloaded with the new image.
The NCP-AII Context: The certification validates your ability to follow the NVIDIA Validated Recipe. The exam specifically tests for the understanding that software (drivers/OS) should not be the primary focus until the hardware‘s firmware baseline is established and verified through a clean power cycle.
Incorrect Options: A. Use NGC CLI to download firmware into GPU memory The NGC CLI is used for pulling container images and datasets, not for system firmware updates. Furthermore, firmware is written to non-volatile flash memory (EEPROMs), not to “GPU memory“ (VRAM). Executing an HPL stress test before validating the firmware baseline is a violation of safety protocols, as unvalidated firmware can lead to thermal or electrical instability under load.
C. Flash BlueField firmware using NVIDIA SMI in a MIG-enabled state NVIDIA SMI (nvidia-smi) is a tool for managing GPUs; it is not the primary tool for flashing BlueField DPU firmware (which typically uses mlxfwmanager, mstflint, or the bf.cfg bundle). Additionally, MIG (Multi-Instance GPU) is a software partitioning feature that has no relevance to the physical firmware flashing process of a network adapter.
D. Bypass firmware validation and proceed to cable testing Bypassing firmware validation is the most common cause of deployment failure in AI infrastructure. Default factory settings are rarely aligned with the latest security patches or performance optimizations required for an H100 “AI Factory.“ Cable signal testing (ibdiagnet) is a later-stage validation step that requires the firmware to be correctly initialized for the transceivers to function at peak speed.
Unattempted
Correct: B Update the BMC and BIOS first, then the HGX baseboard firmware, followed by the BlueField-3 DPU firmware, and finally validate with a cold reboot.
The Technical Reason: This follows the “Outside-In“ management principle for NVIDIA-Certified Systems:
BMC & BIOS: The Baseboard Management Controller (BMC) and BIOS form the foundation of the server. They must be updated first to ensure the system can properly communicate with and power the high-performance accelerators via the PCIe and I2C buses.
HGX Baseboard: This includes the GPU VBIOS, NVSwitch firmware, and complex CPLDs. These updates require the host BIOS to be at a compatible version to manage the complex PCIe training.
BlueField-3 DPU: DPUs are treated as independent subsystems (servers-on-a-card). Updating them last ensures the host environment is stable enough to handle the DPU‘s initialization and the potential re-enumeration of the PCIe bus.
Cold Reboot: A cold reboot (AC power cycle) is often mandatory for the BlueField-3 and HGX components. It ensures that the persistent logic in the hardware‘s ERoT (Electronic Root of Trust) and FPGA components is fully reset and reloaded with the new image.
The NCP-AII Context: The certification validates your ability to follow the NVIDIA Validated Recipe. The exam specifically tests for the understanding that software (drivers/OS) should not be the primary focus until the hardware‘s firmware baseline is established and verified through a clean power cycle.
Incorrect Options: A. Use NGC CLI to download firmware into GPU memory The NGC CLI is used for pulling container images and datasets, not for system firmware updates. Furthermore, firmware is written to non-volatile flash memory (EEPROMs), not to “GPU memory“ (VRAM). Executing an HPL stress test before validating the firmware baseline is a violation of safety protocols, as unvalidated firmware can lead to thermal or electrical instability under load.
C. Flash BlueField firmware using NVIDIA SMI in a MIG-enabled state NVIDIA SMI (nvidia-smi) is a tool for managing GPUs; it is not the primary tool for flashing BlueField DPU firmware (which typically uses mlxfwmanager, mstflint, or the bf.cfg bundle). Additionally, MIG (Multi-Instance GPU) is a software partitioning feature that has no relevance to the physical firmware flashing process of a network adapter.
D. Bypass firmware validation and proceed to cable testing Bypassing firmware validation is the most common cause of deployment failure in AI infrastructure. Default factory settings are rarely aligned with the latest security patches or performance optimizations required for an H100 “AI Factory.“ Cable signal testing (ibdiagnet) is a later-stage validation step that requires the firmware to be correctly initialized for the transceivers to function at peak speed.
Question 13 of 60
13. Question
An administrator needs to install the NGC CLI on all host nodes to allow researchers to pull optimized AI containers directly from the NVIDIA GPU Cloud. After downloading the binary, what is the mandatory next step to enable the CLI to interact with the private registries and resources on the NGC portal?
Correct
Correct: C Running the ‘ngc config set‘ command with a valid API key
The Technical Reason: Simply downloading and placing the NGC CLI binary in your execution path (e.g., /usr/local/bin) is insufficient for access.
Interactive Setup: The ngc config set command initiates an interactive session. It prompts the administrator to provide an API Key, which is generated from the NVIDIA NGC portal under the “Setup“ or “API Key“ section.
Scope and Context: Beyond authentication, this command allows the administrator to define the default Organization, Team, and ACE (Accelerated Computing Environment). This configuration is stored in a local config file (typically ~/.ngc/config), which the CLI references for every subsequent pull or push operation.
The NCP-AII Context: The exam expects you to know that the API key is the primary credential for NGC. Without this step, the CLI cannot resolve private container repositories or access organization-specific models and datasets.
Incorrect Options: A. Restarting the physical server The NGC CLI is a standalone user-space application. It does not install kernel modules or drivers. While the NVIDIA GPU Drivers and NVIDIA Container Toolkit might require service restarts or reboots during their own installation, the NGC CLI is strictly for catalog interaction and requires no system-level initialization.
B. Compiling the NGC source code NVIDIA provides the NGC CLI as a pre-compiled, platform-specific binary (for Linux, Windows, and macOS). There is no requirement (or public repository) for administrators to compile it from source as part of the standard deployment workflow.
D. Configuring a static IP for the management interface The NGC CLI communicates with NVIDIAÂ’s cloud services over standard HTTPS (Port 443). It does not require a specialized “management interface“ or a static IP address. As long as the host has outbound internet connectivity and valid DNS resolution, the CLI will function.
Incorrect
Correct: C Running the ‘ngc config set‘ command with a valid API key
The Technical Reason: Simply downloading and placing the NGC CLI binary in your execution path (e.g., /usr/local/bin) is insufficient for access.
Interactive Setup: The ngc config set command initiates an interactive session. It prompts the administrator to provide an API Key, which is generated from the NVIDIA NGC portal under the “Setup“ or “API Key“ section.
Scope and Context: Beyond authentication, this command allows the administrator to define the default Organization, Team, and ACE (Accelerated Computing Environment). This configuration is stored in a local config file (typically ~/.ngc/config), which the CLI references for every subsequent pull or push operation.
The NCP-AII Context: The exam expects you to know that the API key is the primary credential for NGC. Without this step, the CLI cannot resolve private container repositories or access organization-specific models and datasets.
Incorrect Options: A. Restarting the physical server The NGC CLI is a standalone user-space application. It does not install kernel modules or drivers. While the NVIDIA GPU Drivers and NVIDIA Container Toolkit might require service restarts or reboots during their own installation, the NGC CLI is strictly for catalog interaction and requires no system-level initialization.
B. Compiling the NGC source code NVIDIA provides the NGC CLI as a pre-compiled, platform-specific binary (for Linux, Windows, and macOS). There is no requirement (or public repository) for administrators to compile it from source as part of the standard deployment workflow.
D. Configuring a static IP for the management interface The NGC CLI communicates with NVIDIAÂ’s cloud services over standard HTTPS (Port 443). It does not require a specialized “management interface“ or a static IP address. As long as the host has outbound internet connectivity and valid DNS resolution, the CLI will function.
Unattempted
Correct: C Running the ‘ngc config set‘ command with a valid API key
The Technical Reason: Simply downloading and placing the NGC CLI binary in your execution path (e.g., /usr/local/bin) is insufficient for access.
Interactive Setup: The ngc config set command initiates an interactive session. It prompts the administrator to provide an API Key, which is generated from the NVIDIA NGC portal under the “Setup“ or “API Key“ section.
Scope and Context: Beyond authentication, this command allows the administrator to define the default Organization, Team, and ACE (Accelerated Computing Environment). This configuration is stored in a local config file (typically ~/.ngc/config), which the CLI references for every subsequent pull or push operation.
The NCP-AII Context: The exam expects you to know that the API key is the primary credential for NGC. Without this step, the CLI cannot resolve private container repositories or access organization-specific models and datasets.
Incorrect Options: A. Restarting the physical server The NGC CLI is a standalone user-space application. It does not install kernel modules or drivers. While the NVIDIA GPU Drivers and NVIDIA Container Toolkit might require service restarts or reboots during their own installation, the NGC CLI is strictly for catalog interaction and requires no system-level initialization.
B. Compiling the NGC source code NVIDIA provides the NGC CLI as a pre-compiled, platform-specific binary (for Linux, Windows, and macOS). There is no requirement (or public repository) for administrators to compile it from source as part of the standard deployment workflow.
D. Configuring a static IP for the management interface The NGC CLI communicates with NVIDIAÂ’s cloud services over standard HTTPS (Port 443). It does not require a specialized “management interface“ or a static IP address. As long as the host has outbound internet connectivity and valid DNS resolution, the CLI will function.
Question 14 of 60
14. Question
After installing the NVIDIA Container Toolkit on a cluster node, an engineer wants to verify that Docker can correctly access the GPUs. Which command should the engineer run to verify the end-to-end integration of the Docker runtime, the NVIDIA driver, and the physical hardware?
Correct
Correct: B The command docker run –rm –gpus all nvidia/cuda:12.0-base nvidia-smi is designed to test the entire GPU-accelerated container stack end-to-end. Here is the step-by-step breakdown of why this command is correct:
docker run: This invokes the Docker engine, which is the first component being tested .
–gpus all: This flag instructs Docker to request access to all available GPUs on the host system. For this to work, the Docker engine must be properly configured to use the NVIDIA Container Toolkit runtime .
nvidia/cuda:12.0-base: This specifies a publicly available container image from NVIDIA that includes the CUDA libraries and the nvidia-smi tool. Using this image verifies that the container can pull from a registry and that the base software is present .
nvidia-smi: This is the command executed inside the container. It is the NVIDIA System Management Interface tool . For this command to run successfully and display the GPU information, several things must happen correctly:
The NVIDIA Container Toolkit must be installed and its runtime must be integrated with Docker .
The Docker engine must be able to call this runtime to mount the necessary GPU devices and libraries into the container .
The NVIDIA driver on the host must be functioning correctly, as nvidia-smi communicates directly with the driver .
If the command succeeds and returns the GPU details (like driver version and GPU utilization), it confirms that all components—Docker, NVIDIA Container Toolkit, driver, and hardware—are working together seamlessly .
Incorrect:
A. apt-get install nvidia-docker-runtime-check: This is incorrect. There is no standard Linux package or command named nvidia-docker-runtime-check. The NVIDIA Container Toolkit is installed using packages like nvidia-container-toolkit . This option represents a non-existent verification tool.
C. systemctl status nvidia-container-toolkit –full: This is incorrect. While systemctl status is a valid command to check the state of a system service, the nvidia-container-toolkit itself does not run as a persistent background service (daemon). The toolkit consists of binaries, configuration files, and a runtime library that Docker calls upon . Therefore, this command would not return a meaningful status about the toolkit‘s integration or its ability to provide GPUs to a container. It does not test the end-to-end functionality.
D. nvlink –verify-docker-integration –node local: This is incorrect. nvlink is a real technology and a subcommand of nvidia-smi used to query the high-speed interconnects between GPUs . However, the options –verify-docker-integration and –node are not valid arguments for nvlink or any standard NVIDIA tool. This command is fabricated and does not perform any verification of Docker integration.
Incorrect
Correct: B The command docker run –rm –gpus all nvidia/cuda:12.0-base nvidia-smi is designed to test the entire GPU-accelerated container stack end-to-end. Here is the step-by-step breakdown of why this command is correct:
docker run: This invokes the Docker engine, which is the first component being tested .
–gpus all: This flag instructs Docker to request access to all available GPUs on the host system. For this to work, the Docker engine must be properly configured to use the NVIDIA Container Toolkit runtime .
nvidia/cuda:12.0-base: This specifies a publicly available container image from NVIDIA that includes the CUDA libraries and the nvidia-smi tool. Using this image verifies that the container can pull from a registry and that the base software is present .
nvidia-smi: This is the command executed inside the container. It is the NVIDIA System Management Interface tool . For this command to run successfully and display the GPU information, several things must happen correctly:
The NVIDIA Container Toolkit must be installed and its runtime must be integrated with Docker .
The Docker engine must be able to call this runtime to mount the necessary GPU devices and libraries into the container .
The NVIDIA driver on the host must be functioning correctly, as nvidia-smi communicates directly with the driver .
If the command succeeds and returns the GPU details (like driver version and GPU utilization), it confirms that all components—Docker, NVIDIA Container Toolkit, driver, and hardware—are working together seamlessly .
Incorrect:
A. apt-get install nvidia-docker-runtime-check: This is incorrect. There is no standard Linux package or command named nvidia-docker-runtime-check. The NVIDIA Container Toolkit is installed using packages like nvidia-container-toolkit . This option represents a non-existent verification tool.
C. systemctl status nvidia-container-toolkit –full: This is incorrect. While systemctl status is a valid command to check the state of a system service, the nvidia-container-toolkit itself does not run as a persistent background service (daemon). The toolkit consists of binaries, configuration files, and a runtime library that Docker calls upon . Therefore, this command would not return a meaningful status about the toolkit‘s integration or its ability to provide GPUs to a container. It does not test the end-to-end functionality.
D. nvlink –verify-docker-integration –node local: This is incorrect. nvlink is a real technology and a subcommand of nvidia-smi used to query the high-speed interconnects between GPUs . However, the options –verify-docker-integration and –node are not valid arguments for nvlink or any standard NVIDIA tool. This command is fabricated and does not perform any verification of Docker integration.
Unattempted
Correct: B The command docker run –rm –gpus all nvidia/cuda:12.0-base nvidia-smi is designed to test the entire GPU-accelerated container stack end-to-end. Here is the step-by-step breakdown of why this command is correct:
docker run: This invokes the Docker engine, which is the first component being tested .
–gpus all: This flag instructs Docker to request access to all available GPUs on the host system. For this to work, the Docker engine must be properly configured to use the NVIDIA Container Toolkit runtime .
nvidia/cuda:12.0-base: This specifies a publicly available container image from NVIDIA that includes the CUDA libraries and the nvidia-smi tool. Using this image verifies that the container can pull from a registry and that the base software is present .
nvidia-smi: This is the command executed inside the container. It is the NVIDIA System Management Interface tool . For this command to run successfully and display the GPU information, several things must happen correctly:
The NVIDIA Container Toolkit must be installed and its runtime must be integrated with Docker .
The Docker engine must be able to call this runtime to mount the necessary GPU devices and libraries into the container .
The NVIDIA driver on the host must be functioning correctly, as nvidia-smi communicates directly with the driver .
If the command succeeds and returns the GPU details (like driver version and GPU utilization), it confirms that all components—Docker, NVIDIA Container Toolkit, driver, and hardware—are working together seamlessly .
Incorrect:
A. apt-get install nvidia-docker-runtime-check: This is incorrect. There is no standard Linux package or command named nvidia-docker-runtime-check. The NVIDIA Container Toolkit is installed using packages like nvidia-container-toolkit . This option represents a non-existent verification tool.
C. systemctl status nvidia-container-toolkit –full: This is incorrect. While systemctl status is a valid command to check the state of a system service, the nvidia-container-toolkit itself does not run as a persistent background service (daemon). The toolkit consists of binaries, configuration files, and a runtime library that Docker calls upon . Therefore, this command would not return a meaningful status about the toolkit‘s integration or its ability to provide GPUs to a container. It does not test the end-to-end functionality.
D. nvlink –verify-docker-integration –node local: This is incorrect. nvlink is a real technology and a subcommand of nvidia-smi used to query the high-speed interconnects between GPUs . However, the options –verify-docker-integration and –node are not valid arguments for nvlink or any standard NVIDIA tool. This command is fabricated and does not perform any verification of Docker integration.
Question 15 of 60
15. Question
An IT architect is deploying NVIDIA Base Command Manager (BCM) to manage a new AI cluster. To ensure high availability (HA) of the management plane for uninterrupted cluster operations, which configuration must be implemented according to the BCM best practices during the installation phase?
Correct
Correct: B Setup a Primary and a Secondary head node with data synchronization and configure a virtual IP (VIP) for cluster management.
The Technical Reason: BCM achieves High Availability through a specialized failover mechanism:
Redundant Head Nodes: A cluster is configured with two dedicated servers: a Primary and a Secondary head node.
Data Synchronization: The cmdaemon service on the primary node continuously synchronizes the cluster database (MySQL), software images (/cm/shared), and configuration files to the secondary node.
Virtual IP (VIP): A floating management IP (VIP) is assigned to the active head node. If the primary node fails, the secondary node detects the failure via a heartbeat, takes over the VIP, and promotes itself to “Active“ status. This ensures that compute nodes and administrators always connect to the same IP address regardless of which physical server is currently in control.
The NCP-AII Context: The exam blueprint explicitly requires candidates to “Install Base Command™ Manager (BCM), configure and verify HA.“ This includes using the cmha-setup tool to initialize the failover pair and cloning the primary node to the secondary.
Incorrect Options: A. Single head node and daily tape backups While backups are part of a general disaster recovery strategy, they do not provide High Availability. In a single head node scenario, the management plane remains offline until a technician manually restores the system from backup, which violates the requirement for “uninterrupted cluster operations.“
C. BlueField DPU as independent head nodes While the BlueField DPU is a powerful “server-on-a-card“ used for offloading networking and security (part of the DOCA stack), it is not designed to run the full BCM head node software suite in a decentralized manner. BCM relies on a centralized (or HA-paired) head node to maintain the global state of the entire fabric and workload manager.
D. Containerized BCM across all compute nodes NVIDIA BCM is a specialized system orchestration layer that manages the lifecycle of compute nodes (including their OS images). Deploying the management layer on the nodes it is supposed to manage creates a circular dependency. BCM best practices dictate that the control plane remains on dedicated head nodes to ensure it can provision compute nodes even when they are in a “bare-metal“ or “down“ state.
Incorrect
Correct: B Setup a Primary and a Secondary head node with data synchronization and configure a virtual IP (VIP) for cluster management.
The Technical Reason: BCM achieves High Availability through a specialized failover mechanism:
Redundant Head Nodes: A cluster is configured with two dedicated servers: a Primary and a Secondary head node.
Data Synchronization: The cmdaemon service on the primary node continuously synchronizes the cluster database (MySQL), software images (/cm/shared), and configuration files to the secondary node.
Virtual IP (VIP): A floating management IP (VIP) is assigned to the active head node. If the primary node fails, the secondary node detects the failure via a heartbeat, takes over the VIP, and promotes itself to “Active“ status. This ensures that compute nodes and administrators always connect to the same IP address regardless of which physical server is currently in control.
The NCP-AII Context: The exam blueprint explicitly requires candidates to “Install Base Command™ Manager (BCM), configure and verify HA.“ This includes using the cmha-setup tool to initialize the failover pair and cloning the primary node to the secondary.
Incorrect Options: A. Single head node and daily tape backups While backups are part of a general disaster recovery strategy, they do not provide High Availability. In a single head node scenario, the management plane remains offline until a technician manually restores the system from backup, which violates the requirement for “uninterrupted cluster operations.“
C. BlueField DPU as independent head nodes While the BlueField DPU is a powerful “server-on-a-card“ used for offloading networking and security (part of the DOCA stack), it is not designed to run the full BCM head node software suite in a decentralized manner. BCM relies on a centralized (or HA-paired) head node to maintain the global state of the entire fabric and workload manager.
D. Containerized BCM across all compute nodes NVIDIA BCM is a specialized system orchestration layer that manages the lifecycle of compute nodes (including their OS images). Deploying the management layer on the nodes it is supposed to manage creates a circular dependency. BCM best practices dictate that the control plane remains on dedicated head nodes to ensure it can provision compute nodes even when they are in a “bare-metal“ or “down“ state.
Unattempted
Correct: B Setup a Primary and a Secondary head node with data synchronization and configure a virtual IP (VIP) for cluster management.
The Technical Reason: BCM achieves High Availability through a specialized failover mechanism:
Redundant Head Nodes: A cluster is configured with two dedicated servers: a Primary and a Secondary head node.
Data Synchronization: The cmdaemon service on the primary node continuously synchronizes the cluster database (MySQL), software images (/cm/shared), and configuration files to the secondary node.
Virtual IP (VIP): A floating management IP (VIP) is assigned to the active head node. If the primary node fails, the secondary node detects the failure via a heartbeat, takes over the VIP, and promotes itself to “Active“ status. This ensures that compute nodes and administrators always connect to the same IP address regardless of which physical server is currently in control.
The NCP-AII Context: The exam blueprint explicitly requires candidates to “Install Base Command™ Manager (BCM), configure and verify HA.“ This includes using the cmha-setup tool to initialize the failover pair and cloning the primary node to the secondary.
Incorrect Options: A. Single head node and daily tape backups While backups are part of a general disaster recovery strategy, they do not provide High Availability. In a single head node scenario, the management plane remains offline until a technician manually restores the system from backup, which violates the requirement for “uninterrupted cluster operations.“
C. BlueField DPU as independent head nodes While the BlueField DPU is a powerful “server-on-a-card“ used for offloading networking and security (part of the DOCA stack), it is not designed to run the full BCM head node software suite in a decentralized manner. BCM relies on a centralized (or HA-paired) head node to maintain the global state of the entire fabric and workload manager.
D. Containerized BCM across all compute nodes NVIDIA BCM is a specialized system orchestration layer that manages the lifecycle of compute nodes (including their OS images). Deploying the management layer on the nodes it is supposed to manage creates a circular dependency. BCM best practices dictate that the control plane remains on dedicated head nodes to ensure it can provision compute nodes even when they are in a “bare-metal“ or “down“ state.
Question 16 of 60
16. Question
An administrator is setting up the security and management parameters for an NVIDIA-Certified server. During the initialization of the Trusted Platform Module (TPM) 2.0 and the Out-of-Band (OOB) interface, which specific configuration ensures that the system can perform a measured boot to validate the integrity of the NVIDIA GPU drivers and the OS kernel before allowing AI workloads to execute?
Correct
Correct: C Enabling the TPM 2.0 in the UEFI settings and configuring the BIOS to perform a Measured Boot that records the software state in the Platform Configuration Registers (PCRs).
The Technical Reason: * Measured Boot: Unlike Secure Boot (which simply blocks unsigned code), Measured Boot uses the TPM to “measure“ (hash) each component of the boot process (firmware, bootloader, kernel, and drivers).
PCRs (Platform Configuration Registers): These measurements are stored in the TPM‘s PCRs. Because these registers can only be “extended“ and not overwritten, they provide an immutable audit trail of the system‘s state.
Integrity Validation: Before an AI workload executes, an attestation service can check these PCR values to ensure that the NVIDIA GPU drivers or the OS kernel haven‘t been tampered with or replaced by malicious versions.
The NCP-AII Context: The certification validates that you can prepare a server for a “Zero-Trust“ environment. Configuring TPM 2.0 within the UEFI/BIOS is the first step in ensuring that the hardware-software handshake is cryptographically verified.
Incorrect Options: A. Disabling Secure Boot to allow custom encryption keys Disabling Secure Boot actually reduces the security posture of the node. Secure Boot and Measured Boot are complementary; Secure Boot ensures only signed code runs, while Measured Boot records what ran. Disabling it would make the system vulnerable to rootkits before the TPM can even record a measurement. Furthermore, TPMs do not require Secure Boot to be disabled to store encryption keys.
B. Setting the BMC to Shared Mode for key exchange Shared Mode (or Sideband) allows the BMC to use one of the standard host NIC ports for management. While this is a valid networking choice, it is not a security configuration for Measured Boot. Key exchange for boot integrity happens locally between the CPU, UEFI, and the TPM chip over the motherboard‘s bus (LPC or SPI), not over the network data fabric.
D. Configuring NVIDIA SMI to run in ‘Exclusive Mode‘ ‘Exclusive Mode‘ (specifically Compute Mode: Exclusive Process) is a setting in nvidia-smi that restricts a GPU so only one CUDA context can be active at a time. It is a resource management feature, not a security or encryption feature. nvidia-smi does not have the capability to “activate TPM encryption“ for GPU memory transactions; GPU memory encryption (where available, such as NVIDIA Confidential Computing) is handled at the hardware/firmware level, not by the SMI tool.
Incorrect
Correct: C Enabling the TPM 2.0 in the UEFI settings and configuring the BIOS to perform a Measured Boot that records the software state in the Platform Configuration Registers (PCRs).
The Technical Reason: * Measured Boot: Unlike Secure Boot (which simply blocks unsigned code), Measured Boot uses the TPM to “measure“ (hash) each component of the boot process (firmware, bootloader, kernel, and drivers).
PCRs (Platform Configuration Registers): These measurements are stored in the TPM‘s PCRs. Because these registers can only be “extended“ and not overwritten, they provide an immutable audit trail of the system‘s state.
Integrity Validation: Before an AI workload executes, an attestation service can check these PCR values to ensure that the NVIDIA GPU drivers or the OS kernel haven‘t been tampered with or replaced by malicious versions.
The NCP-AII Context: The certification validates that you can prepare a server for a “Zero-Trust“ environment. Configuring TPM 2.0 within the UEFI/BIOS is the first step in ensuring that the hardware-software handshake is cryptographically verified.
Incorrect Options: A. Disabling Secure Boot to allow custom encryption keys Disabling Secure Boot actually reduces the security posture of the node. Secure Boot and Measured Boot are complementary; Secure Boot ensures only signed code runs, while Measured Boot records what ran. Disabling it would make the system vulnerable to rootkits before the TPM can even record a measurement. Furthermore, TPMs do not require Secure Boot to be disabled to store encryption keys.
B. Setting the BMC to Shared Mode for key exchange Shared Mode (or Sideband) allows the BMC to use one of the standard host NIC ports for management. While this is a valid networking choice, it is not a security configuration for Measured Boot. Key exchange for boot integrity happens locally between the CPU, UEFI, and the TPM chip over the motherboard‘s bus (LPC or SPI), not over the network data fabric.
D. Configuring NVIDIA SMI to run in ‘Exclusive Mode‘ ‘Exclusive Mode‘ (specifically Compute Mode: Exclusive Process) is a setting in nvidia-smi that restricts a GPU so only one CUDA context can be active at a time. It is a resource management feature, not a security or encryption feature. nvidia-smi does not have the capability to “activate TPM encryption“ for GPU memory transactions; GPU memory encryption (where available, such as NVIDIA Confidential Computing) is handled at the hardware/firmware level, not by the SMI tool.
Unattempted
Correct: C Enabling the TPM 2.0 in the UEFI settings and configuring the BIOS to perform a Measured Boot that records the software state in the Platform Configuration Registers (PCRs).
The Technical Reason: * Measured Boot: Unlike Secure Boot (which simply blocks unsigned code), Measured Boot uses the TPM to “measure“ (hash) each component of the boot process (firmware, bootloader, kernel, and drivers).
PCRs (Platform Configuration Registers): These measurements are stored in the TPM‘s PCRs. Because these registers can only be “extended“ and not overwritten, they provide an immutable audit trail of the system‘s state.
Integrity Validation: Before an AI workload executes, an attestation service can check these PCR values to ensure that the NVIDIA GPU drivers or the OS kernel haven‘t been tampered with or replaced by malicious versions.
The NCP-AII Context: The certification validates that you can prepare a server for a “Zero-Trust“ environment. Configuring TPM 2.0 within the UEFI/BIOS is the first step in ensuring that the hardware-software handshake is cryptographically verified.
Incorrect Options: A. Disabling Secure Boot to allow custom encryption keys Disabling Secure Boot actually reduces the security posture of the node. Secure Boot and Measured Boot are complementary; Secure Boot ensures only signed code runs, while Measured Boot records what ran. Disabling it would make the system vulnerable to rootkits before the TPM can even record a measurement. Furthermore, TPMs do not require Secure Boot to be disabled to store encryption keys.
B. Setting the BMC to Shared Mode for key exchange Shared Mode (or Sideband) allows the BMC to use one of the standard host NIC ports for management. While this is a valid networking choice, it is not a security configuration for Measured Boot. Key exchange for boot integrity happens locally between the CPU, UEFI, and the TPM chip over the motherboard‘s bus (LPC or SPI), not over the network data fabric.
D. Configuring NVIDIA SMI to run in ‘Exclusive Mode‘ ‘Exclusive Mode‘ (specifically Compute Mode: Exclusive Process) is a setting in nvidia-smi that restricts a GPU so only one CUDA context can be active at a time. It is a resource management feature, not a security or encryption feature. nvidia-smi does not have the capability to “activate TPM encryption“ for GPU memory transactions; GPU memory encryption (where available, such as NVIDIA Confidential Computing) is handled at the hardware/firmware level, not by the SMI tool.
Question 17 of 60
17. Question
To facilitate the use of various AI models and tools, an administrator needs to install the NGC CLI on the cluster‘s hosts. What is the main benefit of using the NGC CLI in a professional AI infrastructure, and how does it integrate with the control plane‘s workflow?
Correct
Correct: C It allows users to download and manage optimized AI containers, pre-trained models, and scripts directly from the NVIDIA GPU Cloud repository.
The Technical Reason: The NGC CLI (Command Line Interface) is designed to provide programmatic access to the NGC Catalog.
Optimized Software: It provides access to GPU-accelerated Docker containers that are pre-configured with the necessary libraries (CUDA, cuDNN, NCCL) and tuned for maximum performance on NVIDIA hardware like the H100 or A100.
Workflow Integration: In a professional AI “factory,“ the NGC CLI is often integrated into automated scripts or the Control Plane (like Slurm or Kubernetes). For instance, an administrator might use it to pull the latest PyTorch or TensorFlow images to a shared filesystem so that compute nodes can execute containerized jobs without each node needing to reach out to the internet.
The NCP-AII Context: The certification exam validates your ability to perform the “day-zero“ tasks of an infrastructure rollout. Installing the NGC CLI and configuring it with an API Key is the standard procedure for enabling researchers to access NVIDIA‘s private and public registries.
Incorrect Options: A. Graphical user interface for monitoring temperature The NGC CLI is a command-line tool, not a GUI. Real-time monitoring of GPU metrics like temperature, power, and utilization across a cluster is typically handled by NVIDIA Base Command Manager (BCM) or specialized telemetry tools like NVIDIA DCGM (Data Center GPU Manager).
B. Physically format hard drives Formatting hard drives and provisioning the operating system are low-level server deployment tasks. In the NVIDIA ecosystem, this is the role of NVIDIA Base Command Manager (BCM) or traditional PXE-boot solutions. The NGC CLI operates at the application and container layer, well above the physical disk management layer.
D. Replace the standard Linux shell The NGC CLI is an application that runs within a standard Linux shell (like Bash or Zsh). It does not replace the operating system‘s interface. While it supports interacting with Python-heavy workloads (by pulling Python-based containers), the tool itself is a binary used for resource management, not a programming environment.
Incorrect
Correct: C It allows users to download and manage optimized AI containers, pre-trained models, and scripts directly from the NVIDIA GPU Cloud repository.
The Technical Reason: The NGC CLI (Command Line Interface) is designed to provide programmatic access to the NGC Catalog.
Optimized Software: It provides access to GPU-accelerated Docker containers that are pre-configured with the necessary libraries (CUDA, cuDNN, NCCL) and tuned for maximum performance on NVIDIA hardware like the H100 or A100.
Workflow Integration: In a professional AI “factory,“ the NGC CLI is often integrated into automated scripts or the Control Plane (like Slurm or Kubernetes). For instance, an administrator might use it to pull the latest PyTorch or TensorFlow images to a shared filesystem so that compute nodes can execute containerized jobs without each node needing to reach out to the internet.
The NCP-AII Context: The certification exam validates your ability to perform the “day-zero“ tasks of an infrastructure rollout. Installing the NGC CLI and configuring it with an API Key is the standard procedure for enabling researchers to access NVIDIA‘s private and public registries.
Incorrect Options: A. Graphical user interface for monitoring temperature The NGC CLI is a command-line tool, not a GUI. Real-time monitoring of GPU metrics like temperature, power, and utilization across a cluster is typically handled by NVIDIA Base Command Manager (BCM) or specialized telemetry tools like NVIDIA DCGM (Data Center GPU Manager).
B. Physically format hard drives Formatting hard drives and provisioning the operating system are low-level server deployment tasks. In the NVIDIA ecosystem, this is the role of NVIDIA Base Command Manager (BCM) or traditional PXE-boot solutions. The NGC CLI operates at the application and container layer, well above the physical disk management layer.
D. Replace the standard Linux shell The NGC CLI is an application that runs within a standard Linux shell (like Bash or Zsh). It does not replace the operating system‘s interface. While it supports interacting with Python-heavy workloads (by pulling Python-based containers), the tool itself is a binary used for resource management, not a programming environment.
Unattempted
Correct: C It allows users to download and manage optimized AI containers, pre-trained models, and scripts directly from the NVIDIA GPU Cloud repository.
The Technical Reason: The NGC CLI (Command Line Interface) is designed to provide programmatic access to the NGC Catalog.
Optimized Software: It provides access to GPU-accelerated Docker containers that are pre-configured with the necessary libraries (CUDA, cuDNN, NCCL) and tuned for maximum performance on NVIDIA hardware like the H100 or A100.
Workflow Integration: In a professional AI “factory,“ the NGC CLI is often integrated into automated scripts or the Control Plane (like Slurm or Kubernetes). For instance, an administrator might use it to pull the latest PyTorch or TensorFlow images to a shared filesystem so that compute nodes can execute containerized jobs without each node needing to reach out to the internet.
The NCP-AII Context: The certification exam validates your ability to perform the “day-zero“ tasks of an infrastructure rollout. Installing the NGC CLI and configuring it with an API Key is the standard procedure for enabling researchers to access NVIDIA‘s private and public registries.
Incorrect Options: A. Graphical user interface for monitoring temperature The NGC CLI is a command-line tool, not a GUI. Real-time monitoring of GPU metrics like temperature, power, and utilization across a cluster is typically handled by NVIDIA Base Command Manager (BCM) or specialized telemetry tools like NVIDIA DCGM (Data Center GPU Manager).
B. Physically format hard drives Formatting hard drives and provisioning the operating system are low-level server deployment tasks. In the NVIDIA ecosystem, this is the role of NVIDIA Base Command Manager (BCM) or traditional PXE-boot solutions. The NGC CLI operates at the application and container layer, well above the physical disk management layer.
D. Replace the standard Linux shell The NGC CLI is an application that runs within a standard Linux shell (like Bash or Zsh). It does not replace the operating system‘s interface. While it supports interacting with Python-heavy workloads (by pulling Python-based containers), the tool itself is a binary used for resource management, not a programming environment.
Question 18 of 60
18. Question
To ensure the reliability of the AI infrastructure under sustained load, a burn-in test is performed. What is the primary purpose of executing a NeMo burn-in test specifically in the context of an NVIDIA AI factory deployment after the initial system bring-up?
Correct
Correct: B To simulate a real-world Large Language Model training workload that stresses the GPUs and the network fabric to identify intermittent hardware failures. • The Technical Reason: While synthetic benchmarks like HPL (High-Performance Linpack) stress raw compute and power, and NCCL-tests measure point-to-point bandwidth, the NeMo burn-in represents an “Application-Level“ stress test. ? Full-Stack Stress: It utilizes the NVIDIA NeMo framework to run a representative LLM training task (e.g., GPT-3 175B architecture). This forces the system to perform complex all-to-all communication across the NVLink and InfiniBand fabrics simultaneously while keeping the GPU Tensor Cores at high utilization. ? Finding “Ghost“ Failures: Some hardware issues, such as a marginally failing InfiniBand cable, an unstable NVSwitch, or a GPU with transient memory errors, may not appear during shorter synthetic tests but will cause a training job to “hang“ or “crash“ during the sustained, varied load of a NeMo run. • The NCP-AII Context: The exam blueprint requires candidates to understand the hierarchy of validation: SMI checks ? HPL (Compute) ? NCCL (Fabric) ? NeMo (Workload). A successful NeMo burn-in is the final “green light“ before handing the AI Factory over to data scientists.
Incorrect: A. Intercept and inspect encrypted traffic via BlueField-3 While the BlueField-3 DPU can handle security and encryption (via DOCA), the NeMo burn-in is not a security test. Its goal is performance and stability validation of the compute/fabric plane. Intercepting traffic from the NVIDIA GPU Cloud (NGC) is a function of the management or security stack, not a training burn-in.
C. Erase data on third-party storage for BCM installation “Burn-in“ refers to a stress test of hardware, not a data-clearing (sanitization) procedure. Preparing storage for Base Command Manager (BCM) is a configuration task that occurs much earlier in the bring-up phase. NeMo requires storage to be ready so it can read datasets and write checkpoints.
D. Calibrate TPM modules via maximum GPU clock speeds TPM (Trusted Platform Module) calibration is not a standard procedure, and TPMs do not require GPU clock speeds for their cryptographic operations. The TPM functions as a hardware Root of Trust for Measured Boot and is independent of the training workload‘s thermal or computational intensity.
Incorrect
Correct: B To simulate a real-world Large Language Model training workload that stresses the GPUs and the network fabric to identify intermittent hardware failures. • The Technical Reason: While synthetic benchmarks like HPL (High-Performance Linpack) stress raw compute and power, and NCCL-tests measure point-to-point bandwidth, the NeMo burn-in represents an “Application-Level“ stress test. ? Full-Stack Stress: It utilizes the NVIDIA NeMo framework to run a representative LLM training task (e.g., GPT-3 175B architecture). This forces the system to perform complex all-to-all communication across the NVLink and InfiniBand fabrics simultaneously while keeping the GPU Tensor Cores at high utilization. ? Finding “Ghost“ Failures: Some hardware issues, such as a marginally failing InfiniBand cable, an unstable NVSwitch, or a GPU with transient memory errors, may not appear during shorter synthetic tests but will cause a training job to “hang“ or “crash“ during the sustained, varied load of a NeMo run. • The NCP-AII Context: The exam blueprint requires candidates to understand the hierarchy of validation: SMI checks ? HPL (Compute) ? NCCL (Fabric) ? NeMo (Workload). A successful NeMo burn-in is the final “green light“ before handing the AI Factory over to data scientists.
Incorrect: A. Intercept and inspect encrypted traffic via BlueField-3 While the BlueField-3 DPU can handle security and encryption (via DOCA), the NeMo burn-in is not a security test. Its goal is performance and stability validation of the compute/fabric plane. Intercepting traffic from the NVIDIA GPU Cloud (NGC) is a function of the management or security stack, not a training burn-in.
C. Erase data on third-party storage for BCM installation “Burn-in“ refers to a stress test of hardware, not a data-clearing (sanitization) procedure. Preparing storage for Base Command Manager (BCM) is a configuration task that occurs much earlier in the bring-up phase. NeMo requires storage to be ready so it can read datasets and write checkpoints.
D. Calibrate TPM modules via maximum GPU clock speeds TPM (Trusted Platform Module) calibration is not a standard procedure, and TPMs do not require GPU clock speeds for their cryptographic operations. The TPM functions as a hardware Root of Trust for Measured Boot and is independent of the training workload‘s thermal or computational intensity.
Unattempted
Correct: B To simulate a real-world Large Language Model training workload that stresses the GPUs and the network fabric to identify intermittent hardware failures. • The Technical Reason: While synthetic benchmarks like HPL (High-Performance Linpack) stress raw compute and power, and NCCL-tests measure point-to-point bandwidth, the NeMo burn-in represents an “Application-Level“ stress test. ? Full-Stack Stress: It utilizes the NVIDIA NeMo framework to run a representative LLM training task (e.g., GPT-3 175B architecture). This forces the system to perform complex all-to-all communication across the NVLink and InfiniBand fabrics simultaneously while keeping the GPU Tensor Cores at high utilization. ? Finding “Ghost“ Failures: Some hardware issues, such as a marginally failing InfiniBand cable, an unstable NVSwitch, or a GPU with transient memory errors, may not appear during shorter synthetic tests but will cause a training job to “hang“ or “crash“ during the sustained, varied load of a NeMo run. • The NCP-AII Context: The exam blueprint requires candidates to understand the hierarchy of validation: SMI checks ? HPL (Compute) ? NCCL (Fabric) ? NeMo (Workload). A successful NeMo burn-in is the final “green light“ before handing the AI Factory over to data scientists.
Incorrect: A. Intercept and inspect encrypted traffic via BlueField-3 While the BlueField-3 DPU can handle security and encryption (via DOCA), the NeMo burn-in is not a security test. Its goal is performance and stability validation of the compute/fabric plane. Intercepting traffic from the NVIDIA GPU Cloud (NGC) is a function of the management or security stack, not a training burn-in.
C. Erase data on third-party storage for BCM installation “Burn-in“ refers to a stress test of hardware, not a data-clearing (sanitization) procedure. Preparing storage for Base Command Manager (BCM) is a configuration task that occurs much earlier in the bring-up phase. NeMo requires storage to be ready so it can read datasets and write checkpoints.
D. Calibrate TPM modules via maximum GPU clock speeds TPM (Trusted Platform Module) calibration is not a standard procedure, and TPMs do not require GPU clock speeds for their cryptographic operations. The TPM functions as a hardware Root of Trust for Measured Boot and is independent of the training workload‘s thermal or computational intensity.
Question 19 of 60
19. Question
After the physical installation and software configuration, an engineer must perform a multifaceted assessment of the cluster using ClusterKit. Which combination of tests within a standard validation workflow would best verify the end-to-end performance and stability of the GPU communication?
Correct
Correct: C Running NCCL tests to measure inter-node and intra-node bandwidth, followed by an HPL burn-in to verify thermal stability and compute consistency.
The Technical Reason: To confirm a cluster is “ready for science,“ an administrator must validate two distinct but overlapping hardware planes:
The Communication Plane (NCCL): NCCL-tests verify that the high-speed fabrics—both the internal NVLink (intra-node) and the external InfiniBand/RoCE (inter-node)—are reaching their theoretical bandwidth limits. This ensures that the collective operations (like All-Reduce) used in distributed training will not be bottlenecked by cable issues or switch misconfigurations.
The Compute and Power Plane (HPL): High-Performance Linpack (HPL) is the standard for stressing the GPUs and CPUs to their thermal limits. A “burn-in“ (running HPL for an extended duration, such as 1–2 hours) ensures that the power delivery and cooling systems can sustain peak performance without thermal throttling or hardware failure.
The NCP-AII Context: The certification exam blueprint explicitly lists “Run ClusterKit to perform a multifaceted node assessment,“ followed by specific requirements to “Perform NCCL burn-in“ and “Perform HPL burn-in.“ Option C describes the standard professional sequence for cluster validation.
Incorrect Options: A. Storage throughput via ‘dd‘ in MIG-enabled sleep state The dd command is a basic utility that does not reflect the complex I/O patterns of AI training. Furthermore, putting GPUs in a “MIG-enabled sleep state“ would prevent you from testing the actual data path to the GPUs. End-to-end validation requires the GPUs to be active and processing data to simulate a real workload.
B. NeMo burn-in on DPUs while upgrading HGX firmware Performing a firmware upgrade during a stress test (burn-in) is a critical safety violation that could lead to system instability or “bricked“ hardware. Additionally, while the BlueField-3 DPU is essential for networking, a NeMo burn-in is designed to stress the GPUs, not the DPU‘s ARM cores.
D. Single-node stress test via NGC CLI and BMC cable testing While the NGC CLI is used to pull containers for testing, it is not the stress test tool itself. Furthermore, verifying the BMC management network (OOB) with a cable tester is a low-level physical layer task that happens during the initial “Bring-up“ phase. It does not provide any data on the “end-to-end performance“ of the high-speed GPU communication fabric.
Incorrect
Correct: C Running NCCL tests to measure inter-node and intra-node bandwidth, followed by an HPL burn-in to verify thermal stability and compute consistency.
The Technical Reason: To confirm a cluster is “ready for science,“ an administrator must validate two distinct but overlapping hardware planes:
The Communication Plane (NCCL): NCCL-tests verify that the high-speed fabrics—both the internal NVLink (intra-node) and the external InfiniBand/RoCE (inter-node)—are reaching their theoretical bandwidth limits. This ensures that the collective operations (like All-Reduce) used in distributed training will not be bottlenecked by cable issues or switch misconfigurations.
The Compute and Power Plane (HPL): High-Performance Linpack (HPL) is the standard for stressing the GPUs and CPUs to their thermal limits. A “burn-in“ (running HPL for an extended duration, such as 1–2 hours) ensures that the power delivery and cooling systems can sustain peak performance without thermal throttling or hardware failure.
The NCP-AII Context: The certification exam blueprint explicitly lists “Run ClusterKit to perform a multifaceted node assessment,“ followed by specific requirements to “Perform NCCL burn-in“ and “Perform HPL burn-in.“ Option C describes the standard professional sequence for cluster validation.
Incorrect Options: A. Storage throughput via ‘dd‘ in MIG-enabled sleep state The dd command is a basic utility that does not reflect the complex I/O patterns of AI training. Furthermore, putting GPUs in a “MIG-enabled sleep state“ would prevent you from testing the actual data path to the GPUs. End-to-end validation requires the GPUs to be active and processing data to simulate a real workload.
B. NeMo burn-in on DPUs while upgrading HGX firmware Performing a firmware upgrade during a stress test (burn-in) is a critical safety violation that could lead to system instability or “bricked“ hardware. Additionally, while the BlueField-3 DPU is essential for networking, a NeMo burn-in is designed to stress the GPUs, not the DPU‘s ARM cores.
D. Single-node stress test via NGC CLI and BMC cable testing While the NGC CLI is used to pull containers for testing, it is not the stress test tool itself. Furthermore, verifying the BMC management network (OOB) with a cable tester is a low-level physical layer task that happens during the initial “Bring-up“ phase. It does not provide any data on the “end-to-end performance“ of the high-speed GPU communication fabric.
Unattempted
Correct: C Running NCCL tests to measure inter-node and intra-node bandwidth, followed by an HPL burn-in to verify thermal stability and compute consistency.
The Technical Reason: To confirm a cluster is “ready for science,“ an administrator must validate two distinct but overlapping hardware planes:
The Communication Plane (NCCL): NCCL-tests verify that the high-speed fabrics—both the internal NVLink (intra-node) and the external InfiniBand/RoCE (inter-node)—are reaching their theoretical bandwidth limits. This ensures that the collective operations (like All-Reduce) used in distributed training will not be bottlenecked by cable issues or switch misconfigurations.
The Compute and Power Plane (HPL): High-Performance Linpack (HPL) is the standard for stressing the GPUs and CPUs to their thermal limits. A “burn-in“ (running HPL for an extended duration, such as 1–2 hours) ensures that the power delivery and cooling systems can sustain peak performance without thermal throttling or hardware failure.
The NCP-AII Context: The certification exam blueprint explicitly lists “Run ClusterKit to perform a multifaceted node assessment,“ followed by specific requirements to “Perform NCCL burn-in“ and “Perform HPL burn-in.“ Option C describes the standard professional sequence for cluster validation.
Incorrect Options: A. Storage throughput via ‘dd‘ in MIG-enabled sleep state The dd command is a basic utility that does not reflect the complex I/O patterns of AI training. Furthermore, putting GPUs in a “MIG-enabled sleep state“ would prevent you from testing the actual data path to the GPUs. End-to-end validation requires the GPUs to be active and processing data to simulate a real workload.
B. NeMo burn-in on DPUs while upgrading HGX firmware Performing a firmware upgrade during a stress test (burn-in) is a critical safety violation that could lead to system instability or “bricked“ hardware. Additionally, while the BlueField-3 DPU is essential for networking, a NeMo burn-in is designed to stress the GPUs, not the DPU‘s ARM cores.
D. Single-node stress test via NGC CLI and BMC cable testing While the NGC CLI is used to pull containers for testing, it is not the stress test tool itself. Furthermore, verifying the BMC management network (OOB) with a cable tester is a low-level physical layer task that happens during the initial “Bring-up“ phase. It does not provide any data on the “end-to-end performance“ of the high-speed GPU communication fabric.
Question 20 of 60
20. Question
A cloud architect is designing a multi-tenant AI environment where different users require varying levels of GPU performance. To optimize the physical layer management, the architect decides to implement MIG. Which command is used to verify the current MIG mode status and list the available GPU instances on a system equipped with the NVIDIA driver?
Correct
Correct: D nvidia-smi -L
The Technical Reason: The nvidia-smi (System Management Interface) utility is the primary tool for managing NVIDIA GPUs.
MIG Status: When the -L (or –list-gpus) flag is used, it lists all detected GPUs in the system. If MIG mode is enabled, the output will change from showing a single physical GPU to listing each active GPU Instance and Compute Instance with its own unique UUID.
Verification: For a physical GPU like the H100 or A100, the output will explicitly show the partition profiles (e.g., MIG 1g.10gb, MIG 3g.40gb) currently configured.
The NCP-AII Context: The certification requires you to know how to “Configure MIG (AI and HPC).“ This includes enabling MIG mode (nvidia-smi -mig 1), creating instances, and using the -L flag to verify that the OS recognizes the partitioned hardware before passing those resources to a container or orchestrator.
Incorrect Options: A. lsusb -v The lsusb command is used to list USB devices and their properties. NVIDIA GPUs in AI infrastructure are connected via the high-speed PCIe or SXM interface, not USB. This command provides no information regarding GPU partitioning or MIG status.
B. ifconfig -a The ifconfig command (or the modern ip addr) is used to manage and display network interface configurations. While it might show the management or InfiniBand interfaces, it has no visibility into the internal hardware partitioning of a GPU.
C. df -h The df (disk free) command is used to report filesystem disk space usage. It shows information about mounted drives and partitions on the storage system, not the internal compute or memory partitioning of a GPU.
Incorrect
Correct: D nvidia-smi -L
The Technical Reason: The nvidia-smi (System Management Interface) utility is the primary tool for managing NVIDIA GPUs.
MIG Status: When the -L (or –list-gpus) flag is used, it lists all detected GPUs in the system. If MIG mode is enabled, the output will change from showing a single physical GPU to listing each active GPU Instance and Compute Instance with its own unique UUID.
Verification: For a physical GPU like the H100 or A100, the output will explicitly show the partition profiles (e.g., MIG 1g.10gb, MIG 3g.40gb) currently configured.
The NCP-AII Context: The certification requires you to know how to “Configure MIG (AI and HPC).“ This includes enabling MIG mode (nvidia-smi -mig 1), creating instances, and using the -L flag to verify that the OS recognizes the partitioned hardware before passing those resources to a container or orchestrator.
Incorrect Options: A. lsusb -v The lsusb command is used to list USB devices and their properties. NVIDIA GPUs in AI infrastructure are connected via the high-speed PCIe or SXM interface, not USB. This command provides no information regarding GPU partitioning or MIG status.
B. ifconfig -a The ifconfig command (or the modern ip addr) is used to manage and display network interface configurations. While it might show the management or InfiniBand interfaces, it has no visibility into the internal hardware partitioning of a GPU.
C. df -h The df (disk free) command is used to report filesystem disk space usage. It shows information about mounted drives and partitions on the storage system, not the internal compute or memory partitioning of a GPU.
Unattempted
Correct: D nvidia-smi -L
The Technical Reason: The nvidia-smi (System Management Interface) utility is the primary tool for managing NVIDIA GPUs.
MIG Status: When the -L (or –list-gpus) flag is used, it lists all detected GPUs in the system. If MIG mode is enabled, the output will change from showing a single physical GPU to listing each active GPU Instance and Compute Instance with its own unique UUID.
Verification: For a physical GPU like the H100 or A100, the output will explicitly show the partition profiles (e.g., MIG 1g.10gb, MIG 3g.40gb) currently configured.
The NCP-AII Context: The certification requires you to know how to “Configure MIG (AI and HPC).“ This includes enabling MIG mode (nvidia-smi -mig 1), creating instances, and using the -L flag to verify that the OS recognizes the partitioned hardware before passing those resources to a container or orchestrator.
Incorrect Options: A. lsusb -v The lsusb command is used to list USB devices and their properties. NVIDIA GPUs in AI infrastructure are connected via the high-speed PCIe or SXM interface, not USB. This command provides no information regarding GPU partitioning or MIG status.
B. ifconfig -a The ifconfig command (or the modern ip addr) is used to manage and display network interface configurations. While it might show the management or InfiniBand interfaces, it has no visibility into the internal hardware partitioning of a GPU.
C. df -h The df (disk free) command is used to report filesystem disk space usage. It shows information about mounted drives and partitions on the storage system, not the internal compute or memory partitioning of a GPU.
Question 21 of 60
21. Question
When performing the initial physical bring-up of an NVIDIA HGX H100 system, an administrator notices that the Baseboard Management Controller (BMC) reports a power capping event despite the rack PDUs operating within their limits. Considering the critical power requirements for AI factories, which action should be the priority to ensure the server meets the high-performance demands of AI workloads without hardware-induced throttling?
Correct
Correct: D. Verify the Power Supply Unit (PSU) redundancy policy in the BMC and ensure all power cables are seated and connected to independent circuits.
This is the priority action because a BMC-reported power capping event despite rack PDUs operating within limits strongly indicates a power delivery configuration issue at the server level, not a cluster-wide power shortage . The NVIDIA documentation for HGX systems details that power capping features must be correctly enabled, particularly for N+N redundant configurations, to ensure safe and high-performance operation . For example, to enable PSU redundancy support, the power budget limit must be set appropriately (e.g., 12 kW) using specific ipmitool commands . If the power cables are not fully seated or are connected to the same circuit rather than independent circuits, the system‘s power sensing logic will detect a fault and invoke a power cap to protect the hardware, throttling performance . The NCP-AII certification exam blueprint explicitly includes “Validate power and cooling parameters“ and “Identify faulty…power supplies“ as key tasks during system bring-up and troubleshooting . Addressing the physical power delivery and redundancy configuration directly resolves the root cause, ensuring the server can draw the full power required (up to 10.2 kW for an HGX H100 system) to meet high-performance AI workload demands without hardware-induced throttling .
Incorrect:
A. Update the TPM firmware to version 2.0 to allow for higher power draw authorization from the motherboard components during peak loads. This is incorrect. The Trusted Platform Module (TPM) is a security chip used for cryptographic operations and platform integrity (secure boot, encryption), not for managing or authorizing power draw . While initial TPM configuration is part of system bring-up, it has no role in power delivery or capping mechanisms .
B. Decrease the GPU clock frequency via nvidia-smi to manually stay under the current power threshold until more power is available. This is incorrect. Manually throttling GPU clocks via nvidia-smi would reduce performance, which is contrary to the goal of “ensuring the server meets the high-performance demands“ . This is a temporary workaround that accepts the power cap, rather than resolving the underlying configuration or hardware issue causing it. The priority is to fix the power delivery so the system can run at full capacity.
C. Reinstall the NVIDIA Container Toolkit to recalibrate the power sensing logic of the underlying operating system and driver stack. This is incorrect. The NVIDIA Container Toolkit is a software component that enables GPU access within containers; it has no functionality related to power sensing, power capping, or hardware power delivery logic . Reinstalling it would not affect BMC-reported power events.
Incorrect
Correct: D. Verify the Power Supply Unit (PSU) redundancy policy in the BMC and ensure all power cables are seated and connected to independent circuits.
This is the priority action because a BMC-reported power capping event despite rack PDUs operating within limits strongly indicates a power delivery configuration issue at the server level, not a cluster-wide power shortage . The NVIDIA documentation for HGX systems details that power capping features must be correctly enabled, particularly for N+N redundant configurations, to ensure safe and high-performance operation . For example, to enable PSU redundancy support, the power budget limit must be set appropriately (e.g., 12 kW) using specific ipmitool commands . If the power cables are not fully seated or are connected to the same circuit rather than independent circuits, the system‘s power sensing logic will detect a fault and invoke a power cap to protect the hardware, throttling performance . The NCP-AII certification exam blueprint explicitly includes “Validate power and cooling parameters“ and “Identify faulty…power supplies“ as key tasks during system bring-up and troubleshooting . Addressing the physical power delivery and redundancy configuration directly resolves the root cause, ensuring the server can draw the full power required (up to 10.2 kW for an HGX H100 system) to meet high-performance AI workload demands without hardware-induced throttling .
Incorrect:
A. Update the TPM firmware to version 2.0 to allow for higher power draw authorization from the motherboard components during peak loads. This is incorrect. The Trusted Platform Module (TPM) is a security chip used for cryptographic operations and platform integrity (secure boot, encryption), not for managing or authorizing power draw . While initial TPM configuration is part of system bring-up, it has no role in power delivery or capping mechanisms .
B. Decrease the GPU clock frequency via nvidia-smi to manually stay under the current power threshold until more power is available. This is incorrect. Manually throttling GPU clocks via nvidia-smi would reduce performance, which is contrary to the goal of “ensuring the server meets the high-performance demands“ . This is a temporary workaround that accepts the power cap, rather than resolving the underlying configuration or hardware issue causing it. The priority is to fix the power delivery so the system can run at full capacity.
C. Reinstall the NVIDIA Container Toolkit to recalibrate the power sensing logic of the underlying operating system and driver stack. This is incorrect. The NVIDIA Container Toolkit is a software component that enables GPU access within containers; it has no functionality related to power sensing, power capping, or hardware power delivery logic . Reinstalling it would not affect BMC-reported power events.
Unattempted
Correct: D. Verify the Power Supply Unit (PSU) redundancy policy in the BMC and ensure all power cables are seated and connected to independent circuits.
This is the priority action because a BMC-reported power capping event despite rack PDUs operating within limits strongly indicates a power delivery configuration issue at the server level, not a cluster-wide power shortage . The NVIDIA documentation for HGX systems details that power capping features must be correctly enabled, particularly for N+N redundant configurations, to ensure safe and high-performance operation . For example, to enable PSU redundancy support, the power budget limit must be set appropriately (e.g., 12 kW) using specific ipmitool commands . If the power cables are not fully seated or are connected to the same circuit rather than independent circuits, the system‘s power sensing logic will detect a fault and invoke a power cap to protect the hardware, throttling performance . The NCP-AII certification exam blueprint explicitly includes “Validate power and cooling parameters“ and “Identify faulty…power supplies“ as key tasks during system bring-up and troubleshooting . Addressing the physical power delivery and redundancy configuration directly resolves the root cause, ensuring the server can draw the full power required (up to 10.2 kW for an HGX H100 system) to meet high-performance AI workload demands without hardware-induced throttling .
Incorrect:
A. Update the TPM firmware to version 2.0 to allow for higher power draw authorization from the motherboard components during peak loads. This is incorrect. The Trusted Platform Module (TPM) is a security chip used for cryptographic operations and platform integrity (secure boot, encryption), not for managing or authorizing power draw . While initial TPM configuration is part of system bring-up, it has no role in power delivery or capping mechanisms .
B. Decrease the GPU clock frequency via nvidia-smi to manually stay under the current power threshold until more power is available. This is incorrect. Manually throttling GPU clocks via nvidia-smi would reduce performance, which is contrary to the goal of “ensuring the server meets the high-performance demands“ . This is a temporary workaround that accepts the power cap, rather than resolving the underlying configuration or hardware issue causing it. The priority is to fix the power delivery so the system can run at full capacity.
C. Reinstall the NVIDIA Container Toolkit to recalibrate the power sensing logic of the underlying operating system and driver stack. This is incorrect. The NVIDIA Container Toolkit is a software component that enables GPU access within containers; it has no functionality related to power sensing, power capping, or hardware power delivery logic . Reinstalling it would not affect BMC-reported power events.
Question 22 of 60
22. Question
A cluster node is reporting a GPU Fallen Off Bus error in the system logs. After verifying the physical seating and power connections, what is the next logical step an administrator should take to troubleshoot this hardware fault on an NVIDIA HGX system?
Correct
Correct: A Check the dmesg output for PCIe AER messages and use the BMC to check for any critical hardware events or power faults related to that GPU slot.
The Technical Reason: When a GPU “falls off the bus,“ it is usually due to a fatal PCIe link error or a sudden power loss to the module.
dmesg & AER: The Linux kernel‘s dmesg log will contain Advanced Error Reporting (AER) messages. These messages provide the specific PCIe error status (e.g., Uncorrectable Error, Receiver Error), which helps determine if the issue is signal integrity or a hardware failure.
BMC Integration: On an HGX system, the GPUs are powered and monitored by the Baseboard Management Controller (BMC). The BMC logs (SEL – System Event Logs) will capture hardware-level alerts that the OS cannot see, such as a voltage regulator failure or a thermal trip on the HGX baseboard.
The NCP-AII Context: The certification validates your ability to use OOB (Out-of-Band) management tools. Checking the BMC and low-level kernel logs is the “NVIDIA-recommended“ first step before escalating to physical component replacement.
Incorrect Options: B. Re-install NGC CLI and use ‘ngc fix-gpu‘ The NGC CLI is a tool for managing containers, models, and datasets in the cloud; it has no hardware diagnostic or “bus-reset“ capabilities. Furthermore, there is no such command as ngc fix-gpu. Hardware bus issues must be handled at the driver, BIOS, or firmware level, not through a cloud repository client.
C. Swap InfiniBand transceivers While InfiniBand is critical for cluster communication, it is part of the Networking Plane. A “GPU Fallen Off Bus“ error is a PCIe/Internal Fabric issue. Swapping network cables or transceivers will not resolve or diagnose a failure in the communication link between the CPU and the GPU on the motherboard.
D. Increase fan speed and disable MIG to re-sync with DPU While cooling is important, forcing 100% fan speed is a reactive measure that doesn‘t diagnose the root cause of a bus failure. Furthermore, the GPU does not “sync its clock“ with the BlueField-3 DPU; they operate on separate domains. Disabling MIG is also irrelevant, as a GPU that has fallen off the bus is invisible to the driver and cannot have its MIG configuration modified.
Incorrect
Correct: A Check the dmesg output for PCIe AER messages and use the BMC to check for any critical hardware events or power faults related to that GPU slot.
The Technical Reason: When a GPU “falls off the bus,“ it is usually due to a fatal PCIe link error or a sudden power loss to the module.
dmesg & AER: The Linux kernel‘s dmesg log will contain Advanced Error Reporting (AER) messages. These messages provide the specific PCIe error status (e.g., Uncorrectable Error, Receiver Error), which helps determine if the issue is signal integrity or a hardware failure.
BMC Integration: On an HGX system, the GPUs are powered and monitored by the Baseboard Management Controller (BMC). The BMC logs (SEL – System Event Logs) will capture hardware-level alerts that the OS cannot see, such as a voltage regulator failure or a thermal trip on the HGX baseboard.
The NCP-AII Context: The certification validates your ability to use OOB (Out-of-Band) management tools. Checking the BMC and low-level kernel logs is the “NVIDIA-recommended“ first step before escalating to physical component replacement.
Incorrect Options: B. Re-install NGC CLI and use ‘ngc fix-gpu‘ The NGC CLI is a tool for managing containers, models, and datasets in the cloud; it has no hardware diagnostic or “bus-reset“ capabilities. Furthermore, there is no such command as ngc fix-gpu. Hardware bus issues must be handled at the driver, BIOS, or firmware level, not through a cloud repository client.
C. Swap InfiniBand transceivers While InfiniBand is critical for cluster communication, it is part of the Networking Plane. A “GPU Fallen Off Bus“ error is a PCIe/Internal Fabric issue. Swapping network cables or transceivers will not resolve or diagnose a failure in the communication link between the CPU and the GPU on the motherboard.
D. Increase fan speed and disable MIG to re-sync with DPU While cooling is important, forcing 100% fan speed is a reactive measure that doesn‘t diagnose the root cause of a bus failure. Furthermore, the GPU does not “sync its clock“ with the BlueField-3 DPU; they operate on separate domains. Disabling MIG is also irrelevant, as a GPU that has fallen off the bus is invisible to the driver and cannot have its MIG configuration modified.
Unattempted
Correct: A Check the dmesg output for PCIe AER messages and use the BMC to check for any critical hardware events or power faults related to that GPU slot.
The Technical Reason: When a GPU “falls off the bus,“ it is usually due to a fatal PCIe link error or a sudden power loss to the module.
dmesg & AER: The Linux kernel‘s dmesg log will contain Advanced Error Reporting (AER) messages. These messages provide the specific PCIe error status (e.g., Uncorrectable Error, Receiver Error), which helps determine if the issue is signal integrity or a hardware failure.
BMC Integration: On an HGX system, the GPUs are powered and monitored by the Baseboard Management Controller (BMC). The BMC logs (SEL – System Event Logs) will capture hardware-level alerts that the OS cannot see, such as a voltage regulator failure or a thermal trip on the HGX baseboard.
The NCP-AII Context: The certification validates your ability to use OOB (Out-of-Band) management tools. Checking the BMC and low-level kernel logs is the “NVIDIA-recommended“ first step before escalating to physical component replacement.
Incorrect Options: B. Re-install NGC CLI and use ‘ngc fix-gpu‘ The NGC CLI is a tool for managing containers, models, and datasets in the cloud; it has no hardware diagnostic or “bus-reset“ capabilities. Furthermore, there is no such command as ngc fix-gpu. Hardware bus issues must be handled at the driver, BIOS, or firmware level, not through a cloud repository client.
C. Swap InfiniBand transceivers While InfiniBand is critical for cluster communication, it is part of the Networking Plane. A “GPU Fallen Off Bus“ error is a PCIe/Internal Fabric issue. Swapping network cables or transceivers will not resolve or diagnose a failure in the communication link between the CPU and the GPU on the motherboard.
D. Increase fan speed and disable MIG to re-sync with DPU While cooling is important, forcing 100% fan speed is a reactive measure that doesn‘t diagnose the root cause of a bus failure. Furthermore, the GPU does not “sync its clock“ with the BlueField-3 DPU; they operate on separate domains. Disabling MIG is also irrelevant, as a GPU that has fallen off the bus is invisible to the driver and cannot have its MIG configuration modified.
Question 23 of 60
23. Question
During the initial configuration of a third-party storage solution for an AI cluster, the administrator must ensure that the storage fabric is properly integrated with the compute nodes. What is a critical requirement for configuring the initial parameters of the storage system to support high-throughput NVIDIA GPUDirect Storage (GDS) operations?
Correct
Correct: A. Enable RDMA support on the storage controllers and ensure the storage network is on the same subnet as the high-speed compute fabric for direct paths.
This is the critical requirement because GPUDirect Storage (GDS) is designed to enable a direct data path for direct memory access (DMA) transfers between GPU memory and storage . For external storage systems to support GDS operations, they must support RDMA network interfaces . The technical documentation specifies that for network-attached storage (file systems) to work with GDS, the file system must support RDMA remote access, typically implemented either through modified NFS protocols (such as NFS over RDMA) or through OFED interfaces . Additionally, when using Mellanox ConnectX-5 or later adapters for GDS, the host channel adapters (HCAs) must be configured in InfiniBand or RoCE v2 mode , and having the storage network on the same subnet as the high-speed compute fabric ensures proper RDMA connectivity for direct GPU-to-storage data paths.
Incorrect:
B. Set the storage LUNs to be managed by the TPM to encrypt all data-in-flight before it reaches the NVIDIA ConnectX-7 network adapters in the nodes. This is incorrect. The Trusted Platform Module (TPM) is a security chip used for cryptographic operations and platform integrity, such as secure boot and encryption key storage. It does not manage storage LUNs or encrypt data-in-flight for network adapters. Data-in-flight encryption for RDMA traffic would be handled by other mechanisms (such as IPsec or TLS), not TPM management of LUNs.
C. Configure the storage to use standard NFS version 3 with no specialized drivers to ensure maximum compatibility with the Base Command Manager. This is incorrect. Standard NFSv3 without RDMA support does not enable the direct data path required for GDS operations. The documentation specifies that for file system storage to support GDS, it must support RDMA network interfaces . PowerScale OneFS v9.2 introduced NFS over RDMA (NFSoRDMA) specifically to enable GDS compatibility . Standard NFSv3 would force data to go through CPU bounce buffers, eliminating the performance benefits of GDS.
D. Disable the PCIe peer-to-peer communication in the system BIOS to prevent the storage controllers from interfering with the internal NVLink traffic. This is incorrect. Disabling PCIe peer-to-peer communication would actually prevent GDS from functioning properly. GDS relies on PCIe peer-to-peer transfers to enable direct data movement between storage devices and GPU memory without passing through CPU memory . The GDS Best Practices Guide explicitly recommends disabling PCIe Access Control Services (ACS) to enable proper peer-to-peer functionality, as ACS forces peer-to-peer PCIe transactions to go through the PCIe Root Complex, preventing GDS from bypassing the CPU .
Incorrect
Correct: A. Enable RDMA support on the storage controllers and ensure the storage network is on the same subnet as the high-speed compute fabric for direct paths.
This is the critical requirement because GPUDirect Storage (GDS) is designed to enable a direct data path for direct memory access (DMA) transfers between GPU memory and storage . For external storage systems to support GDS operations, they must support RDMA network interfaces . The technical documentation specifies that for network-attached storage (file systems) to work with GDS, the file system must support RDMA remote access, typically implemented either through modified NFS protocols (such as NFS over RDMA) or through OFED interfaces . Additionally, when using Mellanox ConnectX-5 or later adapters for GDS, the host channel adapters (HCAs) must be configured in InfiniBand or RoCE v2 mode , and having the storage network on the same subnet as the high-speed compute fabric ensures proper RDMA connectivity for direct GPU-to-storage data paths.
Incorrect:
B. Set the storage LUNs to be managed by the TPM to encrypt all data-in-flight before it reaches the NVIDIA ConnectX-7 network adapters in the nodes. This is incorrect. The Trusted Platform Module (TPM) is a security chip used for cryptographic operations and platform integrity, such as secure boot and encryption key storage. It does not manage storage LUNs or encrypt data-in-flight for network adapters. Data-in-flight encryption for RDMA traffic would be handled by other mechanisms (such as IPsec or TLS), not TPM management of LUNs.
C. Configure the storage to use standard NFS version 3 with no specialized drivers to ensure maximum compatibility with the Base Command Manager. This is incorrect. Standard NFSv3 without RDMA support does not enable the direct data path required for GDS operations. The documentation specifies that for file system storage to support GDS, it must support RDMA network interfaces . PowerScale OneFS v9.2 introduced NFS over RDMA (NFSoRDMA) specifically to enable GDS compatibility . Standard NFSv3 would force data to go through CPU bounce buffers, eliminating the performance benefits of GDS.
D. Disable the PCIe peer-to-peer communication in the system BIOS to prevent the storage controllers from interfering with the internal NVLink traffic. This is incorrect. Disabling PCIe peer-to-peer communication would actually prevent GDS from functioning properly. GDS relies on PCIe peer-to-peer transfers to enable direct data movement between storage devices and GPU memory without passing through CPU memory . The GDS Best Practices Guide explicitly recommends disabling PCIe Access Control Services (ACS) to enable proper peer-to-peer functionality, as ACS forces peer-to-peer PCIe transactions to go through the PCIe Root Complex, preventing GDS from bypassing the CPU .
Unattempted
Correct: A. Enable RDMA support on the storage controllers and ensure the storage network is on the same subnet as the high-speed compute fabric for direct paths.
This is the critical requirement because GPUDirect Storage (GDS) is designed to enable a direct data path for direct memory access (DMA) transfers between GPU memory and storage . For external storage systems to support GDS operations, they must support RDMA network interfaces . The technical documentation specifies that for network-attached storage (file systems) to work with GDS, the file system must support RDMA remote access, typically implemented either through modified NFS protocols (such as NFS over RDMA) or through OFED interfaces . Additionally, when using Mellanox ConnectX-5 or later adapters for GDS, the host channel adapters (HCAs) must be configured in InfiniBand or RoCE v2 mode , and having the storage network on the same subnet as the high-speed compute fabric ensures proper RDMA connectivity for direct GPU-to-storage data paths.
Incorrect:
B. Set the storage LUNs to be managed by the TPM to encrypt all data-in-flight before it reaches the NVIDIA ConnectX-7 network adapters in the nodes. This is incorrect. The Trusted Platform Module (TPM) is a security chip used for cryptographic operations and platform integrity, such as secure boot and encryption key storage. It does not manage storage LUNs or encrypt data-in-flight for network adapters. Data-in-flight encryption for RDMA traffic would be handled by other mechanisms (such as IPsec or TLS), not TPM management of LUNs.
C. Configure the storage to use standard NFS version 3 with no specialized drivers to ensure maximum compatibility with the Base Command Manager. This is incorrect. Standard NFSv3 without RDMA support does not enable the direct data path required for GDS operations. The documentation specifies that for file system storage to support GDS, it must support RDMA network interfaces . PowerScale OneFS v9.2 introduced NFS over RDMA (NFSoRDMA) specifically to enable GDS compatibility . Standard NFSv3 would force data to go through CPU bounce buffers, eliminating the performance benefits of GDS.
D. Disable the PCIe peer-to-peer communication in the system BIOS to prevent the storage controllers from interfering with the internal NVLink traffic. This is incorrect. Disabling PCIe peer-to-peer communication would actually prevent GDS from functioning properly. GDS relies on PCIe peer-to-peer transfers to enable direct data movement between storage devices and GPU memory without passing through CPU memory . The GDS Best Practices Guide explicitly recommends disabling PCIe Access Control Services (ACS) to enable proper peer-to-peer functionality, as ACS forces peer-to-peer PCIe transactions to go through the PCIe Root Complex, preventing GDS from bypassing the CPU .
Question 24 of 60
24. Question
An engineer needs to optimize the storage performance for a cluster of NVIDIA servers. They observe that the CPU utilization is very high during data ingestion, which is slowing down the training process. Which technology should be implemented to resolve this specific bottleneck?
Correct
Correct: D Enable GPUDirect Storage (GDS) to allow the data to be transferred directly from the storage to the GPU memory without involving the CPU.
The Technical Reason: Traditionally, data must travel from storage to CPU-managed system RAM and then to GPU VRAM. This “standard path“ consumes significant CPU cycles and increases latency.
Direct Path: GPUDirect Storage (GDS) creates a direct DMA (Direct Memory Access) path between local NVMe or remote storage (NVMe-oF) and the GPU memory.
CPU Offloading: By bypassing the CPU, GDS effectively reduces CPU utilization by up to 80% during heavy I/O and provides significantly higher bandwidth (up to 2x-8x improvement) and lower latency.
The NCP-AII Context: The exam expects you to recognize GDS as the “Performance Force Multiplier“ for data-intensive pipelines. You should know that implementing GDS requires the nvidia-fs kernel driver and the use of the cuFile API within applications to achieve these benefits.
Incorrect Options: A. Increase the number of CPU cores While adding more cores might provide more “headroom“ for the overhead, it does not solve the underlying architectural inefficiency. In a high-density AI Factory (like an HGX H100 system), even the most powerful CPUs can become a bottleneck when trying to feed multiple 400Gbps network links or high-speed NVMe arrays. Increasing core count is an expensive “brute force“ approach that fails to address the actual bottleneck.
B. Replace NVMe SSDs with slower SATA drives Slowing down the storage to “match“ the CPU‘s processing speed is counterproductive. This would severely degrade the overall training performance and result in the GPUs idling while waiting for data (under-utilization). The goal of AI infrastructure optimization is to accelerate the slowest component, not to throttle the fastest ones.
C. Run the training job entirely on the CPU Running AI training (especially Large Language Models or complex computer vision) on a CPU is thousands of times slower than on a GPU. This defeats the entire purpose of an NVIDIA-accelerated AI infrastructure and would lead to an immediate failure to meet performance SLAs.
Incorrect
Correct: D Enable GPUDirect Storage (GDS) to allow the data to be transferred directly from the storage to the GPU memory without involving the CPU.
The Technical Reason: Traditionally, data must travel from storage to CPU-managed system RAM and then to GPU VRAM. This “standard path“ consumes significant CPU cycles and increases latency.
Direct Path: GPUDirect Storage (GDS) creates a direct DMA (Direct Memory Access) path between local NVMe or remote storage (NVMe-oF) and the GPU memory.
CPU Offloading: By bypassing the CPU, GDS effectively reduces CPU utilization by up to 80% during heavy I/O and provides significantly higher bandwidth (up to 2x-8x improvement) and lower latency.
The NCP-AII Context: The exam expects you to recognize GDS as the “Performance Force Multiplier“ for data-intensive pipelines. You should know that implementing GDS requires the nvidia-fs kernel driver and the use of the cuFile API within applications to achieve these benefits.
Incorrect Options: A. Increase the number of CPU cores While adding more cores might provide more “headroom“ for the overhead, it does not solve the underlying architectural inefficiency. In a high-density AI Factory (like an HGX H100 system), even the most powerful CPUs can become a bottleneck when trying to feed multiple 400Gbps network links or high-speed NVMe arrays. Increasing core count is an expensive “brute force“ approach that fails to address the actual bottleneck.
B. Replace NVMe SSDs with slower SATA drives Slowing down the storage to “match“ the CPU‘s processing speed is counterproductive. This would severely degrade the overall training performance and result in the GPUs idling while waiting for data (under-utilization). The goal of AI infrastructure optimization is to accelerate the slowest component, not to throttle the fastest ones.
C. Run the training job entirely on the CPU Running AI training (especially Large Language Models or complex computer vision) on a CPU is thousands of times slower than on a GPU. This defeats the entire purpose of an NVIDIA-accelerated AI infrastructure and would lead to an immediate failure to meet performance SLAs.
Unattempted
Correct: D Enable GPUDirect Storage (GDS) to allow the data to be transferred directly from the storage to the GPU memory without involving the CPU.
The Technical Reason: Traditionally, data must travel from storage to CPU-managed system RAM and then to GPU VRAM. This “standard path“ consumes significant CPU cycles and increases latency.
Direct Path: GPUDirect Storage (GDS) creates a direct DMA (Direct Memory Access) path between local NVMe or remote storage (NVMe-oF) and the GPU memory.
CPU Offloading: By bypassing the CPU, GDS effectively reduces CPU utilization by up to 80% during heavy I/O and provides significantly higher bandwidth (up to 2x-8x improvement) and lower latency.
The NCP-AII Context: The exam expects you to recognize GDS as the “Performance Force Multiplier“ for data-intensive pipelines. You should know that implementing GDS requires the nvidia-fs kernel driver and the use of the cuFile API within applications to achieve these benefits.
Incorrect Options: A. Increase the number of CPU cores While adding more cores might provide more “headroom“ for the overhead, it does not solve the underlying architectural inefficiency. In a high-density AI Factory (like an HGX H100 system), even the most powerful CPUs can become a bottleneck when trying to feed multiple 400Gbps network links or high-speed NVMe arrays. Increasing core count is an expensive “brute force“ approach that fails to address the actual bottleneck.
B. Replace NVMe SSDs with slower SATA drives Slowing down the storage to “match“ the CPU‘s processing speed is counterproductive. This would severely degrade the overall training performance and result in the GPUs idling while waiting for data (under-utilization). The goal of AI infrastructure optimization is to accelerate the slowest component, not to throttle the fastest ones.
C. Run the training job entirely on the CPU Running AI training (especially Large Language Models or complex computer vision) on a CPU is thousands of times slower than on a GPU. This defeats the entire purpose of an NVIDIA-accelerated AI infrastructure and would lead to an immediate failure to meet performance SLAs.
Question 25 of 60
25. Question
A data center engineer is performing the initial system and server bring-up for a new HGX-based AI factory deployment. During the validation phase, the engineer must verify that the power and cooling parameters are within the specific design envelopes for high-density GPU workloads. Which sequence of actions represents the most accurate process for ensuring the infrastructure can support the peak power draw and thermal output of the installed H100 GPUs?
Correct
Correct: C Monitor BMC power telemetry to verify peak wattage during stress tests and ensure PDU thresholds are configured to handle transient spikes without tripping breakers.
The Technical Reason: NVIDIA HGX H100 systems can draw massive amounts of power (often exceeding 10kW per server under load).
BMC Telemetry: The Baseboard Management Controller (BMC) provides granular, real-time data on power consumption and thermal sensors that the OS cannot always see accurately.
Transient Spikes: AI workloads, particularly during synchronization phases of training, can cause sudden, massive increases in current. If the Power Distribution Unit (PDU) thresholds or circuit breakers are not sized for these “transients,“ the entire rack could lose power.
Validation: Running a stress test (like HPL or NeMo burn-in) while monitoring this telemetry confirms the infrastructure can handle the “worst-case“ load.
The NCP-AII Context: The exam expects you to use Out-of-Band (OOB) management tools (BMC/Redfish) to validate that the physical site infrastructure matches the requirements of the AI hardware.
Incorrect Options: A. Check the utility meter and adjust fan speeds manually A utility meter provides data for the entire facility, which is too broad to validate a specific server‘s power envelope. Additionally, fan speeds on HGX systems are managed by complex automated thermal algorithms (PID loops) within the BMC/Firmware; manual adjustment via the OS is insufficient and dangerous for high-density 700W GPUs.
B. Inspect air filters and use a handheld thermometer While physical inspection is part of general maintenance, it is not a quantitative validation of the system‘s “design envelope.“ Measuring the exhaust of a single PSU does not account for the massive heat generated by the GPUs and NVSwitches, which is primarily exhausted through the high-speed fan modules at the rear of the chassis.
D. Disable OOB management and set GPU power limits to lowest Disabling the Out-of-Band (OOB) controller (the BMC) is counterproductive, as it is the primary tool used to monitor and manage the system‘s safety. Setting the GPU power limit to its lowest setting would prevent the system from ever reaching its peak design envelope, meaning you haven‘t actually validated whether the power and cooling can handle a real AI workload.
Incorrect
Correct: C Monitor BMC power telemetry to verify peak wattage during stress tests and ensure PDU thresholds are configured to handle transient spikes without tripping breakers.
The Technical Reason: NVIDIA HGX H100 systems can draw massive amounts of power (often exceeding 10kW per server under load).
BMC Telemetry: The Baseboard Management Controller (BMC) provides granular, real-time data on power consumption and thermal sensors that the OS cannot always see accurately.
Transient Spikes: AI workloads, particularly during synchronization phases of training, can cause sudden, massive increases in current. If the Power Distribution Unit (PDU) thresholds or circuit breakers are not sized for these “transients,“ the entire rack could lose power.
Validation: Running a stress test (like HPL or NeMo burn-in) while monitoring this telemetry confirms the infrastructure can handle the “worst-case“ load.
The NCP-AII Context: The exam expects you to use Out-of-Band (OOB) management tools (BMC/Redfish) to validate that the physical site infrastructure matches the requirements of the AI hardware.
Incorrect Options: A. Check the utility meter and adjust fan speeds manually A utility meter provides data for the entire facility, which is too broad to validate a specific server‘s power envelope. Additionally, fan speeds on HGX systems are managed by complex automated thermal algorithms (PID loops) within the BMC/Firmware; manual adjustment via the OS is insufficient and dangerous for high-density 700W GPUs.
B. Inspect air filters and use a handheld thermometer While physical inspection is part of general maintenance, it is not a quantitative validation of the system‘s “design envelope.“ Measuring the exhaust of a single PSU does not account for the massive heat generated by the GPUs and NVSwitches, which is primarily exhausted through the high-speed fan modules at the rear of the chassis.
D. Disable OOB management and set GPU power limits to lowest Disabling the Out-of-Band (OOB) controller (the BMC) is counterproductive, as it is the primary tool used to monitor and manage the system‘s safety. Setting the GPU power limit to its lowest setting would prevent the system from ever reaching its peak design envelope, meaning you haven‘t actually validated whether the power and cooling can handle a real AI workload.
Unattempted
Correct: C Monitor BMC power telemetry to verify peak wattage during stress tests and ensure PDU thresholds are configured to handle transient spikes without tripping breakers.
The Technical Reason: NVIDIA HGX H100 systems can draw massive amounts of power (often exceeding 10kW per server under load).
BMC Telemetry: The Baseboard Management Controller (BMC) provides granular, real-time data on power consumption and thermal sensors that the OS cannot always see accurately.
Transient Spikes: AI workloads, particularly during synchronization phases of training, can cause sudden, massive increases in current. If the Power Distribution Unit (PDU) thresholds or circuit breakers are not sized for these “transients,“ the entire rack could lose power.
Validation: Running a stress test (like HPL or NeMo burn-in) while monitoring this telemetry confirms the infrastructure can handle the “worst-case“ load.
The NCP-AII Context: The exam expects you to use Out-of-Band (OOB) management tools (BMC/Redfish) to validate that the physical site infrastructure matches the requirements of the AI hardware.
Incorrect Options: A. Check the utility meter and adjust fan speeds manually A utility meter provides data for the entire facility, which is too broad to validate a specific server‘s power envelope. Additionally, fan speeds on HGX systems are managed by complex automated thermal algorithms (PID loops) within the BMC/Firmware; manual adjustment via the OS is insufficient and dangerous for high-density 700W GPUs.
B. Inspect air filters and use a handheld thermometer While physical inspection is part of general maintenance, it is not a quantitative validation of the system‘s “design envelope.“ Measuring the exhaust of a single PSU does not account for the massive heat generated by the GPUs and NVSwitches, which is primarily exhausted through the high-speed fan modules at the rear of the chassis.
D. Disable OOB management and set GPU power limits to lowest Disabling the Out-of-Band (OOB) controller (the BMC) is counterproductive, as it is the primary tool used to monitor and manage the system‘s safety. Setting the GPU power limit to its lowest setting would prevent the system from ever reaching its peak design envelope, meaning you haven‘t actually validated whether the power and cooling can handle a real AI workload.
Question 26 of 60
26. Question
A cluster administrator is running the NeMo burn-in test to validate the system‘s readiness for large language model (LLM) training. If the test fails on a subset of nodes with ‘NCCL Timeout‘ errors, which of the following is the most logical troubleshooting path to determine the root cause of the failure in the AI infrastructure?
Correct
Correct: A Check the physical cabling and signal quality on the InfiniBand fabric, then verify that the NCCL_IB_HCA environment variables are correctly configured for the multi-rail topology. • The Technical Reason: In a rail-optimized AI factory (like a DGX SuperPOD), each GPU is typically mapped to a specific InfiniBand HCA (Host Channel Adapter). ? Physical Layer: A single bad cable or a “flapping“ link on the InfiniBand fabric will prevent a collective operation (like AllReduce) from completing, leading to a global timeout across all participating nodes. ? Environment Variables: NCCL requires explicit guidance to navigate complex multi-rail networks. If the NCCL_IB_HCA variable is misconfigured (e.g., excluding a necessary HCA) or if the NCCL_SOCKET_IFNAME points to a management network instead of the high-speed fabric, the library will fail to establish the necessary RDMA paths between GPUs. • The NCP-AII Context: The exam validates your ability to follow a structured troubleshooting hierarchy: Physical ? Firmware ? Software Configuration. Verifying the link quality (ibdiagnet) and then the software environment variables is the “NVIDIA-Certified“ method for resolving fabric-level hangs.
Incorrect: B. Increase the timeout value in the NeMo configuration Increasing the timeout is a “band-aid“ solution that ignores the underlying failure. NCCL timeouts in a healthy cluster are rare; if they occur, it means a node is physically unreachable or a process has crashed. Allowing more time for a failed communication round will not fix a broken physical link and will only delay the inevitable job failure.
C. Reinstall drivers and reboot the head node Reinstalling drivers is an extreme measure that should only be taken after logs confirm a driver-level corruption. Rebooting the head node is irrelevant to “NCCL Timeout“ errors occurring on compute nodes, as NCCL communication happens directly between the compute nodes‘ GPUs and NICs, bypassing the head node entirely during the execution phase.
D. Disable InfiniBand and use the 1GbE management network The 1GbE management network lacks the bandwidth and RDMA capabilities required for LLM training. Forcing NCCL to use a slow management network for a NeMo burn-in would result in extreme performance degradation (orders of magnitude slower) and would likely trigger even more timeouts due to the massive congestion on the low-speed interface.
Incorrect
Correct: A Check the physical cabling and signal quality on the InfiniBand fabric, then verify that the NCCL_IB_HCA environment variables are correctly configured for the multi-rail topology. • The Technical Reason: In a rail-optimized AI factory (like a DGX SuperPOD), each GPU is typically mapped to a specific InfiniBand HCA (Host Channel Adapter). ? Physical Layer: A single bad cable or a “flapping“ link on the InfiniBand fabric will prevent a collective operation (like AllReduce) from completing, leading to a global timeout across all participating nodes. ? Environment Variables: NCCL requires explicit guidance to navigate complex multi-rail networks. If the NCCL_IB_HCA variable is misconfigured (e.g., excluding a necessary HCA) or if the NCCL_SOCKET_IFNAME points to a management network instead of the high-speed fabric, the library will fail to establish the necessary RDMA paths between GPUs. • The NCP-AII Context: The exam validates your ability to follow a structured troubleshooting hierarchy: Physical ? Firmware ? Software Configuration. Verifying the link quality (ibdiagnet) and then the software environment variables is the “NVIDIA-Certified“ method for resolving fabric-level hangs.
Incorrect: B. Increase the timeout value in the NeMo configuration Increasing the timeout is a “band-aid“ solution that ignores the underlying failure. NCCL timeouts in a healthy cluster are rare; if they occur, it means a node is physically unreachable or a process has crashed. Allowing more time for a failed communication round will not fix a broken physical link and will only delay the inevitable job failure.
C. Reinstall drivers and reboot the head node Reinstalling drivers is an extreme measure that should only be taken after logs confirm a driver-level corruption. Rebooting the head node is irrelevant to “NCCL Timeout“ errors occurring on compute nodes, as NCCL communication happens directly between the compute nodes‘ GPUs and NICs, bypassing the head node entirely during the execution phase.
D. Disable InfiniBand and use the 1GbE management network The 1GbE management network lacks the bandwidth and RDMA capabilities required for LLM training. Forcing NCCL to use a slow management network for a NeMo burn-in would result in extreme performance degradation (orders of magnitude slower) and would likely trigger even more timeouts due to the massive congestion on the low-speed interface.
Unattempted
Correct: A Check the physical cabling and signal quality on the InfiniBand fabric, then verify that the NCCL_IB_HCA environment variables are correctly configured for the multi-rail topology. • The Technical Reason: In a rail-optimized AI factory (like a DGX SuperPOD), each GPU is typically mapped to a specific InfiniBand HCA (Host Channel Adapter). ? Physical Layer: A single bad cable or a “flapping“ link on the InfiniBand fabric will prevent a collective operation (like AllReduce) from completing, leading to a global timeout across all participating nodes. ? Environment Variables: NCCL requires explicit guidance to navigate complex multi-rail networks. If the NCCL_IB_HCA variable is misconfigured (e.g., excluding a necessary HCA) or if the NCCL_SOCKET_IFNAME points to a management network instead of the high-speed fabric, the library will fail to establish the necessary RDMA paths between GPUs. • The NCP-AII Context: The exam validates your ability to follow a structured troubleshooting hierarchy: Physical ? Firmware ? Software Configuration. Verifying the link quality (ibdiagnet) and then the software environment variables is the “NVIDIA-Certified“ method for resolving fabric-level hangs.
Incorrect: B. Increase the timeout value in the NeMo configuration Increasing the timeout is a “band-aid“ solution that ignores the underlying failure. NCCL timeouts in a healthy cluster are rare; if they occur, it means a node is physically unreachable or a process has crashed. Allowing more time for a failed communication round will not fix a broken physical link and will only delay the inevitable job failure.
C. Reinstall drivers and reboot the head node Reinstalling drivers is an extreme measure that should only be taken after logs confirm a driver-level corruption. Rebooting the head node is irrelevant to “NCCL Timeout“ errors occurring on compute nodes, as NCCL communication happens directly between the compute nodes‘ GPUs and NICs, bypassing the head node entirely during the execution phase.
D. Disable InfiniBand and use the 1GbE management network The 1GbE management network lacks the bandwidth and RDMA capabilities required for LLM training. Forcing NCCL to use a slow management network for a NeMo burn-in would result in extreme performance degradation (orders of magnitude slower) and would likely trigger even more timeouts due to the massive congestion on the low-speed interface.
Question 27 of 60
27. Question
A deployment team needs to validate the storage subsystem performance for a large NeMo-based Large Language Model (LLM) training job. Which test should be prioritized to ensure that the storage can handle the massive checkpointing requirements of the training process?
Correct
Correct: B A sequential write throughput test using Large Block sizes (e.g., 1MB+) to simulate the frequent saving of model checkpoints (NeMo burn-in).
The Technical Reason: Checkpointing involves saving the entire state of a model—including weights, optimizer states, and gradients—to persistent storage so that training can be resumed in case of a hardware failure.
I/O Pattern: Checkpoints are massive, monolithic files (often terabytes in size for models like GPT-3 or Llama 3). Writing these files is a sequential write operation.
Block Size: Because the data volume is so high, using Large Block sizes (1MB, 4MB, or even 16MB) is necessary to achieve the maximum throughput of the storage fabric (e.g., InfiniBand or 400GbE) and the underlying parallel filesystem (like Lustre, GPFS, or Weka).
Synchronous Stalls: In many configurations, training stops while the checkpoint is being written. Therefore, the higher the sequential write throughput, the less “dead time“ the GPUs spend idling.
The NCP-AII Context: The exam validates your understanding of the NVIDIA-Certified Storage requirements. You must know that while “read“ performance is critical for data ingestion, “write“ throughput is the primary bottleneck for the reliability and efficiency of large-scale distributed training.
Incorrect Options: A. GPU-to-GPU P2P test A Peer-to-Peer (P2P) test verifies NVLink or PCIe communication between GPUs. While this is essential for training speed (All-Reduce operations), it does not test the storage subsystem. Storage data typically moves from the GPU/Host to an external storage array, which is a different data path than P2P.
C. Network latency test using ‘traceroute‘ traceroute is a basic networking tool used to identify the path packets take across a network. While latency is important for some applications, storage for AI training is primarily sensitive to bandwidth (throughput) rather than the number of network hops. A storage array within 3 hops could still have poor throughput if the links are congested or misconfigured.
D. Random read IOPS test using small 4KB blocks Small-block random read IOPS (Input/Output Operations Per Second) is the primary metric for Inference workloads or metadata-heavy tasks. However, it is the opposite of what is needed for Checkpointing. Using 4KB blocks would fail to saturate the high-bandwidth links of an AI Factory and would not accurately simulate the behavior of a NeMo training job saving a 1TB model state.
Incorrect
Correct: B A sequential write throughput test using Large Block sizes (e.g., 1MB+) to simulate the frequent saving of model checkpoints (NeMo burn-in).
The Technical Reason: Checkpointing involves saving the entire state of a model—including weights, optimizer states, and gradients—to persistent storage so that training can be resumed in case of a hardware failure.
I/O Pattern: Checkpoints are massive, monolithic files (often terabytes in size for models like GPT-3 or Llama 3). Writing these files is a sequential write operation.
Block Size: Because the data volume is so high, using Large Block sizes (1MB, 4MB, or even 16MB) is necessary to achieve the maximum throughput of the storage fabric (e.g., InfiniBand or 400GbE) and the underlying parallel filesystem (like Lustre, GPFS, or Weka).
Synchronous Stalls: In many configurations, training stops while the checkpoint is being written. Therefore, the higher the sequential write throughput, the less “dead time“ the GPUs spend idling.
The NCP-AII Context: The exam validates your understanding of the NVIDIA-Certified Storage requirements. You must know that while “read“ performance is critical for data ingestion, “write“ throughput is the primary bottleneck for the reliability and efficiency of large-scale distributed training.
Incorrect Options: A. GPU-to-GPU P2P test A Peer-to-Peer (P2P) test verifies NVLink or PCIe communication between GPUs. While this is essential for training speed (All-Reduce operations), it does not test the storage subsystem. Storage data typically moves from the GPU/Host to an external storage array, which is a different data path than P2P.
C. Network latency test using ‘traceroute‘ traceroute is a basic networking tool used to identify the path packets take across a network. While latency is important for some applications, storage for AI training is primarily sensitive to bandwidth (throughput) rather than the number of network hops. A storage array within 3 hops could still have poor throughput if the links are congested or misconfigured.
D. Random read IOPS test using small 4KB blocks Small-block random read IOPS (Input/Output Operations Per Second) is the primary metric for Inference workloads or metadata-heavy tasks. However, it is the opposite of what is needed for Checkpointing. Using 4KB blocks would fail to saturate the high-bandwidth links of an AI Factory and would not accurately simulate the behavior of a NeMo training job saving a 1TB model state.
Unattempted
Correct: B A sequential write throughput test using Large Block sizes (e.g., 1MB+) to simulate the frequent saving of model checkpoints (NeMo burn-in).
The Technical Reason: Checkpointing involves saving the entire state of a model—including weights, optimizer states, and gradients—to persistent storage so that training can be resumed in case of a hardware failure.
I/O Pattern: Checkpoints are massive, monolithic files (often terabytes in size for models like GPT-3 or Llama 3). Writing these files is a sequential write operation.
Block Size: Because the data volume is so high, using Large Block sizes (1MB, 4MB, or even 16MB) is necessary to achieve the maximum throughput of the storage fabric (e.g., InfiniBand or 400GbE) and the underlying parallel filesystem (like Lustre, GPFS, or Weka).
Synchronous Stalls: In many configurations, training stops while the checkpoint is being written. Therefore, the higher the sequential write throughput, the less “dead time“ the GPUs spend idling.
The NCP-AII Context: The exam validates your understanding of the NVIDIA-Certified Storage requirements. You must know that while “read“ performance is critical for data ingestion, “write“ throughput is the primary bottleneck for the reliability and efficiency of large-scale distributed training.
Incorrect Options: A. GPU-to-GPU P2P test A Peer-to-Peer (P2P) test verifies NVLink or PCIe communication between GPUs. While this is essential for training speed (All-Reduce operations), it does not test the storage subsystem. Storage data typically moves from the GPU/Host to an external storage array, which is a different data path than P2P.
C. Network latency test using ‘traceroute‘ traceroute is a basic networking tool used to identify the path packets take across a network. While latency is important for some applications, storage for AI training is primarily sensitive to bandwidth (throughput) rather than the number of network hops. A storage array within 3 hops could still have poor throughput if the links are congested or misconfigured.
D. Random read IOPS test using small 4KB blocks Small-block random read IOPS (Input/Output Operations Per Second) is the primary metric for Inference workloads or metadata-heavy tasks. However, it is the opposite of what is needed for Checkpointing. Using 4KB blocks would fail to saturate the high-bandwidth links of an AI Factory and would not accurately simulate the behavior of a NeMo training job saving a 1TB model state.
Question 28 of 60
28. Question
A storage administrator is optimizing the storage layer for an AI factory using NVIDIA Magnum IO GPUDirect Storage (GDS). The users are reporting that the ‘checkpointing‘ phase of their training jobs is taking too long. How does enabling and optimizing GPUDirect Storage help in this specific scenario?
Correct
Correct: D. GDS provides a direct data path between the GPU memory and the storage (NVMe/NVMe-oF), bypassing the CPU‘s bounce buffers and reducing latency and CPU overhead.
This is correct because GPUDirect Storage (GDS) is specifically designed to address I/O bottlenecks such as slow checkpointing by creating a direct memory access (DMA) path between GPU memory and storage . The traditional checkpointing process requires data to be copied through a temporary “bounce buffer“ in CPU system memory, which involves two copy operations and consumes CPU cycles . GDS eliminates this inefficiency by enabling storage devices (NVMe/NVMe-oF) to transfer data directly to and from GPU memory without burdening the CPU . This direct path reduces latency, increases bandwidth, and significantly speeds up checkpoint operations, as demonstrated by technologies like FastPersist that leverage GDS to make checkpointing overhead negligible during training .
Incorrect:
A. GDS encrypts the checkpoint files so they take up 50% less space on the NVMe drives, thus reducing the time it takes to write them. This is incorrect. GDS does not perform encryption or compression of data. Its function is strictly focused on creating a direct data movement path between GPU memory and storage to improve I/O performance . Data reduction techniques like encryption or compression are separate functions handled by other software layers.
B. GDS automatically deletes old checkpoints to free up space, ensuring that the current write operation always has the fastest available blocks. This is incorrect. GDS has no capability to manage storage space, delete files, or perform any storage lifecycle management. It is a data path optimization technology that enables direct DMA transfers, not a storage management tool .
C. GDS increases the GPU‘s clock speed during I/O operations to process the incoming data stream more quickly. This is incorrect. GDS does not modify GPU clock speeds or any hardware performance states. Clock speed management is handled by power management subsystems and tools like nvidia-smi. GDS solely optimizes the data path between storage and GPU memory by eliminating CPU bounce buffers .
Incorrect
Correct: D. GDS provides a direct data path between the GPU memory and the storage (NVMe/NVMe-oF), bypassing the CPU‘s bounce buffers and reducing latency and CPU overhead.
This is correct because GPUDirect Storage (GDS) is specifically designed to address I/O bottlenecks such as slow checkpointing by creating a direct memory access (DMA) path between GPU memory and storage . The traditional checkpointing process requires data to be copied through a temporary “bounce buffer“ in CPU system memory, which involves two copy operations and consumes CPU cycles . GDS eliminates this inefficiency by enabling storage devices (NVMe/NVMe-oF) to transfer data directly to and from GPU memory without burdening the CPU . This direct path reduces latency, increases bandwidth, and significantly speeds up checkpoint operations, as demonstrated by technologies like FastPersist that leverage GDS to make checkpointing overhead negligible during training .
Incorrect:
A. GDS encrypts the checkpoint files so they take up 50% less space on the NVMe drives, thus reducing the time it takes to write them. This is incorrect. GDS does not perform encryption or compression of data. Its function is strictly focused on creating a direct data movement path between GPU memory and storage to improve I/O performance . Data reduction techniques like encryption or compression are separate functions handled by other software layers.
B. GDS automatically deletes old checkpoints to free up space, ensuring that the current write operation always has the fastest available blocks. This is incorrect. GDS has no capability to manage storage space, delete files, or perform any storage lifecycle management. It is a data path optimization technology that enables direct DMA transfers, not a storage management tool .
C. GDS increases the GPU‘s clock speed during I/O operations to process the incoming data stream more quickly. This is incorrect. GDS does not modify GPU clock speeds or any hardware performance states. Clock speed management is handled by power management subsystems and tools like nvidia-smi. GDS solely optimizes the data path between storage and GPU memory by eliminating CPU bounce buffers .
Unattempted
Correct: D. GDS provides a direct data path between the GPU memory and the storage (NVMe/NVMe-oF), bypassing the CPU‘s bounce buffers and reducing latency and CPU overhead.
This is correct because GPUDirect Storage (GDS) is specifically designed to address I/O bottlenecks such as slow checkpointing by creating a direct memory access (DMA) path between GPU memory and storage . The traditional checkpointing process requires data to be copied through a temporary “bounce buffer“ in CPU system memory, which involves two copy operations and consumes CPU cycles . GDS eliminates this inefficiency by enabling storage devices (NVMe/NVMe-oF) to transfer data directly to and from GPU memory without burdening the CPU . This direct path reduces latency, increases bandwidth, and significantly speeds up checkpoint operations, as demonstrated by technologies like FastPersist that leverage GDS to make checkpointing overhead negligible during training .
Incorrect:
A. GDS encrypts the checkpoint files so they take up 50% less space on the NVMe drives, thus reducing the time it takes to write them. This is incorrect. GDS does not perform encryption or compression of data. Its function is strictly focused on creating a direct data movement path between GPU memory and storage to improve I/O performance . Data reduction techniques like encryption or compression are separate functions handled by other software layers.
B. GDS automatically deletes old checkpoints to free up space, ensuring that the current write operation always has the fastest available blocks. This is incorrect. GDS has no capability to manage storage space, delete files, or perform any storage lifecycle management. It is a data path optimization technology that enables direct DMA transfers, not a storage management tool .
C. GDS increases the GPU‘s clock speed during I/O operations to process the incoming data stream more quickly. This is incorrect. GDS does not modify GPU clock speeds or any hardware performance states. Clock speed management is handled by power management subsystems and tools like nvidia-smi. GDS solely optimizes the data path between storage and GPU memory by eliminating CPU bounce buffers .
Question 29 of 60
29. Question
When configuring an NVIDIA BlueField network platform to act as a secure co-processor in an AI infrastructure, the administrator must decide between the Separated Host and Embedded Function modes. If the goal is to have the BlueField DPU manage the networking and security policies independently of the host x86 CPU, which mode is required and what is the primary management interface used for DPU configuration?
Correct
Correct: B. The DPU should be set to DPU Mode (Embedded Function), and the primary management is done through the oob_net0 interface or a console connection to the DPU ARM cores.
This is correct because DPU Mode, also known as embedded CPU function ownership (ECPF) mode, is the default mode for BlueField DPU SKUs where the embedded Arm system controls the NIC resources and data path independently of the host x86 CPU . In this mode, “the NIC resources and functionality are owned and controlled by the embedded Arm subsystem,“ and “all network communication to the host flows through a virtual switch control plane hosted on the Arm cores“ . This enables the DPU to act as a secure co-processor managing networking and security policies separately from the host.
Regarding management interfaces, the documentation specifies two primary methods for configuring and managing the DPU in this mode. First, the oob_net0 interface is explicitly defined as “a gigabit Ethernet interface which provides TCP/IP network connectivity to the Arm cores“ and “is intended to be used for management traffic (e.g., file transfer protocols, SSH, etc)“ . Second, direct console access to the Arm cores provides configuration capability, as the DPU Arm subsystem runs its own operating system and services . The DPU Mode installation documentation confirms that after initial setup, “the BlueField should be configured with an IP address, all required settings, has up-to-date software component versions, and is ready to use“ .
Incorrect:
A. The DPU must be in Bypass Mode to ensure that the 400Gb/s traffic does not get inspected by the ARM cores, which would otherwise create a significant bottleneck for AI training. This is incorrect. There is no operational mode called “Bypass Mode“ defined in the NVIDIA documentation. While traffic optimization is addressed through the fast path/slow path architecture where “the Arm processor can define rules in the eswitch through the ECPF, allowing packets to bypass the Arm processor and be processed directly by the eswitch“ , this is a performance optimization within DPU Mode, not a separate mode. Furthermore, this option incorrectly suggests that Arm core inspection is inherently a bottleneck, whereas the DPU is designed specifically to offload these functions from the host CPU.
C. The DPU should be in Virtualization Mode, allowing the x86 host to manage the DPU firmware directly through the BIOS setup menu of the server without needing a separate OS on the DPU. This is incorrect. “Virtualization Mode“ is not a recognized operational mode for BlueField DPUs . Additionally, in DPU Mode, “the embedded Arm system runs services that manage the NIC resources and data path“ , meaning a separate OS on the DPU Arm cores is required and fundamental to its operation. The documentation confirms that “the driver on the host side can only be loaded after the driver on the BlueField has loaded and completed NIC configuration“ , further establishing that the DPU operates independently with its own software stack.
D. The DPU should be set to Nic Mode (Separated Host), where the configuration is managed entirely through the host standard Linux network manager without ARM intervention. This is incorrect because in NIC Mode, “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This mode explicitly does not enable the DPU to act as a secure co-processor managing networking independently—it reverts the device to traditional NIC functionality where the host controls all networking. While configuration in NIC Mode would be managed through the host, this contradicts the stated goal of having the DPU manage networking and security policies independently of the host x86 CPU.
Incorrect
Correct: B. The DPU should be set to DPU Mode (Embedded Function), and the primary management is done through the oob_net0 interface or a console connection to the DPU ARM cores.
This is correct because DPU Mode, also known as embedded CPU function ownership (ECPF) mode, is the default mode for BlueField DPU SKUs where the embedded Arm system controls the NIC resources and data path independently of the host x86 CPU . In this mode, “the NIC resources and functionality are owned and controlled by the embedded Arm subsystem,“ and “all network communication to the host flows through a virtual switch control plane hosted on the Arm cores“ . This enables the DPU to act as a secure co-processor managing networking and security policies separately from the host.
Regarding management interfaces, the documentation specifies two primary methods for configuring and managing the DPU in this mode. First, the oob_net0 interface is explicitly defined as “a gigabit Ethernet interface which provides TCP/IP network connectivity to the Arm cores“ and “is intended to be used for management traffic (e.g., file transfer protocols, SSH, etc)“ . Second, direct console access to the Arm cores provides configuration capability, as the DPU Arm subsystem runs its own operating system and services . The DPU Mode installation documentation confirms that after initial setup, “the BlueField should be configured with an IP address, all required settings, has up-to-date software component versions, and is ready to use“ .
Incorrect:
A. The DPU must be in Bypass Mode to ensure that the 400Gb/s traffic does not get inspected by the ARM cores, which would otherwise create a significant bottleneck for AI training. This is incorrect. There is no operational mode called “Bypass Mode“ defined in the NVIDIA documentation. While traffic optimization is addressed through the fast path/slow path architecture where “the Arm processor can define rules in the eswitch through the ECPF, allowing packets to bypass the Arm processor and be processed directly by the eswitch“ , this is a performance optimization within DPU Mode, not a separate mode. Furthermore, this option incorrectly suggests that Arm core inspection is inherently a bottleneck, whereas the DPU is designed specifically to offload these functions from the host CPU.
C. The DPU should be in Virtualization Mode, allowing the x86 host to manage the DPU firmware directly through the BIOS setup menu of the server without needing a separate OS on the DPU. This is incorrect. “Virtualization Mode“ is not a recognized operational mode for BlueField DPUs . Additionally, in DPU Mode, “the embedded Arm system runs services that manage the NIC resources and data path“ , meaning a separate OS on the DPU Arm cores is required and fundamental to its operation. The documentation confirms that “the driver on the host side can only be loaded after the driver on the BlueField has loaded and completed NIC configuration“ , further establishing that the DPU operates independently with its own software stack.
D. The DPU should be set to Nic Mode (Separated Host), where the configuration is managed entirely through the host standard Linux network manager without ARM intervention. This is incorrect because in NIC Mode, “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This mode explicitly does not enable the DPU to act as a secure co-processor managing networking independently—it reverts the device to traditional NIC functionality where the host controls all networking. While configuration in NIC Mode would be managed through the host, this contradicts the stated goal of having the DPU manage networking and security policies independently of the host x86 CPU.
Unattempted
Correct: B. The DPU should be set to DPU Mode (Embedded Function), and the primary management is done through the oob_net0 interface or a console connection to the DPU ARM cores.
This is correct because DPU Mode, also known as embedded CPU function ownership (ECPF) mode, is the default mode for BlueField DPU SKUs where the embedded Arm system controls the NIC resources and data path independently of the host x86 CPU . In this mode, “the NIC resources and functionality are owned and controlled by the embedded Arm subsystem,“ and “all network communication to the host flows through a virtual switch control plane hosted on the Arm cores“ . This enables the DPU to act as a secure co-processor managing networking and security policies separately from the host.
Regarding management interfaces, the documentation specifies two primary methods for configuring and managing the DPU in this mode. First, the oob_net0 interface is explicitly defined as “a gigabit Ethernet interface which provides TCP/IP network connectivity to the Arm cores“ and “is intended to be used for management traffic (e.g., file transfer protocols, SSH, etc)“ . Second, direct console access to the Arm cores provides configuration capability, as the DPU Arm subsystem runs its own operating system and services . The DPU Mode installation documentation confirms that after initial setup, “the BlueField should be configured with an IP address, all required settings, has up-to-date software component versions, and is ready to use“ .
Incorrect:
A. The DPU must be in Bypass Mode to ensure that the 400Gb/s traffic does not get inspected by the ARM cores, which would otherwise create a significant bottleneck for AI training. This is incorrect. There is no operational mode called “Bypass Mode“ defined in the NVIDIA documentation. While traffic optimization is addressed through the fast path/slow path architecture where “the Arm processor can define rules in the eswitch through the ECPF, allowing packets to bypass the Arm processor and be processed directly by the eswitch“ , this is a performance optimization within DPU Mode, not a separate mode. Furthermore, this option incorrectly suggests that Arm core inspection is inherently a bottleneck, whereas the DPU is designed specifically to offload these functions from the host CPU.
C. The DPU should be in Virtualization Mode, allowing the x86 host to manage the DPU firmware directly through the BIOS setup menu of the server without needing a separate OS on the DPU. This is incorrect. “Virtualization Mode“ is not a recognized operational mode for BlueField DPUs . Additionally, in DPU Mode, “the embedded Arm system runs services that manage the NIC resources and data path“ , meaning a separate OS on the DPU Arm cores is required and fundamental to its operation. The documentation confirms that “the driver on the host side can only be loaded after the driver on the BlueField has loaded and completed NIC configuration“ , further establishing that the DPU operates independently with its own software stack.
D. The DPU should be set to Nic Mode (Separated Host), where the configuration is managed entirely through the host standard Linux network manager without ARM intervention. This is incorrect because in NIC Mode, “the Arm cores of BlueField are inactive, and the device functions as an NVIDIA® ConnectX® network adapter“ . This mode explicitly does not enable the DPU to act as a secure co-processor managing networking independently—it reverts the device to traditional NIC functionality where the host controls all networking. While configuration in NIC Mode would be managed through the host, this contradicts the stated goal of having the DPU manage networking and security policies independently of the host x86 CPU.
Question 30 of 60
30. Question
An administrator needs to partition an NVIDIA A100 GPU into multiple instances to support concurrent small-scale training jobs and inference services. Which technology should be configured, and what is a key requirement for this configuration to be persistent across system reboots?
Correct
Correct: C Configure Multi-Instance GPU (MIG) and ensure the MIG mode is enabled via nvidia-smi with the -mig 1 (or equivalent) flag followed by a reboot.
The Technical Reason: * MIG Technology: MIG allows an A100 to be split into up to seven independent GPU instances. Each instance has its own dedicated high-bandwidth memory, cache, and compute cores, providing guaranteed Quality of Service (QoS) and fault isolation.
Persistence: MIG mode is a persistent hardware state. Enabling it requires the nvidia-smi -i -mig 1 command. Because this change reconfigures the GPU‘s internal memory and engine partitioning, a system reboot or a GPU reset is mandatory to initialize the new hardware identity.
The NCP-AII Context: The exam expects you to know that MIG is managed at the driver level via nvidia-smi. Verification of the persistence is often checked using nvidia-smi -q -d MIG, which shows if the mode is “Enabled“ and “Pending“ (if a reboot is still required).
Incorrect Options: A. NVLink Bridge settings in the BIOS NVLink is used to interconnect multiple GPUs to act as a single unit (scaling up); it is not a partitioning technology (scaling out). While the BIOS/SBIOS is used for initial hardware discovery, partitioning logic is handled by the NVIDIA driver and the GPU‘s firmware, not through “virtual lanes“ in the BIOS.
B. NVIDIA vGPU profiles within VMware ESXi While vGPU is a valid virtualization technology, it is primarily used in Virtual Desktop Infrastructure (VDI) or cloud environments where a hypervisor (like ESXi) manages the slices. In a professional “AI Factory“ or bare-metal environment (the focus of NCP-AII), MIG is the preferred method because it provides physical hardware-level isolation without the overhead of a hypervisor.
D. Slurm logical partitioning in slurm.conf Slurm is a workload manager that allocates resources; it does not create hardware partitions. Slurm can be configured to recognize existing MIG instances (via the GRES configuration), but it cannot physically partition a GPU. If the underlying MIG configuration isn‘t set via the NVIDIA driver first, Slurm will only see one large, undivided GPU.
Incorrect
Correct: C Configure Multi-Instance GPU (MIG) and ensure the MIG mode is enabled via nvidia-smi with the -mig 1 (or equivalent) flag followed by a reboot.
The Technical Reason: * MIG Technology: MIG allows an A100 to be split into up to seven independent GPU instances. Each instance has its own dedicated high-bandwidth memory, cache, and compute cores, providing guaranteed Quality of Service (QoS) and fault isolation.
Persistence: MIG mode is a persistent hardware state. Enabling it requires the nvidia-smi -i -mig 1 command. Because this change reconfigures the GPU‘s internal memory and engine partitioning, a system reboot or a GPU reset is mandatory to initialize the new hardware identity.
The NCP-AII Context: The exam expects you to know that MIG is managed at the driver level via nvidia-smi. Verification of the persistence is often checked using nvidia-smi -q -d MIG, which shows if the mode is “Enabled“ and “Pending“ (if a reboot is still required).
Incorrect Options: A. NVLink Bridge settings in the BIOS NVLink is used to interconnect multiple GPUs to act as a single unit (scaling up); it is not a partitioning technology (scaling out). While the BIOS/SBIOS is used for initial hardware discovery, partitioning logic is handled by the NVIDIA driver and the GPU‘s firmware, not through “virtual lanes“ in the BIOS.
B. NVIDIA vGPU profiles within VMware ESXi While vGPU is a valid virtualization technology, it is primarily used in Virtual Desktop Infrastructure (VDI) or cloud environments where a hypervisor (like ESXi) manages the slices. In a professional “AI Factory“ or bare-metal environment (the focus of NCP-AII), MIG is the preferred method because it provides physical hardware-level isolation without the overhead of a hypervisor.
D. Slurm logical partitioning in slurm.conf Slurm is a workload manager that allocates resources; it does not create hardware partitions. Slurm can be configured to recognize existing MIG instances (via the GRES configuration), but it cannot physically partition a GPU. If the underlying MIG configuration isn‘t set via the NVIDIA driver first, Slurm will only see one large, undivided GPU.
Unattempted
Correct: C Configure Multi-Instance GPU (MIG) and ensure the MIG mode is enabled via nvidia-smi with the -mig 1 (or equivalent) flag followed by a reboot.
The Technical Reason: * MIG Technology: MIG allows an A100 to be split into up to seven independent GPU instances. Each instance has its own dedicated high-bandwidth memory, cache, and compute cores, providing guaranteed Quality of Service (QoS) and fault isolation.
Persistence: MIG mode is a persistent hardware state. Enabling it requires the nvidia-smi -i -mig 1 command. Because this change reconfigures the GPU‘s internal memory and engine partitioning, a system reboot or a GPU reset is mandatory to initialize the new hardware identity.
The NCP-AII Context: The exam expects you to know that MIG is managed at the driver level via nvidia-smi. Verification of the persistence is often checked using nvidia-smi -q -d MIG, which shows if the mode is “Enabled“ and “Pending“ (if a reboot is still required).
Incorrect Options: A. NVLink Bridge settings in the BIOS NVLink is used to interconnect multiple GPUs to act as a single unit (scaling up); it is not a partitioning technology (scaling out). While the BIOS/SBIOS is used for initial hardware discovery, partitioning logic is handled by the NVIDIA driver and the GPU‘s firmware, not through “virtual lanes“ in the BIOS.
B. NVIDIA vGPU profiles within VMware ESXi While vGPU is a valid virtualization technology, it is primarily used in Virtual Desktop Infrastructure (VDI) or cloud environments where a hypervisor (like ESXi) manages the slices. In a professional “AI Factory“ or bare-metal environment (the focus of NCP-AII), MIG is the preferred method because it provides physical hardware-level isolation without the overhead of a hypervisor.
D. Slurm logical partitioning in slurm.conf Slurm is a workload manager that allocates resources; it does not create hardware partitions. Slurm can be configured to recognize existing MIG instances (via the GRES configuration), but it cannot physically partition a GPU. If the underlying MIG configuration isn‘t set via the NVIDIA driver first, Slurm will only see one large, undivided GPU.
Question 31 of 60
31. Question
When optimizing performance for an AI cluster consisting of nodes with AMD EPYC processors and NVIDIA H100 GPUs, which BIOS/OS tuning parameter is most critical for ensuring low-latency communication between the CPUs and GPUs?
Correct
Correct: C Ensuring the IOMMU is configured correctly and setting the Determinism Slider to ‘Performance‘ to maintain consistent CPU clock frequencies.
The Technical Reason: * Determinism Slider: AMD EPYC processors feature “Determinism Control.“ By default, CPUs may vary their clock speeds based on thermal and power conditions (Power Determinism). Setting the Determinism Slider to “Performance“ (Performance Determinism) forces the CPU to maintain a consistent, predictable frequency. This is critical for AI workloads where the CPU must feed data to the GPUs at a constant, high-speed rate to prevent “starving“ the Tensor Cores.
IOMMU: The IOMMU (Input-Output Memory Management Unit) must be enabled and correctly configured (often using the iommu=pt or “pass-through“ kernel parameter) to allow the GPUs and DPUs to access system memory directly without unnecessary software translation overhead, which is vital for GPUDirect RDMA.
The NCP-AII Context: The exam specifically tests your ability to “Execute performance optimization for AMD and Intel servers.“ On AMD platforms, moving away from “Eco“ or “Auto“ power settings to “Performance Determinism“ is a documented NVIDIA-Certified best practice for HGX and DGX systems.
Incorrect Options: A. Setting the BlueField-3 DPU to ‘Bridge Mode‘ “Bridge Mode“ is a networking configuration, but it does not allow the AMD Infinity Fabric (the internal CPU-to-CPU interconnect) to manage InfiniBand traffic. InfiniBand traffic is managed by the DPU‘s internal hardware and the Subnet Manager. The Infinity Fabric is strictly for internal CPU die-to-die or socket-to-socket communication.
B. Disabling PCIe Gen5 and forcing Gen3 The NVIDIA H100 is a PCIe Gen5 device. Forcing the system to Gen3 would reduce the available bandwidth from 63GB/s per x16 slot down to 15.7GB/s. This would create a massive bottleneck for “North-South“ data transfers and would never be recommended as an optimization. The NVIDIA Container Toolkit does not require “retries“ that would be solved by lowering the hardware link speed.
D. Enabling ‘Eco-Mode‘ in the BIOS Eco-Mode is designed for power saving and limits the CPU‘s TDP (Thermal Design Power). In a high-performance AI cluster, this would lead to significant CPU throttling. While it might leave “thermal headroom,“ the goal of an AI Factory is to provide maximum power to both the CPU and GPU simultaneously to ensure the CPU can keep up with the data-loading demands of the H100s.
Incorrect
Correct: C Ensuring the IOMMU is configured correctly and setting the Determinism Slider to ‘Performance‘ to maintain consistent CPU clock frequencies.
The Technical Reason: * Determinism Slider: AMD EPYC processors feature “Determinism Control.“ By default, CPUs may vary their clock speeds based on thermal and power conditions (Power Determinism). Setting the Determinism Slider to “Performance“ (Performance Determinism) forces the CPU to maintain a consistent, predictable frequency. This is critical for AI workloads where the CPU must feed data to the GPUs at a constant, high-speed rate to prevent “starving“ the Tensor Cores.
IOMMU: The IOMMU (Input-Output Memory Management Unit) must be enabled and correctly configured (often using the iommu=pt or “pass-through“ kernel parameter) to allow the GPUs and DPUs to access system memory directly without unnecessary software translation overhead, which is vital for GPUDirect RDMA.
The NCP-AII Context: The exam specifically tests your ability to “Execute performance optimization for AMD and Intel servers.“ On AMD platforms, moving away from “Eco“ or “Auto“ power settings to “Performance Determinism“ is a documented NVIDIA-Certified best practice for HGX and DGX systems.
Incorrect Options: A. Setting the BlueField-3 DPU to ‘Bridge Mode‘ “Bridge Mode“ is a networking configuration, but it does not allow the AMD Infinity Fabric (the internal CPU-to-CPU interconnect) to manage InfiniBand traffic. InfiniBand traffic is managed by the DPU‘s internal hardware and the Subnet Manager. The Infinity Fabric is strictly for internal CPU die-to-die or socket-to-socket communication.
B. Disabling PCIe Gen5 and forcing Gen3 The NVIDIA H100 is a PCIe Gen5 device. Forcing the system to Gen3 would reduce the available bandwidth from 63GB/s per x16 slot down to 15.7GB/s. This would create a massive bottleneck for “North-South“ data transfers and would never be recommended as an optimization. The NVIDIA Container Toolkit does not require “retries“ that would be solved by lowering the hardware link speed.
D. Enabling ‘Eco-Mode‘ in the BIOS Eco-Mode is designed for power saving and limits the CPU‘s TDP (Thermal Design Power). In a high-performance AI cluster, this would lead to significant CPU throttling. While it might leave “thermal headroom,“ the goal of an AI Factory is to provide maximum power to both the CPU and GPU simultaneously to ensure the CPU can keep up with the data-loading demands of the H100s.
Unattempted
Correct: C Ensuring the IOMMU is configured correctly and setting the Determinism Slider to ‘Performance‘ to maintain consistent CPU clock frequencies.
The Technical Reason: * Determinism Slider: AMD EPYC processors feature “Determinism Control.“ By default, CPUs may vary their clock speeds based on thermal and power conditions (Power Determinism). Setting the Determinism Slider to “Performance“ (Performance Determinism) forces the CPU to maintain a consistent, predictable frequency. This is critical for AI workloads where the CPU must feed data to the GPUs at a constant, high-speed rate to prevent “starving“ the Tensor Cores.
IOMMU: The IOMMU (Input-Output Memory Management Unit) must be enabled and correctly configured (often using the iommu=pt or “pass-through“ kernel parameter) to allow the GPUs and DPUs to access system memory directly without unnecessary software translation overhead, which is vital for GPUDirect RDMA.
The NCP-AII Context: The exam specifically tests your ability to “Execute performance optimization for AMD and Intel servers.“ On AMD platforms, moving away from “Eco“ or “Auto“ power settings to “Performance Determinism“ is a documented NVIDIA-Certified best practice for HGX and DGX systems.
Incorrect Options: A. Setting the BlueField-3 DPU to ‘Bridge Mode‘ “Bridge Mode“ is a networking configuration, but it does not allow the AMD Infinity Fabric (the internal CPU-to-CPU interconnect) to manage InfiniBand traffic. InfiniBand traffic is managed by the DPU‘s internal hardware and the Subnet Manager. The Infinity Fabric is strictly for internal CPU die-to-die or socket-to-socket communication.
B. Disabling PCIe Gen5 and forcing Gen3 The NVIDIA H100 is a PCIe Gen5 device. Forcing the system to Gen3 would reduce the available bandwidth from 63GB/s per x16 slot down to 15.7GB/s. This would create a massive bottleneck for “North-South“ data transfers and would never be recommended as an optimization. The NVIDIA Container Toolkit does not require “retries“ that would be solved by lowering the hardware link speed.
D. Enabling ‘Eco-Mode‘ in the BIOS Eco-Mode is designed for power saving and limits the CPU‘s TDP (Thermal Design Power). In a high-performance AI cluster, this would lead to significant CPU throttling. While it might leave “thermal headroom,“ the goal of an AI Factory is to provide maximum power to both the CPU and GPU simultaneously to ensure the CPU can keep up with the data-loading demands of the H100s.
Question 32 of 60
32. Question
During a routine audit of an Intel-based AI server, an administrator finds that one of the two redundant power supplies (PSUs) has failed. What is the most immediate risk to the AI workloads currently running on that node, and how should it be addressed?
Correct
Correct: C The system may lose redundancy, and if the remaining PSU cannot handle the full peak load of the GPUs, the server could crash or throttle performance.
The Technical Reason: NVIDIA-Certified systems (like HGX or DGX platforms) consume massive amounts of power—often between 6.5 kW and 10.2 kW per node. They typically use an N+N or N+1 redundant power configuration (e.g., 6 PSUs in a 3+3 or 4+2 setup).
Peak Load: AI training workloads are not steady; they cause rapid power “spikes“ during specific operations like backpropagation or model checkpointing.
Impact of Failure: If one PSU fails, the remaining units must absorb the load. If the system is running a high-intensity workload that nears the maximum capacity of the surviving PSUs, the “Over-Current Protection“ (OCP) may trip, leading to an immediate system crash. In some configurations, the BMC may preemptively throttle the GPU clocks to a lower power state to prevent a total shutdown, significantly slowing down AI jobs.
The NCP-AII Context: The certification requires administrators to “Identify and troubleshoot hardware faults“ and “Validate power parameters.“ Understanding that a PSU failure isn‘t just about “backup“ but about “load capacity“ is a key distinction for the professional level.
Incorrect: A. The Linux OS will switch to ‘Read-Only‘ mode This is a distractor. “Read-Only“ file system triggers are typically the result of storage controller failures or disk corruption (to prevent further data loss). While a power surge could damage a disk, the operating system does not have a native mechanism to monitor PSU redundancy states and proactively lock the file system to “protect TPM keys.“
B. The InfiniBand network will stop functioning In NVIDIA-Certified systems, the power is distributed through a Power Distribution Board (PDB) inside the chassis. This board balances the current from all active PSUs and distributes it to the motherboard, GPUs, and NICs. There is no one-to-one “mapping“ where specific InfiniBand cards are tied to a specific PSU. As long as the system has enough total power to stay online, the network cards will remain functional.
D. The GPUs will immediately lose all their training data GPU VRAM is volatile, meaning it does require power to maintain state. However, losing one of two redundant PSUs does not cut power to the GPUs. The system continues to run on the remaining PSU. Data is only lost if the remaining PSU fails or trips, causing the entire server to shut down. The risk is the loss of the safety margin, not the immediate loss of data.
Incorrect
Correct: C The system may lose redundancy, and if the remaining PSU cannot handle the full peak load of the GPUs, the server could crash or throttle performance.
The Technical Reason: NVIDIA-Certified systems (like HGX or DGX platforms) consume massive amounts of power—often between 6.5 kW and 10.2 kW per node. They typically use an N+N or N+1 redundant power configuration (e.g., 6 PSUs in a 3+3 or 4+2 setup).
Peak Load: AI training workloads are not steady; they cause rapid power “spikes“ during specific operations like backpropagation or model checkpointing.
Impact of Failure: If one PSU fails, the remaining units must absorb the load. If the system is running a high-intensity workload that nears the maximum capacity of the surviving PSUs, the “Over-Current Protection“ (OCP) may trip, leading to an immediate system crash. In some configurations, the BMC may preemptively throttle the GPU clocks to a lower power state to prevent a total shutdown, significantly slowing down AI jobs.
The NCP-AII Context: The certification requires administrators to “Identify and troubleshoot hardware faults“ and “Validate power parameters.“ Understanding that a PSU failure isn‘t just about “backup“ but about “load capacity“ is a key distinction for the professional level.
Incorrect: A. The Linux OS will switch to ‘Read-Only‘ mode This is a distractor. “Read-Only“ file system triggers are typically the result of storage controller failures or disk corruption (to prevent further data loss). While a power surge could damage a disk, the operating system does not have a native mechanism to monitor PSU redundancy states and proactively lock the file system to “protect TPM keys.“
B. The InfiniBand network will stop functioning In NVIDIA-Certified systems, the power is distributed through a Power Distribution Board (PDB) inside the chassis. This board balances the current from all active PSUs and distributes it to the motherboard, GPUs, and NICs. There is no one-to-one “mapping“ where specific InfiniBand cards are tied to a specific PSU. As long as the system has enough total power to stay online, the network cards will remain functional.
D. The GPUs will immediately lose all their training data GPU VRAM is volatile, meaning it does require power to maintain state. However, losing one of two redundant PSUs does not cut power to the GPUs. The system continues to run on the remaining PSU. Data is only lost if the remaining PSU fails or trips, causing the entire server to shut down. The risk is the loss of the safety margin, not the immediate loss of data.
Unattempted
Correct: C The system may lose redundancy, and if the remaining PSU cannot handle the full peak load of the GPUs, the server could crash or throttle performance.
The Technical Reason: NVIDIA-Certified systems (like HGX or DGX platforms) consume massive amounts of power—often between 6.5 kW and 10.2 kW per node. They typically use an N+N or N+1 redundant power configuration (e.g., 6 PSUs in a 3+3 or 4+2 setup).
Peak Load: AI training workloads are not steady; they cause rapid power “spikes“ during specific operations like backpropagation or model checkpointing.
Impact of Failure: If one PSU fails, the remaining units must absorb the load. If the system is running a high-intensity workload that nears the maximum capacity of the surviving PSUs, the “Over-Current Protection“ (OCP) may trip, leading to an immediate system crash. In some configurations, the BMC may preemptively throttle the GPU clocks to a lower power state to prevent a total shutdown, significantly slowing down AI jobs.
The NCP-AII Context: The certification requires administrators to “Identify and troubleshoot hardware faults“ and “Validate power parameters.“ Understanding that a PSU failure isn‘t just about “backup“ but about “load capacity“ is a key distinction for the professional level.
Incorrect: A. The Linux OS will switch to ‘Read-Only‘ mode This is a distractor. “Read-Only“ file system triggers are typically the result of storage controller failures or disk corruption (to prevent further data loss). While a power surge could damage a disk, the operating system does not have a native mechanism to monitor PSU redundancy states and proactively lock the file system to “protect TPM keys.“
B. The InfiniBand network will stop functioning In NVIDIA-Certified systems, the power is distributed through a Power Distribution Board (PDB) inside the chassis. This board balances the current from all active PSUs and distributes it to the motherboard, GPUs, and NICs. There is no one-to-one “mapping“ where specific InfiniBand cards are tied to a specific PSU. As long as the system has enough total power to stay online, the network cards will remain functional.
D. The GPUs will immediately lose all their training data GPU VRAM is volatile, meaning it does require power to maintain state. However, losing one of two redundant PSUs does not cut power to the GPUs. The system continues to run on the remaining PSU. Data is only lost if the remaining PSU fails or trips, causing the entire server to shut down. The risk is the loss of the safety margin, not the immediate loss of data.
Question 33 of 60
33. Question
A data center engineer is performing the initial bring-up of an NVIDIA HGX H100 system. After configuring the Baseboard Management Controller (BMC) and the Out-of-Band (OOB) network, the engineer needs to ensure all firmware is aligned with the latest NVIDIA validated stack. When updating the firmware on an HGX baseboard, which specific component must be carefully synchronized with the GPU firmware to ensure proper NVLink Fabric performance and thermal management?
Correct
Correct: B The NVSwitch firmware.
The Technical Reason: An HGX H100 system consists of a “GPU Tray“ where the 8 GPUs and the NVSwitch ASICs reside.
NVLink Fabric Performance: The NVSwitch is the high-speed engine that enables all-to-all GPU communication. Its firmware contains the routing logic and link-training protocols. If the NVSwitch firmware is not synchronized with the GPU VBIOS, the NVLink lanes may fail to train at their maximum speed (900 GB/s for H100), or specific peer-to-peer (P2P) paths may be disabled.
Thermal Management: The HGX baseboard uses a complex thermal profile where the power and heat of the NVSwitches are balanced against the GPUs. NVIDIA releases “Firmware Recipes“ where the NVSwitch and GPU versions are validated together to ensure the cooling fans respond correctly to the combined thermal load of the entire tray.
The NCP-AII Context: The exam expects candidates to understand that the NVIDIA Fabric Manager (FM) and the nvfwupd tool treat the GPU tray as a single functional unit. Updating one without the other breaks the “Validated Stack“ and can result in nvlink error counters or high-latency “traps“ in the fabric.
Incorrect: A. The TPM 2.0 security module While the Trusted Platform Module (TPM) is critical for the “Secure Boot“ and “Measured Boot“ processes of the server, it is an independent security component. Its firmware does not impact the operational performance of the NVLink fabric or the thermal curves of the GPUs.
C. The storage controller BIOS The storage controller (managing NVMe or SATA drives) is part of the “North-South“ data path. While it must be functional for the OS to boot, it has no interaction with the internal GPU-to-GPU highway (NVLink). Updating it does not require synchronization with the GPU tray firmware.
D. The PCIe Retimer firmware PCIe Retimers are used to extend the signal integrity of the PCIe Gen5 lanes between the CPU and the GPUs. While important for the initial detection of the GPUs on the PCIe bus, they do not manage the NVLink fabric. Once the GPUs are “seen“ by the OS, the NVSwitch takes over for high-speed inter-GPU communication.
Incorrect
Correct: B The NVSwitch firmware.
The Technical Reason: An HGX H100 system consists of a “GPU Tray“ where the 8 GPUs and the NVSwitch ASICs reside.
NVLink Fabric Performance: The NVSwitch is the high-speed engine that enables all-to-all GPU communication. Its firmware contains the routing logic and link-training protocols. If the NVSwitch firmware is not synchronized with the GPU VBIOS, the NVLink lanes may fail to train at their maximum speed (900 GB/s for H100), or specific peer-to-peer (P2P) paths may be disabled.
Thermal Management: The HGX baseboard uses a complex thermal profile where the power and heat of the NVSwitches are balanced against the GPUs. NVIDIA releases “Firmware Recipes“ where the NVSwitch and GPU versions are validated together to ensure the cooling fans respond correctly to the combined thermal load of the entire tray.
The NCP-AII Context: The exam expects candidates to understand that the NVIDIA Fabric Manager (FM) and the nvfwupd tool treat the GPU tray as a single functional unit. Updating one without the other breaks the “Validated Stack“ and can result in nvlink error counters or high-latency “traps“ in the fabric.
Incorrect: A. The TPM 2.0 security module While the Trusted Platform Module (TPM) is critical for the “Secure Boot“ and “Measured Boot“ processes of the server, it is an independent security component. Its firmware does not impact the operational performance of the NVLink fabric or the thermal curves of the GPUs.
C. The storage controller BIOS The storage controller (managing NVMe or SATA drives) is part of the “North-South“ data path. While it must be functional for the OS to boot, it has no interaction with the internal GPU-to-GPU highway (NVLink). Updating it does not require synchronization with the GPU tray firmware.
D. The PCIe Retimer firmware PCIe Retimers are used to extend the signal integrity of the PCIe Gen5 lanes between the CPU and the GPUs. While important for the initial detection of the GPUs on the PCIe bus, they do not manage the NVLink fabric. Once the GPUs are “seen“ by the OS, the NVSwitch takes over for high-speed inter-GPU communication.
Unattempted
Correct: B The NVSwitch firmware.
The Technical Reason: An HGX H100 system consists of a “GPU Tray“ where the 8 GPUs and the NVSwitch ASICs reside.
NVLink Fabric Performance: The NVSwitch is the high-speed engine that enables all-to-all GPU communication. Its firmware contains the routing logic and link-training protocols. If the NVSwitch firmware is not synchronized with the GPU VBIOS, the NVLink lanes may fail to train at their maximum speed (900 GB/s for H100), or specific peer-to-peer (P2P) paths may be disabled.
Thermal Management: The HGX baseboard uses a complex thermal profile where the power and heat of the NVSwitches are balanced against the GPUs. NVIDIA releases “Firmware Recipes“ where the NVSwitch and GPU versions are validated together to ensure the cooling fans respond correctly to the combined thermal load of the entire tray.
The NCP-AII Context: The exam expects candidates to understand that the NVIDIA Fabric Manager (FM) and the nvfwupd tool treat the GPU tray as a single functional unit. Updating one without the other breaks the “Validated Stack“ and can result in nvlink error counters or high-latency “traps“ in the fabric.
Incorrect: A. The TPM 2.0 security module While the Trusted Platform Module (TPM) is critical for the “Secure Boot“ and “Measured Boot“ processes of the server, it is an independent security component. Its firmware does not impact the operational performance of the NVLink fabric or the thermal curves of the GPUs.
C. The storage controller BIOS The storage controller (managing NVMe or SATA drives) is part of the “North-South“ data path. While it must be functional for the OS to boot, it has no interaction with the internal GPU-to-GPU highway (NVLink). Updating it does not require synchronization with the GPU tray firmware.
D. The PCIe Retimer firmware PCIe Retimers are used to extend the signal integrity of the PCIe Gen5 lanes between the CPU and the GPUs. While important for the initial detection of the GPUs on the PCIe bus, they do not manage the NVLink fabric. Once the GPUs are “seen“ by the OS, the NVSwitch takes over for high-speed inter-GPU communication.
Question 34 of 60
34. Question
When designing the network topology for a multi-rack AI factory, an architect must select appropriate transceivers for the East-West compute fabric. The design requires 400Gb/s InfiniBand connectivity between Leaf and Spine switches with a maximum distance of 50 meters. Which transceiver type and cabling combination provides the best balance of signal quality, power efficiency, and cost for this specific distance?
Correct
Correct: D Active Optical Cables (AOC) or Multimode Fiber (MMF) with 400G-SR4 transceivers as they are optimized for high-bandwidth communication within the 30 to 100-meter range.
The Technical Reason: At 400Gb/s (NDR), the signal attenuation over copper is extreme.
Transceiver Type: The 400G-SR4 (Short Reach, 4-channel) transceiver uses 850nm VCSEL lasers to transmit data over Multimode Fiber. It is specifically designed for distances up to 50 meters (on OM4 fiber) or 100 meters in some optimized configurations.
Cabling: Active Optical Cables (AOCs) are essentially transceivers permanently attached to a fiber cable, providing a cost-effective, low-power solution for fixed distances like 50 meters.
The NCP-AII Context: The exam validates your ability to “describe and validate cable types and transceivers.“ For a 50-meter span—which typically covers a “Row“ or a “Zone“ in an AI factory—multimode optics (SR4) provide the best balance of cost and power efficiency compared to expensive single-mode long-haul optics.
Incorrect: A. Standard Category 6 Ethernet cables using RJ45 connectors Cat6/RJ45 cables are limited to 10Gb/s (or 1G/2.5G/5G) and are used for management or legacy office networks. They cannot support the high-frequency PAM4 signaling required for 400Gb/s InfiniBand. Using them for a compute fabric would result in zero connectivity.
B. Passive Copper Direct Attach Cables (DAC) While Passive DACs offer the lowest latency and zero power consumption, they are physically limited by signal degradation. For 400Gb/s NDR InfiniBand, passive copper is generally restricted to a maximum of 1.5 to 3 meters. Attempting to run a passive copper cable 50 meters would result in total signal loss.
C. Single-mode Fiber with 400G-DR4 transceivers 400G-DR4 (Datacenter Reach, 4-channel) transceivers use Single-mode Fiber (SMF) and are designed for distances up to 500 meters. While they would technically work at 50 meters, they use more expensive silicon photonics and EML lasers. In a massive AI factory, choosing DR4 for a 50-meter run would lead to significantly higher deployment costs and higher power consumption per port without any performance benefit.
Incorrect
Correct: D Active Optical Cables (AOC) or Multimode Fiber (MMF) with 400G-SR4 transceivers as they are optimized for high-bandwidth communication within the 30 to 100-meter range.
The Technical Reason: At 400Gb/s (NDR), the signal attenuation over copper is extreme.
Transceiver Type: The 400G-SR4 (Short Reach, 4-channel) transceiver uses 850nm VCSEL lasers to transmit data over Multimode Fiber. It is specifically designed for distances up to 50 meters (on OM4 fiber) or 100 meters in some optimized configurations.
Cabling: Active Optical Cables (AOCs) are essentially transceivers permanently attached to a fiber cable, providing a cost-effective, low-power solution for fixed distances like 50 meters.
The NCP-AII Context: The exam validates your ability to “describe and validate cable types and transceivers.“ For a 50-meter span—which typically covers a “Row“ or a “Zone“ in an AI factory—multimode optics (SR4) provide the best balance of cost and power efficiency compared to expensive single-mode long-haul optics.
Incorrect: A. Standard Category 6 Ethernet cables using RJ45 connectors Cat6/RJ45 cables are limited to 10Gb/s (or 1G/2.5G/5G) and are used for management or legacy office networks. They cannot support the high-frequency PAM4 signaling required for 400Gb/s InfiniBand. Using them for a compute fabric would result in zero connectivity.
B. Passive Copper Direct Attach Cables (DAC) While Passive DACs offer the lowest latency and zero power consumption, they are physically limited by signal degradation. For 400Gb/s NDR InfiniBand, passive copper is generally restricted to a maximum of 1.5 to 3 meters. Attempting to run a passive copper cable 50 meters would result in total signal loss.
C. Single-mode Fiber with 400G-DR4 transceivers 400G-DR4 (Datacenter Reach, 4-channel) transceivers use Single-mode Fiber (SMF) and are designed for distances up to 500 meters. While they would technically work at 50 meters, they use more expensive silicon photonics and EML lasers. In a massive AI factory, choosing DR4 for a 50-meter run would lead to significantly higher deployment costs and higher power consumption per port without any performance benefit.
Unattempted
Correct: D Active Optical Cables (AOC) or Multimode Fiber (MMF) with 400G-SR4 transceivers as they are optimized for high-bandwidth communication within the 30 to 100-meter range.
The Technical Reason: At 400Gb/s (NDR), the signal attenuation over copper is extreme.
Transceiver Type: The 400G-SR4 (Short Reach, 4-channel) transceiver uses 850nm VCSEL lasers to transmit data over Multimode Fiber. It is specifically designed for distances up to 50 meters (on OM4 fiber) or 100 meters in some optimized configurations.
Cabling: Active Optical Cables (AOCs) are essentially transceivers permanently attached to a fiber cable, providing a cost-effective, low-power solution for fixed distances like 50 meters.
The NCP-AII Context: The exam validates your ability to “describe and validate cable types and transceivers.“ For a 50-meter span—which typically covers a “Row“ or a “Zone“ in an AI factory—multimode optics (SR4) provide the best balance of cost and power efficiency compared to expensive single-mode long-haul optics.
Incorrect: A. Standard Category 6 Ethernet cables using RJ45 connectors Cat6/RJ45 cables are limited to 10Gb/s (or 1G/2.5G/5G) and are used for management or legacy office networks. They cannot support the high-frequency PAM4 signaling required for 400Gb/s InfiniBand. Using them for a compute fabric would result in zero connectivity.
B. Passive Copper Direct Attach Cables (DAC) While Passive DACs offer the lowest latency and zero power consumption, they are physically limited by signal degradation. For 400Gb/s NDR InfiniBand, passive copper is generally restricted to a maximum of 1.5 to 3 meters. Attempting to run a passive copper cable 50 meters would result in total signal loss.
C. Single-mode Fiber with 400G-DR4 transceivers 400G-DR4 (Datacenter Reach, 4-channel) transceivers use Single-mode Fiber (SMF) and are designed for distances up to 500 meters. While they would technically work at 50 meters, they use more expensive silicon photonics and EML lasers. In a massive AI factory, choosing DR4 for a 50-meter run would lead to significantly higher deployment costs and higher power consumption per port without any performance benefit.
Question 35 of 60
35. Question
When configuring Multi-Instance GPU (MIG) for a diverse set of workloads including Jupyter notebooks and small-scale inference, how does the administrator verify that the GPU instances are correctly partitioned and available for the NVIDIA Container Runtime?
Correct
Correct: C By running the command ‘nvidia-smi -L‘ to list the active GPU instances and checking their UUIDs against the docker run environment variables.
The Technical Reason: When Multi-Instance GPU (MIG) is enabled and profiles are created, the physical GPU is logically divided into multiple independent instances.
Verification: The nvidia-smi -L command is the primary method to list all available GPUs and their respective MIG instances. Each MIG instance is assigned a unique UUID (e.g., MIG-GPU-52f017e8-…).
Container Integration: To use a specific MIG instance with the NVIDIA Container Runtime (Docker/Podman), you must pass the instance‘s UUID using the NVIDIA_VISIBLE_DEVICES environment variable (e.g., docker run –runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=MIG-GPU-…). Matching the UUID from nvidia-smi -L to your deployment script ensures the container is pinned to the correct hardware partition.
The NCP-AII Context: The exam validates your ability to “Enable and manage MIG“ and “Demonstrate how to use NVIDIA GPUs with Docker.“ Using nvidia-smi -L is the standardized way to confirm that the partitioning has successfully taken place at the driver level.
Incorrect: A. By using the ‘nvc-check‘ utility There is no standard NVIDIA utility called nvc-check for verifying hardware slicing. While the silicon is indeed physically isolated (memory and cache), the verification is handled through the standard NVIDIA System Management Interface (nvidia-smi) or the NVML library, not a fictional check tool.
B. By rebooting and observing the BIOS splash screen MIG is a dynamic configuration managed by the NVIDIA Driver and the GPU Manager within the Operating System. It is not a BIOS-level feature. While some servers require a reboot to change the “GPU Mode“ (from Compute to MIG-enabled), the individual instances themselves are created and identified after the OS has loaded and the driver is initialized.
D. By checking the /proc/cpuinfo file The /proc/cpuinfo file is a virtual file in Linux that provides information about the Host CPU (e.g., Intel Xeon or AMD EPYC cores). It does not contain any information regarding GPU partitioning, VRAM allocation, or MIG instances. GPU-specific information is typically found under /proc/driver/nvidia/ or via the nvidia-smi tool.
Incorrect
Correct: C By running the command ‘nvidia-smi -L‘ to list the active GPU instances and checking their UUIDs against the docker run environment variables.
The Technical Reason: When Multi-Instance GPU (MIG) is enabled and profiles are created, the physical GPU is logically divided into multiple independent instances.
Verification: The nvidia-smi -L command is the primary method to list all available GPUs and their respective MIG instances. Each MIG instance is assigned a unique UUID (e.g., MIG-GPU-52f017e8-…).
Container Integration: To use a specific MIG instance with the NVIDIA Container Runtime (Docker/Podman), you must pass the instance‘s UUID using the NVIDIA_VISIBLE_DEVICES environment variable (e.g., docker run –runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=MIG-GPU-…). Matching the UUID from nvidia-smi -L to your deployment script ensures the container is pinned to the correct hardware partition.
The NCP-AII Context: The exam validates your ability to “Enable and manage MIG“ and “Demonstrate how to use NVIDIA GPUs with Docker.“ Using nvidia-smi -L is the standardized way to confirm that the partitioning has successfully taken place at the driver level.
Incorrect: A. By using the ‘nvc-check‘ utility There is no standard NVIDIA utility called nvc-check for verifying hardware slicing. While the silicon is indeed physically isolated (memory and cache), the verification is handled through the standard NVIDIA System Management Interface (nvidia-smi) or the NVML library, not a fictional check tool.
B. By rebooting and observing the BIOS splash screen MIG is a dynamic configuration managed by the NVIDIA Driver and the GPU Manager within the Operating System. It is not a BIOS-level feature. While some servers require a reboot to change the “GPU Mode“ (from Compute to MIG-enabled), the individual instances themselves are created and identified after the OS has loaded and the driver is initialized.
D. By checking the /proc/cpuinfo file The /proc/cpuinfo file is a virtual file in Linux that provides information about the Host CPU (e.g., Intel Xeon or AMD EPYC cores). It does not contain any information regarding GPU partitioning, VRAM allocation, or MIG instances. GPU-specific information is typically found under /proc/driver/nvidia/ or via the nvidia-smi tool.
Unattempted
Correct: C By running the command ‘nvidia-smi -L‘ to list the active GPU instances and checking their UUIDs against the docker run environment variables.
The Technical Reason: When Multi-Instance GPU (MIG) is enabled and profiles are created, the physical GPU is logically divided into multiple independent instances.
Verification: The nvidia-smi -L command is the primary method to list all available GPUs and their respective MIG instances. Each MIG instance is assigned a unique UUID (e.g., MIG-GPU-52f017e8-…).
Container Integration: To use a specific MIG instance with the NVIDIA Container Runtime (Docker/Podman), you must pass the instance‘s UUID using the NVIDIA_VISIBLE_DEVICES environment variable (e.g., docker run –runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=MIG-GPU-…). Matching the UUID from nvidia-smi -L to your deployment script ensures the container is pinned to the correct hardware partition.
The NCP-AII Context: The exam validates your ability to “Enable and manage MIG“ and “Demonstrate how to use NVIDIA GPUs with Docker.“ Using nvidia-smi -L is the standardized way to confirm that the partitioning has successfully taken place at the driver level.
Incorrect: A. By using the ‘nvc-check‘ utility There is no standard NVIDIA utility called nvc-check for verifying hardware slicing. While the silicon is indeed physically isolated (memory and cache), the verification is handled through the standard NVIDIA System Management Interface (nvidia-smi) or the NVML library, not a fictional check tool.
B. By rebooting and observing the BIOS splash screen MIG is a dynamic configuration managed by the NVIDIA Driver and the GPU Manager within the Operating System. It is not a BIOS-level feature. While some servers require a reboot to change the “GPU Mode“ (from Compute to MIG-enabled), the individual instances themselves are created and identified after the OS has loaded and the driver is initialized.
D. By checking the /proc/cpuinfo file The /proc/cpuinfo file is a virtual file in Linux that provides information about the Host CPU (e.g., Intel Xeon or AMD EPYC cores). It does not contain any information regarding GPU partitioning, VRAM allocation, or MIG instances. GPU-specific information is typically found under /proc/driver/nvidia/ or via the nvidia-smi tool.
Question 36 of 60
36. Question
During a performance audit of an AI factory, it is discovered that the InfiniBand fabric is experiencing high levels of congestion discard packets. Which optimization strategy should the network administrator apply at the switch level to resolve this and improve cluster-wide performance?
Correct
Correct: C Enable Adaptive Routing and configure Congestion Control (CC) parameters on the InfiniBand switches and DPUs across the fabric.
The Technical Reason: InfiniBand uses a credit-based flow control mechanism. When a specific link becomes saturated (a “hot spot“), it can cause “backpressure“ that leads to congestion throughout the fabric, resulting in congestion discard packets.
Adaptive Routing (AR): This switch-level feature dynamically routes packets around congested links. Instead of following a static path, the switch hardware identifies the least loaded port to move traffic, balancing the load across all available spine links.
Congestion Control (CC): This is a hardware-based mechanism where switches detect congestion and send a Congestion Notification Packet (CNP) back to the source HCA (Host Channel Adapter) or DPU. The source then throttles its injection rate, preventing the fabric from becoming overwhelmed and eliminating packet discards.
The NCP-AII Context: The exam expects you to know that the NVIDIA Unified Fabric Manager (UFM) is typically used to enable and monitor these features. Adaptive Routing and CC are the “Gold Standard“ for optimizing the East-West fabric in an AI Factory.
Incorrect: A. Reduce the MTU size to 1500 bytes In an AI infrastructure, reducing the MTU (Maximum Transmission Unit) is counterproductive. AI workloads involve massive data transfers; a larger MTU (typically 4096 bytes for InfiniBand) is preferred to reduce packet header overhead and CPU interruptions. Reducing it to 1500 (the standard Ethernet MTU) would significantly increase congestion by requiring more packets to move the same amount of data.
B. Physically disconnect half of the compute nodes While this would technically reduce traffic, it is not an optimization strategy; it is a destruction of the cluster‘s compute capacity. The goal of an AI Factory is to maximize utilization, not to limit it by removing hardware. Proper fabric design (like a Fat Tree or Rail-Optimized topology) should handle full-bandwidth traffic from all nodes if optimized correctly.
D. Disable the Subnet Manager on all switches The Subnet Manager (SM) is the “brain“ of the InfiniBand network. It is responsible for discovering the topology and assigning LIDs (Local Identifiers). Disabling the SM on all switches would cause the entire fabric to stop functioning immediately, as no packets can be routed without a valid forwarding table. While the SM can be a bottleneck in extremely large fabrics, the solution is to optimize its sweep interval or use High Availability (HA) SMs, not to disable it.
Incorrect
Correct: C Enable Adaptive Routing and configure Congestion Control (CC) parameters on the InfiniBand switches and DPUs across the fabric.
The Technical Reason: InfiniBand uses a credit-based flow control mechanism. When a specific link becomes saturated (a “hot spot“), it can cause “backpressure“ that leads to congestion throughout the fabric, resulting in congestion discard packets.
Adaptive Routing (AR): This switch-level feature dynamically routes packets around congested links. Instead of following a static path, the switch hardware identifies the least loaded port to move traffic, balancing the load across all available spine links.
Congestion Control (CC): This is a hardware-based mechanism where switches detect congestion and send a Congestion Notification Packet (CNP) back to the source HCA (Host Channel Adapter) or DPU. The source then throttles its injection rate, preventing the fabric from becoming overwhelmed and eliminating packet discards.
The NCP-AII Context: The exam expects you to know that the NVIDIA Unified Fabric Manager (UFM) is typically used to enable and monitor these features. Adaptive Routing and CC are the “Gold Standard“ for optimizing the East-West fabric in an AI Factory.
Incorrect: A. Reduce the MTU size to 1500 bytes In an AI infrastructure, reducing the MTU (Maximum Transmission Unit) is counterproductive. AI workloads involve massive data transfers; a larger MTU (typically 4096 bytes for InfiniBand) is preferred to reduce packet header overhead and CPU interruptions. Reducing it to 1500 (the standard Ethernet MTU) would significantly increase congestion by requiring more packets to move the same amount of data.
B. Physically disconnect half of the compute nodes While this would technically reduce traffic, it is not an optimization strategy; it is a destruction of the cluster‘s compute capacity. The goal of an AI Factory is to maximize utilization, not to limit it by removing hardware. Proper fabric design (like a Fat Tree or Rail-Optimized topology) should handle full-bandwidth traffic from all nodes if optimized correctly.
D. Disable the Subnet Manager on all switches The Subnet Manager (SM) is the “brain“ of the InfiniBand network. It is responsible for discovering the topology and assigning LIDs (Local Identifiers). Disabling the SM on all switches would cause the entire fabric to stop functioning immediately, as no packets can be routed without a valid forwarding table. While the SM can be a bottleneck in extremely large fabrics, the solution is to optimize its sweep interval or use High Availability (HA) SMs, not to disable it.
Unattempted
Correct: C Enable Adaptive Routing and configure Congestion Control (CC) parameters on the InfiniBand switches and DPUs across the fabric.
The Technical Reason: InfiniBand uses a credit-based flow control mechanism. When a specific link becomes saturated (a “hot spot“), it can cause “backpressure“ that leads to congestion throughout the fabric, resulting in congestion discard packets.
Adaptive Routing (AR): This switch-level feature dynamically routes packets around congested links. Instead of following a static path, the switch hardware identifies the least loaded port to move traffic, balancing the load across all available spine links.
Congestion Control (CC): This is a hardware-based mechanism where switches detect congestion and send a Congestion Notification Packet (CNP) back to the source HCA (Host Channel Adapter) or DPU. The source then throttles its injection rate, preventing the fabric from becoming overwhelmed and eliminating packet discards.
The NCP-AII Context: The exam expects you to know that the NVIDIA Unified Fabric Manager (UFM) is typically used to enable and monitor these features. Adaptive Routing and CC are the “Gold Standard“ for optimizing the East-West fabric in an AI Factory.
Incorrect: A. Reduce the MTU size to 1500 bytes In an AI infrastructure, reducing the MTU (Maximum Transmission Unit) is counterproductive. AI workloads involve massive data transfers; a larger MTU (typically 4096 bytes for InfiniBand) is preferred to reduce packet header overhead and CPU interruptions. Reducing it to 1500 (the standard Ethernet MTU) would significantly increase congestion by requiring more packets to move the same amount of data.
B. Physically disconnect half of the compute nodes While this would technically reduce traffic, it is not an optimization strategy; it is a destruction of the cluster‘s compute capacity. The goal of an AI Factory is to maximize utilization, not to limit it by removing hardware. Proper fabric design (like a Fat Tree or Rail-Optimized topology) should handle full-bandwidth traffic from all nodes if optimized correctly.
D. Disable the Subnet Manager on all switches The Subnet Manager (SM) is the “brain“ of the InfiniBand network. It is responsible for discovering the topology and assigning LIDs (Local Identifiers). Disabling the SM on all switches would cause the entire fabric to stop functioning immediately, as no packets can be routed without a valid forwarding table. While the SM can be a bottleneck in extremely large fabrics, the solution is to optimize its sweep interval or use High Availability (HA) SMs, not to disable it.
Question 37 of 60
37. Question
A storage optimization task requires reducing the I/O wait times for a large-scale training job. Which of the following strategies would provide the most significant performance improvement for an NVIDIA-certified AI infrastructure with shared storage?
Correct
Correct: C Implementing NVIDIA GPUDirect Storage (GDS) to enable a direct DMA path between the network interface card and the GPU memory, bypassing the CPU.
The Technical Reason: In traditional storage I/O, data must travel from the storage/network card into a CPU system memory buffer (bounce buffer) before being copied again into GPU memory. This creates a “CPU Bottleneck,“ increasing latency and consuming CPU cycles that should be used for data preprocessing.
The GDS Advantage: GPUDirect Storage (GDS) creates a direct Remote Direct Memory Access (RDMA) path. Data moves directly from the NVMe storage (or the network card in the case of distributed storage) to the GPU VRAM.
Performance Impact: This significantly reduces I/O wait times, lowers CPU utilization, and enables the multi-terabyte datasets used in LLM training to saturate the high-speed 400Gb/s (NDR) fabric.
The NCP-AII Context: The exam validates your understanding of the Magnum IO stack. GDS is the primary optimization for “North-South“ traffic in an NVIDIA-certified architecture.
Incorrect: A. Moving datasets to the /tmp directory of the head node This is a catastrophic performance choice for two reasons:
Network Bottleneck: The OOB (Out-of-Band) management network is typically 1Gb/s or 10Gb/s Ethernet. Attempting to stream large-scale training data over this network would be thousands of times slower than the compute fabric.
Single Point of Failure: Sharing a single /tmp directory from one node to many creates a massive “hot spot“ and contention that would stall all compute nodes.
B. Reducing MIG instances to allow more PCIe bandwidth to SATA drives This option is technically flawed:
SATA Limitations: SATA-based boot drives are limited to ~600MB/s, which is negligible compared to the PCIe Gen5 bandwidth available on an H100 system.
Resource Waste: Reducing MIG instances (GPU partitioning) actually lowers your compute capacity. PCIe bandwidth is managed by the hardware switches/retimers; reducing the number of GPU “slices“ does not “give“ more speed to a slow SATA drive.
D. Upgrading BMC firmware for faster storage health polling The BMC (Baseboard Management Controller) is part of the management plane. While it monitors the health of drives (temperature, failure status), it is not involved in the data path of a training job. Faster polling of health metrics provides better monitoring, but it has zero impact on the actual I/O wait times or the speed of data transfer to the GPUs.
Incorrect
Correct: C Implementing NVIDIA GPUDirect Storage (GDS) to enable a direct DMA path between the network interface card and the GPU memory, bypassing the CPU.
The Technical Reason: In traditional storage I/O, data must travel from the storage/network card into a CPU system memory buffer (bounce buffer) before being copied again into GPU memory. This creates a “CPU Bottleneck,“ increasing latency and consuming CPU cycles that should be used for data preprocessing.
The GDS Advantage: GPUDirect Storage (GDS) creates a direct Remote Direct Memory Access (RDMA) path. Data moves directly from the NVMe storage (or the network card in the case of distributed storage) to the GPU VRAM.
Performance Impact: This significantly reduces I/O wait times, lowers CPU utilization, and enables the multi-terabyte datasets used in LLM training to saturate the high-speed 400Gb/s (NDR) fabric.
The NCP-AII Context: The exam validates your understanding of the Magnum IO stack. GDS is the primary optimization for “North-South“ traffic in an NVIDIA-certified architecture.
Incorrect: A. Moving datasets to the /tmp directory of the head node This is a catastrophic performance choice for two reasons:
Network Bottleneck: The OOB (Out-of-Band) management network is typically 1Gb/s or 10Gb/s Ethernet. Attempting to stream large-scale training data over this network would be thousands of times slower than the compute fabric.
Single Point of Failure: Sharing a single /tmp directory from one node to many creates a massive “hot spot“ and contention that would stall all compute nodes.
B. Reducing MIG instances to allow more PCIe bandwidth to SATA drives This option is technically flawed:
SATA Limitations: SATA-based boot drives are limited to ~600MB/s, which is negligible compared to the PCIe Gen5 bandwidth available on an H100 system.
Resource Waste: Reducing MIG instances (GPU partitioning) actually lowers your compute capacity. PCIe bandwidth is managed by the hardware switches/retimers; reducing the number of GPU “slices“ does not “give“ more speed to a slow SATA drive.
D. Upgrading BMC firmware for faster storage health polling The BMC (Baseboard Management Controller) is part of the management plane. While it monitors the health of drives (temperature, failure status), it is not involved in the data path of a training job. Faster polling of health metrics provides better monitoring, but it has zero impact on the actual I/O wait times or the speed of data transfer to the GPUs.
Unattempted
Correct: C Implementing NVIDIA GPUDirect Storage (GDS) to enable a direct DMA path between the network interface card and the GPU memory, bypassing the CPU.
The Technical Reason: In traditional storage I/O, data must travel from the storage/network card into a CPU system memory buffer (bounce buffer) before being copied again into GPU memory. This creates a “CPU Bottleneck,“ increasing latency and consuming CPU cycles that should be used for data preprocessing.
The GDS Advantage: GPUDirect Storage (GDS) creates a direct Remote Direct Memory Access (RDMA) path. Data moves directly from the NVMe storage (or the network card in the case of distributed storage) to the GPU VRAM.
Performance Impact: This significantly reduces I/O wait times, lowers CPU utilization, and enables the multi-terabyte datasets used in LLM training to saturate the high-speed 400Gb/s (NDR) fabric.
The NCP-AII Context: The exam validates your understanding of the Magnum IO stack. GDS is the primary optimization for “North-South“ traffic in an NVIDIA-certified architecture.
Incorrect: A. Moving datasets to the /tmp directory of the head node This is a catastrophic performance choice for two reasons:
Network Bottleneck: The OOB (Out-of-Band) management network is typically 1Gb/s or 10Gb/s Ethernet. Attempting to stream large-scale training data over this network would be thousands of times slower than the compute fabric.
Single Point of Failure: Sharing a single /tmp directory from one node to many creates a massive “hot spot“ and contention that would stall all compute nodes.
B. Reducing MIG instances to allow more PCIe bandwidth to SATA drives This option is technically flawed:
SATA Limitations: SATA-based boot drives are limited to ~600MB/s, which is negligible compared to the PCIe Gen5 bandwidth available on an H100 system.
Resource Waste: Reducing MIG instances (GPU partitioning) actually lowers your compute capacity. PCIe bandwidth is managed by the hardware switches/retimers; reducing the number of GPU “slices“ does not “give“ more speed to a slow SATA drive.
D. Upgrading BMC firmware for faster storage health polling The BMC (Baseboard Management Controller) is part of the management plane. While it monitors the health of drives (temperature, failure status), it is not involved in the data path of a training job. Faster polling of health metrics provides better monitoring, but it has zero impact on the actual I/O wait times or the speed of data transfer to the GPUs.
Question 38 of 60
38. Question
A network administrator needs to optimize an NVIDIA BlueField network platform for a high-performance AI cluster. The goal is to offload networking and security tasks from the host CPU to the DPU. Which action is required to ensure the BlueField DPU is correctly configured to operate in DPU mode rather than as a standard Network Interface Card?
Correct
Correct: B Use the mlxconfig tool to set the INTERNAL_CPU_MODEL parameter and ensure the DPU is running its own Linux-based operating system on the internal Arm cores.
The Technical Reason: A BlueField DPU can operate as either a standard NIC (where the host controls everything) or in DPU Mode (also known as ECPF mode). In DPU Mode, the internal Arm subsystem owns the network resources.
Configuration Tool: The mlxconfig (part of NVIDIA Firmware Tools/MFT) is the authoritative tool for changing non-volatile hardware configurations.
Key Parameter: The parameter INTERNAL_CPU_MODEL=1 (specifically EMBEDDED_CPU) and ensuring the INTERNAL_CPU_OFFLOAD_ENGINE is enabled are what allow the DPU to boot its own Linux OS (e.g., Ubuntu or BFB image) on its Arm cores. This enables the “offload“ capabilities where networking and security logic are decoupled from the host CPU.
The NCP-AII Context: The certification validates your ability to “Configure and manage a BlueField network platform.“ Recognizing the difference between INTERNAL_CPU_MODEL settings is critical to moving a BlueField from a “dumb“ adapter to a “smart“ programmable processor.
Incorrect: A. Configure the host BIOS to ignore the DPU as a boot device While you may choose not to boot the host from the DPU, this does not change the internal operation mode of the DPU itself. Furthermore, a DPU does not “automatically download“ firmware from the NGC CLI; firmware is flashed via the host (using mlxfwmanager or nvfwupd) or via the DPU-BMC.
C. Enable ‘MIG‘ mode on the BlueField card using nvidia-smi Multi-Instance GPU (MIG) is a technology exclusive to NVIDIA GPUs (like H100 or A100). It allows a physical GPU to be partitioned into multiple instances. BlueField DPUs are networking processors and do not support MIG. Partitioning network resources on a DPU is typically handled via SR-IOV (Single Root I/O Virtualization) or VirtIO, not nvidia-smi.
D. Disable the PCIe connection between the host and the DPU The DPU requires the PCIe connection to receive power and to provide the host with its network interfaces. Disabling the PCIe link would make the DPU invisible to the host, effectively “bricking“ the node‘s connectivity. DPU mode relies on the PCIe bus to communicate with the host while the Arm cores manage the high-speed network traffic.
Incorrect
Correct: B Use the mlxconfig tool to set the INTERNAL_CPU_MODEL parameter and ensure the DPU is running its own Linux-based operating system on the internal Arm cores.
The Technical Reason: A BlueField DPU can operate as either a standard NIC (where the host controls everything) or in DPU Mode (also known as ECPF mode). In DPU Mode, the internal Arm subsystem owns the network resources.
Configuration Tool: The mlxconfig (part of NVIDIA Firmware Tools/MFT) is the authoritative tool for changing non-volatile hardware configurations.
Key Parameter: The parameter INTERNAL_CPU_MODEL=1 (specifically EMBEDDED_CPU) and ensuring the INTERNAL_CPU_OFFLOAD_ENGINE is enabled are what allow the DPU to boot its own Linux OS (e.g., Ubuntu or BFB image) on its Arm cores. This enables the “offload“ capabilities where networking and security logic are decoupled from the host CPU.
The NCP-AII Context: The certification validates your ability to “Configure and manage a BlueField network platform.“ Recognizing the difference between INTERNAL_CPU_MODEL settings is critical to moving a BlueField from a “dumb“ adapter to a “smart“ programmable processor.
Incorrect: A. Configure the host BIOS to ignore the DPU as a boot device While you may choose not to boot the host from the DPU, this does not change the internal operation mode of the DPU itself. Furthermore, a DPU does not “automatically download“ firmware from the NGC CLI; firmware is flashed via the host (using mlxfwmanager or nvfwupd) or via the DPU-BMC.
C. Enable ‘MIG‘ mode on the BlueField card using nvidia-smi Multi-Instance GPU (MIG) is a technology exclusive to NVIDIA GPUs (like H100 or A100). It allows a physical GPU to be partitioned into multiple instances. BlueField DPUs are networking processors and do not support MIG. Partitioning network resources on a DPU is typically handled via SR-IOV (Single Root I/O Virtualization) or VirtIO, not nvidia-smi.
D. Disable the PCIe connection between the host and the DPU The DPU requires the PCIe connection to receive power and to provide the host with its network interfaces. Disabling the PCIe link would make the DPU invisible to the host, effectively “bricking“ the node‘s connectivity. DPU mode relies on the PCIe bus to communicate with the host while the Arm cores manage the high-speed network traffic.
Unattempted
Correct: B Use the mlxconfig tool to set the INTERNAL_CPU_MODEL parameter and ensure the DPU is running its own Linux-based operating system on the internal Arm cores.
The Technical Reason: A BlueField DPU can operate as either a standard NIC (where the host controls everything) or in DPU Mode (also known as ECPF mode). In DPU Mode, the internal Arm subsystem owns the network resources.
Configuration Tool: The mlxconfig (part of NVIDIA Firmware Tools/MFT) is the authoritative tool for changing non-volatile hardware configurations.
Key Parameter: The parameter INTERNAL_CPU_MODEL=1 (specifically EMBEDDED_CPU) and ensuring the INTERNAL_CPU_OFFLOAD_ENGINE is enabled are what allow the DPU to boot its own Linux OS (e.g., Ubuntu or BFB image) on its Arm cores. This enables the “offload“ capabilities where networking and security logic are decoupled from the host CPU.
The NCP-AII Context: The certification validates your ability to “Configure and manage a BlueField network platform.“ Recognizing the difference between INTERNAL_CPU_MODEL settings is critical to moving a BlueField from a “dumb“ adapter to a “smart“ programmable processor.
Incorrect: A. Configure the host BIOS to ignore the DPU as a boot device While you may choose not to boot the host from the DPU, this does not change the internal operation mode of the DPU itself. Furthermore, a DPU does not “automatically download“ firmware from the NGC CLI; firmware is flashed via the host (using mlxfwmanager or nvfwupd) or via the DPU-BMC.
C. Enable ‘MIG‘ mode on the BlueField card using nvidia-smi Multi-Instance GPU (MIG) is a technology exclusive to NVIDIA GPUs (like H100 or A100). It allows a physical GPU to be partitioned into multiple instances. BlueField DPUs are networking processors and do not support MIG. Partitioning network resources on a DPU is typically handled via SR-IOV (Single Root I/O Virtualization) or VirtIO, not nvidia-smi.
D. Disable the PCIe connection between the host and the DPU The DPU requires the PCIe connection to receive power and to provide the host with its network interfaces. Disabling the PCIe link would make the DPU invisible to the host, effectively “bricking“ the node‘s connectivity. DPU mode relies on the PCIe bus to communicate with the host while the Arm cores manage the high-speed network traffic.
Question 39 of 60
39. Question
An AI cluster consisting of both Intel-based and AMD-based servers is showing inconsistent performance across identical H100 GPUs. The administrator wants to optimize the AMD-based nodes specifically. Which optimization technique is most relevant to ensuring the best performance for NVIDIA GPUs on AMD EPYC platforms?
Correct
Correct: B Enable ‘NPS‘ (NUMA nodes per socket) in the BIOS and ensure that the GPU and its corresponding HCA are attached to the same NUMA domain.
The Technical Reason: AMD EPYC processors (like the 9004 series) use a Chiplet-based architecture where multiple Core Complex Dies (CCDs) connect to a central I/O Die (IOD).
NPS (Nodes Per Socket): The NPS setting (often NPS1, NPS2, or NPS4) determines how memory and PCIe resources are partitioned into NUMA domains. Setting this correctly allows the system to expose the physical proximity of the CPU cores to specific PCIe lanes.
Affinity: For peak performance in an AI Factory, the GPU and the HCA (Host Channel Adapter/InfiniBand card) must reside in the same NUMA domain. This ensures that data traveling from the network to the GPU (or vice-versa) stays within the same local I/O die, avoiding the “hop“ across the Infinity Fabric, which adds latency and reduces effective bandwidth.
The NCP-AII Context: The certification emphasizes “Topology Awareness.“ Administrators must use tools like nvidia-smi topo -m to verify that the GPU-to-NIC relationship is reported as NODE (local) rather than SYS (remote/across a CPU bridge).
Incorrect: A. Increase the size of the Linux swap partition Swap space is a portion of the hard drive used when physical System RAM is exhausted. AI workloads rely on high-speed HBM3 (VRAM) and DDR5 (System RAM). If a training job spills into the swap partition (even on an NVMe drive), performance will drop by several orders of magnitude. Increasing swap does not optimize GPU performance; it merely provides a “safety net“ against Out-of-Memory (OOM) crashes, albeit at the cost of unusable performance levels.
C. Configure the GPU to use ‘Asynchronous Copy‘ mode This is a distractor. While CUDA supports asynchronous memory copies as a programming feature (via cudaMemcpyAsync), there is no toggle in nvidia-smi called “Asynchronous Copy mode“ that specifically optimizes for AMD PCIe controllers. NVIDIA drivers handle PCIe transactions identically across both Intel and AMD, provided the underlying BIOS settings (like Above 4G Decoding and Resizable BAR) are enabled.
D. Disable ‘Turbo Boost‘ feature on the AMD CPUs Disabling Turbo Boost (or Core Performance Boost in AMD terms) limits the CPU‘s ability to reach its peak frequency. AI workloads—especially data preprocessing and kernel launching—depend on high single-core clock speeds to feed the GPUs. Lowering the CPU‘s thermal output does not “give“ that headroom to the GPU; the GPU‘s thermal management is independent. Disabling this feature would likely decrease overall performance by creating a CPU bottleneck.
Incorrect
Correct: B Enable ‘NPS‘ (NUMA nodes per socket) in the BIOS and ensure that the GPU and its corresponding HCA are attached to the same NUMA domain.
The Technical Reason: AMD EPYC processors (like the 9004 series) use a Chiplet-based architecture where multiple Core Complex Dies (CCDs) connect to a central I/O Die (IOD).
NPS (Nodes Per Socket): The NPS setting (often NPS1, NPS2, or NPS4) determines how memory and PCIe resources are partitioned into NUMA domains. Setting this correctly allows the system to expose the physical proximity of the CPU cores to specific PCIe lanes.
Affinity: For peak performance in an AI Factory, the GPU and the HCA (Host Channel Adapter/InfiniBand card) must reside in the same NUMA domain. This ensures that data traveling from the network to the GPU (or vice-versa) stays within the same local I/O die, avoiding the “hop“ across the Infinity Fabric, which adds latency and reduces effective bandwidth.
The NCP-AII Context: The certification emphasizes “Topology Awareness.“ Administrators must use tools like nvidia-smi topo -m to verify that the GPU-to-NIC relationship is reported as NODE (local) rather than SYS (remote/across a CPU bridge).
Incorrect: A. Increase the size of the Linux swap partition Swap space is a portion of the hard drive used when physical System RAM is exhausted. AI workloads rely on high-speed HBM3 (VRAM) and DDR5 (System RAM). If a training job spills into the swap partition (even on an NVMe drive), performance will drop by several orders of magnitude. Increasing swap does not optimize GPU performance; it merely provides a “safety net“ against Out-of-Memory (OOM) crashes, albeit at the cost of unusable performance levels.
C. Configure the GPU to use ‘Asynchronous Copy‘ mode This is a distractor. While CUDA supports asynchronous memory copies as a programming feature (via cudaMemcpyAsync), there is no toggle in nvidia-smi called “Asynchronous Copy mode“ that specifically optimizes for AMD PCIe controllers. NVIDIA drivers handle PCIe transactions identically across both Intel and AMD, provided the underlying BIOS settings (like Above 4G Decoding and Resizable BAR) are enabled.
D. Disable ‘Turbo Boost‘ feature on the AMD CPUs Disabling Turbo Boost (or Core Performance Boost in AMD terms) limits the CPU‘s ability to reach its peak frequency. AI workloads—especially data preprocessing and kernel launching—depend on high single-core clock speeds to feed the GPUs. Lowering the CPU‘s thermal output does not “give“ that headroom to the GPU; the GPU‘s thermal management is independent. Disabling this feature would likely decrease overall performance by creating a CPU bottleneck.
Unattempted
Correct: B Enable ‘NPS‘ (NUMA nodes per socket) in the BIOS and ensure that the GPU and its corresponding HCA are attached to the same NUMA domain.
The Technical Reason: AMD EPYC processors (like the 9004 series) use a Chiplet-based architecture where multiple Core Complex Dies (CCDs) connect to a central I/O Die (IOD).
NPS (Nodes Per Socket): The NPS setting (often NPS1, NPS2, or NPS4) determines how memory and PCIe resources are partitioned into NUMA domains. Setting this correctly allows the system to expose the physical proximity of the CPU cores to specific PCIe lanes.
Affinity: For peak performance in an AI Factory, the GPU and the HCA (Host Channel Adapter/InfiniBand card) must reside in the same NUMA domain. This ensures that data traveling from the network to the GPU (or vice-versa) stays within the same local I/O die, avoiding the “hop“ across the Infinity Fabric, which adds latency and reduces effective bandwidth.
The NCP-AII Context: The certification emphasizes “Topology Awareness.“ Administrators must use tools like nvidia-smi topo -m to verify that the GPU-to-NIC relationship is reported as NODE (local) rather than SYS (remote/across a CPU bridge).
Incorrect: A. Increase the size of the Linux swap partition Swap space is a portion of the hard drive used when physical System RAM is exhausted. AI workloads rely on high-speed HBM3 (VRAM) and DDR5 (System RAM). If a training job spills into the swap partition (even on an NVMe drive), performance will drop by several orders of magnitude. Increasing swap does not optimize GPU performance; it merely provides a “safety net“ against Out-of-Memory (OOM) crashes, albeit at the cost of unusable performance levels.
C. Configure the GPU to use ‘Asynchronous Copy‘ mode This is a distractor. While CUDA supports asynchronous memory copies as a programming feature (via cudaMemcpyAsync), there is no toggle in nvidia-smi called “Asynchronous Copy mode“ that specifically optimizes for AMD PCIe controllers. NVIDIA drivers handle PCIe transactions identically across both Intel and AMD, provided the underlying BIOS settings (like Above 4G Decoding and Resizable BAR) are enabled.
D. Disable ‘Turbo Boost‘ feature on the AMD CPUs Disabling Turbo Boost (or Core Performance Boost in AMD terms) limits the CPU‘s ability to reach its peak frequency. AI workloads—especially data preprocessing and kernel launching—depend on high single-core clock speeds to feed the GPUs. Lowering the CPU‘s thermal output does not “give“ that headroom to the GPU; the GPU‘s thermal management is independent. Disabling this feature would likely decrease overall performance by creating a CPU bottleneck.
Question 40 of 60
40. Question
A system administrator is performing the initial bring-up of an NVIDIA HGX system. After verifying the physical installation and power delivery, the engineer needs to ensure firmware consistency across the baseboard management controller (BMC) and the complex GPU baseboard. Which sequence represents the most reliable method for ensuring the hardware is validated and firmware is consistent across all components before OS deployment?
Correct
Correct: C Access the BMC via the OOB network to verify power and cooling health, update the BMC and BIOS firmware, then use HGX firmware update tools to synchronize the GPU baseboard and NVSwitch components.
The Technical Reason: An HGX system is a complex “baseboard assembly“ where multiple components must speak the same language.
Phase 1 (Health Check): Before flashing sensitive firmware, the BMC (Baseboard Management Controller) must confirm that the power supplies are redundant and the fans are operational. A thermal event or power loss during a flash can “brick“ the board.
Phase 2 (Management Plane): The BMC and BIOS/UEFI form the foundation. They must be updated first because they often contain the updated Redfish protocols or PCIe initialization code required to “see“ and communicate with the high-speed GPU components.
Phase 3 (Data Plane): Finally, the HGX firmware update tools (like nvfwupd) are used to push a “validated recipe“ to the NVSwitches and GPUs. This ensures that the inter-GPU fabric (NVLink) and the GPUs themselves are synchronized for thermal and power management.
The NCP-AII Context: The exam validates the “Validated Stack“ philosophy. You do not update components in isolation; you apply a specific firmware version “recipe“ provided by NVIDIA for that specific HGX generation (e.g., H100 or A100).
Incorrect Options: A. Perform a cold boot and immediately run an HPL test Running a high-stress benchmark like HPL (High-Performance Linpack) on unvalidated or mismatched firmware is dangerous. Mismatched firmware can lead to incorrect thermal throttling behaviors, potentially causing hardware damage or inconsistent results. Stress testing is a final validation step, not an initial bring-up step.
B. Manually install the NVIDIA Container Toolkit and run a Docker container The NVIDIA Container Toolkit is a high-level software utility. It relies on the NVIDIA Driver being installed in the OS. If the underlying firmware (BMC/BIOS/HGX) is inconsistent, the driver may fail to load, or the GPUs may not be detected at all. You cannot use a “container“ to validate low-level hardware firmware consistency.
D. Boot into Linux and use NVIDIA SMI to flash the VBIOS first nvidia-smi is primarily a monitoring and management tool for the GPU within the OS. Flashing the VBIOS before verifying the BMC and BIOS status ignores the management hierarchy. If the BIOS or PCIe Retimers are running old firmware, the nvidia-smi tool might not even have a stable path to communicate with the GPU for a safe flash.
Incorrect
Correct: C Access the BMC via the OOB network to verify power and cooling health, update the BMC and BIOS firmware, then use HGX firmware update tools to synchronize the GPU baseboard and NVSwitch components.
The Technical Reason: An HGX system is a complex “baseboard assembly“ where multiple components must speak the same language.
Phase 1 (Health Check): Before flashing sensitive firmware, the BMC (Baseboard Management Controller) must confirm that the power supplies are redundant and the fans are operational. A thermal event or power loss during a flash can “brick“ the board.
Phase 2 (Management Plane): The BMC and BIOS/UEFI form the foundation. They must be updated first because they often contain the updated Redfish protocols or PCIe initialization code required to “see“ and communicate with the high-speed GPU components.
Phase 3 (Data Plane): Finally, the HGX firmware update tools (like nvfwupd) are used to push a “validated recipe“ to the NVSwitches and GPUs. This ensures that the inter-GPU fabric (NVLink) and the GPUs themselves are synchronized for thermal and power management.
The NCP-AII Context: The exam validates the “Validated Stack“ philosophy. You do not update components in isolation; you apply a specific firmware version “recipe“ provided by NVIDIA for that specific HGX generation (e.g., H100 or A100).
Incorrect Options: A. Perform a cold boot and immediately run an HPL test Running a high-stress benchmark like HPL (High-Performance Linpack) on unvalidated or mismatched firmware is dangerous. Mismatched firmware can lead to incorrect thermal throttling behaviors, potentially causing hardware damage or inconsistent results. Stress testing is a final validation step, not an initial bring-up step.
B. Manually install the NVIDIA Container Toolkit and run a Docker container The NVIDIA Container Toolkit is a high-level software utility. It relies on the NVIDIA Driver being installed in the OS. If the underlying firmware (BMC/BIOS/HGX) is inconsistent, the driver may fail to load, or the GPUs may not be detected at all. You cannot use a “container“ to validate low-level hardware firmware consistency.
D. Boot into Linux and use NVIDIA SMI to flash the VBIOS first nvidia-smi is primarily a monitoring and management tool for the GPU within the OS. Flashing the VBIOS before verifying the BMC and BIOS status ignores the management hierarchy. If the BIOS or PCIe Retimers are running old firmware, the nvidia-smi tool might not even have a stable path to communicate with the GPU for a safe flash.
Unattempted
Correct: C Access the BMC via the OOB network to verify power and cooling health, update the BMC and BIOS firmware, then use HGX firmware update tools to synchronize the GPU baseboard and NVSwitch components.
The Technical Reason: An HGX system is a complex “baseboard assembly“ where multiple components must speak the same language.
Phase 1 (Health Check): Before flashing sensitive firmware, the BMC (Baseboard Management Controller) must confirm that the power supplies are redundant and the fans are operational. A thermal event or power loss during a flash can “brick“ the board.
Phase 2 (Management Plane): The BMC and BIOS/UEFI form the foundation. They must be updated first because they often contain the updated Redfish protocols or PCIe initialization code required to “see“ and communicate with the high-speed GPU components.
Phase 3 (Data Plane): Finally, the HGX firmware update tools (like nvfwupd) are used to push a “validated recipe“ to the NVSwitches and GPUs. This ensures that the inter-GPU fabric (NVLink) and the GPUs themselves are synchronized for thermal and power management.
The NCP-AII Context: The exam validates the “Validated Stack“ philosophy. You do not update components in isolation; you apply a specific firmware version “recipe“ provided by NVIDIA for that specific HGX generation (e.g., H100 or A100).
Incorrect Options: A. Perform a cold boot and immediately run an HPL test Running a high-stress benchmark like HPL (High-Performance Linpack) on unvalidated or mismatched firmware is dangerous. Mismatched firmware can lead to incorrect thermal throttling behaviors, potentially causing hardware damage or inconsistent results. Stress testing is a final validation step, not an initial bring-up step.
B. Manually install the NVIDIA Container Toolkit and run a Docker container The NVIDIA Container Toolkit is a high-level software utility. It relies on the NVIDIA Driver being installed in the OS. If the underlying firmware (BMC/BIOS/HGX) is inconsistent, the driver may fail to load, or the GPUs may not be detected at all. You cannot use a “container“ to validate low-level hardware firmware consistency.
D. Boot into Linux and use NVIDIA SMI to flash the VBIOS first nvidia-smi is primarily a monitoring and management tool for the GPU within the OS. Flashing the VBIOS before verifying the BMC and BIOS status ignores the management hierarchy. If the BIOS or PCIe Retimers are running old firmware, the nvidia-smi tool might not even have a stable path to communicate with the GPU for a safe flash.
Question 41 of 60
41. Question
In a cluster utilizing BlueField-3 Data Processing Units (DPUs), the network team wants to offload the OVS (Open vSwitch) data path to the DPU hardware to maximize host CPU availability. Which NVIDIA platform must be configured to manage these DPU resources, and what is the primary benefit for AI workloads?
Correct
Correct: D The DOCA (Data-Center-on-a-Chip Architecture) framework; it offloads network and storage tasks from the host CPU to the DPU.
The Technical Reason: To achieve hardware acceleration for services like Open vSwitch (OVS), NVIDIA provides the DOCA software framework. DOCA is a “SDK and Runtime“ that abstracts the complex programming of the DPU‘s internal Arm cores and hardware accelerators (like ASAP² for switching).
Primary Benefit for AI: In a standard setup, the host CPU spends significant cycles managing network traffic, packet switching (OVS), and storage protocols. By offloading these to the DPU via DOCA, those CPU cycles are “reclaimed“ for the host. For AI workloads, this means the CPU can focus entirely on data preprocessing, augmentation, and feeding the GPUs, while the DPU handles the “East-West“ and “North-South“ data movement at wire speed.
The NCP-AII Context: The certification validates your ability to install and verify the DOCA driver and runtime environment. The exam expects you to recognize DOCA as the unified management layer that allows the DPU to function as a programmable infrastructure processor.
Incorrect Options: A. The Base Command Manager; it replaces the DPU firmware with a standard Linux kernel Base Command Manager (BCM) is a cluster management tool (formerly Bright Cluster Manager) used to provision and monitor the entire cluster. While it can manage DPU nodes, it does not “replace“ DPU firmware with a standard kernel to run CUDA. DPUs run a specialized DOCA-BFB (BlueField Bundle) OS on their Arm cores, and they are designed to offload infrastructure tasks, not to execute standard CUDA kernels (which are meant for GPUs).
B. The NVIDIA Container Toolkit; it maps DPU memory directly to GPU memory The NVIDIA Container Toolkit is used to expose GPUs to containers. While technologies like GPUDirect RDMA allow DPUs and GPUs to share data directly, this is enabled by the DOCA and CUDA drivers working together, not by the Container Toolkit itself. Furthermore, DPU-to-GPU mapping occurs over the PCIe bus, whereas NVLink is strictly for GPU-to-GPU or GPU-to-CPU (Grace) communication.
C. The NGC CLI; it provides a interface to virtualize the DPU as a GPU The NGC (NVIDIA GPU Cloud) CLI is a tool for downloading containers, models, and scripts from the NVIDIA registry. It has no hardware virtualization capabilities. A DPU cannot be “virtualized as a GPU“; they are fundamentally different processors with different instruction sets (Arm/Networking vs. Parallel SIMT).
Incorrect
Correct: D The DOCA (Data-Center-on-a-Chip Architecture) framework; it offloads network and storage tasks from the host CPU to the DPU.
The Technical Reason: To achieve hardware acceleration for services like Open vSwitch (OVS), NVIDIA provides the DOCA software framework. DOCA is a “SDK and Runtime“ that abstracts the complex programming of the DPU‘s internal Arm cores and hardware accelerators (like ASAP² for switching).
Primary Benefit for AI: In a standard setup, the host CPU spends significant cycles managing network traffic, packet switching (OVS), and storage protocols. By offloading these to the DPU via DOCA, those CPU cycles are “reclaimed“ for the host. For AI workloads, this means the CPU can focus entirely on data preprocessing, augmentation, and feeding the GPUs, while the DPU handles the “East-West“ and “North-South“ data movement at wire speed.
The NCP-AII Context: The certification validates your ability to install and verify the DOCA driver and runtime environment. The exam expects you to recognize DOCA as the unified management layer that allows the DPU to function as a programmable infrastructure processor.
Incorrect Options: A. The Base Command Manager; it replaces the DPU firmware with a standard Linux kernel Base Command Manager (BCM) is a cluster management tool (formerly Bright Cluster Manager) used to provision and monitor the entire cluster. While it can manage DPU nodes, it does not “replace“ DPU firmware with a standard kernel to run CUDA. DPUs run a specialized DOCA-BFB (BlueField Bundle) OS on their Arm cores, and they are designed to offload infrastructure tasks, not to execute standard CUDA kernels (which are meant for GPUs).
B. The NVIDIA Container Toolkit; it maps DPU memory directly to GPU memory The NVIDIA Container Toolkit is used to expose GPUs to containers. While technologies like GPUDirect RDMA allow DPUs and GPUs to share data directly, this is enabled by the DOCA and CUDA drivers working together, not by the Container Toolkit itself. Furthermore, DPU-to-GPU mapping occurs over the PCIe bus, whereas NVLink is strictly for GPU-to-GPU or GPU-to-CPU (Grace) communication.
C. The NGC CLI; it provides a interface to virtualize the DPU as a GPU The NGC (NVIDIA GPU Cloud) CLI is a tool for downloading containers, models, and scripts from the NVIDIA registry. It has no hardware virtualization capabilities. A DPU cannot be “virtualized as a GPU“; they are fundamentally different processors with different instruction sets (Arm/Networking vs. Parallel SIMT).
Unattempted
Correct: D The DOCA (Data-Center-on-a-Chip Architecture) framework; it offloads network and storage tasks from the host CPU to the DPU.
The Technical Reason: To achieve hardware acceleration for services like Open vSwitch (OVS), NVIDIA provides the DOCA software framework. DOCA is a “SDK and Runtime“ that abstracts the complex programming of the DPU‘s internal Arm cores and hardware accelerators (like ASAP² for switching).
Primary Benefit for AI: In a standard setup, the host CPU spends significant cycles managing network traffic, packet switching (OVS), and storage protocols. By offloading these to the DPU via DOCA, those CPU cycles are “reclaimed“ for the host. For AI workloads, this means the CPU can focus entirely on data preprocessing, augmentation, and feeding the GPUs, while the DPU handles the “East-West“ and “North-South“ data movement at wire speed.
The NCP-AII Context: The certification validates your ability to install and verify the DOCA driver and runtime environment. The exam expects you to recognize DOCA as the unified management layer that allows the DPU to function as a programmable infrastructure processor.
Incorrect Options: A. The Base Command Manager; it replaces the DPU firmware with a standard Linux kernel Base Command Manager (BCM) is a cluster management tool (formerly Bright Cluster Manager) used to provision and monitor the entire cluster. While it can manage DPU nodes, it does not “replace“ DPU firmware with a standard kernel to run CUDA. DPUs run a specialized DOCA-BFB (BlueField Bundle) OS on their Arm cores, and they are designed to offload infrastructure tasks, not to execute standard CUDA kernels (which are meant for GPUs).
B. The NVIDIA Container Toolkit; it maps DPU memory directly to GPU memory The NVIDIA Container Toolkit is used to expose GPUs to containers. While technologies like GPUDirect RDMA allow DPUs and GPUs to share data directly, this is enabled by the DOCA and CUDA drivers working together, not by the Container Toolkit itself. Furthermore, DPU-to-GPU mapping occurs over the PCIe bus, whereas NVLink is strictly for GPU-to-GPU or GPU-to-CPU (Grace) communication.
C. The NGC CLI; it provides a interface to virtualize the DPU as a GPU The NGC (NVIDIA GPU Cloud) CLI is a tool for downloading containers, models, and scripts from the NVIDIA registry. It has no hardware virtualization capabilities. A DPU cannot be “virtualized as a GPU“; they are fundamentally different processors with different instruction sets (Arm/Networking vs. Parallel SIMT).
Question 42 of 60
42. Question
When performing a NeMo burn-in test on a large-scale cluster intended for Large Language Model (LLM) training, what is the engineer specifically trying to validate regarding the overall system health?
Correct
Correct: A The ability of the cluster to maintain sustained throughput during a real-world training workload.
The Technical Reason: Unlike synthetic benchmarks (like HPL or NCCL-tests) that isolate specific hardware components, a NeMo burn-in simulates a real Large Language Model (LLM) training job. It stresses the “Full Stack“:
Compute: Massive GPU utilization across all nodes.
Fabric: Continuous East-West traffic (all-reduce operations) via InfiniBand/RoCE.
Storage: Frequent “Checkpointing“ (writing model weights to the parallel file system).
Validation Goal: It identifies “soft failures“—such as a single GPU that throttles only after 4 hours of heat, or a network switch that drops packets only under 90% load—that shorter tests might miss.
The NCP-AII Context: The certification emphasizes that “Validation“ is not just about speed, but about stability. A cluster that is fast but crashes every 30 minutes due to thermal or fabric issues is not “NVIDIA-Certified“ for production.
Incorrect Options: B. Compatibility with legacy 32-bit Windows applications Modern AI infrastructure is built almost exclusively on 64-bit Linux (Ubuntu or RHEL/SLES). NVIDIA-Certified systems for LLM training do not support legacy 32-bit Windows applications, nor is that a goal of AI infrastructure deployment.
C. The speed at which the BIOS can perform a POST While a fast Power-On Self-Test (POST) is convenient for maintenance, it has no impact on the performance of a training job that may run for weeks. The NeMo burn-in is an “Application-Layer“ test, whereas the POST is a “Hardware-Initialization“ phase that happens long before the OS or NeMo even loads.
D. The resolution of the monitor connected to the head node‘s VGA port AI clusters are typically “Headless“ (managed remotely via SSH and the BMC/OOB network). The VGA resolution of a head node is irrelevant to the training throughput of thousands of GPUs connected via high-speed InfiniBand fabrics.
Incorrect
Correct: A The ability of the cluster to maintain sustained throughput during a real-world training workload.
The Technical Reason: Unlike synthetic benchmarks (like HPL or NCCL-tests) that isolate specific hardware components, a NeMo burn-in simulates a real Large Language Model (LLM) training job. It stresses the “Full Stack“:
Compute: Massive GPU utilization across all nodes.
Fabric: Continuous East-West traffic (all-reduce operations) via InfiniBand/RoCE.
Storage: Frequent “Checkpointing“ (writing model weights to the parallel file system).
Validation Goal: It identifies “soft failures“—such as a single GPU that throttles only after 4 hours of heat, or a network switch that drops packets only under 90% load—that shorter tests might miss.
The NCP-AII Context: The certification emphasizes that “Validation“ is not just about speed, but about stability. A cluster that is fast but crashes every 30 minutes due to thermal or fabric issues is not “NVIDIA-Certified“ for production.
Incorrect Options: B. Compatibility with legacy 32-bit Windows applications Modern AI infrastructure is built almost exclusively on 64-bit Linux (Ubuntu or RHEL/SLES). NVIDIA-Certified systems for LLM training do not support legacy 32-bit Windows applications, nor is that a goal of AI infrastructure deployment.
C. The speed at which the BIOS can perform a POST While a fast Power-On Self-Test (POST) is convenient for maintenance, it has no impact on the performance of a training job that may run for weeks. The NeMo burn-in is an “Application-Layer“ test, whereas the POST is a “Hardware-Initialization“ phase that happens long before the OS or NeMo even loads.
D. The resolution of the monitor connected to the head node‘s VGA port AI clusters are typically “Headless“ (managed remotely via SSH and the BMC/OOB network). The VGA resolution of a head node is irrelevant to the training throughput of thousands of GPUs connected via high-speed InfiniBand fabrics.
Unattempted
Correct: A The ability of the cluster to maintain sustained throughput during a real-world training workload.
The Technical Reason: Unlike synthetic benchmarks (like HPL or NCCL-tests) that isolate specific hardware components, a NeMo burn-in simulates a real Large Language Model (LLM) training job. It stresses the “Full Stack“:
Compute: Massive GPU utilization across all nodes.
Fabric: Continuous East-West traffic (all-reduce operations) via InfiniBand/RoCE.
Storage: Frequent “Checkpointing“ (writing model weights to the parallel file system).
Validation Goal: It identifies “soft failures“—such as a single GPU that throttles only after 4 hours of heat, or a network switch that drops packets only under 90% load—that shorter tests might miss.
The NCP-AII Context: The certification emphasizes that “Validation“ is not just about speed, but about stability. A cluster that is fast but crashes every 30 minutes due to thermal or fabric issues is not “NVIDIA-Certified“ for production.
Incorrect Options: B. Compatibility with legacy 32-bit Windows applications Modern AI infrastructure is built almost exclusively on 64-bit Linux (Ubuntu or RHEL/SLES). NVIDIA-Certified systems for LLM training do not support legacy 32-bit Windows applications, nor is that a goal of AI infrastructure deployment.
C. The speed at which the BIOS can perform a POST While a fast Power-On Self-Test (POST) is convenient for maintenance, it has no impact on the performance of a training job that may run for weeks. The NeMo burn-in is an “Application-Layer“ test, whereas the POST is a “Hardware-Initialization“ phase that happens long before the OS or NeMo even loads.
D. The resolution of the monitor connected to the head node‘s VGA port AI clusters are typically “Headless“ (managed remotely via SSH and the BMC/OOB network). The VGA resolution of a head node is irrelevant to the training throughput of thousands of GPUs connected via high-speed InfiniBand fabrics.
Question 43 of 60
43. Question
A system administrator is configuring the software stack on a group of compute nodes. They need to ensure that researchers can run AI workloads using Docker containers while having full access to the NVIDIA GPU hardware. Which sequence of software installations and configurations is required on the host nodes to enable this capability?
Correct
Correct: D Install the NVIDIA GPU driver, then install the Docker engine, and finally install the NVIDIA Container Toolkit to allow Docker to interface with the GPU via the –gpus flag.
The Technical Reason: The installation must follow a logical dependency chain:
NVIDIA GPU Driver: This is the foundational kernel-level software. Without the driver, the OS cannot communicate with the hardware, and tools like nvidia-smi will not function.
Docker Engine: The container runtime must be present before it can be configured to use specialized plugins.
NVIDIA Container Toolkit: This toolkit (formerly nvidia-docker2) provides the nvidia-container-runtime. During its configuration phase (e.g., using nvidia-ctk runtime configure), it modifies the Docker daemon.json to register the NVIDIA runtime. This allows the –gpus flag to successfully map host GPU resources into the container.
The NCP-AII Context: The exam blueprint explicitly requires candidates to “Install/update/remove NVIDIA GPU drivers,“ “Install the NVIDIA container toolkit,“ and “Demonstrate how to use NVIDIA GPUs with Docker.“ Following this specific sequence ensures a stable, validated environment.
Incorrect Options: A. Install the NVIDIA Container Toolkit first The NVIDIA Container Toolkit is not an installer for the entire stack. It is a set of libraries and a runtime wrapper. It assumes that the GPU driver is already active on the host and that a container engine (like Docker or Containerd) is already installed and ready to be configured. It will not “automatically download“ the base drivers or the Docker engine.
B. Install Slurm and the Pyxis plugin to replace Docker While Slurm and the Pyxis/Enroot combination are common in HPC clusters for managing containers, they do not “replace the need for a container runtime.“ In fact, Pyxis and Enroot are essentially a specialized runtime and plugin that allow Slurm to execute container images (often pulled from Docker registries) with GPU support. They are an alternative to Docker in large-scale multi-node clusters, but the question specifically asks about enabling Docker workloads.
C. Install CUDA and manually copy libcuda.so This is a “legacy“ approach that is highly discouraged in modern AI infrastructure. Manually copying driver libraries into a container breaks portability and makes maintenance impossible, as the libraries inside the container must match the driver version on the host. The NVIDIA Container Toolkit handles this dynamically by “mounting“ the necessary driver libraries from the host into the container at runtime.
Incorrect
Correct: D Install the NVIDIA GPU driver, then install the Docker engine, and finally install the NVIDIA Container Toolkit to allow Docker to interface with the GPU via the –gpus flag.
The Technical Reason: The installation must follow a logical dependency chain:
NVIDIA GPU Driver: This is the foundational kernel-level software. Without the driver, the OS cannot communicate with the hardware, and tools like nvidia-smi will not function.
Docker Engine: The container runtime must be present before it can be configured to use specialized plugins.
NVIDIA Container Toolkit: This toolkit (formerly nvidia-docker2) provides the nvidia-container-runtime. During its configuration phase (e.g., using nvidia-ctk runtime configure), it modifies the Docker daemon.json to register the NVIDIA runtime. This allows the –gpus flag to successfully map host GPU resources into the container.
The NCP-AII Context: The exam blueprint explicitly requires candidates to “Install/update/remove NVIDIA GPU drivers,“ “Install the NVIDIA container toolkit,“ and “Demonstrate how to use NVIDIA GPUs with Docker.“ Following this specific sequence ensures a stable, validated environment.
Incorrect Options: A. Install the NVIDIA Container Toolkit first The NVIDIA Container Toolkit is not an installer for the entire stack. It is a set of libraries and a runtime wrapper. It assumes that the GPU driver is already active on the host and that a container engine (like Docker or Containerd) is already installed and ready to be configured. It will not “automatically download“ the base drivers or the Docker engine.
B. Install Slurm and the Pyxis plugin to replace Docker While Slurm and the Pyxis/Enroot combination are common in HPC clusters for managing containers, they do not “replace the need for a container runtime.“ In fact, Pyxis and Enroot are essentially a specialized runtime and plugin that allow Slurm to execute container images (often pulled from Docker registries) with GPU support. They are an alternative to Docker in large-scale multi-node clusters, but the question specifically asks about enabling Docker workloads.
C. Install CUDA and manually copy libcuda.so This is a “legacy“ approach that is highly discouraged in modern AI infrastructure. Manually copying driver libraries into a container breaks portability and makes maintenance impossible, as the libraries inside the container must match the driver version on the host. The NVIDIA Container Toolkit handles this dynamically by “mounting“ the necessary driver libraries from the host into the container at runtime.
Unattempted
Correct: D Install the NVIDIA GPU driver, then install the Docker engine, and finally install the NVIDIA Container Toolkit to allow Docker to interface with the GPU via the –gpus flag.
The Technical Reason: The installation must follow a logical dependency chain:
NVIDIA GPU Driver: This is the foundational kernel-level software. Without the driver, the OS cannot communicate with the hardware, and tools like nvidia-smi will not function.
Docker Engine: The container runtime must be present before it can be configured to use specialized plugins.
NVIDIA Container Toolkit: This toolkit (formerly nvidia-docker2) provides the nvidia-container-runtime. During its configuration phase (e.g., using nvidia-ctk runtime configure), it modifies the Docker daemon.json to register the NVIDIA runtime. This allows the –gpus flag to successfully map host GPU resources into the container.
The NCP-AII Context: The exam blueprint explicitly requires candidates to “Install/update/remove NVIDIA GPU drivers,“ “Install the NVIDIA container toolkit,“ and “Demonstrate how to use NVIDIA GPUs with Docker.“ Following this specific sequence ensures a stable, validated environment.
Incorrect Options: A. Install the NVIDIA Container Toolkit first The NVIDIA Container Toolkit is not an installer for the entire stack. It is a set of libraries and a runtime wrapper. It assumes that the GPU driver is already active on the host and that a container engine (like Docker or Containerd) is already installed and ready to be configured. It will not “automatically download“ the base drivers or the Docker engine.
B. Install Slurm and the Pyxis plugin to replace Docker While Slurm and the Pyxis/Enroot combination are common in HPC clusters for managing containers, they do not “replace the need for a container runtime.“ In fact, Pyxis and Enroot are essentially a specialized runtime and plugin that allow Slurm to execute container images (often pulled from Docker registries) with GPU support. They are an alternative to Docker in large-scale multi-node clusters, but the question specifically asks about enabling Docker workloads.
C. Install CUDA and manually copy libcuda.so This is a “legacy“ approach that is highly discouraged in modern AI infrastructure. Manually copying driver libraries into a container breaks portability and makes maintenance impossible, as the libraries inside the container must match the driver version on the host. The NVIDIA Container Toolkit handles this dynamically by “mounting“ the necessary driver libraries from the host into the container at runtime.
Question 44 of 60
44. Question
An infrastructure engineer is validating the cabling for a large-scale AI cluster using InfiniBand NDR transceivers and Twinax copper cables. During the signal quality verification phase, several links report high Bit Error Rates (BER). Which action is the most appropriate according to NVIDIA validation standards to ensure physical layer stability before proceeding to software installation?
Correct
Correct: C Replace the copper cables with Active Optical Cables (AOC) if the distance exceeds 3 meters or check for exceeded bend radius on existing cables.
The Technical Reason: NVIDIA NDR (400Gb/s) InfiniBand uses 100G-PAM4 signaling per lane. This high-frequency signaling is extremely sensitive to physical degradation.
Distance Limits: Passive Direct Attach Copper (DAC) cables for NDR generally have a maximum reliable reach of 2 to 3 meters. Beyond this, the signal attenuation is too high to maintain an acceptable Bit Error Rate (BER).
Bend Radius: Exceeding the specified bend radius (typically 10x to 15x the cable diameter) causes physical deformation of the internal copper pairs, leading to impedance mismatches and high BER.
AOC/Transceivers: For distances greater than 3 meters, NVIDIA standards dictate moving to Active Copper Cables (ACC) or Active Optical Cables (AOC)/Transceivers to ensure signal integrity through active amplification or optical conversion.
The NCP-AII Context: The exam expects you to know the physical limitations of the “Validated Stack.“ If a link is up but reporting high BER via ibdiagnet, the first step is always physical inspection and ensuring the media type matches the distance requirement.
Incorrect Options: A. Manually force the port speed to a lower generation While downshifting from NDR (400G) to HDR (200G) might reduce errors, it is not an “optimization“ or a “fix“—it is a performance degradation. In an NVIDIA-certified infrastructure, the goal is to run at the rated speed of the hardware. If the hardware cannot maintain signal integrity at its rated speed, the physical medium (cable) is the fault and must be replaced, not the configuration throttled.
B. Use the NGC CLI to reset GPU firmware The NGC CLI is a software tool used to manage containers and models from the NVIDIA GPU Cloud. It has no capability to recalibrate the SerDes (Serializer/Deserializer) of the network controllers or reset firmware on the HCA. Firmware updates are handled via mlxfwmanager or nvfwupd, but firmware is rarely the cause of high BER compared to physical cabling issues.
D. Ignore the BER if the link state is Up This is a critical error in AI clusters. While the Subnet Manager (SM) handles routing, it does not “correct“ packet errors; the hardware‘s Forward Error Correction (FEC) does. However, if the BER is high, the FEC will eventually be overwhelmed, leading to dropped packets and retransmissions. In high-performance collective operations (like NCCL All-Reduce), a single link with high BER can cause the entire training job to stall or time out.
Incorrect
Correct: C Replace the copper cables with Active Optical Cables (AOC) if the distance exceeds 3 meters or check for exceeded bend radius on existing cables.
The Technical Reason: NVIDIA NDR (400Gb/s) InfiniBand uses 100G-PAM4 signaling per lane. This high-frequency signaling is extremely sensitive to physical degradation.
Distance Limits: Passive Direct Attach Copper (DAC) cables for NDR generally have a maximum reliable reach of 2 to 3 meters. Beyond this, the signal attenuation is too high to maintain an acceptable Bit Error Rate (BER).
Bend Radius: Exceeding the specified bend radius (typically 10x to 15x the cable diameter) causes physical deformation of the internal copper pairs, leading to impedance mismatches and high BER.
AOC/Transceivers: For distances greater than 3 meters, NVIDIA standards dictate moving to Active Copper Cables (ACC) or Active Optical Cables (AOC)/Transceivers to ensure signal integrity through active amplification or optical conversion.
The NCP-AII Context: The exam expects you to know the physical limitations of the “Validated Stack.“ If a link is up but reporting high BER via ibdiagnet, the first step is always physical inspection and ensuring the media type matches the distance requirement.
Incorrect Options: A. Manually force the port speed to a lower generation While downshifting from NDR (400G) to HDR (200G) might reduce errors, it is not an “optimization“ or a “fix“—it is a performance degradation. In an NVIDIA-certified infrastructure, the goal is to run at the rated speed of the hardware. If the hardware cannot maintain signal integrity at its rated speed, the physical medium (cable) is the fault and must be replaced, not the configuration throttled.
B. Use the NGC CLI to reset GPU firmware The NGC CLI is a software tool used to manage containers and models from the NVIDIA GPU Cloud. It has no capability to recalibrate the SerDes (Serializer/Deserializer) of the network controllers or reset firmware on the HCA. Firmware updates are handled via mlxfwmanager or nvfwupd, but firmware is rarely the cause of high BER compared to physical cabling issues.
D. Ignore the BER if the link state is Up This is a critical error in AI clusters. While the Subnet Manager (SM) handles routing, it does not “correct“ packet errors; the hardware‘s Forward Error Correction (FEC) does. However, if the BER is high, the FEC will eventually be overwhelmed, leading to dropped packets and retransmissions. In high-performance collective operations (like NCCL All-Reduce), a single link with high BER can cause the entire training job to stall or time out.
Unattempted
Correct: C Replace the copper cables with Active Optical Cables (AOC) if the distance exceeds 3 meters or check for exceeded bend radius on existing cables.
The Technical Reason: NVIDIA NDR (400Gb/s) InfiniBand uses 100G-PAM4 signaling per lane. This high-frequency signaling is extremely sensitive to physical degradation.
Distance Limits: Passive Direct Attach Copper (DAC) cables for NDR generally have a maximum reliable reach of 2 to 3 meters. Beyond this, the signal attenuation is too high to maintain an acceptable Bit Error Rate (BER).
Bend Radius: Exceeding the specified bend radius (typically 10x to 15x the cable diameter) causes physical deformation of the internal copper pairs, leading to impedance mismatches and high BER.
AOC/Transceivers: For distances greater than 3 meters, NVIDIA standards dictate moving to Active Copper Cables (ACC) or Active Optical Cables (AOC)/Transceivers to ensure signal integrity through active amplification or optical conversion.
The NCP-AII Context: The exam expects you to know the physical limitations of the “Validated Stack.“ If a link is up but reporting high BER via ibdiagnet, the first step is always physical inspection and ensuring the media type matches the distance requirement.
Incorrect Options: A. Manually force the port speed to a lower generation While downshifting from NDR (400G) to HDR (200G) might reduce errors, it is not an “optimization“ or a “fix“—it is a performance degradation. In an NVIDIA-certified infrastructure, the goal is to run at the rated speed of the hardware. If the hardware cannot maintain signal integrity at its rated speed, the physical medium (cable) is the fault and must be replaced, not the configuration throttled.
B. Use the NGC CLI to reset GPU firmware The NGC CLI is a software tool used to manage containers and models from the NVIDIA GPU Cloud. It has no capability to recalibrate the SerDes (Serializer/Deserializer) of the network controllers or reset firmware on the HCA. Firmware updates are handled via mlxfwmanager or nvfwupd, but firmware is rarely the cause of high BER compared to physical cabling issues.
D. Ignore the BER if the link state is Up This is a critical error in AI clusters. While the Subnet Manager (SM) handles routing, it does not “correct“ packet errors; the hardware‘s Forward Error Correction (FEC) does. However, if the BER is high, the FEC will eventually be overwhelmed, leading to dropped packets and retransmissions. In high-performance collective operations (like NCCL All-Reduce), a single link with high BER can cause the entire training job to stall or time out.
Question 45 of 60
45. Question
When setting up a MIG (Multi-Instance GPU) configuration for a multi-tenant AI environment, an administrator needs to ensure that memory and cache isolation are strictly enforced between different users. Which MIG profile characteristic ensures that compute resources are dedicated and not shared with other instances on the same physical GPU?
Correct
Correct: B The selection of a specific ‘Slice‘ (e.g., 1g.10gb) which provides hardware-level isolation of the memory controller and SMs.
The Technical Reason: The “M“ in MIG stands for “Multi-Instance,“ and its core value proposition is hardware-level partitioning.
Architecture: When you create a MIG instance (like 1g.10gb), the GPU hardware physically assigns specific Streaming Multiprocessors (SMs), L2 cache banks, and memory controllers to that instance.
Isolation: This ensures that one user‘s workload cannot “evict“ data from another user‘s L2 cache or saturate the memory bandwidth of another instance. This provides a guaranteed Quality of Service (QoS) and prevents “noisy neighbor“ effects.
The NCP-AII Context: The exam validates your knowledge of the GPU Instance (GI) and Compute Instance (CI) hierarchy. A GI is the fundamental unit of isolation that includes these dedicated hardware paths. The 1g.10gb notation represents 1 compute slice and 10GB of isolated memory.
Incorrect Options: A. The use of ‘Shared‘ profiles In the official NVIDIA MIG terminology, there is no such thing as a “Shared“ profile that allows instances to “burst“ into each other‘s L2 cache. The very purpose of MIG is to prevent this kind of sharing. If users need to share cache or burst resources, they would use traditional CUDA streams or Time-Slicing, not MIG.
C. Enabling the ‘Overcommit‘ flag in the driver GPU memory is a finite physical resource and cannot be overcommitted in the same way CPU memory can. While technologies like “Unified Memory“ allow for oversubscription by swapping to system RAM, this is not a feature of MIG. MIG instances have a “hard“ memory limit; once an instance hits its 10GB or 20GB limit, it will trigger an Out of Memory (OOM) error to protect the stability of other instances.
D. Configuring the GPU in ‘Time-Slice‘ mode Time-Slicing is the “legacy“ or alternative method of GPU sharing. It uses a software scheduler to rotate tasks on the GPU.
The Flaw: In Time-Slice mode, while tasks are rotated every few milliseconds, they all share the same memory and L2 cache while they are active. There is zero memory or fault isolation. If one user‘s process crashes or leaks memory in Time-Slice mode, it can impact everyone else on the card. MIG was specifically designed to solve these exact weaknesses.
Incorrect
Correct: B The selection of a specific ‘Slice‘ (e.g., 1g.10gb) which provides hardware-level isolation of the memory controller and SMs.
The Technical Reason: The “M“ in MIG stands for “Multi-Instance,“ and its core value proposition is hardware-level partitioning.
Architecture: When you create a MIG instance (like 1g.10gb), the GPU hardware physically assigns specific Streaming Multiprocessors (SMs), L2 cache banks, and memory controllers to that instance.
Isolation: This ensures that one user‘s workload cannot “evict“ data from another user‘s L2 cache or saturate the memory bandwidth of another instance. This provides a guaranteed Quality of Service (QoS) and prevents “noisy neighbor“ effects.
The NCP-AII Context: The exam validates your knowledge of the GPU Instance (GI) and Compute Instance (CI) hierarchy. A GI is the fundamental unit of isolation that includes these dedicated hardware paths. The 1g.10gb notation represents 1 compute slice and 10GB of isolated memory.
Incorrect Options: A. The use of ‘Shared‘ profiles In the official NVIDIA MIG terminology, there is no such thing as a “Shared“ profile that allows instances to “burst“ into each other‘s L2 cache. The very purpose of MIG is to prevent this kind of sharing. If users need to share cache or burst resources, they would use traditional CUDA streams or Time-Slicing, not MIG.
C. Enabling the ‘Overcommit‘ flag in the driver GPU memory is a finite physical resource and cannot be overcommitted in the same way CPU memory can. While technologies like “Unified Memory“ allow for oversubscription by swapping to system RAM, this is not a feature of MIG. MIG instances have a “hard“ memory limit; once an instance hits its 10GB or 20GB limit, it will trigger an Out of Memory (OOM) error to protect the stability of other instances.
D. Configuring the GPU in ‘Time-Slice‘ mode Time-Slicing is the “legacy“ or alternative method of GPU sharing. It uses a software scheduler to rotate tasks on the GPU.
The Flaw: In Time-Slice mode, while tasks are rotated every few milliseconds, they all share the same memory and L2 cache while they are active. There is zero memory or fault isolation. If one user‘s process crashes or leaks memory in Time-Slice mode, it can impact everyone else on the card. MIG was specifically designed to solve these exact weaknesses.
Unattempted
Correct: B The selection of a specific ‘Slice‘ (e.g., 1g.10gb) which provides hardware-level isolation of the memory controller and SMs.
The Technical Reason: The “M“ in MIG stands for “Multi-Instance,“ and its core value proposition is hardware-level partitioning.
Architecture: When you create a MIG instance (like 1g.10gb), the GPU hardware physically assigns specific Streaming Multiprocessors (SMs), L2 cache banks, and memory controllers to that instance.
Isolation: This ensures that one user‘s workload cannot “evict“ data from another user‘s L2 cache or saturate the memory bandwidth of another instance. This provides a guaranteed Quality of Service (QoS) and prevents “noisy neighbor“ effects.
The NCP-AII Context: The exam validates your knowledge of the GPU Instance (GI) and Compute Instance (CI) hierarchy. A GI is the fundamental unit of isolation that includes these dedicated hardware paths. The 1g.10gb notation represents 1 compute slice and 10GB of isolated memory.
Incorrect Options: A. The use of ‘Shared‘ profiles In the official NVIDIA MIG terminology, there is no such thing as a “Shared“ profile that allows instances to “burst“ into each other‘s L2 cache. The very purpose of MIG is to prevent this kind of sharing. If users need to share cache or burst resources, they would use traditional CUDA streams or Time-Slicing, not MIG.
C. Enabling the ‘Overcommit‘ flag in the driver GPU memory is a finite physical resource and cannot be overcommitted in the same way CPU memory can. While technologies like “Unified Memory“ allow for oversubscription by swapping to system RAM, this is not a feature of MIG. MIG instances have a “hard“ memory limit; once an instance hits its 10GB or 20GB limit, it will trigger an Out of Memory (OOM) error to protect the stability of other instances.
D. Configuring the GPU in ‘Time-Slice‘ mode Time-Slicing is the “legacy“ or alternative method of GPU sharing. It uses a software scheduler to rotate tasks on the GPU.
The Flaw: In Time-Slice mode, while tasks are rotated every few milliseconds, they all share the same memory and L2 cache while they are active. There is zero memory or fault isolation. If one user‘s process crashes or leaks memory in Time-Slice mode, it can impact everyone else on the card. MIG was specifically designed to solve these exact weaknesses.
Question 46 of 60
46. Question
The final stage of cluster verification involves running a NeMo burn-in test. This test is designed to stress the GPUs, the inter-node fabric, and the storage system simultaneously. If the test fails with a Connection Timed Out error specifically during the checkpoint saving phase, which component should be the primary focus for troubleshooting?
Correct
Correct: D The storage fabric and the parallel file system configuration, as checkpointing is an I/O intensive operation that tests the storage throughput and latency.
The Technical Reason: Training Large Language Models (LLMs) requires periodic “checkpoints“ where the entire state of the model (often hundreds of gigabytes or terabytes) is written from GPU memory to a parallel file system (like Lustre, Weka, or IBM Storage Scale).
The Connection Timeout: If the storage fabric (the network paths between compute nodes and the storage array) is saturated or misconfigured, the write operation will hang. Since the NeMo framework expects an acknowledgment within a specific window, a slow storage response triggers a Connection Timed Out error.
Checkpointing Stress: Unlike the training loop (which is GPU-bound), the checkpointing phase is the only time the storage fabric is pushed to its absolute limit.
The NCP-AII Context: The certification teaches that a balanced cluster requires “North-South“ (Storage) performance to match “East-West“ (GPU Interconnect) performance. If storage cannot keep up with the frequency of checkpoints, the entire training job becomes unstable.
Incorrect: A. The VBIOS of the GPUs The VBIOS is low-level firmware that manages power, clocks, and thermal profiles of the individual GPU hardware. It has no role in network handshaking, file synchronization, or managing data flow across spine switches. Network protocols are handled by the OS kernel, the NIC/DPU drivers (DOCA), and the communication libraries (NCCL).
B. The cooling fans in the server rack While vibrations can theoretically affect traditional mechanical Hard Disk Drives (HDDs), modern AI infrastructure exclusively uses NVMe SSDs. SSDs have no moving parts and are entirely immune to acoustic vibrations or fan noise. A fan failure would cause a “Thermal Throttling“ event (as discussed in previous questions), not a storage timeout during a specific I/O phase.
C. The IPMI configuration on the BMC The BMC (Baseboard Management Controller) manages the “Management Plane“ (OOB). It is responsible for power-on/off, health monitoring, and remote console access. It does not sit in the data path of the high-speed storage fabric. Authorizing data packets to the storage array is the job of the file system client and the network switches, not the BMC.
Incorrect
Correct: D The storage fabric and the parallel file system configuration, as checkpointing is an I/O intensive operation that tests the storage throughput and latency.
The Technical Reason: Training Large Language Models (LLMs) requires periodic “checkpoints“ where the entire state of the model (often hundreds of gigabytes or terabytes) is written from GPU memory to a parallel file system (like Lustre, Weka, or IBM Storage Scale).
The Connection Timeout: If the storage fabric (the network paths between compute nodes and the storage array) is saturated or misconfigured, the write operation will hang. Since the NeMo framework expects an acknowledgment within a specific window, a slow storage response triggers a Connection Timed Out error.
Checkpointing Stress: Unlike the training loop (which is GPU-bound), the checkpointing phase is the only time the storage fabric is pushed to its absolute limit.
The NCP-AII Context: The certification teaches that a balanced cluster requires “North-South“ (Storage) performance to match “East-West“ (GPU Interconnect) performance. If storage cannot keep up with the frequency of checkpoints, the entire training job becomes unstable.
Incorrect: A. The VBIOS of the GPUs The VBIOS is low-level firmware that manages power, clocks, and thermal profiles of the individual GPU hardware. It has no role in network handshaking, file synchronization, or managing data flow across spine switches. Network protocols are handled by the OS kernel, the NIC/DPU drivers (DOCA), and the communication libraries (NCCL).
B. The cooling fans in the server rack While vibrations can theoretically affect traditional mechanical Hard Disk Drives (HDDs), modern AI infrastructure exclusively uses NVMe SSDs. SSDs have no moving parts and are entirely immune to acoustic vibrations or fan noise. A fan failure would cause a “Thermal Throttling“ event (as discussed in previous questions), not a storage timeout during a specific I/O phase.
C. The IPMI configuration on the BMC The BMC (Baseboard Management Controller) manages the “Management Plane“ (OOB). It is responsible for power-on/off, health monitoring, and remote console access. It does not sit in the data path of the high-speed storage fabric. Authorizing data packets to the storage array is the job of the file system client and the network switches, not the BMC.
Unattempted
Correct: D The storage fabric and the parallel file system configuration, as checkpointing is an I/O intensive operation that tests the storage throughput and latency.
The Technical Reason: Training Large Language Models (LLMs) requires periodic “checkpoints“ where the entire state of the model (often hundreds of gigabytes or terabytes) is written from GPU memory to a parallel file system (like Lustre, Weka, or IBM Storage Scale).
The Connection Timeout: If the storage fabric (the network paths between compute nodes and the storage array) is saturated or misconfigured, the write operation will hang. Since the NeMo framework expects an acknowledgment within a specific window, a slow storage response triggers a Connection Timed Out error.
Checkpointing Stress: Unlike the training loop (which is GPU-bound), the checkpointing phase is the only time the storage fabric is pushed to its absolute limit.
The NCP-AII Context: The certification teaches that a balanced cluster requires “North-South“ (Storage) performance to match “East-West“ (GPU Interconnect) performance. If storage cannot keep up with the frequency of checkpoints, the entire training job becomes unstable.
Incorrect: A. The VBIOS of the GPUs The VBIOS is low-level firmware that manages power, clocks, and thermal profiles of the individual GPU hardware. It has no role in network handshaking, file synchronization, or managing data flow across spine switches. Network protocols are handled by the OS kernel, the NIC/DPU drivers (DOCA), and the communication libraries (NCCL).
B. The cooling fans in the server rack While vibrations can theoretically affect traditional mechanical Hard Disk Drives (HDDs), modern AI infrastructure exclusively uses NVMe SSDs. SSDs have no moving parts and are entirely immune to acoustic vibrations or fan noise. A fan failure would cause a “Thermal Throttling“ event (as discussed in previous questions), not a storage timeout during a specific I/O phase.
C. The IPMI configuration on the BMC The BMC (Baseboard Management Controller) manages the “Management Plane“ (OOB). It is responsible for power-on/off, health monitoring, and remote console access. It does not sit in the data path of the high-speed storage fabric. Authorizing data packets to the storage array is the job of the file system client and the network switches, not the BMC.
Question 47 of 60
47. Question
A network engineer is configuring a BlueField-3 Data Processing Unit (DPU) to act as a secure offload engine for a multi-tenant AI cluster. The requirement is to isolate the management traffic from the data traffic while ensuring the DPU can perform hardware-accelerated encryption. Which action is necessary to correctly manage the DPU physical and logical interfaces for this deployment?
Correct
Correct: D Configure the DPU in Separated mode where the ARM cores manage the OOB interface and the network ports are assigned to the host as virtual functions.
The Technical Reason: To achieve true isolation in a multi-tenant environment, the management plane and the data plane must be decoupled.
Separated Mode (Symmetric Model): In this mode, the DPU partitions its resources so that the internal Arm cores have their own dedicated network function (MAC/IP) for management tasks, while the x86 host sees the high-speed network ports.
Isolation: The Arm cores can manage the OOB (Out-of-Band) 1GbE interface and provide services like firewalling and telemetry without the host CPU seeing that traffic.
Hardware Acceleration: Even in this mode, the DPUÂ’s hardware engines (like the NVIDIA ConnectX-7 core inside the DPU) can perform encryption (IPsec/TLS) and switching (OVS) offloads.
The NCP-AII Context: The blueprint requires you to “Confirm FW/SW on BlueField-3“ and “Configure initial parameters.“ Understanding that Separated mode allows for independent management of the DPU‘s SoC while providing high-speed data paths to the host is essential for building secure AI clouds.
Incorrect Options: A. Flash the DPU with standard ConnectX-7 firmware While a BlueField-3 contains a ConnectX-7 core, flashing it with standard NIC firmware would disable the Arm cores entirely. This effectively “downgrades“ the DPU to a standard NIC, removing its ability to act as a secure offload engine or manage management traffic independently from the host. This defeats the purpose of deploying a DPU in the first place.
B. Manually bridge management with InfiniBand fabric Bridging the OOB management network with the high-speed data fabric is a major security risk and a performance bottleneck. Management traffic (low-speed, high-security) should always remain isolated from the data traffic (high-speed, lower-security). Furthermore, routing between VLANs is typically handled by the DPU‘s internal switch silicon via DOCA or OVS-kernel offload, not by a manual software bridge that would consume DPU Arm cycles.
C. Enable MIG (Multi-Instance GPU) on the BlueField-3 DPU MIG is an exclusive feature of NVIDIA GPUs (like the H100 or A100). BlueField-3 DPUs do not have Tensor Cores or the GPU architecture required for MIG. Networking multi-tenancy on a DPU is achieved via SR-IOV (Single Root I/O Virtualization) or VirtIO, which creates Virtual Functions (VFs) to partition the network bandwidth, not MIG slices.
Incorrect
Correct: D Configure the DPU in Separated mode where the ARM cores manage the OOB interface and the network ports are assigned to the host as virtual functions.
The Technical Reason: To achieve true isolation in a multi-tenant environment, the management plane and the data plane must be decoupled.
Separated Mode (Symmetric Model): In this mode, the DPU partitions its resources so that the internal Arm cores have their own dedicated network function (MAC/IP) for management tasks, while the x86 host sees the high-speed network ports.
Isolation: The Arm cores can manage the OOB (Out-of-Band) 1GbE interface and provide services like firewalling and telemetry without the host CPU seeing that traffic.
Hardware Acceleration: Even in this mode, the DPUÂ’s hardware engines (like the NVIDIA ConnectX-7 core inside the DPU) can perform encryption (IPsec/TLS) and switching (OVS) offloads.
The NCP-AII Context: The blueprint requires you to “Confirm FW/SW on BlueField-3“ and “Configure initial parameters.“ Understanding that Separated mode allows for independent management of the DPU‘s SoC while providing high-speed data paths to the host is essential for building secure AI clouds.
Incorrect Options: A. Flash the DPU with standard ConnectX-7 firmware While a BlueField-3 contains a ConnectX-7 core, flashing it with standard NIC firmware would disable the Arm cores entirely. This effectively “downgrades“ the DPU to a standard NIC, removing its ability to act as a secure offload engine or manage management traffic independently from the host. This defeats the purpose of deploying a DPU in the first place.
B. Manually bridge management with InfiniBand fabric Bridging the OOB management network with the high-speed data fabric is a major security risk and a performance bottleneck. Management traffic (low-speed, high-security) should always remain isolated from the data traffic (high-speed, lower-security). Furthermore, routing between VLANs is typically handled by the DPU‘s internal switch silicon via DOCA or OVS-kernel offload, not by a manual software bridge that would consume DPU Arm cycles.
C. Enable MIG (Multi-Instance GPU) on the BlueField-3 DPU MIG is an exclusive feature of NVIDIA GPUs (like the H100 or A100). BlueField-3 DPUs do not have Tensor Cores or the GPU architecture required for MIG. Networking multi-tenancy on a DPU is achieved via SR-IOV (Single Root I/O Virtualization) or VirtIO, which creates Virtual Functions (VFs) to partition the network bandwidth, not MIG slices.
Unattempted
Correct: D Configure the DPU in Separated mode where the ARM cores manage the OOB interface and the network ports are assigned to the host as virtual functions.
The Technical Reason: To achieve true isolation in a multi-tenant environment, the management plane and the data plane must be decoupled.
Separated Mode (Symmetric Model): In this mode, the DPU partitions its resources so that the internal Arm cores have their own dedicated network function (MAC/IP) for management tasks, while the x86 host sees the high-speed network ports.
Isolation: The Arm cores can manage the OOB (Out-of-Band) 1GbE interface and provide services like firewalling and telemetry without the host CPU seeing that traffic.
Hardware Acceleration: Even in this mode, the DPUÂ’s hardware engines (like the NVIDIA ConnectX-7 core inside the DPU) can perform encryption (IPsec/TLS) and switching (OVS) offloads.
The NCP-AII Context: The blueprint requires you to “Confirm FW/SW on BlueField-3“ and “Configure initial parameters.“ Understanding that Separated mode allows for independent management of the DPU‘s SoC while providing high-speed data paths to the host is essential for building secure AI clouds.
Incorrect Options: A. Flash the DPU with standard ConnectX-7 firmware While a BlueField-3 contains a ConnectX-7 core, flashing it with standard NIC firmware would disable the Arm cores entirely. This effectively “downgrades“ the DPU to a standard NIC, removing its ability to act as a secure offload engine or manage management traffic independently from the host. This defeats the purpose of deploying a DPU in the first place.
B. Manually bridge management with InfiniBand fabric Bridging the OOB management network with the high-speed data fabric is a major security risk and a performance bottleneck. Management traffic (low-speed, high-security) should always remain isolated from the data traffic (high-speed, lower-security). Furthermore, routing between VLANs is typically handled by the DPU‘s internal switch silicon via DOCA or OVS-kernel offload, not by a manual software bridge that would consume DPU Arm cycles.
C. Enable MIG (Multi-Instance GPU) on the BlueField-3 DPU MIG is an exclusive feature of NVIDIA GPUs (like the H100 or A100). BlueField-3 DPUs do not have Tensor Cores or the GPU architecture required for MIG. Networking multi-tenancy on a DPU is achieved via SR-IOV (Single Root I/O Virtualization) or VirtIO, which creates Virtual Functions (VFs) to partition the network bandwidth, not MIG slices.
Question 48 of 60
48. Question
An AI infrastructure node is failing to reach the 400Gbps line rate on its InfiniBand interface. The administrator suspects a faulty network card. Which sequence of actions should be taken to identify and replace the faulty component?
Correct
Correct: D Confirm the fault using ‘ibstat‘ and ‘mlxlink‘, power down the node, replace the ConnectX or BlueField card, and verify the new card‘s firmware after boot.
The Technical Reason: The diagnostic and replacement workflow for high-speed networking must follow a standardized sequence:
Diagnosis: Tools like ibstat verify the port‘s logical state (e.g., Active vs. Down) and link speed. mlxlink provides lower-level physical details, such as the actual negotiated speed, Bit Error Rates (BER), and lane-specific status, which is crucial for identifying if a 400Gbps (NDR) link is underperforming due to a faulty ASIC or transceiver.
Physical Replacement: Once the hardware fault is isolated to the card, the server must be gracefully shut down and powered off to prevent electrical damage during the swap.
Post-Replacement Validation: New cards (ConnectX-7 or BlueField-3) often ship with factory firmware that may not match the “Validated Recipe“ of the cluster. Verification and potential alignment of firmware are mandatory steps to ensure the new card participates correctly in the fabric.
The NCP-AII Context: The exam validates that you can use the Mellanox Firmware Tools (MFT) and InfiniBand Diagnostic Utilities to confirm component health before performing physical maintenance.
Incorrect Options: A. Flash switch firmware onto the network card Switch firmware (e.g., for a Quantum-2 switch) and adapter firmware (for a ConnectX-7) are fundamentally different and incompatible. Attempting to flash one onto the other would result in a “brick“ (permanently disabled hardware). Additionally, software/firmware cannot “compensate“ for physical ASIC damage; physical damage always requires hardware replacement.
B. Delete the node from BCM and wait for auto-ejection Base Command Manager (BCM) is a powerful orchestration tool, but it does not possess mechanical robotics to “physically eject“ cards from a PCIe slot. While BCM can automate software re-imaging or health monitoring, a technician must still manually replace the physical hardware.
C. Use ‘nvidia-smi‘ to re-assign MAC addresses to GPUs nvidia-smi is used for managing NVIDIA GPUs, not network cards. Furthermore, MAC addresses belong to the Data Link Layer (Layer 2) of a network adapter and cannot be “assigned“ to a GPU to make its CUDA cores act as a network interface. Network processing and general-purpose GPU computing are handled by distinct hardware architectures.
Incorrect
Correct: D Confirm the fault using ‘ibstat‘ and ‘mlxlink‘, power down the node, replace the ConnectX or BlueField card, and verify the new card‘s firmware after boot.
The Technical Reason: The diagnostic and replacement workflow for high-speed networking must follow a standardized sequence:
Diagnosis: Tools like ibstat verify the port‘s logical state (e.g., Active vs. Down) and link speed. mlxlink provides lower-level physical details, such as the actual negotiated speed, Bit Error Rates (BER), and lane-specific status, which is crucial for identifying if a 400Gbps (NDR) link is underperforming due to a faulty ASIC or transceiver.
Physical Replacement: Once the hardware fault is isolated to the card, the server must be gracefully shut down and powered off to prevent electrical damage during the swap.
Post-Replacement Validation: New cards (ConnectX-7 or BlueField-3) often ship with factory firmware that may not match the “Validated Recipe“ of the cluster. Verification and potential alignment of firmware are mandatory steps to ensure the new card participates correctly in the fabric.
The NCP-AII Context: The exam validates that you can use the Mellanox Firmware Tools (MFT) and InfiniBand Diagnostic Utilities to confirm component health before performing physical maintenance.
Incorrect Options: A. Flash switch firmware onto the network card Switch firmware (e.g., for a Quantum-2 switch) and adapter firmware (for a ConnectX-7) are fundamentally different and incompatible. Attempting to flash one onto the other would result in a “brick“ (permanently disabled hardware). Additionally, software/firmware cannot “compensate“ for physical ASIC damage; physical damage always requires hardware replacement.
B. Delete the node from BCM and wait for auto-ejection Base Command Manager (BCM) is a powerful orchestration tool, but it does not possess mechanical robotics to “physically eject“ cards from a PCIe slot. While BCM can automate software re-imaging or health monitoring, a technician must still manually replace the physical hardware.
C. Use ‘nvidia-smi‘ to re-assign MAC addresses to GPUs nvidia-smi is used for managing NVIDIA GPUs, not network cards. Furthermore, MAC addresses belong to the Data Link Layer (Layer 2) of a network adapter and cannot be “assigned“ to a GPU to make its CUDA cores act as a network interface. Network processing and general-purpose GPU computing are handled by distinct hardware architectures.
Unattempted
Correct: D Confirm the fault using ‘ibstat‘ and ‘mlxlink‘, power down the node, replace the ConnectX or BlueField card, and verify the new card‘s firmware after boot.
The Technical Reason: The diagnostic and replacement workflow for high-speed networking must follow a standardized sequence:
Diagnosis: Tools like ibstat verify the port‘s logical state (e.g., Active vs. Down) and link speed. mlxlink provides lower-level physical details, such as the actual negotiated speed, Bit Error Rates (BER), and lane-specific status, which is crucial for identifying if a 400Gbps (NDR) link is underperforming due to a faulty ASIC or transceiver.
Physical Replacement: Once the hardware fault is isolated to the card, the server must be gracefully shut down and powered off to prevent electrical damage during the swap.
Post-Replacement Validation: New cards (ConnectX-7 or BlueField-3) often ship with factory firmware that may not match the “Validated Recipe“ of the cluster. Verification and potential alignment of firmware are mandatory steps to ensure the new card participates correctly in the fabric.
The NCP-AII Context: The exam validates that you can use the Mellanox Firmware Tools (MFT) and InfiniBand Diagnostic Utilities to confirm component health before performing physical maintenance.
Incorrect Options: A. Flash switch firmware onto the network card Switch firmware (e.g., for a Quantum-2 switch) and adapter firmware (for a ConnectX-7) are fundamentally different and incompatible. Attempting to flash one onto the other would result in a “brick“ (permanently disabled hardware). Additionally, software/firmware cannot “compensate“ for physical ASIC damage; physical damage always requires hardware replacement.
B. Delete the node from BCM and wait for auto-ejection Base Command Manager (BCM) is a powerful orchestration tool, but it does not possess mechanical robotics to “physically eject“ cards from a PCIe slot. While BCM can automate software re-imaging or health monitoring, a technician must still manually replace the physical hardware.
C. Use ‘nvidia-smi‘ to re-assign MAC addresses to GPUs nvidia-smi is used for managing NVIDIA GPUs, not network cards. Furthermore, MAC addresses belong to the Data Link Layer (Layer 2) of a network adapter and cannot be “assigned“ to a GPU to make its CUDA cores act as a network interface. Network processing and general-purpose GPU computing are handled by distinct hardware architectures.
Question 49 of 60
49. Question
An AI cluster is experiencing storage bottlenecks during the ‘checkpointing‘ phase of training. An administrator decides to optimize the storage layer. Which technology should be implemented to allow the GPUs to write data directly to the NVMe-over-Fabrics storage without involving the host CPU?
Correct
Correct: C NVIDIA Magnum IO GPUDirect Storage (GDS), which establishes a direct DMA path between GPU memory and the storage controllers.
The Technical Reason: During “checkpointing,“ large amounts of GPU memory state (weights, gradients, optimizer states) must be saved to persistent storage.
The Problem: Traditional I/O requires a “bounce buffer“ in CPU system memory, leading to high CPU overhead and latency.
The GDS Solution: GPUDirect Storage (GDS) allows a Direct Memory Access (DMA) engine to move data directly between GPU VRAM and NVMe storage (local or remote over fabrics like NVMe-oF). This bypasses the CPU and system memory entirely.
The Benefit: It significantly reduces I/O wait times, lowers CPU utilization, and enables the cluster to saturate the high-bandwidth network/storage fabric, which is critical for large-scale LLM training where checkpoints can be hundreds of gigabytes.
The NCP-AII Context: The exam validates your understanding of the Magnum IO stack. GDS is the primary technology used to solve “North-South“ (Storage-to-GPU) performance issues in an NVIDIA-Certified architecture.
Incorrect Options: A. The Slurm ‘checkpoint‘ command via OOB management Slurm is a workload manager, not a data-transfer engine. While it can trigger checkpointing scripts, sending data through the OOB (Out-of-Band) management network (typically 1GbE or 10GbE) is a critical error. The OOB network is meant for management (IPMI/Redfish), not for high-throughput training data. Using it for checkpoints would stall the cluster for hours.
B. A third-party RAID controller using CUDA cores for parity RAID controllers have their own dedicated processors (ASICs) or use the host CPU for parity calculations. There is no standard NVIDIA-certified architecture where a RAID controller offloads parity math to a GPU‘s CUDA cores. Furthermore, this does not address the data-path bottleneck between the GPU and the storage controller.
D. Standard NFS with the ‘async‘ flag While the async flag in NFS can speed up the “perception“ of a write by buffering it in CPU system RAM, it does not remove the CPU from the data path. In fact, it increases CPU and system RAM pressure. Additionally, standard NFS lacks the RDMA capabilities required to match the performance of NVMe-over-Fabrics (NVMe-oF) used in high-performance AI clusters.
Incorrect
Correct: C NVIDIA Magnum IO GPUDirect Storage (GDS), which establishes a direct DMA path between GPU memory and the storage controllers.
The Technical Reason: During “checkpointing,“ large amounts of GPU memory state (weights, gradients, optimizer states) must be saved to persistent storage.
The Problem: Traditional I/O requires a “bounce buffer“ in CPU system memory, leading to high CPU overhead and latency.
The GDS Solution: GPUDirect Storage (GDS) allows a Direct Memory Access (DMA) engine to move data directly between GPU VRAM and NVMe storage (local or remote over fabrics like NVMe-oF). This bypasses the CPU and system memory entirely.
The Benefit: It significantly reduces I/O wait times, lowers CPU utilization, and enables the cluster to saturate the high-bandwidth network/storage fabric, which is critical for large-scale LLM training where checkpoints can be hundreds of gigabytes.
The NCP-AII Context: The exam validates your understanding of the Magnum IO stack. GDS is the primary technology used to solve “North-South“ (Storage-to-GPU) performance issues in an NVIDIA-Certified architecture.
Incorrect Options: A. The Slurm ‘checkpoint‘ command via OOB management Slurm is a workload manager, not a data-transfer engine. While it can trigger checkpointing scripts, sending data through the OOB (Out-of-Band) management network (typically 1GbE or 10GbE) is a critical error. The OOB network is meant for management (IPMI/Redfish), not for high-throughput training data. Using it for checkpoints would stall the cluster for hours.
B. A third-party RAID controller using CUDA cores for parity RAID controllers have their own dedicated processors (ASICs) or use the host CPU for parity calculations. There is no standard NVIDIA-certified architecture where a RAID controller offloads parity math to a GPU‘s CUDA cores. Furthermore, this does not address the data-path bottleneck between the GPU and the storage controller.
D. Standard NFS with the ‘async‘ flag While the async flag in NFS can speed up the “perception“ of a write by buffering it in CPU system RAM, it does not remove the CPU from the data path. In fact, it increases CPU and system RAM pressure. Additionally, standard NFS lacks the RDMA capabilities required to match the performance of NVMe-over-Fabrics (NVMe-oF) used in high-performance AI clusters.
Unattempted
Correct: C NVIDIA Magnum IO GPUDirect Storage (GDS), which establishes a direct DMA path between GPU memory and the storage controllers.
The Technical Reason: During “checkpointing,“ large amounts of GPU memory state (weights, gradients, optimizer states) must be saved to persistent storage.
The Problem: Traditional I/O requires a “bounce buffer“ in CPU system memory, leading to high CPU overhead and latency.
The GDS Solution: GPUDirect Storage (GDS) allows a Direct Memory Access (DMA) engine to move data directly between GPU VRAM and NVMe storage (local or remote over fabrics like NVMe-oF). This bypasses the CPU and system memory entirely.
The Benefit: It significantly reduces I/O wait times, lowers CPU utilization, and enables the cluster to saturate the high-bandwidth network/storage fabric, which is critical for large-scale LLM training where checkpoints can be hundreds of gigabytes.
The NCP-AII Context: The exam validates your understanding of the Magnum IO stack. GDS is the primary technology used to solve “North-South“ (Storage-to-GPU) performance issues in an NVIDIA-Certified architecture.
Incorrect Options: A. The Slurm ‘checkpoint‘ command via OOB management Slurm is a workload manager, not a data-transfer engine. While it can trigger checkpointing scripts, sending data through the OOB (Out-of-Band) management network (typically 1GbE or 10GbE) is a critical error. The OOB network is meant for management (IPMI/Redfish), not for high-throughput training data. Using it for checkpoints would stall the cluster for hours.
B. A third-party RAID controller using CUDA cores for parity RAID controllers have their own dedicated processors (ASICs) or use the host CPU for parity calculations. There is no standard NVIDIA-certified architecture where a RAID controller offloads parity math to a GPU‘s CUDA cores. Furthermore, this does not address the data-path bottleneck between the GPU and the storage controller.
D. Standard NFS with the ‘async‘ flag While the async flag in NFS can speed up the “perception“ of a write by buffering it in CPU system RAM, it does not remove the CPU from the data path. In fact, it increases CPU and system RAM pressure. Additionally, standard NFS lacks the RDMA capabilities required to match the performance of NVMe-over-Fabrics (NVMe-oF) used in high-performance AI clusters.
Question 50 of 60
50. Question
A deployment team is configuring the Out-of-Band (OOB) management network for a new NVIDIA-based server farm. They must ensure that the BMC is reachable for remote firmware updates of the HGX baseboard. What is the recommended sequence for performing a firmware upgrade on an HGX system to ensure component compatibility and system stability?
Correct
Correct: A Update the BMC firmware first, then the BIOS/UEFI, followed by the HGX baseboard firmware, and then the individual GPU firmware.
The Technical Reason: This sequence follows the hardware dependency hierarchy of an NVIDIA-Certified system:
BMC (Baseboard Management Controller): The BMC is the “root of management.“ It must be updated first to ensure it can correctly monitor power, thermals, and provide the Redfish/IPMI interfaces needed to manage the rest of the update process.
BIOS/UEFI: The BIOS initializes the PCIe bus and CPU-to-GPU pathways. Updated BIOS versions are often required to support newer HGX firmware features or to correctly “map“ the massive memory space of modern GPUs (e.g., Above 4G Decoding/Resizable BAR).
HGX Baseboard/NVSwitch: This firmware coordinates the internal NVLink Fabric. It must be stable before the GPUs themselves are flashed to ensure the high-speed interconnect is ready to train.
GPU Firmware (VBIOS): This is the final step. The GPU firmware depends on all underlying layers (Power, PCIe, and Fabric) being correctly initialized.
The NCP-AII Context: The certification validates the “Validated Recipe“ philosophy. NVIDIA releases these updates as a bundled stack. Attempting to update a GPU VBIOS while the BMC or BIOS is on an incompatible older version can lead to “bricked“ hardware or thermal throttling errors.
Incorrect Options: B. Update the OS drivers first, then the GPU firmware, and finally the BMC This is the reverse of the required order. OS drivers (the “top“ of the stack) require the underlying firmware to be compatible to load correctly. Furthermore, updating the BMC last is risky; if the new GPU firmware requires new power or thermal profiles that the old BMC doesn‘t recognize, the system might trigger a safety shutdown or fail to boot.
C. Flash all components simultaneously using a broadcast script Simultaneous flashing is extremely dangerous in an HGX environment. Because these components have functional dependencies (e.g., the NVSwitch depends on BIOS settings), flashing them all at once can cause race conditions. A failure in one component during a broadcast could leave the system in an inconsistent state where the BMC can no longer reach the baseboard to recover it.
D. Use the NVIDIA Container Toolkit with the –update-firmware flag This is a distractor. The NVIDIA Container Toolkit is a software utility used to expose GPUs to Docker/Podman containers. It operates at the application layer and has no capability or flag (–update-firmware) to perform low-level hardware firmware updates for the BMC, BIOS, or HGX baseboard. Hardware updates are performed via the BMC Web UI, Redfish APIs, or tools like nvfwupd.
Incorrect
Correct: A Update the BMC firmware first, then the BIOS/UEFI, followed by the HGX baseboard firmware, and then the individual GPU firmware.
The Technical Reason: This sequence follows the hardware dependency hierarchy of an NVIDIA-Certified system:
BMC (Baseboard Management Controller): The BMC is the “root of management.“ It must be updated first to ensure it can correctly monitor power, thermals, and provide the Redfish/IPMI interfaces needed to manage the rest of the update process.
BIOS/UEFI: The BIOS initializes the PCIe bus and CPU-to-GPU pathways. Updated BIOS versions are often required to support newer HGX firmware features or to correctly “map“ the massive memory space of modern GPUs (e.g., Above 4G Decoding/Resizable BAR).
HGX Baseboard/NVSwitch: This firmware coordinates the internal NVLink Fabric. It must be stable before the GPUs themselves are flashed to ensure the high-speed interconnect is ready to train.
GPU Firmware (VBIOS): This is the final step. The GPU firmware depends on all underlying layers (Power, PCIe, and Fabric) being correctly initialized.
The NCP-AII Context: The certification validates the “Validated Recipe“ philosophy. NVIDIA releases these updates as a bundled stack. Attempting to update a GPU VBIOS while the BMC or BIOS is on an incompatible older version can lead to “bricked“ hardware or thermal throttling errors.
Incorrect Options: B. Update the OS drivers first, then the GPU firmware, and finally the BMC This is the reverse of the required order. OS drivers (the “top“ of the stack) require the underlying firmware to be compatible to load correctly. Furthermore, updating the BMC last is risky; if the new GPU firmware requires new power or thermal profiles that the old BMC doesn‘t recognize, the system might trigger a safety shutdown or fail to boot.
C. Flash all components simultaneously using a broadcast script Simultaneous flashing is extremely dangerous in an HGX environment. Because these components have functional dependencies (e.g., the NVSwitch depends on BIOS settings), flashing them all at once can cause race conditions. A failure in one component during a broadcast could leave the system in an inconsistent state where the BMC can no longer reach the baseboard to recover it.
D. Use the NVIDIA Container Toolkit with the –update-firmware flag This is a distractor. The NVIDIA Container Toolkit is a software utility used to expose GPUs to Docker/Podman containers. It operates at the application layer and has no capability or flag (–update-firmware) to perform low-level hardware firmware updates for the BMC, BIOS, or HGX baseboard. Hardware updates are performed via the BMC Web UI, Redfish APIs, or tools like nvfwupd.
Unattempted
Correct: A Update the BMC firmware first, then the BIOS/UEFI, followed by the HGX baseboard firmware, and then the individual GPU firmware.
The Technical Reason: This sequence follows the hardware dependency hierarchy of an NVIDIA-Certified system:
BMC (Baseboard Management Controller): The BMC is the “root of management.“ It must be updated first to ensure it can correctly monitor power, thermals, and provide the Redfish/IPMI interfaces needed to manage the rest of the update process.
BIOS/UEFI: The BIOS initializes the PCIe bus and CPU-to-GPU pathways. Updated BIOS versions are often required to support newer HGX firmware features or to correctly “map“ the massive memory space of modern GPUs (e.g., Above 4G Decoding/Resizable BAR).
HGX Baseboard/NVSwitch: This firmware coordinates the internal NVLink Fabric. It must be stable before the GPUs themselves are flashed to ensure the high-speed interconnect is ready to train.
GPU Firmware (VBIOS): This is the final step. The GPU firmware depends on all underlying layers (Power, PCIe, and Fabric) being correctly initialized.
The NCP-AII Context: The certification validates the “Validated Recipe“ philosophy. NVIDIA releases these updates as a bundled stack. Attempting to update a GPU VBIOS while the BMC or BIOS is on an incompatible older version can lead to “bricked“ hardware or thermal throttling errors.
Incorrect Options: B. Update the OS drivers first, then the GPU firmware, and finally the BMC This is the reverse of the required order. OS drivers (the “top“ of the stack) require the underlying firmware to be compatible to load correctly. Furthermore, updating the BMC last is risky; if the new GPU firmware requires new power or thermal profiles that the old BMC doesn‘t recognize, the system might trigger a safety shutdown or fail to boot.
C. Flash all components simultaneously using a broadcast script Simultaneous flashing is extremely dangerous in an HGX environment. Because these components have functional dependencies (e.g., the NVSwitch depends on BIOS settings), flashing them all at once can cause race conditions. A failure in one component during a broadcast could leave the system in an inconsistent state where the BMC can no longer reach the baseboard to recover it.
D. Use the NVIDIA Container Toolkit with the –update-firmware flag This is a distractor. The NVIDIA Container Toolkit is a software utility used to expose GPUs to Docker/Podman containers. It operates at the application layer and has no capability or flag (–update-firmware) to perform low-level hardware firmware updates for the BMC, BIOS, or HGX baseboard. Hardware updates are performed via the BMC Web UI, Redfish APIs, or tools like nvfwupd.
Question 51 of 60
51. Question
During the final verification of an AI factory, the team executes a High-Performance Linpack (HPL) test. The results show a significant ‘Rmax‘ value drop compared to the ‘Rpeak‘ theoretical performance. Which cluster-level assessment tool is best suited for identifying if the issue is a specific ‘limping‘ node or a general network congestion issue?
Correct
Correct: C ClusterKit; it performs multifaceted node assessments and identifies outliers in performance across the entire cluster.
The Technical Reason: ClusterKit (part of the HPC-X toolkit) is specifically designed for the “Validation“ phase of a deployment. While HPL measures aggregate cluster performance, it cannot pinpoint why a drop occurred. ClusterKit runs a suite of tests (bandwidth, latency, memory bandwidth, GFLOPS) across all nodes and generates a report that highlights outliers.
“Limping Node“ Detection: If one node has a faulty PCIe riser or thermal throttling, ClusterKit will flag it as an outlier compared to the cluster average.
Network vs. Compute: By running both intra-node (GPU-to-GPU) and inter-node (Network) tests, it can determine if the bottleneck is the InfiniBand fabric or the individual servers.
The NCP-AII Context: The exam blueprint explicitly lists “Run ClusterKit to perform a multifaceted node assessment“ as a key task under Cluster Test and Verification. It is the standard tool for moving from “the cluster is slow“ to “Node 04 has a sub-optimal HCA.“
Incorrect Options: A. The ‘ping‘ utility ping uses ICMP to check if a node is “alive.“ It does not measure bandwidth, latency at load, or GPU performance. A node can respond to a ping perfectly while having a hardware fault that causes it to perform at 10% of its intended GFLOPS during an HPL run.
B. The DOCA Benchmarking tool While DOCA tools are essential for BlueField DPU management, “DOCA Benchmarking“ is focused on the DPU‘s networking and offload capabilities. It is not the primary tool used to diagnose general GPU compute performance drops in an HPL (High-Performance Linpack) context, which is a GPU-heavy workload.
D. The Slurm ‘squeue‘ command squeue is a job scheduling utility that shows the status of active and pending jobs. It provides information about resource allocation but gives zero telemetry regarding the performance or health of the underlying hardware. It cannot identify a “limping“ node.
Incorrect
Correct: C ClusterKit; it performs multifaceted node assessments and identifies outliers in performance across the entire cluster.
The Technical Reason: ClusterKit (part of the HPC-X toolkit) is specifically designed for the “Validation“ phase of a deployment. While HPL measures aggregate cluster performance, it cannot pinpoint why a drop occurred. ClusterKit runs a suite of tests (bandwidth, latency, memory bandwidth, GFLOPS) across all nodes and generates a report that highlights outliers.
“Limping Node“ Detection: If one node has a faulty PCIe riser or thermal throttling, ClusterKit will flag it as an outlier compared to the cluster average.
Network vs. Compute: By running both intra-node (GPU-to-GPU) and inter-node (Network) tests, it can determine if the bottleneck is the InfiniBand fabric or the individual servers.
The NCP-AII Context: The exam blueprint explicitly lists “Run ClusterKit to perform a multifaceted node assessment“ as a key task under Cluster Test and Verification. It is the standard tool for moving from “the cluster is slow“ to “Node 04 has a sub-optimal HCA.“
Incorrect Options: A. The ‘ping‘ utility ping uses ICMP to check if a node is “alive.“ It does not measure bandwidth, latency at load, or GPU performance. A node can respond to a ping perfectly while having a hardware fault that causes it to perform at 10% of its intended GFLOPS during an HPL run.
B. The DOCA Benchmarking tool While DOCA tools are essential for BlueField DPU management, “DOCA Benchmarking“ is focused on the DPU‘s networking and offload capabilities. It is not the primary tool used to diagnose general GPU compute performance drops in an HPL (High-Performance Linpack) context, which is a GPU-heavy workload.
D. The Slurm ‘squeue‘ command squeue is a job scheduling utility that shows the status of active and pending jobs. It provides information about resource allocation but gives zero telemetry regarding the performance or health of the underlying hardware. It cannot identify a “limping“ node.
Unattempted
Correct: C ClusterKit; it performs multifaceted node assessments and identifies outliers in performance across the entire cluster.
The Technical Reason: ClusterKit (part of the HPC-X toolkit) is specifically designed for the “Validation“ phase of a deployment. While HPL measures aggregate cluster performance, it cannot pinpoint why a drop occurred. ClusterKit runs a suite of tests (bandwidth, latency, memory bandwidth, GFLOPS) across all nodes and generates a report that highlights outliers.
“Limping Node“ Detection: If one node has a faulty PCIe riser or thermal throttling, ClusterKit will flag it as an outlier compared to the cluster average.
Network vs. Compute: By running both intra-node (GPU-to-GPU) and inter-node (Network) tests, it can determine if the bottleneck is the InfiniBand fabric or the individual servers.
The NCP-AII Context: The exam blueprint explicitly lists “Run ClusterKit to perform a multifaceted node assessment“ as a key task under Cluster Test and Verification. It is the standard tool for moving from “the cluster is slow“ to “Node 04 has a sub-optimal HCA.“
Incorrect Options: A. The ‘ping‘ utility ping uses ICMP to check if a node is “alive.“ It does not measure bandwidth, latency at load, or GPU performance. A node can respond to a ping perfectly while having a hardware fault that causes it to perform at 10% of its intended GFLOPS during an HPL run.
B. The DOCA Benchmarking tool While DOCA tools are essential for BlueField DPU management, “DOCA Benchmarking“ is focused on the DPU‘s networking and offload capabilities. It is not the primary tool used to diagnose general GPU compute performance drops in an HPL (High-Performance Linpack) context, which is a GPU-heavy workload.
D. The Slurm ‘squeue‘ command squeue is a job scheduling utility that shows the status of active and pending jobs. It provides information about resource allocation but gives zero telemetry regarding the performance or health of the underlying hardware. It cannot identify a “limping“ node.
Question 52 of 60
52. Question
A team is performing a NeMo burn-in test as part of the final cluster verification. This test is designed to run for an extended period. What is the primary objective of running a burn-in test like NeMo or a heavy NCCL/HPL loop before handing the cluster over to the data science users?
Correct
Correct: A To identify ‘infant mortality‘ in hardware components and ensure the system‘s thermal and power stability under a sustained, realistic AI workload.
The Technical Reason: AI clusters are composed of thousands of sensitive components (GPUs, transceivers, power supply units).
Infant Mortality: This term refers to hardware components that fail early in their life cycle. A high-stress test like NeMo burn-in pushes all components to their limits simultaneously, triggering these latent failures before the system is handed over to users.
Thermal/Power Stability: Unlike a 5-minute benchmark, an extended burn-in (often 24–72 hours) allows the data center‘s cooling and the server‘s internal fans to reach a thermal steady state. It validates that the power distribution units (PDUs) can handle the massive, fluctuating current draws of a real LLM training job without tripping breakers.
The NCP-AII Context: The exam blueprint requires candidates to “Perform NeMo burn-in“ and “Validate hardware operation for workloads.“ This objective is critical because “soft errors“—such as a GPU that only throttles after 4 hours of 100% utilization or an InfiniBand link that drops packets only when the switch hits peak temperature—can only be caught during a sustained, full-stack burn-in.
Incorrect Options: B. To train a production-ready LLM to sell immediately While the NeMo framework is used for training LLMs, a “burn-in“ test typically uses synthetic or public datasets (like WikiText) specifically to stress the hardware. The goal of the infrastructure engineer is validation, not model production. Training a commercial-grade LLM requires hyperparameter tuning and data curation that are outside the scope of infrastructure bring-up.
C. To clear switch cache and reset firmware to factory defaults A burn-in test does the opposite of resetting to factory defaults—it validates the customized production configuration (MTU sizes, PFC settings, and specific firmware versions). Resetting to factory defaults would erase the “Validated Recipe“ configuration required for the cluster to function at scale. Furthermore, switch caches are cleared automatically by standard network operations and do not require a multi-day LLM test to “reset.“
D. To verify NVIDIA AI Enterprise license activation License activation is a Control Plane task usually handled during the OS or software stack installation (using gridd or the NVIDIA License System). While a burn-in test might fail if the software isn‘t licensed (due to performance caps), verifying the license is a simple administrative check, not the primary objective of a complex, high-energy hardware stress test.
Incorrect
Correct: A To identify ‘infant mortality‘ in hardware components and ensure the system‘s thermal and power stability under a sustained, realistic AI workload.
The Technical Reason: AI clusters are composed of thousands of sensitive components (GPUs, transceivers, power supply units).
Infant Mortality: This term refers to hardware components that fail early in their life cycle. A high-stress test like NeMo burn-in pushes all components to their limits simultaneously, triggering these latent failures before the system is handed over to users.
Thermal/Power Stability: Unlike a 5-minute benchmark, an extended burn-in (often 24–72 hours) allows the data center‘s cooling and the server‘s internal fans to reach a thermal steady state. It validates that the power distribution units (PDUs) can handle the massive, fluctuating current draws of a real LLM training job without tripping breakers.
The NCP-AII Context: The exam blueprint requires candidates to “Perform NeMo burn-in“ and “Validate hardware operation for workloads.“ This objective is critical because “soft errors“—such as a GPU that only throttles after 4 hours of 100% utilization or an InfiniBand link that drops packets only when the switch hits peak temperature—can only be caught during a sustained, full-stack burn-in.
Incorrect Options: B. To train a production-ready LLM to sell immediately While the NeMo framework is used for training LLMs, a “burn-in“ test typically uses synthetic or public datasets (like WikiText) specifically to stress the hardware. The goal of the infrastructure engineer is validation, not model production. Training a commercial-grade LLM requires hyperparameter tuning and data curation that are outside the scope of infrastructure bring-up.
C. To clear switch cache and reset firmware to factory defaults A burn-in test does the opposite of resetting to factory defaults—it validates the customized production configuration (MTU sizes, PFC settings, and specific firmware versions). Resetting to factory defaults would erase the “Validated Recipe“ configuration required for the cluster to function at scale. Furthermore, switch caches are cleared automatically by standard network operations and do not require a multi-day LLM test to “reset.“
D. To verify NVIDIA AI Enterprise license activation License activation is a Control Plane task usually handled during the OS or software stack installation (using gridd or the NVIDIA License System). While a burn-in test might fail if the software isn‘t licensed (due to performance caps), verifying the license is a simple administrative check, not the primary objective of a complex, high-energy hardware stress test.
Unattempted
Correct: A To identify ‘infant mortality‘ in hardware components and ensure the system‘s thermal and power stability under a sustained, realistic AI workload.
The Technical Reason: AI clusters are composed of thousands of sensitive components (GPUs, transceivers, power supply units).
Infant Mortality: This term refers to hardware components that fail early in their life cycle. A high-stress test like NeMo burn-in pushes all components to their limits simultaneously, triggering these latent failures before the system is handed over to users.
Thermal/Power Stability: Unlike a 5-minute benchmark, an extended burn-in (often 24–72 hours) allows the data center‘s cooling and the server‘s internal fans to reach a thermal steady state. It validates that the power distribution units (PDUs) can handle the massive, fluctuating current draws of a real LLM training job without tripping breakers.
The NCP-AII Context: The exam blueprint requires candidates to “Perform NeMo burn-in“ and “Validate hardware operation for workloads.“ This objective is critical because “soft errors“—such as a GPU that only throttles after 4 hours of 100% utilization or an InfiniBand link that drops packets only when the switch hits peak temperature—can only be caught during a sustained, full-stack burn-in.
Incorrect Options: B. To train a production-ready LLM to sell immediately While the NeMo framework is used for training LLMs, a “burn-in“ test typically uses synthetic or public datasets (like WikiText) specifically to stress the hardware. The goal of the infrastructure engineer is validation, not model production. Training a commercial-grade LLM requires hyperparameter tuning and data curation that are outside the scope of infrastructure bring-up.
C. To clear switch cache and reset firmware to factory defaults A burn-in test does the opposite of resetting to factory defaults—it validates the customized production configuration (MTU sizes, PFC settings, and specific firmware versions). Resetting to factory defaults would erase the “Validated Recipe“ configuration required for the cluster to function at scale. Furthermore, switch caches are cleared automatically by standard network operations and do not require a multi-day LLM test to “reset.“
D. To verify NVIDIA AI Enterprise license activation License activation is a Control Plane task usually handled during the OS or software stack installation (using gridd or the NVIDIA License System). While a burn-in test might fail if the software isn‘t licensed (due to performance caps), verifying the license is a simple administrative check, not the primary objective of a complex, high-energy hardware stress test.
Question 53 of 60
53. Question
A system administrator receives an alert that an NVIDIA H100 GPU in a cluster node has entered a ‘fallen off the bus‘ state. The ‘nvidia-smi‘ command shows the GPU is missing, and ‘dmesg‘ reports a PCIe bus error. After a warm reboot fails to resolve the issue, what is the next best troubleshooting step to identify if the fault is with the GPU or the PCIe slot?
Correct
Correct: C Perform a cold boot of the system, and if the issue persists, move the GPU to a known-working PCIe slot to see if the error follows the card.
The Technical Reason: * Cold Boot: Unlike a warm reboot, a cold boot (complete power cycle) fully discharges the capacitors and resets the PCIe training sequence and the GPU‘s internal firmware state. This can often clear transient electrical glitches that a warm reboot cannot.
Slot Swapping: This is the standard “isolation“ procedure in data center hardware maintenance. If the error follows the card to a new slot, the GPU‘s internal PCIe controller or circuitry is likely faulty. If the error stays with the original slot even with a different card, the motherboard/riser or the PCIe lanes of that specific CPU socket are the root cause.
The NCP-AII Context: The exam validates your ability to “Identify and troubleshoot hardware faults“ using a logical process of elimination. Moving components to “known-good“ configurations is the gold standard for hardware diagnostics in the field.
Incorrect Options: A. Measure voltage on NVLink bridge pins while running This is extremely dangerous and physically impossible on a running H100 system without specialized laboratory equipment. NVLink bridges in modern HGX/H100 systems are high-density connectors or integrated baseboards. Attempting to use a manual multimeter on these pins while the system is powered and under load would likely cause a short circuit, catastrophic hardware damage, and poses a safety risk to the administrator.
B. Reinstall the driver with the ‘–force‘ flag While driver corruption can cause detection issues, a “fallen off the bus“ error reported in dmesg as a PCIe bus error (often accompanied by Xid 79) points to a hardware/link layer failure. Reinstalling the driver will not fix a physical link that is down. PCIe training is a hardware-level handshake that occurs during the POST/Boot phase, long before the NVIDIA driver is loaded into the kernel.
D. Update Slurm to ignore PCIe errors Slurm is a job scheduler and cannot “ignore“ hardware-level PCIe errors. If the GPU has fallen off the bus, it is physically invisible to the operating system. No amount of scheduler configuration can make a “missing“ GPU execute code. Furthermore, ignoring critical hardware errors in a production cluster can lead to system instability, kernel panics, and data corruption.
Incorrect
Correct: C Perform a cold boot of the system, and if the issue persists, move the GPU to a known-working PCIe slot to see if the error follows the card.
The Technical Reason: * Cold Boot: Unlike a warm reboot, a cold boot (complete power cycle) fully discharges the capacitors and resets the PCIe training sequence and the GPU‘s internal firmware state. This can often clear transient electrical glitches that a warm reboot cannot.
Slot Swapping: This is the standard “isolation“ procedure in data center hardware maintenance. If the error follows the card to a new slot, the GPU‘s internal PCIe controller or circuitry is likely faulty. If the error stays with the original slot even with a different card, the motherboard/riser or the PCIe lanes of that specific CPU socket are the root cause.
The NCP-AII Context: The exam validates your ability to “Identify and troubleshoot hardware faults“ using a logical process of elimination. Moving components to “known-good“ configurations is the gold standard for hardware diagnostics in the field.
Incorrect Options: A. Measure voltage on NVLink bridge pins while running This is extremely dangerous and physically impossible on a running H100 system without specialized laboratory equipment. NVLink bridges in modern HGX/H100 systems are high-density connectors or integrated baseboards. Attempting to use a manual multimeter on these pins while the system is powered and under load would likely cause a short circuit, catastrophic hardware damage, and poses a safety risk to the administrator.
B. Reinstall the driver with the ‘–force‘ flag While driver corruption can cause detection issues, a “fallen off the bus“ error reported in dmesg as a PCIe bus error (often accompanied by Xid 79) points to a hardware/link layer failure. Reinstalling the driver will not fix a physical link that is down. PCIe training is a hardware-level handshake that occurs during the POST/Boot phase, long before the NVIDIA driver is loaded into the kernel.
D. Update Slurm to ignore PCIe errors Slurm is a job scheduler and cannot “ignore“ hardware-level PCIe errors. If the GPU has fallen off the bus, it is physically invisible to the operating system. No amount of scheduler configuration can make a “missing“ GPU execute code. Furthermore, ignoring critical hardware errors in a production cluster can lead to system instability, kernel panics, and data corruption.
Unattempted
Correct: C Perform a cold boot of the system, and if the issue persists, move the GPU to a known-working PCIe slot to see if the error follows the card.
The Technical Reason: * Cold Boot: Unlike a warm reboot, a cold boot (complete power cycle) fully discharges the capacitors and resets the PCIe training sequence and the GPU‘s internal firmware state. This can often clear transient electrical glitches that a warm reboot cannot.
Slot Swapping: This is the standard “isolation“ procedure in data center hardware maintenance. If the error follows the card to a new slot, the GPU‘s internal PCIe controller or circuitry is likely faulty. If the error stays with the original slot even with a different card, the motherboard/riser or the PCIe lanes of that specific CPU socket are the root cause.
The NCP-AII Context: The exam validates your ability to “Identify and troubleshoot hardware faults“ using a logical process of elimination. Moving components to “known-good“ configurations is the gold standard for hardware diagnostics in the field.
Incorrect Options: A. Measure voltage on NVLink bridge pins while running This is extremely dangerous and physically impossible on a running H100 system without specialized laboratory equipment. NVLink bridges in modern HGX/H100 systems are high-density connectors or integrated baseboards. Attempting to use a manual multimeter on these pins while the system is powered and under load would likely cause a short circuit, catastrophic hardware damage, and poses a safety risk to the administrator.
B. Reinstall the driver with the ‘–force‘ flag While driver corruption can cause detection issues, a “fallen off the bus“ error reported in dmesg as a PCIe bus error (often accompanied by Xid 79) points to a hardware/link layer failure. Reinstalling the driver will not fix a physical link that is down. PCIe training is a hardware-level handshake that occurs during the POST/Boot phase, long before the NVIDIA driver is loaded into the kernel.
D. Update Slurm to ignore PCIe errors Slurm is a job scheduler and cannot “ignore“ hardware-level PCIe errors. If the GPU has fallen off the bus, it is physically invisible to the operating system. No amount of scheduler configuration can make a “missing“ GPU execute code. Furthermore, ignoring critical hardware errors in a production cluster can lead to system instability, kernel panics, and data corruption.
Question 54 of 60
54. Question
During the management of the physical layer in an AI factory, an engineer notices a high rate of CRC (Cyclic Redundancy Check) errors on a BlueField-3 network link. The link is currently running at 400Gb/s. What is the most likely cause of these errors when using high-speed optical transceivers, and how should it be addressed according to professional standards?
Correct
Correct: A The optical fiber connectors are likely contaminated with dust or oil; they should be cleaned with a specialized fiber cleaner and inspected with a scope.
The Technical Reason: High-speed NDR (400Gb/s) links use PAM4 modulation, which is far more sensitive to signal-to-noise ratios than older NRZ (100Gb/s) links.
Contamination: A single speck of dust or a fingerprint on the fiber end-face (MPO/MTP or OSFP/QSFP112 connectors) acts as an attenuator and a source of back-reflection. This leads to Cyclic Redundancy Check (CRC) errors as the receiver struggles to distinguish between the four distinct PAM4 voltage levels.
Standard Procedure: NVIDIA professional standards dictate a “Clean and Inspect“ protocol. Engineers must use lint-free specialized cleaners (like click-pens) and a fiber inspection scope to ensure the core is pristine before mating the connector.
The NCP-AII Context: The certification emphasizes that physical layer health is the foundation of the AI Factory. CRC errors are almost always a symptom of physical layer issues (bad cables, dirty transceivers, or poor seating) rather than software bugs.
Incorrect Options: B. GPU clock speed causing EMI in the fiber Optical fibers carry data as photons (light), not electrons. Therefore, they are inherently immune to Electromagnetic Interference (EMI) from nearby electronics like GPUs or power supplies. While EMI can affect copper DAC cables, it cannot disrupt the signal inside a glass fiber.
C. Slurm scheduler causing the card to overheat While high workload intensity can cause a DPU or GPU to heat up, NVIDIA-Certified systems are designed with sophisticated thermal management to throttle performance or increase fan speeds before hardware begins “miscalculating“ checksums. Overheating typically results in a thermal shutdown or performance throttling, not isolated CRC errors on a network link.
D. Old Linux kernel affecting the CRC algorithm CRC (Cyclic Redundancy Check) is a mathematical operation performed in hardware (ASIC) on the BlueField-3 DPU, not in the Linux kernel software. The kernel merely reports the statistics that the hardware provides. Updating the kernel will not change how the hardware calculates checksums or fix a link that is physically dropping bits due to signal degradation.
Incorrect
Correct: A The optical fiber connectors are likely contaminated with dust or oil; they should be cleaned with a specialized fiber cleaner and inspected with a scope.
The Technical Reason: High-speed NDR (400Gb/s) links use PAM4 modulation, which is far more sensitive to signal-to-noise ratios than older NRZ (100Gb/s) links.
Contamination: A single speck of dust or a fingerprint on the fiber end-face (MPO/MTP or OSFP/QSFP112 connectors) acts as an attenuator and a source of back-reflection. This leads to Cyclic Redundancy Check (CRC) errors as the receiver struggles to distinguish between the four distinct PAM4 voltage levels.
Standard Procedure: NVIDIA professional standards dictate a “Clean and Inspect“ protocol. Engineers must use lint-free specialized cleaners (like click-pens) and a fiber inspection scope to ensure the core is pristine before mating the connector.
The NCP-AII Context: The certification emphasizes that physical layer health is the foundation of the AI Factory. CRC errors are almost always a symptom of physical layer issues (bad cables, dirty transceivers, or poor seating) rather than software bugs.
Incorrect Options: B. GPU clock speed causing EMI in the fiber Optical fibers carry data as photons (light), not electrons. Therefore, they are inherently immune to Electromagnetic Interference (EMI) from nearby electronics like GPUs or power supplies. While EMI can affect copper DAC cables, it cannot disrupt the signal inside a glass fiber.
C. Slurm scheduler causing the card to overheat While high workload intensity can cause a DPU or GPU to heat up, NVIDIA-Certified systems are designed with sophisticated thermal management to throttle performance or increase fan speeds before hardware begins “miscalculating“ checksums. Overheating typically results in a thermal shutdown or performance throttling, not isolated CRC errors on a network link.
D. Old Linux kernel affecting the CRC algorithm CRC (Cyclic Redundancy Check) is a mathematical operation performed in hardware (ASIC) on the BlueField-3 DPU, not in the Linux kernel software. The kernel merely reports the statistics that the hardware provides. Updating the kernel will not change how the hardware calculates checksums or fix a link that is physically dropping bits due to signal degradation.
Unattempted
Correct: A The optical fiber connectors are likely contaminated with dust or oil; they should be cleaned with a specialized fiber cleaner and inspected with a scope.
The Technical Reason: High-speed NDR (400Gb/s) links use PAM4 modulation, which is far more sensitive to signal-to-noise ratios than older NRZ (100Gb/s) links.
Contamination: A single speck of dust or a fingerprint on the fiber end-face (MPO/MTP or OSFP/QSFP112 connectors) acts as an attenuator and a source of back-reflection. This leads to Cyclic Redundancy Check (CRC) errors as the receiver struggles to distinguish between the four distinct PAM4 voltage levels.
Standard Procedure: NVIDIA professional standards dictate a “Clean and Inspect“ protocol. Engineers must use lint-free specialized cleaners (like click-pens) and a fiber inspection scope to ensure the core is pristine before mating the connector.
The NCP-AII Context: The certification emphasizes that physical layer health is the foundation of the AI Factory. CRC errors are almost always a symptom of physical layer issues (bad cables, dirty transceivers, or poor seating) rather than software bugs.
Incorrect Options: B. GPU clock speed causing EMI in the fiber Optical fibers carry data as photons (light), not electrons. Therefore, they are inherently immune to Electromagnetic Interference (EMI) from nearby electronics like GPUs or power supplies. While EMI can affect copper DAC cables, it cannot disrupt the signal inside a glass fiber.
C. Slurm scheduler causing the card to overheat While high workload intensity can cause a DPU or GPU to heat up, NVIDIA-Certified systems are designed with sophisticated thermal management to throttle performance or increase fan speeds before hardware begins “miscalculating“ checksums. Overheating typically results in a thermal shutdown or performance throttling, not isolated CRC errors on a network link.
D. Old Linux kernel affecting the CRC algorithm CRC (Cyclic Redundancy Check) is a mathematical operation performed in hardware (ASIC) on the BlueField-3 DPU, not in the Linux kernel software. The kernel merely reports the statistics that the hardware provides. Updating the kernel will not change how the hardware calculates checksums or fix a link that is physically dropping bits due to signal degradation.
Question 55 of 60
55. Question
A storage optimization task requires reducing the I/O wait times for a large-scale training job. Which of the following strategies would provide the most significant performance improvement for an NVIDIA-certified AI infrastructure?
Correct
Correct: D Implementing NVIDIA GPUDirect Storage (GDS) to enable a direct DMA path between the network interface card and the GPU memory, bypassing the CPU buffer.
The Technical Reason: In traditional storage architectures, data must be copied from the storage/network into a “bounce buffer“ in the CPU‘s system RAM before being moved to the GPU. This creates a “CPU bottleneck“ where the processor becomes overwhelmed by memory copy operations.
Direct Path: Magnum IO GPUDirect Storage (GDS) creates a direct path for Direct Memory Access (DMA) between the GPU memory (VRAM) and the storage (local NVMe or remote NVMe-oF).
The Benefit: It significantly reduces I/O latency, lowers CPU utilization, and increases the overall bandwidth available for training. This is particularly vital for checkpointing—saving large model states—which can otherwise stall a cluster for minutes or hours.
The NCP-AII Context: The certification blueprint emphasizes “Optimizing storage“ and understanding the “Data Fabric.“ GDS is the flagship NVIDIA technology for removing the CPU from the storage data path.
Incorrect Options: A. Upgrading the BMC firmware for faster Slurm polling While keeping the BMC (Baseboard Management Controller) updated is a best practice for system stability and monitoring, it has zero impact on the high-speed data path used for training. The BMC manages “Out-of-Band“ tasks (power, thermals, health); it does not touch the “In-Band“ I/O traffic between storage and GPUs.
B. Reducing MIG instances to allow more bandwidth for SATA boot drives This option is fundamentally flawed for two reasons:
MIG (Multi-Instance GPU) manages compute and memory partitioning inside the GPU; it does not “free up“ PCIe lanes for other devices like SATA controllers.
SATA drives are limited to 6Gbps and are far too slow for AI training data. In an NVIDIA-certified architecture, training data should reside on NVMe or high-performance parallel file systems. The boot drive‘s performance is irrelevant to the training job‘s I/O wait times.
C. Moving datasets to the head node‘s /tmp via OOB network The OOB (Out-of-Band) management network is typically a 1GbE or 10GbE network designed for administrative traffic. Using it to share massive training datasets would create a massive bottleneck. Furthermore, the head node is a single point of failure and lacks the aggregate bandwidth of a dedicated parallel storage fabric (like Lustre, GPFS, or Weka) which is required for scaling an AI cluster.
Incorrect
Correct: D Implementing NVIDIA GPUDirect Storage (GDS) to enable a direct DMA path between the network interface card and the GPU memory, bypassing the CPU buffer.
The Technical Reason: In traditional storage architectures, data must be copied from the storage/network into a “bounce buffer“ in the CPU‘s system RAM before being moved to the GPU. This creates a “CPU bottleneck“ where the processor becomes overwhelmed by memory copy operations.
Direct Path: Magnum IO GPUDirect Storage (GDS) creates a direct path for Direct Memory Access (DMA) between the GPU memory (VRAM) and the storage (local NVMe or remote NVMe-oF).
The Benefit: It significantly reduces I/O latency, lowers CPU utilization, and increases the overall bandwidth available for training. This is particularly vital for checkpointing—saving large model states—which can otherwise stall a cluster for minutes or hours.
The NCP-AII Context: The certification blueprint emphasizes “Optimizing storage“ and understanding the “Data Fabric.“ GDS is the flagship NVIDIA technology for removing the CPU from the storage data path.
Incorrect Options: A. Upgrading the BMC firmware for faster Slurm polling While keeping the BMC (Baseboard Management Controller) updated is a best practice for system stability and monitoring, it has zero impact on the high-speed data path used for training. The BMC manages “Out-of-Band“ tasks (power, thermals, health); it does not touch the “In-Band“ I/O traffic between storage and GPUs.
B. Reducing MIG instances to allow more bandwidth for SATA boot drives This option is fundamentally flawed for two reasons:
MIG (Multi-Instance GPU) manages compute and memory partitioning inside the GPU; it does not “free up“ PCIe lanes for other devices like SATA controllers.
SATA drives are limited to 6Gbps and are far too slow for AI training data. In an NVIDIA-certified architecture, training data should reside on NVMe or high-performance parallel file systems. The boot drive‘s performance is irrelevant to the training job‘s I/O wait times.
C. Moving datasets to the head node‘s /tmp via OOB network The OOB (Out-of-Band) management network is typically a 1GbE or 10GbE network designed for administrative traffic. Using it to share massive training datasets would create a massive bottleneck. Furthermore, the head node is a single point of failure and lacks the aggregate bandwidth of a dedicated parallel storage fabric (like Lustre, GPFS, or Weka) which is required for scaling an AI cluster.
Unattempted
Correct: D Implementing NVIDIA GPUDirect Storage (GDS) to enable a direct DMA path between the network interface card and the GPU memory, bypassing the CPU buffer.
The Technical Reason: In traditional storage architectures, data must be copied from the storage/network into a “bounce buffer“ in the CPU‘s system RAM before being moved to the GPU. This creates a “CPU bottleneck“ where the processor becomes overwhelmed by memory copy operations.
Direct Path: Magnum IO GPUDirect Storage (GDS) creates a direct path for Direct Memory Access (DMA) between the GPU memory (VRAM) and the storage (local NVMe or remote NVMe-oF).
The Benefit: It significantly reduces I/O latency, lowers CPU utilization, and increases the overall bandwidth available for training. This is particularly vital for checkpointing—saving large model states—which can otherwise stall a cluster for minutes or hours.
The NCP-AII Context: The certification blueprint emphasizes “Optimizing storage“ and understanding the “Data Fabric.“ GDS is the flagship NVIDIA technology for removing the CPU from the storage data path.
Incorrect Options: A. Upgrading the BMC firmware for faster Slurm polling While keeping the BMC (Baseboard Management Controller) updated is a best practice for system stability and monitoring, it has zero impact on the high-speed data path used for training. The BMC manages “Out-of-Band“ tasks (power, thermals, health); it does not touch the “In-Band“ I/O traffic between storage and GPUs.
B. Reducing MIG instances to allow more bandwidth for SATA boot drives This option is fundamentally flawed for two reasons:
MIG (Multi-Instance GPU) manages compute and memory partitioning inside the GPU; it does not “free up“ PCIe lanes for other devices like SATA controllers.
SATA drives are limited to 6Gbps and are far too slow for AI training data. In an NVIDIA-certified architecture, training data should reside on NVMe or high-performance parallel file systems. The boot drive‘s performance is irrelevant to the training job‘s I/O wait times.
C. Moving datasets to the head node‘s /tmp via OOB network The OOB (Out-of-Band) management network is typically a 1GbE or 10GbE network designed for administrative traffic. Using it to share massive training datasets would create a massive bottleneck. Furthermore, the head node is a single point of failure and lacks the aggregate bandwidth of a dedicated parallel storage fabric (like Lustre, GPFS, or Weka) which is required for scaling an AI cluster.
Question 56 of 60
56. Question
An administrator needs to partition an NVIDIA A100 GPU into multiple instances to support concurrent small-scale training jobs and inference services. Which technology should be configured, and what is a key requirement for this configuration to be persistent across system reboots according to professional AI infrastructure standards?
Correct
Correct: B Configure Multi-Instance GPU (MIG) and ensure the MIG mode is enabled via nvidia-smi with the reboot flag for the target GPU.
The Technical Reason: * MIG Technology: Multi-Instance GPU (MIG) is the hardware-based partitioning feature for NVIDIA Ampere (A100) and Hopper (H100) architectures. It allows a single GPU to be split into up to seven independent instances, each with its own high-bandwidth memory, cache, and compute cores.
Persistence and State: To enable MIG, the GPU must be toggled into “MIG Mode.“ This is done using nvidia-smi -i -mig 1.
The Reboot Requirement: On many server platforms (especially HGX/DGX), enabling or disabling MIG mode requires a GPU reset or a system reboot to re-enumerate the PCIe physical and virtual functions. Using the -r (reboot) or similar persistence flags ensures the hardware state is preserved across power cycles.
The NCP-AII Context: The exam validates that you can move a system from a “Single-Instance“ (Default) to a “Multi-Instance“ state. It specifically tests the knowledge that MIG provides hardware-level isolation, which is superior to software-level partitioning for multi-tenant security.
Incorrect Options: A. Enable NVIDIA vGPU profiles within VMware ESXi While vGPU is a valid technology for virtualizing GPUs in a VDI or cloud environment, it is typically a software/hypervisor-layer solution. The NCP-AII certification focuses on the bare-metal AI infrastructure and the native hardware capabilities of the A100/H100. Furthermore, assigning MAC addresses to “GPU slices“ is not how vGPU or MIG works; networking is handled by the DPU or NIC, not the GPU cores themselves.
C. Use Slurm to partition the GPU logically Slurm is a workload manager. While Slurm can be configured to recognize and assign MIG instances to specific jobs, it does not “partition the GPU logically“ at the hardware level. If you only partition in slurm.conf without enabling MIG in the hardware/driver, there is no physical isolation between jobs, leading to “noisy neighbor“ performance issues and security risks.
D. Configure NVLink Bridge settings in the BIOS NVLink is a high-speed interconnect for GPU-to-GPU communication. It is not used to partition a single GPU into smaller instances. Furthermore, the DOCA driver on a DPU manages networking and storage offloads; it does not manage the internal compute partitioning of a host-attached GPU via BIOS “virtual lanes.“
Incorrect
Correct: B Configure Multi-Instance GPU (MIG) and ensure the MIG mode is enabled via nvidia-smi with the reboot flag for the target GPU.
The Technical Reason: * MIG Technology: Multi-Instance GPU (MIG) is the hardware-based partitioning feature for NVIDIA Ampere (A100) and Hopper (H100) architectures. It allows a single GPU to be split into up to seven independent instances, each with its own high-bandwidth memory, cache, and compute cores.
Persistence and State: To enable MIG, the GPU must be toggled into “MIG Mode.“ This is done using nvidia-smi -i -mig 1.
The Reboot Requirement: On many server platforms (especially HGX/DGX), enabling or disabling MIG mode requires a GPU reset or a system reboot to re-enumerate the PCIe physical and virtual functions. Using the -r (reboot) or similar persistence flags ensures the hardware state is preserved across power cycles.
The NCP-AII Context: The exam validates that you can move a system from a “Single-Instance“ (Default) to a “Multi-Instance“ state. It specifically tests the knowledge that MIG provides hardware-level isolation, which is superior to software-level partitioning for multi-tenant security.
Incorrect Options: A. Enable NVIDIA vGPU profiles within VMware ESXi While vGPU is a valid technology for virtualizing GPUs in a VDI or cloud environment, it is typically a software/hypervisor-layer solution. The NCP-AII certification focuses on the bare-metal AI infrastructure and the native hardware capabilities of the A100/H100. Furthermore, assigning MAC addresses to “GPU slices“ is not how vGPU or MIG works; networking is handled by the DPU or NIC, not the GPU cores themselves.
C. Use Slurm to partition the GPU logically Slurm is a workload manager. While Slurm can be configured to recognize and assign MIG instances to specific jobs, it does not “partition the GPU logically“ at the hardware level. If you only partition in slurm.conf without enabling MIG in the hardware/driver, there is no physical isolation between jobs, leading to “noisy neighbor“ performance issues and security risks.
D. Configure NVLink Bridge settings in the BIOS NVLink is a high-speed interconnect for GPU-to-GPU communication. It is not used to partition a single GPU into smaller instances. Furthermore, the DOCA driver on a DPU manages networking and storage offloads; it does not manage the internal compute partitioning of a host-attached GPU via BIOS “virtual lanes.“
Unattempted
Correct: B Configure Multi-Instance GPU (MIG) and ensure the MIG mode is enabled via nvidia-smi with the reboot flag for the target GPU.
The Technical Reason: * MIG Technology: Multi-Instance GPU (MIG) is the hardware-based partitioning feature for NVIDIA Ampere (A100) and Hopper (H100) architectures. It allows a single GPU to be split into up to seven independent instances, each with its own high-bandwidth memory, cache, and compute cores.
Persistence and State: To enable MIG, the GPU must be toggled into “MIG Mode.“ This is done using nvidia-smi -i -mig 1.
The Reboot Requirement: On many server platforms (especially HGX/DGX), enabling or disabling MIG mode requires a GPU reset or a system reboot to re-enumerate the PCIe physical and virtual functions. Using the -r (reboot) or similar persistence flags ensures the hardware state is preserved across power cycles.
The NCP-AII Context: The exam validates that you can move a system from a “Single-Instance“ (Default) to a “Multi-Instance“ state. It specifically tests the knowledge that MIG provides hardware-level isolation, which is superior to software-level partitioning for multi-tenant security.
Incorrect Options: A. Enable NVIDIA vGPU profiles within VMware ESXi While vGPU is a valid technology for virtualizing GPUs in a VDI or cloud environment, it is typically a software/hypervisor-layer solution. The NCP-AII certification focuses on the bare-metal AI infrastructure and the native hardware capabilities of the A100/H100. Furthermore, assigning MAC addresses to “GPU slices“ is not how vGPU or MIG works; networking is handled by the DPU or NIC, not the GPU cores themselves.
C. Use Slurm to partition the GPU logically Slurm is a workload manager. While Slurm can be configured to recognize and assign MIG instances to specific jobs, it does not “partition the GPU logically“ at the hardware level. If you only partition in slurm.conf without enabling MIG in the hardware/driver, there is no physical isolation between jobs, leading to “noisy neighbor“ performance issues and security risks.
D. Configure NVLink Bridge settings in the BIOS NVLink is a high-speed interconnect for GPU-to-GPU communication. It is not used to partition a single GPU into smaller instances. Furthermore, the DOCA driver on a DPU manages networking and storage offloads; it does not manage the internal compute partitioning of a host-attached GPU via BIOS “virtual lanes.“
Question 57 of 60
57. Question
When configuring a BlueField-3 DPU to support AI workloads, which feature must be correctly implemented to allow for efficient communication between the GPU memory and the network without involving the host CPU‘s system memory?
Correct
Correct: C GPUDirect RDMA, which requires the peer-to-peer (P2P) capability to be supported and enabled between the DPU and the GPU over the PCIe bus.
The Technical Reason: GPUDirect RDMA (Remote Direct Memory Access) is the foundational technology that allows a network interface (like the ConnectX-7 core inside a BlueField-3 DPU) to read from or write to GPU memory directly.
Bypassing the CPU: Without this, data would have to be copied from GPU memory to a “bounce buffer“ in the host‘s system RAM (managed by the CPU) before being sent to the network. This adds significant latency and consumes CPU cycles.
P2P Requirement: For this to work, the PCIe hierarchy must support Peer-to-Peer (P2P) transfers. This often involves ensuring that both the GPU and DPU are connected to the same PCIe switch or root complex and that features like Access Control Services (ACS) are correctly configured (or disabled/relaxed) to allow the devices to “talk“ to each other without host intervention.
The NCP-AII Context: The certification expects you to know how to validate this data path. Tools like ib_write_bw (with the –use_cuda flag) are used to confirm that RDMA is successfully pulling data directly from the GPU VRAM.
Incorrect Options: A. Slurm scheduler‘s DPU-plugin for MIG-like ARM partitioning This is a distractor that blends two different concepts. Slurm is a job scheduler, and while it can manage DPU resources, it doesn‘t “partition ARM cores into MIG-like instances.“ MIG (Multi-Instance GPU) is a hardware-specific feature of GPUs (A100/H100), not the ARM processors on a DPU. Furthermore, SSH management has no impact on the high-speed data path between GPU memory and the network.
B. NVIDIA SMI migration to DPU DDR5 memory There is no feature called “NVIDIA SMI migration“ that moves GPU pages to the DPU‘s local memory to solve network congestion. While the BlueField-3 has its own DDR5 memory, it is used for the DPU‘s internal OS (DOCA), management tasks, and storage offloads (like BlueField SNAP), not as a spillover buffer for GPU training memory.
D. ERSPAN for real-time backup ERSPAN is a networking protocol used for port mirroring—sending a copy of network traffic to a remote monitoring device for analysis. It is a diagnostic tool, not a performance optimization for AI workloads. It does not facilitate direct memory access and would actually increase overhead rather than reduce it.
Incorrect
Correct: C GPUDirect RDMA, which requires the peer-to-peer (P2P) capability to be supported and enabled between the DPU and the GPU over the PCIe bus.
The Technical Reason: GPUDirect RDMA (Remote Direct Memory Access) is the foundational technology that allows a network interface (like the ConnectX-7 core inside a BlueField-3 DPU) to read from or write to GPU memory directly.
Bypassing the CPU: Without this, data would have to be copied from GPU memory to a “bounce buffer“ in the host‘s system RAM (managed by the CPU) before being sent to the network. This adds significant latency and consumes CPU cycles.
P2P Requirement: For this to work, the PCIe hierarchy must support Peer-to-Peer (P2P) transfers. This often involves ensuring that both the GPU and DPU are connected to the same PCIe switch or root complex and that features like Access Control Services (ACS) are correctly configured (or disabled/relaxed) to allow the devices to “talk“ to each other without host intervention.
The NCP-AII Context: The certification expects you to know how to validate this data path. Tools like ib_write_bw (with the –use_cuda flag) are used to confirm that RDMA is successfully pulling data directly from the GPU VRAM.
Incorrect Options: A. Slurm scheduler‘s DPU-plugin for MIG-like ARM partitioning This is a distractor that blends two different concepts. Slurm is a job scheduler, and while it can manage DPU resources, it doesn‘t “partition ARM cores into MIG-like instances.“ MIG (Multi-Instance GPU) is a hardware-specific feature of GPUs (A100/H100), not the ARM processors on a DPU. Furthermore, SSH management has no impact on the high-speed data path between GPU memory and the network.
B. NVIDIA SMI migration to DPU DDR5 memory There is no feature called “NVIDIA SMI migration“ that moves GPU pages to the DPU‘s local memory to solve network congestion. While the BlueField-3 has its own DDR5 memory, it is used for the DPU‘s internal OS (DOCA), management tasks, and storage offloads (like BlueField SNAP), not as a spillover buffer for GPU training memory.
D. ERSPAN for real-time backup ERSPAN is a networking protocol used for port mirroring—sending a copy of network traffic to a remote monitoring device for analysis. It is a diagnostic tool, not a performance optimization for AI workloads. It does not facilitate direct memory access and would actually increase overhead rather than reduce it.
Unattempted
Correct: C GPUDirect RDMA, which requires the peer-to-peer (P2P) capability to be supported and enabled between the DPU and the GPU over the PCIe bus.
The Technical Reason: GPUDirect RDMA (Remote Direct Memory Access) is the foundational technology that allows a network interface (like the ConnectX-7 core inside a BlueField-3 DPU) to read from or write to GPU memory directly.
Bypassing the CPU: Without this, data would have to be copied from GPU memory to a “bounce buffer“ in the host‘s system RAM (managed by the CPU) before being sent to the network. This adds significant latency and consumes CPU cycles.
P2P Requirement: For this to work, the PCIe hierarchy must support Peer-to-Peer (P2P) transfers. This often involves ensuring that both the GPU and DPU are connected to the same PCIe switch or root complex and that features like Access Control Services (ACS) are correctly configured (or disabled/relaxed) to allow the devices to “talk“ to each other without host intervention.
The NCP-AII Context: The certification expects you to know how to validate this data path. Tools like ib_write_bw (with the –use_cuda flag) are used to confirm that RDMA is successfully pulling data directly from the GPU VRAM.
Incorrect Options: A. Slurm scheduler‘s DPU-plugin for MIG-like ARM partitioning This is a distractor that blends two different concepts. Slurm is a job scheduler, and while it can manage DPU resources, it doesn‘t “partition ARM cores into MIG-like instances.“ MIG (Multi-Instance GPU) is a hardware-specific feature of GPUs (A100/H100), not the ARM processors on a DPU. Furthermore, SSH management has no impact on the high-speed data path between GPU memory and the network.
B. NVIDIA SMI migration to DPU DDR5 memory There is no feature called “NVIDIA SMI migration“ that moves GPU pages to the DPU‘s local memory to solve network congestion. While the BlueField-3 has its own DDR5 memory, it is used for the DPU‘s internal OS (DOCA), management tasks, and storage offloads (like BlueField SNAP), not as a spillover buffer for GPU training memory.
D. ERSPAN for real-time backup ERSPAN is a networking protocol used for port mirroring—sending a copy of network traffic to a remote monitoring device for analysis. It is a diagnostic tool, not a performance optimization for AI workloads. It does not facilitate direct memory access and would actually increase overhead rather than reduce it.
Question 58 of 60
58. Question
A cluster administrator is performing a single-node stress test and a High-Performance Linpack (HPL) benchmark on a new H100 node. The HPL results are significantly lower than the theoretical peak. Upon investigation, they notice that the NVLink Switch status indicates some links are down. Why is NVLink health critical for HPL performance on a multi-GPU node?
Correct
Correct: A NVLink provides the high-bandwidth, low-latency interconnect required for the massive All-Reduce and exchange operations between GPUs during the matrix factorization. • The Technical Reason: HPL solves a dense system of linear equations (A? = b) using LU factorization. In a multi-GPU node (like an HGX H100), the matrix is distributed across all GPUs. ? The Communication Bottleneck: As the GPUs compute their local portions of the matrix, they must constantly synchronize and exchange data (specifically using “All-Reduce“ and broadcast operations). ? NVLink vs. PCIe: Standard PCIe Gen5 provides about 64GB/s per x16 slot. In contrast, the NVLink fabric on an H100 provides up to 900GB/s of bidirectional bandwidth. ? Impact of Failed Links: If NVLink switches or individual links are down, the GPUs are forced to “fall back“ to the PCIe bus for these exchanges. Because PCIe bandwidth is more than 10x slower than NVLink, the GPUs spend most of their time waiting for data rather than performing floating-point math, leading to a massive drop in the observed Rmax (actual performance) compared to the Rpeak (theoretical max). • The NCP-AII Context: The exam validates that you can use nvidia-smi nvlink -s to verify link status. A common troubleshooting scenario involves identifying why a node passes individual GPU stress tests but fails to reach HPL performance targets—the answer is almost always a sub-optimal or faulty high-speed interconnect.
Incorrect: B. GPUs communicate via the BMC management network The BMC (Baseboard Management Controller) network is an “Out-of-Band“ (OOB) network, typically 1Gb/s, used for management tasks like power control and sensor monitoring. It is electrically and logically isolated from the high-speed data plane. GPUs never use the BMC network for training or HPL data exchange.
C. NVLink is only used for video output This is a misconception often carried over from consumer SLI bridges. In the AI infrastructure world, NVLink is strictly a data interconnect for peer-to-peer (P2P) memory access and collective operations. High-end data center GPUs like the H100 do not even have video output ports (DisplayPort/HDMI).
D. HPL uses NVLink to synchronize with SSD storage NVLink is a GPU-to-GPU (and in some cases GPU-to-CPU) interconnect. It does not connect to or synchronize with SSD storage. Storage synchronization is handled by the storage controller and the system clock over the PCIe bus or the network fabric (InfiniBand/Ethernet).
Incorrect
Correct: A NVLink provides the high-bandwidth, low-latency interconnect required for the massive All-Reduce and exchange operations between GPUs during the matrix factorization. • The Technical Reason: HPL solves a dense system of linear equations (A? = b) using LU factorization. In a multi-GPU node (like an HGX H100), the matrix is distributed across all GPUs. ? The Communication Bottleneck: As the GPUs compute their local portions of the matrix, they must constantly synchronize and exchange data (specifically using “All-Reduce“ and broadcast operations). ? NVLink vs. PCIe: Standard PCIe Gen5 provides about 64GB/s per x16 slot. In contrast, the NVLink fabric on an H100 provides up to 900GB/s of bidirectional bandwidth. ? Impact of Failed Links: If NVLink switches or individual links are down, the GPUs are forced to “fall back“ to the PCIe bus for these exchanges. Because PCIe bandwidth is more than 10x slower than NVLink, the GPUs spend most of their time waiting for data rather than performing floating-point math, leading to a massive drop in the observed Rmax (actual performance) compared to the Rpeak (theoretical max). • The NCP-AII Context: The exam validates that you can use nvidia-smi nvlink -s to verify link status. A common troubleshooting scenario involves identifying why a node passes individual GPU stress tests but fails to reach HPL performance targets—the answer is almost always a sub-optimal or faulty high-speed interconnect.
Incorrect: B. GPUs communicate via the BMC management network The BMC (Baseboard Management Controller) network is an “Out-of-Band“ (OOB) network, typically 1Gb/s, used for management tasks like power control and sensor monitoring. It is electrically and logically isolated from the high-speed data plane. GPUs never use the BMC network for training or HPL data exchange.
C. NVLink is only used for video output This is a misconception often carried over from consumer SLI bridges. In the AI infrastructure world, NVLink is strictly a data interconnect for peer-to-peer (P2P) memory access and collective operations. High-end data center GPUs like the H100 do not even have video output ports (DisplayPort/HDMI).
D. HPL uses NVLink to synchronize with SSD storage NVLink is a GPU-to-GPU (and in some cases GPU-to-CPU) interconnect. It does not connect to or synchronize with SSD storage. Storage synchronization is handled by the storage controller and the system clock over the PCIe bus or the network fabric (InfiniBand/Ethernet).
Unattempted
Correct: A NVLink provides the high-bandwidth, low-latency interconnect required for the massive All-Reduce and exchange operations between GPUs during the matrix factorization. • The Technical Reason: HPL solves a dense system of linear equations (A? = b) using LU factorization. In a multi-GPU node (like an HGX H100), the matrix is distributed across all GPUs. ? The Communication Bottleneck: As the GPUs compute their local portions of the matrix, they must constantly synchronize and exchange data (specifically using “All-Reduce“ and broadcast operations). ? NVLink vs. PCIe: Standard PCIe Gen5 provides about 64GB/s per x16 slot. In contrast, the NVLink fabric on an H100 provides up to 900GB/s of bidirectional bandwidth. ? Impact of Failed Links: If NVLink switches or individual links are down, the GPUs are forced to “fall back“ to the PCIe bus for these exchanges. Because PCIe bandwidth is more than 10x slower than NVLink, the GPUs spend most of their time waiting for data rather than performing floating-point math, leading to a massive drop in the observed Rmax (actual performance) compared to the Rpeak (theoretical max). • The NCP-AII Context: The exam validates that you can use nvidia-smi nvlink -s to verify link status. A common troubleshooting scenario involves identifying why a node passes individual GPU stress tests but fails to reach HPL performance targets—the answer is almost always a sub-optimal or faulty high-speed interconnect.
Incorrect: B. GPUs communicate via the BMC management network The BMC (Baseboard Management Controller) network is an “Out-of-Band“ (OOB) network, typically 1Gb/s, used for management tasks like power control and sensor monitoring. It is electrically and logically isolated from the high-speed data plane. GPUs never use the BMC network for training or HPL data exchange.
C. NVLink is only used for video output This is a misconception often carried over from consumer SLI bridges. In the AI infrastructure world, NVLink is strictly a data interconnect for peer-to-peer (P2P) memory access and collective operations. High-end data center GPUs like the H100 do not even have video output ports (DisplayPort/HDMI).
D. HPL uses NVLink to synchronize with SSD storage NVLink is a GPU-to-GPU (and in some cases GPU-to-CPU) interconnect. It does not connect to or synchronize with SSD storage. Storage synchronization is handled by the storage controller and the system clock over the PCIe bus or the network fabric (InfiniBand/Ethernet).
Question 59 of 60
59. Question
An infrastructure engineer is validating the cabling and transceiver configuration for a high-performance compute cluster. To ensure optimal performance for GPUDirect RDMA across a rail-optimized network topology, the engineer must verify the signal quality of the InfiniBand links. Which specific metric or tool provides the most accurate assessment of physical layer stability and cable integrity for NDR 400Gb/s links during the hardware validation phase?
Correct
Correct: C The Link Error Monitor (LEM) counters and Bit Error Rate (BER) values • The Technical Reason: High-speed networks like NDR use PAM4 modulation, which is highly sensitive to signal noise and physical impairments. ? BER (Bit Error Rate): This is the primary metric for signal quality. NVIDIA-certified links are expected to maintain an extremely low pre-FEC (Forward Error Correction) BER to ensure that the hardware can successfully correct any minor flips without dropping packets. ? LEM (Link Error Monitor): This hardware feature on NVIDIA Quantum-2 switches and ConnectX-7 adapters continuously monitors the link quality. If the BER exceeds a specific threshold, the LEM will flag the link as unstable or even “downshift“ it to protect data integrity. ? Validation Tool: Professionals use ibdiagnet –get_phy_info or mlxlink to pull these specific counters. High symbol errors or a BER approaching the 10^{-12} threshold typically indicate a dirty fiber connector or a faulty transceiver. • The NCP-AII Context: The exam validates that you can move beyond basic “link up/down“ status. You must be able to verify signal quality to ensure that GPUDirect RDMA can operate at peak efficiency without retransmissions caused by physical layer noise.
Incorrect: A. GPU temperature delta during CUDA execution While GPU temperatures are part of the System and Server Bring-up validation, they do not provide any information about the network‘s physical layer. A GPU can be thermally stable while the InfiniBand link it is connected to is dropping 10% of its packets due to a crimped fiber cable.
B. Standard Ethernet ping latency A “ping“ uses small ICMP packets and primarily tests the software stack‘s reachability. It does not stress the high-bandwidth capabilities of an NDR link and cannot detect the microscopic bit errors that occur at 400Gb/s line rates. Furthermore, ping latency is often influenced by CPU scheduling, making it a poor metric for physical cable integrity.
D. Total power consumption measured by the BMC The BMC (Baseboard Management Controller) monitors the overall power draw of the server (PSUs, fans, GPUs). While a failed transceiver might slightly alter power consumption, it is not a diagnostic metric for link stability. Power metrics cannot distinguish between a perfectly functioning 400Gb/s link and one that is failing due to high BER.
Incorrect
Correct: C The Link Error Monitor (LEM) counters and Bit Error Rate (BER) values • The Technical Reason: High-speed networks like NDR use PAM4 modulation, which is highly sensitive to signal noise and physical impairments. ? BER (Bit Error Rate): This is the primary metric for signal quality. NVIDIA-certified links are expected to maintain an extremely low pre-FEC (Forward Error Correction) BER to ensure that the hardware can successfully correct any minor flips without dropping packets. ? LEM (Link Error Monitor): This hardware feature on NVIDIA Quantum-2 switches and ConnectX-7 adapters continuously monitors the link quality. If the BER exceeds a specific threshold, the LEM will flag the link as unstable or even “downshift“ it to protect data integrity. ? Validation Tool: Professionals use ibdiagnet –get_phy_info or mlxlink to pull these specific counters. High symbol errors or a BER approaching the 10^{-12} threshold typically indicate a dirty fiber connector or a faulty transceiver. • The NCP-AII Context: The exam validates that you can move beyond basic “link up/down“ status. You must be able to verify signal quality to ensure that GPUDirect RDMA can operate at peak efficiency without retransmissions caused by physical layer noise.
Incorrect: A. GPU temperature delta during CUDA execution While GPU temperatures are part of the System and Server Bring-up validation, they do not provide any information about the network‘s physical layer. A GPU can be thermally stable while the InfiniBand link it is connected to is dropping 10% of its packets due to a crimped fiber cable.
B. Standard Ethernet ping latency A “ping“ uses small ICMP packets and primarily tests the software stack‘s reachability. It does not stress the high-bandwidth capabilities of an NDR link and cannot detect the microscopic bit errors that occur at 400Gb/s line rates. Furthermore, ping latency is often influenced by CPU scheduling, making it a poor metric for physical cable integrity.
D. Total power consumption measured by the BMC The BMC (Baseboard Management Controller) monitors the overall power draw of the server (PSUs, fans, GPUs). While a failed transceiver might slightly alter power consumption, it is not a diagnostic metric for link stability. Power metrics cannot distinguish between a perfectly functioning 400Gb/s link and one that is failing due to high BER.
Unattempted
Correct: C The Link Error Monitor (LEM) counters and Bit Error Rate (BER) values • The Technical Reason: High-speed networks like NDR use PAM4 modulation, which is highly sensitive to signal noise and physical impairments. ? BER (Bit Error Rate): This is the primary metric for signal quality. NVIDIA-certified links are expected to maintain an extremely low pre-FEC (Forward Error Correction) BER to ensure that the hardware can successfully correct any minor flips without dropping packets. ? LEM (Link Error Monitor): This hardware feature on NVIDIA Quantum-2 switches and ConnectX-7 adapters continuously monitors the link quality. If the BER exceeds a specific threshold, the LEM will flag the link as unstable or even “downshift“ it to protect data integrity. ? Validation Tool: Professionals use ibdiagnet –get_phy_info or mlxlink to pull these specific counters. High symbol errors or a BER approaching the 10^{-12} threshold typically indicate a dirty fiber connector or a faulty transceiver. • The NCP-AII Context: The exam validates that you can move beyond basic “link up/down“ status. You must be able to verify signal quality to ensure that GPUDirect RDMA can operate at peak efficiency without retransmissions caused by physical layer noise.
Incorrect: A. GPU temperature delta during CUDA execution While GPU temperatures are part of the System and Server Bring-up validation, they do not provide any information about the network‘s physical layer. A GPU can be thermally stable while the InfiniBand link it is connected to is dropping 10% of its packets due to a crimped fiber cable.
B. Standard Ethernet ping latency A “ping“ uses small ICMP packets and primarily tests the software stack‘s reachability. It does not stress the high-bandwidth capabilities of an NDR link and cannot detect the microscopic bit errors that occur at 400Gb/s line rates. Furthermore, ping latency is often influenced by CPU scheduling, making it a poor metric for physical cable integrity.
D. Total power consumption measured by the BMC The BMC (Baseboard Management Controller) monitors the overall power draw of the server (PSUs, fans, GPUs). While a failed transceiver might slightly alter power consumption, it is not a diagnostic metric for link stability. Power metrics cannot distinguish between a perfectly functioning 400Gb/s link and one that is failing due to high BER.
Question 60 of 60
60. Question
During the installation of Base Command Manager (BCM) in an AI cluster, an administrator must configure High Availability (HA) for the head node. What is the primary reason for configuring HA in the BCM control plane, and what is a critical requirement for the synchronization between the primary and secondary head nodes?
Correct
Correct: B HA ensures cluster management continuity if the primary node fails; it requires a shared storage or synchronized database for the cluster state.
The Technical Reason: The “Control Plane“ of an AI cluster (managed by BCM‘s cmdaemon) is the brain of the operation. It handles job scheduling, health monitoring, and node provisioning.
Continuity: If a single head node fails without HA, the entire cluster becomes unmanaged—new jobs cannot start, and health monitoring ceases.
Synchronization: For High Availability (HA) to work, the secondary head node must be a “mirror“ of the primary. This requires the cluster database (which stores the state of every node, job, and configuration) to be synchronized in real-time. This is typically achieved using DRBD (Distributed Replicated Block Device) to sync the underlying file systems or a shared storage backend so that if the secondary takes over, it has an identical view of the cluster state.
The NCP-AII Context: The exam blueprint specifically lists “Install Base Command Manager (BCM), configure and verify HA“ as a key objective. Professionals must understand that HA is about redundancy and state consistency, not performance scaling.
Incorrect Options: A. HA is required to enable GPU overclocking Overclocking is a hardware-level tuning task (usually discouraged in enterprise AI clusters to maintain stability) and has no relation to head node High Availability. Furthermore, for HA to function correctly in BCM, it is a strict requirement that the nodes have identical operating system versions and BCM software versions to prevent synchronization conflicts.
C. HA allows the cluster to bypass the InfiniBand fabric The InfiniBand fabric is the “Data Plane“ used for high-speed GPU-to-GPU communication. The BCM head nodes manage the “Control Plane.“ Having a redundant head node does not replace the need for a high-performance compute fabric; the two serve completely different purposes in the AI Factory architecture.
D. HA is used to double compute power via serial cable HA is a failover mechanism (Active-Passive), not a load-balancing mechanism (Active-Active) designed to double compute power. While the nodes must be synchronized, they use high-speed Ethernet (often via a dedicated heartbeat link or the management network), not legacy serial cables, to handle the massive amount of data required for modern cluster state synchronization.
Incorrect
Correct: B HA ensures cluster management continuity if the primary node fails; it requires a shared storage or synchronized database for the cluster state.
The Technical Reason: The “Control Plane“ of an AI cluster (managed by BCM‘s cmdaemon) is the brain of the operation. It handles job scheduling, health monitoring, and node provisioning.
Continuity: If a single head node fails without HA, the entire cluster becomes unmanaged—new jobs cannot start, and health monitoring ceases.
Synchronization: For High Availability (HA) to work, the secondary head node must be a “mirror“ of the primary. This requires the cluster database (which stores the state of every node, job, and configuration) to be synchronized in real-time. This is typically achieved using DRBD (Distributed Replicated Block Device) to sync the underlying file systems or a shared storage backend so that if the secondary takes over, it has an identical view of the cluster state.
The NCP-AII Context: The exam blueprint specifically lists “Install Base Command Manager (BCM), configure and verify HA“ as a key objective. Professionals must understand that HA is about redundancy and state consistency, not performance scaling.
Incorrect Options: A. HA is required to enable GPU overclocking Overclocking is a hardware-level tuning task (usually discouraged in enterprise AI clusters to maintain stability) and has no relation to head node High Availability. Furthermore, for HA to function correctly in BCM, it is a strict requirement that the nodes have identical operating system versions and BCM software versions to prevent synchronization conflicts.
C. HA allows the cluster to bypass the InfiniBand fabric The InfiniBand fabric is the “Data Plane“ used for high-speed GPU-to-GPU communication. The BCM head nodes manage the “Control Plane.“ Having a redundant head node does not replace the need for a high-performance compute fabric; the two serve completely different purposes in the AI Factory architecture.
D. HA is used to double compute power via serial cable HA is a failover mechanism (Active-Passive), not a load-balancing mechanism (Active-Active) designed to double compute power. While the nodes must be synchronized, they use high-speed Ethernet (often via a dedicated heartbeat link or the management network), not legacy serial cables, to handle the massive amount of data required for modern cluster state synchronization.
Unattempted
Correct: B HA ensures cluster management continuity if the primary node fails; it requires a shared storage or synchronized database for the cluster state.
The Technical Reason: The “Control Plane“ of an AI cluster (managed by BCM‘s cmdaemon) is the brain of the operation. It handles job scheduling, health monitoring, and node provisioning.
Continuity: If a single head node fails without HA, the entire cluster becomes unmanaged—new jobs cannot start, and health monitoring ceases.
Synchronization: For High Availability (HA) to work, the secondary head node must be a “mirror“ of the primary. This requires the cluster database (which stores the state of every node, job, and configuration) to be synchronized in real-time. This is typically achieved using DRBD (Distributed Replicated Block Device) to sync the underlying file systems or a shared storage backend so that if the secondary takes over, it has an identical view of the cluster state.
The NCP-AII Context: The exam blueprint specifically lists “Install Base Command Manager (BCM), configure and verify HA“ as a key objective. Professionals must understand that HA is about redundancy and state consistency, not performance scaling.
Incorrect Options: A. HA is required to enable GPU overclocking Overclocking is a hardware-level tuning task (usually discouraged in enterprise AI clusters to maintain stability) and has no relation to head node High Availability. Furthermore, for HA to function correctly in BCM, it is a strict requirement that the nodes have identical operating system versions and BCM software versions to prevent synchronization conflicts.
C. HA allows the cluster to bypass the InfiniBand fabric The InfiniBand fabric is the “Data Plane“ used for high-speed GPU-to-GPU communication. The BCM head nodes manage the “Control Plane.“ Having a redundant head node does not replace the need for a high-performance compute fabric; the two serve completely different purposes in the AI Factory architecture.
D. HA is used to double compute power via serial cable HA is a failover mechanism (Active-Passive), not a load-balancing mechanism (Active-Active) designed to double compute power. While the nodes must be synchronized, they use high-speed Ethernet (often via a dedicated heartbeat link or the management network), not legacy serial cables, to handle the massive amount of data required for modern cluster state synchronization.
X
Use Page numbers below to navigate to other practice tests