You have already completed the Test before. Hence you can not start it again.
Test is loading...
You must sign in or sign up to start the Test.
You have to finish following quiz, to start this Test:
Your results are here!! for" NVIDIA NCP-AII Practice Test 1 "
0 of 60 questions answered correctly
Your time:
Time has elapsed
Your Final Score is : 0
You have attempted : 0
Number of Correct Questions : 0 and scored 0
Number of Incorrect Questions : 0 and Negative marks 0
Average score
Your score
NVIDIA NCP-AII
You have attempted: 0
Number of Correct Questions: 0 and scored 0
Number of Incorrect Questions: 0 and Negative marks 0
You can review your answers by clicking on “View Answers” option. Important Note : Open Reference Documentation Links in New Tab (Right Click and Open in New Tab).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Answered
Review
Question 1 of 60
1. Question
An infrastructure architect is deploying an NVIDIA BlueField-3 DPU (Data Processing Unit) to manage the network control plane for a cluster. To achieve optimal performance and offload networking tasks effectively, the administrator must configure the DPU‘s internal operation mode. Which mode is specifically designed to allow the DPU to run a separate OS and manage network traffic independently from the host CPU?
Correct
Correct: C DPU Mode (also known as Embedded Function mode), where the internal Arm cores run a Linux-based OS to manage offloads and control plane functions.
The Technical Reason: In DPU Mode (the default state for BlueField DPUs), the DPU acts as an autonomous “computer-in-front-of-a-computer.“ The onboard Arm Cortex-A78 cores boot their own dedicated operating system (typically Ubuntu or CentOS-based). In this mode, the DPU owns the network resources; the host CPU sees a “Representor“ of the network, but the DPU‘s Arm OS manages the actual eSwitch, security policies, and storage offloads (like NVMe-oF) independently.
The NCP-AII Context: This mode is the prerequisite for using NVIDIA DOCA (Data Center Infrastructure-on-a-Chip Architecture). To pass the certification, you must know that DPU Mode enables Isolation, allowing infrastructure services to run on the DPU while the host CPU remains dedicated to AI training or inference.
Incorrect: A. Bridge Mode “Bridge Mode“ is not a standard operation mode for the BlueField DPU in the context of the NCP-AII curriculum. While a DPU can perform bridging functions via Open vSwitch (OVS), the term “Bridge Mode“ does not describe the hardware state required to run a separate internal OS or manage the control plane independently.
B. NIC Mode (Network Interface Card mode) In NIC Mode, the BlueField-3 behaves like a standard ConnectX-7 adapter. Crucially, in this mode on BlueField-3, the internal Arm cores are inactive and the internal OS does not boot. This mode is used when a customer only needs high-speed 400Gb/s connectivity and does not require the programmability or offload capabilities of the DPU. It saves power but removes the “independent management“ benefit.
D. Legacy Mode There is no “Legacy Mode“ in the BlueField-3 architecture designed for PCIe Gen2 compatibility. BlueField-3 is a PCIe Gen 5.0 device. While it is backward compatible with older PCIe slots through standard bus negotiation, there is no specific “Legacy Mode“ setting used during deployment to manage the software control plane.
Incorrect
Correct: C DPU Mode (also known as Embedded Function mode), where the internal Arm cores run a Linux-based OS to manage offloads and control plane functions.
The Technical Reason: In DPU Mode (the default state for BlueField DPUs), the DPU acts as an autonomous “computer-in-front-of-a-computer.“ The onboard Arm Cortex-A78 cores boot their own dedicated operating system (typically Ubuntu or CentOS-based). In this mode, the DPU owns the network resources; the host CPU sees a “Representor“ of the network, but the DPU‘s Arm OS manages the actual eSwitch, security policies, and storage offloads (like NVMe-oF) independently.
The NCP-AII Context: This mode is the prerequisite for using NVIDIA DOCA (Data Center Infrastructure-on-a-Chip Architecture). To pass the certification, you must know that DPU Mode enables Isolation, allowing infrastructure services to run on the DPU while the host CPU remains dedicated to AI training or inference.
Incorrect: A. Bridge Mode “Bridge Mode“ is not a standard operation mode for the BlueField DPU in the context of the NCP-AII curriculum. While a DPU can perform bridging functions via Open vSwitch (OVS), the term “Bridge Mode“ does not describe the hardware state required to run a separate internal OS or manage the control plane independently.
B. NIC Mode (Network Interface Card mode) In NIC Mode, the BlueField-3 behaves like a standard ConnectX-7 adapter. Crucially, in this mode on BlueField-3, the internal Arm cores are inactive and the internal OS does not boot. This mode is used when a customer only needs high-speed 400Gb/s connectivity and does not require the programmability or offload capabilities of the DPU. It saves power but removes the “independent management“ benefit.
D. Legacy Mode There is no “Legacy Mode“ in the BlueField-3 architecture designed for PCIe Gen2 compatibility. BlueField-3 is a PCIe Gen 5.0 device. While it is backward compatible with older PCIe slots through standard bus negotiation, there is no specific “Legacy Mode“ setting used during deployment to manage the software control plane.
Unattempted
Correct: C DPU Mode (also known as Embedded Function mode), where the internal Arm cores run a Linux-based OS to manage offloads and control plane functions.
The Technical Reason: In DPU Mode (the default state for BlueField DPUs), the DPU acts as an autonomous “computer-in-front-of-a-computer.“ The onboard Arm Cortex-A78 cores boot their own dedicated operating system (typically Ubuntu or CentOS-based). In this mode, the DPU owns the network resources; the host CPU sees a “Representor“ of the network, but the DPU‘s Arm OS manages the actual eSwitch, security policies, and storage offloads (like NVMe-oF) independently.
The NCP-AII Context: This mode is the prerequisite for using NVIDIA DOCA (Data Center Infrastructure-on-a-Chip Architecture). To pass the certification, you must know that DPU Mode enables Isolation, allowing infrastructure services to run on the DPU while the host CPU remains dedicated to AI training or inference.
Incorrect: A. Bridge Mode “Bridge Mode“ is not a standard operation mode for the BlueField DPU in the context of the NCP-AII curriculum. While a DPU can perform bridging functions via Open vSwitch (OVS), the term “Bridge Mode“ does not describe the hardware state required to run a separate internal OS or manage the control plane independently.
B. NIC Mode (Network Interface Card mode) In NIC Mode, the BlueField-3 behaves like a standard ConnectX-7 adapter. Crucially, in this mode on BlueField-3, the internal Arm cores are inactive and the internal OS does not boot. This mode is used when a customer only needs high-speed 400Gb/s connectivity and does not require the programmability or offload capabilities of the DPU. It saves power but removes the “independent management“ benefit.
D. Legacy Mode There is no “Legacy Mode“ in the BlueField-3 architecture designed for PCIe Gen2 compatibility. BlueField-3 is a PCIe Gen 5.0 device. While it is backward compatible with older PCIe slots through standard bus negotiation, there is no specific “Legacy Mode“ setting used during deployment to manage the software control plane.
Question 2 of 60
2. Question
An infrastructure manager is deploying BlueField network platforms across a cluster to implement a Zero Trust security model. Which feature of the BlueField platform facilitates the isolation of the management plane from the tenant data plane?
Correct
Correct: C The ability to run a separate Linux distribution on the DPU ARM cores that is independent of the host operating system.
The Technical Reason: The BlueField DPU is essentially a “computer-in-front-of-a-computer.“ It features its own dedicated ARM processor cores, memory, and a localized operating system (usually a specialized Ubuntu or CentOS-based BlueField image).
Isolation Mechanism: By running management agents, firewalls, and telemetry services on the DPU‘s ARM cores, the Management Plane is physically and logically separated from the Host (Tenant) Data Plane. Even if a tenant gains root access to the host OS, they cannot access the DPU‘s OS, which controls the network hardware and security policies.
The NCP-AII Context: This “Air-Gapped“ management approach is the fundamental requirement for implementing a Zero Trust architecture in multi-tenant AI clusters, ensuring that the infrastructure control remains secure regardless of the state of the compute nodes.
Incorrect: A. The implementation of a Cooling Control Loop While DPUs have thermal sensors and can throttle performance if they overheat, there is no “Cooling Control Loop“ designed to shut down hardware based on “unauthorized network traffic.“ Security enforcement is handled via the DOCA Flow or OVS (Open vSwitch) acceleration within the DPU, not through the cooling system.
B. The use of NVLink cables to encrypt data to local storage This is a technical mismatch. NVLink is a high-speed interconnect used strictly for GPU-to-GPU or GPU-to-CPU communication; it is not used for connecting DPUs to local storage drives. Furthermore, storage encryption is typically handled via OPAL-compliant drives or software-defined encryption, not via the “cables“ themselves.
D. The integration of a physical hardware switch to disconnect GPUs NVIDIA infrastructure does not use “kill switches“ that physically disconnect GPU hardware upon a breach. Instead, the BlueField DPU uses Micro-segmentation and Virtual Private Clouds (VPC) to logically isolate a compromised node from the rest of the fabric, preventing “lateral movement“ of an attacker while keeping the hardware physically connected for forensics or recovery.
Incorrect
Correct: C The ability to run a separate Linux distribution on the DPU ARM cores that is independent of the host operating system.
The Technical Reason: The BlueField DPU is essentially a “computer-in-front-of-a-computer.“ It features its own dedicated ARM processor cores, memory, and a localized operating system (usually a specialized Ubuntu or CentOS-based BlueField image).
Isolation Mechanism: By running management agents, firewalls, and telemetry services on the DPU‘s ARM cores, the Management Plane is physically and logically separated from the Host (Tenant) Data Plane. Even if a tenant gains root access to the host OS, they cannot access the DPU‘s OS, which controls the network hardware and security policies.
The NCP-AII Context: This “Air-Gapped“ management approach is the fundamental requirement for implementing a Zero Trust architecture in multi-tenant AI clusters, ensuring that the infrastructure control remains secure regardless of the state of the compute nodes.
Incorrect: A. The implementation of a Cooling Control Loop While DPUs have thermal sensors and can throttle performance if they overheat, there is no “Cooling Control Loop“ designed to shut down hardware based on “unauthorized network traffic.“ Security enforcement is handled via the DOCA Flow or OVS (Open vSwitch) acceleration within the DPU, not through the cooling system.
B. The use of NVLink cables to encrypt data to local storage This is a technical mismatch. NVLink is a high-speed interconnect used strictly for GPU-to-GPU or GPU-to-CPU communication; it is not used for connecting DPUs to local storage drives. Furthermore, storage encryption is typically handled via OPAL-compliant drives or software-defined encryption, not via the “cables“ themselves.
D. The integration of a physical hardware switch to disconnect GPUs NVIDIA infrastructure does not use “kill switches“ that physically disconnect GPU hardware upon a breach. Instead, the BlueField DPU uses Micro-segmentation and Virtual Private Clouds (VPC) to logically isolate a compromised node from the rest of the fabric, preventing “lateral movement“ of an attacker while keeping the hardware physically connected for forensics or recovery.
Unattempted
Correct: C The ability to run a separate Linux distribution on the DPU ARM cores that is independent of the host operating system.
The Technical Reason: The BlueField DPU is essentially a “computer-in-front-of-a-computer.“ It features its own dedicated ARM processor cores, memory, and a localized operating system (usually a specialized Ubuntu or CentOS-based BlueField image).
Isolation Mechanism: By running management agents, firewalls, and telemetry services on the DPU‘s ARM cores, the Management Plane is physically and logically separated from the Host (Tenant) Data Plane. Even if a tenant gains root access to the host OS, they cannot access the DPU‘s OS, which controls the network hardware and security policies.
The NCP-AII Context: This “Air-Gapped“ management approach is the fundamental requirement for implementing a Zero Trust architecture in multi-tenant AI clusters, ensuring that the infrastructure control remains secure regardless of the state of the compute nodes.
Incorrect: A. The implementation of a Cooling Control Loop While DPUs have thermal sensors and can throttle performance if they overheat, there is no “Cooling Control Loop“ designed to shut down hardware based on “unauthorized network traffic.“ Security enforcement is handled via the DOCA Flow or OVS (Open vSwitch) acceleration within the DPU, not through the cooling system.
B. The use of NVLink cables to encrypt data to local storage This is a technical mismatch. NVLink is a high-speed interconnect used strictly for GPU-to-GPU or GPU-to-CPU communication; it is not used for connecting DPUs to local storage drives. Furthermore, storage encryption is typically handled via OPAL-compliant drives or software-defined encryption, not via the “cables“ themselves.
D. The integration of a physical hardware switch to disconnect GPUs NVIDIA infrastructure does not use “kill switches“ that physically disconnect GPU hardware upon a breach. Instead, the BlueField DPU uses Micro-segmentation and Virtual Private Clouds (VPC) to logically isolate a compromised node from the rest of the fabric, preventing “lateral movement“ of an attacker while keeping the hardware physically connected for forensics or recovery.
Question 3 of 60
3. Question
When performing the initial bring-up of an NVIDIA HGX H100 system within a large-scale AI factory, a technician must validate the power and cooling parameters before executing heavy workloads. If the BMC reports that the power supply units are functioning in a non-redundant state despite being fully populated, which sequence of validation steps is most appropriate to ensure the system meets the high-density power requirements for GPU-based servers?
Correct
Correct: A Verify the input voltage at the PDU, check the BMC power policy settings for N+N redundancy, and validate that all C19/C20 power cables are properly seated.
The Technical Reason: NVIDIA HGX H100 systems have massive power draws (often exceeding 10kW per server under peak load).
Input Voltage: High-density AI servers typically require 208V–240V or higher. If the PDU is providing insufficient voltage, the PSUs may not provide full output, causing the BMC to flag a lack of redundancy.
BMC Power Policy: The Baseboard Management Controller (BMC) must be explicitly configured for the desired redundancy mode (e.g., N+N or N+1). If set incorrectly, it may report an error even if the hardware is healthy.
Physical Seating: C19/C20 connectors are the industry standard for high-current draw; a loose connection can prevent a PSU from being recognized or delivering its full power budget.
The NCP-AII Context: This sequence follows the standard “Physical to Logical“ troubleshooting methodology taught for NVIDIA data center infrastructure bring-up.
Incorrect: B. Disable the Out-of-Band management controller and overclock fans The Flaw: Disabling the Out-of-Band (OOB) management (the BMC) would leave you “blind“ to the system‘s health, which is dangerous for high-density servers. Furthermore, “overclocking“ fans via nvidia-smi (which is for GPU management, not system fans) is a cooling fix, not a power redundancy fix. It does nothing to solve why the PSUs are in a non-redundant state.
C. Immediately perform a firmware downgrade on the GPU Baseboard The Flaw: Firmware downgrades are high-risk actions and are almost never the first step in power troubleshooting. This option focuses on the “GPU Baseboard“ sensors, but PSU redundancy is managed by the System BMC. Re-enabling a TPM (Trusted Platform Module) is a security task and has zero relationship with the electrical redundancy of the power supply units.
D. Replace existing transceivers with active optical cables (AOCs) The Flaw: This is a technical mismatch. Transceivers and AOCs are Networking components. While EMI (Electromagnetic Interference) is a theoretical concern in data centers, it does not cause a “non-redundant“ power state. This option attempts to solve a Power/Electrical problem with a Networking/Fiber optic solution.
Incorrect
Correct: A Verify the input voltage at the PDU, check the BMC power policy settings for N+N redundancy, and validate that all C19/C20 power cables are properly seated.
The Technical Reason: NVIDIA HGX H100 systems have massive power draws (often exceeding 10kW per server under peak load).
Input Voltage: High-density AI servers typically require 208V–240V or higher. If the PDU is providing insufficient voltage, the PSUs may not provide full output, causing the BMC to flag a lack of redundancy.
BMC Power Policy: The Baseboard Management Controller (BMC) must be explicitly configured for the desired redundancy mode (e.g., N+N or N+1). If set incorrectly, it may report an error even if the hardware is healthy.
Physical Seating: C19/C20 connectors are the industry standard for high-current draw; a loose connection can prevent a PSU from being recognized or delivering its full power budget.
The NCP-AII Context: This sequence follows the standard “Physical to Logical“ troubleshooting methodology taught for NVIDIA data center infrastructure bring-up.
Incorrect: B. Disable the Out-of-Band management controller and overclock fans The Flaw: Disabling the Out-of-Band (OOB) management (the BMC) would leave you “blind“ to the system‘s health, which is dangerous for high-density servers. Furthermore, “overclocking“ fans via nvidia-smi (which is for GPU management, not system fans) is a cooling fix, not a power redundancy fix. It does nothing to solve why the PSUs are in a non-redundant state.
C. Immediately perform a firmware downgrade on the GPU Baseboard The Flaw: Firmware downgrades are high-risk actions and are almost never the first step in power troubleshooting. This option focuses on the “GPU Baseboard“ sensors, but PSU redundancy is managed by the System BMC. Re-enabling a TPM (Trusted Platform Module) is a security task and has zero relationship with the electrical redundancy of the power supply units.
D. Replace existing transceivers with active optical cables (AOCs) The Flaw: This is a technical mismatch. Transceivers and AOCs are Networking components. While EMI (Electromagnetic Interference) is a theoretical concern in data centers, it does not cause a “non-redundant“ power state. This option attempts to solve a Power/Electrical problem with a Networking/Fiber optic solution.
Unattempted
Correct: A Verify the input voltage at the PDU, check the BMC power policy settings for N+N redundancy, and validate that all C19/C20 power cables are properly seated.
The Technical Reason: NVIDIA HGX H100 systems have massive power draws (often exceeding 10kW per server under peak load).
Input Voltage: High-density AI servers typically require 208V–240V or higher. If the PDU is providing insufficient voltage, the PSUs may not provide full output, causing the BMC to flag a lack of redundancy.
BMC Power Policy: The Baseboard Management Controller (BMC) must be explicitly configured for the desired redundancy mode (e.g., N+N or N+1). If set incorrectly, it may report an error even if the hardware is healthy.
Physical Seating: C19/C20 connectors are the industry standard for high-current draw; a loose connection can prevent a PSU from being recognized or delivering its full power budget.
The NCP-AII Context: This sequence follows the standard “Physical to Logical“ troubleshooting methodology taught for NVIDIA data center infrastructure bring-up.
Incorrect: B. Disable the Out-of-Band management controller and overclock fans The Flaw: Disabling the Out-of-Band (OOB) management (the BMC) would leave you “blind“ to the system‘s health, which is dangerous for high-density servers. Furthermore, “overclocking“ fans via nvidia-smi (which is for GPU management, not system fans) is a cooling fix, not a power redundancy fix. It does nothing to solve why the PSUs are in a non-redundant state.
C. Immediately perform a firmware downgrade on the GPU Baseboard The Flaw: Firmware downgrades are high-risk actions and are almost never the first step in power troubleshooting. This option focuses on the “GPU Baseboard“ sensors, but PSU redundancy is managed by the System BMC. Re-enabling a TPM (Trusted Platform Module) is a security task and has zero relationship with the electrical redundancy of the power supply units.
D. Replace existing transceivers with active optical cables (AOCs) The Flaw: This is a technical mismatch. Transceivers and AOCs are Networking components. While EMI (Electromagnetic Interference) is a theoretical concern in data centers, it does not cause a “non-redundant“ power state. This option attempts to solve a Power/Electrical problem with a Networking/Fiber optic solution.
Question 4 of 60
4. Question
When designing the network topology for a large-scale AI factory using NVIDIA Quantum-2 InfiniBand switches, an engineer must ensure non-blocking communication for collective operations. If the server-to-leaf ratio is 1:1 and the leaf-to-spine ratio is also 1:1, but the engineer accidentally uses an incorrect transceiver type that limits the port speed to 100Gbps instead of the expected 400Gbps, what is the primary impact on the system validation phase?
Correct
Correct: C The effective bandwidth for E/W (East-West) traffic will be reduced by seventy-five percent, causing the NCCL collective performance to drop significantly below the expected baseline.
The Technical Reason: NVIDIA Quantum-2 InfiniBand is designed for 400Gbps (NDR) per port. If a transceiver limits the speed to 100Gbps (EDR), the link is operating at only 25% of its rated capacity.
The Impact: In an AI Factory, “East-West“ traffic refers to the communication between GPU nodes (GPU-to-GPU across the fabric). NCCL (NVIDIA Collective Communications Library) is the standard library used for these operations (All-Reduce, All-Gather). Because NCCL performance is directly tied to the underlying fabric bandwidth, a 75% reduction in link speed will cause NCCL benchmarks to fail their performance validation baselines.
The NCP-AII Context: System validation involves comparing actual results against a “Golden Baseline.“ A 100G link in a 400G design is a critical configuration error that prevents the cluster from reaching its “Non-Blocking“ performance goals.
Incorrect: A. The system will fail the HPL test due to a memory overflow error HPL (High-Performance Linpack) is primarily a test of floating-point computational power. While a slow network will make a multi-node HPL run take much longer, it does not cause a “memory overflow.“ Memory overflow is typically a software/coding error or a result of trying to fit a dataset into a GPU‘s VRAM that is too large; it is not triggered by a transceiver speed mismatch.
B. NVIDIA SMI will report a hardware fault on the GPU baseboard nvidia-smi monitors the GPUs, their temperatures, and their internal NVLink status. It does not monitor the external InfiniBand transceiver speeds directly as a “GPU hardware fault.“ A transceiver speed issue is a Networking (InfiniBand) issue, not a Compute (GPU Baseboard) issue. The driver would still load correctly, though performance would be degraded.
D. The BMC will automatically shut down the server to protect transceivers This is factually incorrect. 400Gbps transceivers (NDR) actually draw more power and generate more heat than 100Gbps (EDR) transceivers. Operating at a lower speed (100G) would reduce the power draw, not increase it. Furthermore, a speed mismatch is not a “critical thermal event“ that would trigger an emergency BMC shutdown.
Incorrect
Correct: C The effective bandwidth for E/W (East-West) traffic will be reduced by seventy-five percent, causing the NCCL collective performance to drop significantly below the expected baseline.
The Technical Reason: NVIDIA Quantum-2 InfiniBand is designed for 400Gbps (NDR) per port. If a transceiver limits the speed to 100Gbps (EDR), the link is operating at only 25% of its rated capacity.
The Impact: In an AI Factory, “East-West“ traffic refers to the communication between GPU nodes (GPU-to-GPU across the fabric). NCCL (NVIDIA Collective Communications Library) is the standard library used for these operations (All-Reduce, All-Gather). Because NCCL performance is directly tied to the underlying fabric bandwidth, a 75% reduction in link speed will cause NCCL benchmarks to fail their performance validation baselines.
The NCP-AII Context: System validation involves comparing actual results against a “Golden Baseline.“ A 100G link in a 400G design is a critical configuration error that prevents the cluster from reaching its “Non-Blocking“ performance goals.
Incorrect: A. The system will fail the HPL test due to a memory overflow error HPL (High-Performance Linpack) is primarily a test of floating-point computational power. While a slow network will make a multi-node HPL run take much longer, it does not cause a “memory overflow.“ Memory overflow is typically a software/coding error or a result of trying to fit a dataset into a GPU‘s VRAM that is too large; it is not triggered by a transceiver speed mismatch.
B. NVIDIA SMI will report a hardware fault on the GPU baseboard nvidia-smi monitors the GPUs, their temperatures, and their internal NVLink status. It does not monitor the external InfiniBand transceiver speeds directly as a “GPU hardware fault.“ A transceiver speed issue is a Networking (InfiniBand) issue, not a Compute (GPU Baseboard) issue. The driver would still load correctly, though performance would be degraded.
D. The BMC will automatically shut down the server to protect transceivers This is factually incorrect. 400Gbps transceivers (NDR) actually draw more power and generate more heat than 100Gbps (EDR) transceivers. Operating at a lower speed (100G) would reduce the power draw, not increase it. Furthermore, a speed mismatch is not a “critical thermal event“ that would trigger an emergency BMC shutdown.
Unattempted
Correct: C The effective bandwidth for E/W (East-West) traffic will be reduced by seventy-five percent, causing the NCCL collective performance to drop significantly below the expected baseline.
The Technical Reason: NVIDIA Quantum-2 InfiniBand is designed for 400Gbps (NDR) per port. If a transceiver limits the speed to 100Gbps (EDR), the link is operating at only 25% of its rated capacity.
The Impact: In an AI Factory, “East-West“ traffic refers to the communication between GPU nodes (GPU-to-GPU across the fabric). NCCL (NVIDIA Collective Communications Library) is the standard library used for these operations (All-Reduce, All-Gather). Because NCCL performance is directly tied to the underlying fabric bandwidth, a 75% reduction in link speed will cause NCCL benchmarks to fail their performance validation baselines.
The NCP-AII Context: System validation involves comparing actual results against a “Golden Baseline.“ A 100G link in a 400G design is a critical configuration error that prevents the cluster from reaching its “Non-Blocking“ performance goals.
Incorrect: A. The system will fail the HPL test due to a memory overflow error HPL (High-Performance Linpack) is primarily a test of floating-point computational power. While a slow network will make a multi-node HPL run take much longer, it does not cause a “memory overflow.“ Memory overflow is typically a software/coding error or a result of trying to fit a dataset into a GPU‘s VRAM that is too large; it is not triggered by a transceiver speed mismatch.
B. NVIDIA SMI will report a hardware fault on the GPU baseboard nvidia-smi monitors the GPUs, their temperatures, and their internal NVLink status. It does not monitor the external InfiniBand transceiver speeds directly as a “GPU hardware fault.“ A transceiver speed issue is a Networking (InfiniBand) issue, not a Compute (GPU Baseboard) issue. The driver would still load correctly, though performance would be degraded.
D. The BMC will automatically shut down the server to protect transceivers This is factually incorrect. 400Gbps transceivers (NDR) actually draw more power and generate more heat than 100Gbps (EDR) transceivers. Operating at a lower speed (100G) would reduce the power draw, not increase it. Furthermore, a speed mismatch is not a “critical thermal event“ that would trigger an emergency BMC shutdown.
Question 5 of 60
5. Question
During the firmware management phase of an NVIDIA HGX platform deployment, a technician notices a discrepancy in the versions of the GPU firmware and the NVSwitch firmware. To ensure hardware operation for workloads is stable and to prevent fault detection errors, which tool should be used to perform a synchronized firmware upgrade across all components including the BlueField DPU and the HGX baseboard?
Correct
Correct: C The NVIDIA Firmware Update (nvfwupd) tool or the NVIDIA Cluster Management utility should be utilized to orchestrate the updates to ensure compatibility between the GPUs and NVSwitches.
The Technical Reason: Modern HGX systems (like the H100 and B200) require “firmware recipes“—specific sets of versions for the GPU, NVSwitch, BMC, and SBIOS that have been validated to work together. The nvfwupd tool is the primary NVIDIA utility designed to parse these PLDM (Platform Level Data Model) firmware packages. It can update the entire “Motherboard Tray“ in a single orchestration, ensuring that the high-speed NVLink fabric doesn‘t encounter fault detection errors due to version mismatches.
The NCP-AII Context: In larger deployments, NVIDIA Base Command Manager (BCM) uses these same underlying mechanisms to automate “cluster-wide“ firmware flashes. This prevents “firmware drift“ across hundreds of nodes, which is a common cause of NCCL collective failures in an AI Factory.
Incorrect: A. Using the NVIDIA Container Toolkit to deploy a sidecar container The NVIDIA Container Toolkit is used to expose GPU hardware to containers (libraries, binaries, etc.). It is a User-Space tool and does not have the low-level hardware access required to flash the firmware of the NVSwitch or the HGX baseboard. Firmware updates must occur at the System/Kernel or Out-of-Band (OOB) level.
B. Booting into a specialized DOS environment This is a legacy approach. Modern NVIDIA DGX and HGX systems are designed for “In-Band“ updates via Linux (using nvfwupd) or “Out-of-Band“ updates via the BMC Redfish API. Standard operating procedures for AI infrastructure prioritize avoiding downtime-heavy DOS-based utilities for sequential flashing.
D. Using standard Linux package managers (apt/yum) Standard package managers are used to update drivers and software libraries (like CUDA or the NVIDIA Driver). While some drivers can trigger a “pending“ GPU firmware update, they do not automatically orchestrate the complex, synchronized flash required for the NVSwitches, the Baseboard CPLDs, or the BlueField DPU. These require specialized tools like nvfwupd or mlxfwmanager.
Incorrect
Correct: C The NVIDIA Firmware Update (nvfwupd) tool or the NVIDIA Cluster Management utility should be utilized to orchestrate the updates to ensure compatibility between the GPUs and NVSwitches.
The Technical Reason: Modern HGX systems (like the H100 and B200) require “firmware recipes“—specific sets of versions for the GPU, NVSwitch, BMC, and SBIOS that have been validated to work together. The nvfwupd tool is the primary NVIDIA utility designed to parse these PLDM (Platform Level Data Model) firmware packages. It can update the entire “Motherboard Tray“ in a single orchestration, ensuring that the high-speed NVLink fabric doesn‘t encounter fault detection errors due to version mismatches.
The NCP-AII Context: In larger deployments, NVIDIA Base Command Manager (BCM) uses these same underlying mechanisms to automate “cluster-wide“ firmware flashes. This prevents “firmware drift“ across hundreds of nodes, which is a common cause of NCCL collective failures in an AI Factory.
Incorrect: A. Using the NVIDIA Container Toolkit to deploy a sidecar container The NVIDIA Container Toolkit is used to expose GPU hardware to containers (libraries, binaries, etc.). It is a User-Space tool and does not have the low-level hardware access required to flash the firmware of the NVSwitch or the HGX baseboard. Firmware updates must occur at the System/Kernel or Out-of-Band (OOB) level.
B. Booting into a specialized DOS environment This is a legacy approach. Modern NVIDIA DGX and HGX systems are designed for “In-Band“ updates via Linux (using nvfwupd) or “Out-of-Band“ updates via the BMC Redfish API. Standard operating procedures for AI infrastructure prioritize avoiding downtime-heavy DOS-based utilities for sequential flashing.
D. Using standard Linux package managers (apt/yum) Standard package managers are used to update drivers and software libraries (like CUDA or the NVIDIA Driver). While some drivers can trigger a “pending“ GPU firmware update, they do not automatically orchestrate the complex, synchronized flash required for the NVSwitches, the Baseboard CPLDs, or the BlueField DPU. These require specialized tools like nvfwupd or mlxfwmanager.
Unattempted
Correct: C The NVIDIA Firmware Update (nvfwupd) tool or the NVIDIA Cluster Management utility should be utilized to orchestrate the updates to ensure compatibility between the GPUs and NVSwitches.
The Technical Reason: Modern HGX systems (like the H100 and B200) require “firmware recipes“—specific sets of versions for the GPU, NVSwitch, BMC, and SBIOS that have been validated to work together. The nvfwupd tool is the primary NVIDIA utility designed to parse these PLDM (Platform Level Data Model) firmware packages. It can update the entire “Motherboard Tray“ in a single orchestration, ensuring that the high-speed NVLink fabric doesn‘t encounter fault detection errors due to version mismatches.
The NCP-AII Context: In larger deployments, NVIDIA Base Command Manager (BCM) uses these same underlying mechanisms to automate “cluster-wide“ firmware flashes. This prevents “firmware drift“ across hundreds of nodes, which is a common cause of NCCL collective failures in an AI Factory.
Incorrect: A. Using the NVIDIA Container Toolkit to deploy a sidecar container The NVIDIA Container Toolkit is used to expose GPU hardware to containers (libraries, binaries, etc.). It is a User-Space tool and does not have the low-level hardware access required to flash the firmware of the NVSwitch or the HGX baseboard. Firmware updates must occur at the System/Kernel or Out-of-Band (OOB) level.
B. Booting into a specialized DOS environment This is a legacy approach. Modern NVIDIA DGX and HGX systems are designed for “In-Band“ updates via Linux (using nvfwupd) or “Out-of-Band“ updates via the BMC Redfish API. Standard operating procedures for AI infrastructure prioritize avoiding downtime-heavy DOS-based utilities for sequential flashing.
D. Using standard Linux package managers (apt/yum) Standard package managers are used to update drivers and software libraries (like CUDA or the NVIDIA Driver). While some drivers can trigger a “pending“ GPU firmware update, they do not automatically orchestrate the complex, synchronized flash required for the NVSwitches, the Baseboard CPLDs, or the BlueField DPU. These require specialized tools like nvfwupd or mlxfwmanager.
Question 6 of 60
6. Question
After installing a new set of InfiniBand cables in a liquid-cooled AI cluster, the administrator runs a NCCL burn-in test. They notice that while most nodes pass, one specific link consistently fails after 15 minutes of heavy load. What is the most likely cause that should be investigated during the ‘Cluster Test and Verification‘ phase?
Correct
Correct: B A thermal issue where the transceiver is overheating under load due to poor airflow or a faulty cooling loop in that rack section.
The Technical Reason: High-speed transceivers (OSFP/QSFP) for InfiniBand NDR (400G) generate significant heat. In a liquid-cooled cluster, while the GPUs and CPUs are managed by cold plates, the networking components often still rely on forced air or specialized cooling manifolds. A link that works initially but fails consistently after 15 minutes of “heavy load“ (like a NCCL burn-in) is a classic symptom of thermal saturation. Once the transceiver exceeds its operating temperature, it may throttle or shut down the link to prevent hardware damage.
The NCP-AII Context: This scenario tests your ability to distinguish between static connectivity (cables plugged in correctly) and dynamic stability (cables surviving peak power/heat). Validating “Power and Cooling“ parameters is a key domain of the exam.
Incorrect: A. The NGC CLI is using an outdated API key The NGC CLI is used for pulling container images or interacting with the NVIDIA GPU Cloud registry. It is not involved in the actual execution or “heartbeat“ of a NCCL test once the container has started. An API key issue would prevent the download of the test tool, not cause a running network link to fail 15 minutes into a hardware stress test.
C. The Slurm scheduler is assigning too many MPI ranks If Slurm assigned too many ranks, the job would typically fail at launch due to insufficient resources (OOM – Out of Memory) or an immediate configuration error. It would not cause a specific physical network link to “fail“ after a consistent 15-minute delay. Furthermore, “memory overflow in the BCM database“ is a logical impossibility for a network link failure; BCM (Base Command Manager) manages the nodes, but doesn‘t store the live memory of a running MPI rank.
D. The TPM is locking the PCIe bus The TPM (Trusted Platform Module) is used for secure boot and attestation. While it can prevent a system from booting with unauthorized firmware, it does not monitor individual network packets for “unauthorized“ data during a benchmark. Network security is handled by firewalls or DPU-based isolation (DOCA), and a TPM would never “lock the bus“ mid-test based on network traffic content.
Incorrect
Correct: B A thermal issue where the transceiver is overheating under load due to poor airflow or a faulty cooling loop in that rack section.
The Technical Reason: High-speed transceivers (OSFP/QSFP) for InfiniBand NDR (400G) generate significant heat. In a liquid-cooled cluster, while the GPUs and CPUs are managed by cold plates, the networking components often still rely on forced air or specialized cooling manifolds. A link that works initially but fails consistently after 15 minutes of “heavy load“ (like a NCCL burn-in) is a classic symptom of thermal saturation. Once the transceiver exceeds its operating temperature, it may throttle or shut down the link to prevent hardware damage.
The NCP-AII Context: This scenario tests your ability to distinguish between static connectivity (cables plugged in correctly) and dynamic stability (cables surviving peak power/heat). Validating “Power and Cooling“ parameters is a key domain of the exam.
Incorrect: A. The NGC CLI is using an outdated API key The NGC CLI is used for pulling container images or interacting with the NVIDIA GPU Cloud registry. It is not involved in the actual execution or “heartbeat“ of a NCCL test once the container has started. An API key issue would prevent the download of the test tool, not cause a running network link to fail 15 minutes into a hardware stress test.
C. The Slurm scheduler is assigning too many MPI ranks If Slurm assigned too many ranks, the job would typically fail at launch due to insufficient resources (OOM – Out of Memory) or an immediate configuration error. It would not cause a specific physical network link to “fail“ after a consistent 15-minute delay. Furthermore, “memory overflow in the BCM database“ is a logical impossibility for a network link failure; BCM (Base Command Manager) manages the nodes, but doesn‘t store the live memory of a running MPI rank.
D. The TPM is locking the PCIe bus The TPM (Trusted Platform Module) is used for secure boot and attestation. While it can prevent a system from booting with unauthorized firmware, it does not monitor individual network packets for “unauthorized“ data during a benchmark. Network security is handled by firewalls or DPU-based isolation (DOCA), and a TPM would never “lock the bus“ mid-test based on network traffic content.
Unattempted
Correct: B A thermal issue where the transceiver is overheating under load due to poor airflow or a faulty cooling loop in that rack section.
The Technical Reason: High-speed transceivers (OSFP/QSFP) for InfiniBand NDR (400G) generate significant heat. In a liquid-cooled cluster, while the GPUs and CPUs are managed by cold plates, the networking components often still rely on forced air or specialized cooling manifolds. A link that works initially but fails consistently after 15 minutes of “heavy load“ (like a NCCL burn-in) is a classic symptom of thermal saturation. Once the transceiver exceeds its operating temperature, it may throttle or shut down the link to prevent hardware damage.
The NCP-AII Context: This scenario tests your ability to distinguish between static connectivity (cables plugged in correctly) and dynamic stability (cables surviving peak power/heat). Validating “Power and Cooling“ parameters is a key domain of the exam.
Incorrect: A. The NGC CLI is using an outdated API key The NGC CLI is used for pulling container images or interacting with the NVIDIA GPU Cloud registry. It is not involved in the actual execution or “heartbeat“ of a NCCL test once the container has started. An API key issue would prevent the download of the test tool, not cause a running network link to fail 15 minutes into a hardware stress test.
C. The Slurm scheduler is assigning too many MPI ranks If Slurm assigned too many ranks, the job would typically fail at launch due to insufficient resources (OOM – Out of Memory) or an immediate configuration error. It would not cause a specific physical network link to “fail“ after a consistent 15-minute delay. Furthermore, “memory overflow in the BCM database“ is a logical impossibility for a network link failure; BCM (Base Command Manager) manages the nodes, but doesn‘t store the live memory of a running MPI rank.
D. The TPM is locking the PCIe bus The TPM (Trusted Platform Module) is used for secure boot and attestation. While it can prevent a system from booting with unauthorized firmware, it does not monitor individual network packets for “unauthorized“ data during a benchmark. Network security is handled by firewalls or DPU-based isolation (DOCA), and a TPM would never “lock the bus“ mid-test based on network traffic content.
Question 7 of 60
7. Question
A system administrator is installing the Slurm workload manager on an AI cluster. To support modern AI workflows, they also need to integrate Enroot and Pyxis. What is the primary role of the Pyxis plugin in this specific software stack?
Correct
Correct: B Pyxis is a Slurm SPANK plugin that allows users to seamlessly run containerized workloads using the ‘srun‘ command without needing Docker daemon access. • The Technical Reason: Pyxis is a Slurm SPANK (Slurm Plug-in Architecture for Node setup and Control) plugin developed by NVIDIA. Its primary function is to extend the Slurm command-line interface, enabling flags like –container-image and –container-mounts. It acts as the “glue“ that allows Slurm to communicate with Enroot (the container runtime), automating the process of pulling an image from a registry (like NGC), converting it to a usable format, and executing it within a job allocation.
• The NCP-AII Context: This stack is critical because it provides a “Docker-like“ experience without the security risks of giving every user root-level access to a Docker daemon. It is the standard orchestration method for NVIDIA DGX SuperPOD and Base Command Manager deployments.
Incorrect:
A. Pyxis is a distributed database that stores training weights This is a distractor. Storing training weights is a function of the Application Layer (using checkpoints in frameworks like PyTorch or TensorFlow) and the Storage Layer (using parallel file systems like Lustre or Weka). Pyxis is a job scheduling plugin, not a database for model data.
C. Pyxis is the primary operating system used by the head node Pyxis is a software plugin, not an operating system. The head node of an NVIDIA AI cluster typically runs Base Command Manager OS (which is RHEL or Ubuntu-based) or a standard Linux distribution. While Pyxis is installed on the OS to enhance Slurm, it does not manage power distribution units (PDUs).
D. Pyxis is a hardware validation tool for InfiniBand cables This confuses Pyxis with tools like NVIDIA ClusterKit, CVT (Cable Validation Tool), or mlxlink. While validating the fabric is a key part of the NCP-AII certification, Pyxis operates at the Software/Orchestration layer and does not perform physical-layer signal quality checks.
Incorrect
Correct: B Pyxis is a Slurm SPANK plugin that allows users to seamlessly run containerized workloads using the ‘srun‘ command without needing Docker daemon access. • The Technical Reason: Pyxis is a Slurm SPANK (Slurm Plug-in Architecture for Node setup and Control) plugin developed by NVIDIA. Its primary function is to extend the Slurm command-line interface, enabling flags like –container-image and –container-mounts. It acts as the “glue“ that allows Slurm to communicate with Enroot (the container runtime), automating the process of pulling an image from a registry (like NGC), converting it to a usable format, and executing it within a job allocation.
• The NCP-AII Context: This stack is critical because it provides a “Docker-like“ experience without the security risks of giving every user root-level access to a Docker daemon. It is the standard orchestration method for NVIDIA DGX SuperPOD and Base Command Manager deployments.
Incorrect:
A. Pyxis is a distributed database that stores training weights This is a distractor. Storing training weights is a function of the Application Layer (using checkpoints in frameworks like PyTorch or TensorFlow) and the Storage Layer (using parallel file systems like Lustre or Weka). Pyxis is a job scheduling plugin, not a database for model data.
C. Pyxis is the primary operating system used by the head node Pyxis is a software plugin, not an operating system. The head node of an NVIDIA AI cluster typically runs Base Command Manager OS (which is RHEL or Ubuntu-based) or a standard Linux distribution. While Pyxis is installed on the OS to enhance Slurm, it does not manage power distribution units (PDUs).
D. Pyxis is a hardware validation tool for InfiniBand cables This confuses Pyxis with tools like NVIDIA ClusterKit, CVT (Cable Validation Tool), or mlxlink. While validating the fabric is a key part of the NCP-AII certification, Pyxis operates at the Software/Orchestration layer and does not perform physical-layer signal quality checks.
Unattempted
Correct: B Pyxis is a Slurm SPANK plugin that allows users to seamlessly run containerized workloads using the ‘srun‘ command without needing Docker daemon access. • The Technical Reason: Pyxis is a Slurm SPANK (Slurm Plug-in Architecture for Node setup and Control) plugin developed by NVIDIA. Its primary function is to extend the Slurm command-line interface, enabling flags like –container-image and –container-mounts. It acts as the “glue“ that allows Slurm to communicate with Enroot (the container runtime), automating the process of pulling an image from a registry (like NGC), converting it to a usable format, and executing it within a job allocation.
• The NCP-AII Context: This stack is critical because it provides a “Docker-like“ experience without the security risks of giving every user root-level access to a Docker daemon. It is the standard orchestration method for NVIDIA DGX SuperPOD and Base Command Manager deployments.
Incorrect:
A. Pyxis is a distributed database that stores training weights This is a distractor. Storing training weights is a function of the Application Layer (using checkpoints in frameworks like PyTorch or TensorFlow) and the Storage Layer (using parallel file systems like Lustre or Weka). Pyxis is a job scheduling plugin, not a database for model data.
C. Pyxis is the primary operating system used by the head node Pyxis is a software plugin, not an operating system. The head node of an NVIDIA AI cluster typically runs Base Command Manager OS (which is RHEL or Ubuntu-based) or a standard Linux distribution. While Pyxis is installed on the OS to enhance Slurm, it does not manage power distribution units (PDUs).
D. Pyxis is a hardware validation tool for InfiniBand cables This confuses Pyxis with tools like NVIDIA ClusterKit, CVT (Cable Validation Tool), or mlxlink. While validating the fabric is a key part of the NCP-AII certification, Pyxis operates at the Software/Orchestration layer and does not perform physical-layer signal quality checks.
Question 8 of 60
8. Question
When configuring the cluster interfaces in Base Command Manager, the administrator must define the category and network settings for the compute nodes. Why is it important to correctly configure the ‘category‘ in BCM, and how does it affect the installation of software like Slurm, Enroot, and Pyxis?
Correct
Correct: B The category acts as a template that defines the software packages, kernel parameters, and configuration files that are applied to a group of nodes.
The Technical Reason: BCM uses a hierarchical configuration model. A Category is a logical grouping that allows administrators to manage many nodes as a single entity. It points to a specific Software Image (the OS and installed binaries) and contains Configuration Overlays (files like /etc/slurm/slurm.conf or kernel tweaks for InfiniBand).
Impact on Slurm/Enroot/Pyxis: In an AI factory, you typically have different types of nodes (e.g., “Login Nodes“ vs. “GPU Compute Nodes“). By assigning a node to a “GPU_Compute“ category, BCM automatically:
Provisions the node with the image containing Enroot and Pyxis.
Pushes the correct Slurm configuration that identifies the node as having 8 GPUs and the Pyxis SPANK plugin enabled.
Applies the necessary Kernel Parameters (like hugepages or pci=realloc) required for high-performance GPU peer-to-peer communication.
Incorrect: A. The category is a billing tag for electricity costs While BCM provides monitoring data that could be exported to a billing system, “Category“ is a fundamental functional unit of cluster management. It is used for provisioning and configuration, not just for passive tagging or financial reporting. It has a massive impact on whether software like Slurm or the GPU drivers are even present on the node.
C. The category determines maximum CPU frequency and overclocks processors CPU frequency management and overclocking are typically handled at the BIOS/UEFI level or via low-level Linux power profiles (e.g., cpupower). While a BCM category could run a script to set a power governor, its primary role is much broader—software orchestration and configuration—rather than being a dedicated overclocking tool.
D. The category defines the physical color of the server chassis This is a distractor. Physical attributes like chassis color have no bearing on the logical management of an AI cluster. BCM tracks physical location via “Rack“ and “Chassis“ objects for mapping, but “Category“ remains the logical template for the operating system and software stack.
Incorrect
Correct: B The category acts as a template that defines the software packages, kernel parameters, and configuration files that are applied to a group of nodes.
The Technical Reason: BCM uses a hierarchical configuration model. A Category is a logical grouping that allows administrators to manage many nodes as a single entity. It points to a specific Software Image (the OS and installed binaries) and contains Configuration Overlays (files like /etc/slurm/slurm.conf or kernel tweaks for InfiniBand).
Impact on Slurm/Enroot/Pyxis: In an AI factory, you typically have different types of nodes (e.g., “Login Nodes“ vs. “GPU Compute Nodes“). By assigning a node to a “GPU_Compute“ category, BCM automatically:
Provisions the node with the image containing Enroot and Pyxis.
Pushes the correct Slurm configuration that identifies the node as having 8 GPUs and the Pyxis SPANK plugin enabled.
Applies the necessary Kernel Parameters (like hugepages or pci=realloc) required for high-performance GPU peer-to-peer communication.
Incorrect: A. The category is a billing tag for electricity costs While BCM provides monitoring data that could be exported to a billing system, “Category“ is a fundamental functional unit of cluster management. It is used for provisioning and configuration, not just for passive tagging or financial reporting. It has a massive impact on whether software like Slurm or the GPU drivers are even present on the node.
C. The category determines maximum CPU frequency and overclocks processors CPU frequency management and overclocking are typically handled at the BIOS/UEFI level or via low-level Linux power profiles (e.g., cpupower). While a BCM category could run a script to set a power governor, its primary role is much broader—software orchestration and configuration—rather than being a dedicated overclocking tool.
D. The category defines the physical color of the server chassis This is a distractor. Physical attributes like chassis color have no bearing on the logical management of an AI cluster. BCM tracks physical location via “Rack“ and “Chassis“ objects for mapping, but “Category“ remains the logical template for the operating system and software stack.
Unattempted
Correct: B The category acts as a template that defines the software packages, kernel parameters, and configuration files that are applied to a group of nodes.
The Technical Reason: BCM uses a hierarchical configuration model. A Category is a logical grouping that allows administrators to manage many nodes as a single entity. It points to a specific Software Image (the OS and installed binaries) and contains Configuration Overlays (files like /etc/slurm/slurm.conf or kernel tweaks for InfiniBand).
Impact on Slurm/Enroot/Pyxis: In an AI factory, you typically have different types of nodes (e.g., “Login Nodes“ vs. “GPU Compute Nodes“). By assigning a node to a “GPU_Compute“ category, BCM automatically:
Provisions the node with the image containing Enroot and Pyxis.
Pushes the correct Slurm configuration that identifies the node as having 8 GPUs and the Pyxis SPANK plugin enabled.
Applies the necessary Kernel Parameters (like hugepages or pci=realloc) required for high-performance GPU peer-to-peer communication.
Incorrect: A. The category is a billing tag for electricity costs While BCM provides monitoring data that could be exported to a billing system, “Category“ is a fundamental functional unit of cluster management. It is used for provisioning and configuration, not just for passive tagging or financial reporting. It has a massive impact on whether software like Slurm or the GPU drivers are even present on the node.
C. The category determines maximum CPU frequency and overclocks processors CPU frequency management and overclocking are typically handled at the BIOS/UEFI level or via low-level Linux power profiles (e.g., cpupower). While a BCM category could run a script to set a power governor, its primary role is much broader—software orchestration and configuration—rather than being a dedicated overclocking tool.
D. The category defines the physical color of the server chassis This is a distractor. Physical attributes like chassis color have no bearing on the logical management of an AI cluster. BCM tracks physical location via “Rack“ and “Chassis“ objects for mapping, but “Category“ remains the logical template for the operating system and software stack.
Question 9 of 60
9. Question
When installing NVIDIA Base Command Manager (BCM) for a new AI cluster, the administrator needs to ensure high availability (HA) for the head node. Which configuration step is essential within BCM to verify that the secondary head node can take over the management of the Slurm scheduler and the provisioning interfaces in the event of a primary node failure?
Correct
Correct: B Configure the ‘failover‘ attribute in the node object, synchronize the /cm/shared directory, and verify the status using the ‘cmsh -c “ha; status“‘ command.
The Technical Reason: In NVIDIA Base Command Manager (BCM), HA is achieved by designating a secondary head node that monitors the primary via a heartbeat.
Failover Attribute: The secondary node must be explicitly defined in the BCM database with the failover property pointing to the primary.
Shared Directory: BCM relies on /cm/shared (often backed by a shared disk or DRBD/RSYNC) to keep the cluster database, Slurm configurations, and software images synchronized between both nodes.
Verification: The cmsh command ha; status is the standard administrative tool to check the health of the heartbeat, the synchronization state, and which node currently holds the “Active“ role.
The NCP-AII Context: The exam tests your ability to use the Cluster Management Shell (cmsh) to manage the lifecycle of the head nodes. Verifying HA status is a mandatory “Day 0“ task in the certification blueprint.
Incorrect: A. Install a third-party load balancer and point to a Docker socket This is a distractor. BCM HA is a built-in feature that uses a Virtual IP (VIP) managed by the BCM software itself (via keepalived or similar internal mechanisms), not a third-party external load balancer. Furthermore, Slurm and BCM provisioning are system-level services, not Docker-containerized services that rely on a Docker socket for cluster-wide failover.
C. Disable the firewall to allow sharing the same MAC address Nodes in a BCM cluster do not share a MAC address; they share a Virtual IP (VIP). Disabling the firewall is a security risk and is not a configuration requirement for HA. Heartbeat communication happens over standard networking protocols (UDP/TCP), which should be explicitly allowed through the firewall, not bypassed by disabling it entirely.
D. Manually copy the /etc/passwd file using a cron job This is an outdated and manual process. In a BCM-managed cluster, user accounts, groups, and passwords are automatically synchronized across the head nodes and the entire cluster through the LDAP/Active Directory integration or the automated synchronization of the /cm/shared and configuration overlays. Manually managing cron jobs for core system files defeats the purpose of an automated cluster manager.
Incorrect
Correct: B Configure the ‘failover‘ attribute in the node object, synchronize the /cm/shared directory, and verify the status using the ‘cmsh -c “ha; status“‘ command.
The Technical Reason: In NVIDIA Base Command Manager (BCM), HA is achieved by designating a secondary head node that monitors the primary via a heartbeat.
Failover Attribute: The secondary node must be explicitly defined in the BCM database with the failover property pointing to the primary.
Shared Directory: BCM relies on /cm/shared (often backed by a shared disk or DRBD/RSYNC) to keep the cluster database, Slurm configurations, and software images synchronized between both nodes.
Verification: The cmsh command ha; status is the standard administrative tool to check the health of the heartbeat, the synchronization state, and which node currently holds the “Active“ role.
The NCP-AII Context: The exam tests your ability to use the Cluster Management Shell (cmsh) to manage the lifecycle of the head nodes. Verifying HA status is a mandatory “Day 0“ task in the certification blueprint.
Incorrect: A. Install a third-party load balancer and point to a Docker socket This is a distractor. BCM HA is a built-in feature that uses a Virtual IP (VIP) managed by the BCM software itself (via keepalived or similar internal mechanisms), not a third-party external load balancer. Furthermore, Slurm and BCM provisioning are system-level services, not Docker-containerized services that rely on a Docker socket for cluster-wide failover.
C. Disable the firewall to allow sharing the same MAC address Nodes in a BCM cluster do not share a MAC address; they share a Virtual IP (VIP). Disabling the firewall is a security risk and is not a configuration requirement for HA. Heartbeat communication happens over standard networking protocols (UDP/TCP), which should be explicitly allowed through the firewall, not bypassed by disabling it entirely.
D. Manually copy the /etc/passwd file using a cron job This is an outdated and manual process. In a BCM-managed cluster, user accounts, groups, and passwords are automatically synchronized across the head nodes and the entire cluster through the LDAP/Active Directory integration or the automated synchronization of the /cm/shared and configuration overlays. Manually managing cron jobs for core system files defeats the purpose of an automated cluster manager.
Unattempted
Correct: B Configure the ‘failover‘ attribute in the node object, synchronize the /cm/shared directory, and verify the status using the ‘cmsh -c “ha; status“‘ command.
The Technical Reason: In NVIDIA Base Command Manager (BCM), HA is achieved by designating a secondary head node that monitors the primary via a heartbeat.
Failover Attribute: The secondary node must be explicitly defined in the BCM database with the failover property pointing to the primary.
Shared Directory: BCM relies on /cm/shared (often backed by a shared disk or DRBD/RSYNC) to keep the cluster database, Slurm configurations, and software images synchronized between both nodes.
Verification: The cmsh command ha; status is the standard administrative tool to check the health of the heartbeat, the synchronization state, and which node currently holds the “Active“ role.
The NCP-AII Context: The exam tests your ability to use the Cluster Management Shell (cmsh) to manage the lifecycle of the head nodes. Verifying HA status is a mandatory “Day 0“ task in the certification blueprint.
Incorrect: A. Install a third-party load balancer and point to a Docker socket This is a distractor. BCM HA is a built-in feature that uses a Virtual IP (VIP) managed by the BCM software itself (via keepalived or similar internal mechanisms), not a third-party external load balancer. Furthermore, Slurm and BCM provisioning are system-level services, not Docker-containerized services that rely on a Docker socket for cluster-wide failover.
C. Disable the firewall to allow sharing the same MAC address Nodes in a BCM cluster do not share a MAC address; they share a Virtual IP (VIP). Disabling the firewall is a security risk and is not a configuration requirement for HA. Heartbeat communication happens over standard networking protocols (UDP/TCP), which should be explicitly allowed through the firewall, not bypassed by disabling it entirely.
D. Manually copy the /etc/passwd file using a cron job This is an outdated and manual process. In a BCM-managed cluster, user accounts, groups, and passwords are automatically synchronized across the head nodes and the entire cluster through the LDAP/Active Directory integration or the automated synchronization of the /cm/shared and configuration overlays. Manually managing cron jobs for core system files defeats the purpose of an automated cluster manager.
Question 10 of 60
10. Question
During production, a node in the AI cluster begins reporting ‘Uncorrectable ECC Errors‘ for one of its GPUs. What is the correct troubleshooting and remediation procedure for this hardware fault in an NVIDIA-certified environment?
Correct
Correct: B Record the GPU serial number and error details using ‘nvidia-smi -q‘, and if the errors persist after a reset, replace the faulty GPU hardware.
The Technical Reason: An Uncorrectable ECC (Error Correction Code) Error indicates a hardware-level failure where the memory parity bits cannot recover the corrupted data. This leads to application crashes (XID errors) to prevent silent data corruption.
The Procedure: The first step is to use nvidia-smi -q (Query) to extract the GPUÂ’s unique serial number, UUID, and specific error counts. While a “Soft Reset“ or a full node reboot can sometimes clear transient errors, persistent Uncorrectable ECC errors are a definitive sign of failing HBM (High Bandwidth Memory). In an NVIDIA-certified environment (DGX/HGX), the standard remediation for persistent uncorrectable errors is a hardware replacement.
Incorrect: A. Switch the GPU into MIG mode to isolate the faulty memory While Multi-Instance GPU (MIG) provides isolation for workloads, it is not a “patch“ for broken hardware. An uncorrectable ECC error occurs at the physical memory layer; partitioning the GPU will not fix the underlying silicon defect. Furthermore, NVIDIA drivers may prevent a GPU with active uncorrectable errors from even entering a healthy MIG state.
C. Downgrade the DOCA drivers to ignore ECC interrupts This is a dangerous and incorrect approach.
Isolation: The BlueField DPU (via DOCA) manages networking and infrastructure, not the internal ECC reporting of the GPU memory.
Data Integrity: Ignoring ECC errors would allow the system to continue running with corrupted data, which would lead to “NaN“ (Not a Number) values during AI training, effectively ruining the model weights.
D. Increase the voltage using ‘ngc config‘ to ‘burn through‘ bits This is factually incorrect and physically destructive.
Tool Mismatch: The NGC CLI (ngc config) is a cloud registry tool for managing API keys and containers; it has no ability to control hardware voltages.
Hardware Risk: Increasing voltage to “burn through“ memory bits is not a valid engineering practice and would likely result in total hardware failure and the voiding of the system warranty.
Incorrect
Correct: B Record the GPU serial number and error details using ‘nvidia-smi -q‘, and if the errors persist after a reset, replace the faulty GPU hardware.
The Technical Reason: An Uncorrectable ECC (Error Correction Code) Error indicates a hardware-level failure where the memory parity bits cannot recover the corrupted data. This leads to application crashes (XID errors) to prevent silent data corruption.
The Procedure: The first step is to use nvidia-smi -q (Query) to extract the GPUÂ’s unique serial number, UUID, and specific error counts. While a “Soft Reset“ or a full node reboot can sometimes clear transient errors, persistent Uncorrectable ECC errors are a definitive sign of failing HBM (High Bandwidth Memory). In an NVIDIA-certified environment (DGX/HGX), the standard remediation for persistent uncorrectable errors is a hardware replacement.
Incorrect: A. Switch the GPU into MIG mode to isolate the faulty memory While Multi-Instance GPU (MIG) provides isolation for workloads, it is not a “patch“ for broken hardware. An uncorrectable ECC error occurs at the physical memory layer; partitioning the GPU will not fix the underlying silicon defect. Furthermore, NVIDIA drivers may prevent a GPU with active uncorrectable errors from even entering a healthy MIG state.
C. Downgrade the DOCA drivers to ignore ECC interrupts This is a dangerous and incorrect approach.
Isolation: The BlueField DPU (via DOCA) manages networking and infrastructure, not the internal ECC reporting of the GPU memory.
Data Integrity: Ignoring ECC errors would allow the system to continue running with corrupted data, which would lead to “NaN“ (Not a Number) values during AI training, effectively ruining the model weights.
D. Increase the voltage using ‘ngc config‘ to ‘burn through‘ bits This is factually incorrect and physically destructive.
Tool Mismatch: The NGC CLI (ngc config) is a cloud registry tool for managing API keys and containers; it has no ability to control hardware voltages.
Hardware Risk: Increasing voltage to “burn through“ memory bits is not a valid engineering practice and would likely result in total hardware failure and the voiding of the system warranty.
Unattempted
Correct: B Record the GPU serial number and error details using ‘nvidia-smi -q‘, and if the errors persist after a reset, replace the faulty GPU hardware.
The Technical Reason: An Uncorrectable ECC (Error Correction Code) Error indicates a hardware-level failure where the memory parity bits cannot recover the corrupted data. This leads to application crashes (XID errors) to prevent silent data corruption.
The Procedure: The first step is to use nvidia-smi -q (Query) to extract the GPUÂ’s unique serial number, UUID, and specific error counts. While a “Soft Reset“ or a full node reboot can sometimes clear transient errors, persistent Uncorrectable ECC errors are a definitive sign of failing HBM (High Bandwidth Memory). In an NVIDIA-certified environment (DGX/HGX), the standard remediation for persistent uncorrectable errors is a hardware replacement.
Incorrect: A. Switch the GPU into MIG mode to isolate the faulty memory While Multi-Instance GPU (MIG) provides isolation for workloads, it is not a “patch“ for broken hardware. An uncorrectable ECC error occurs at the physical memory layer; partitioning the GPU will not fix the underlying silicon defect. Furthermore, NVIDIA drivers may prevent a GPU with active uncorrectable errors from even entering a healthy MIG state.
C. Downgrade the DOCA drivers to ignore ECC interrupts This is a dangerous and incorrect approach.
Isolation: The BlueField DPU (via DOCA) manages networking and infrastructure, not the internal ECC reporting of the GPU memory.
Data Integrity: Ignoring ECC errors would allow the system to continue running with corrupted data, which would lead to “NaN“ (Not a Number) values during AI training, effectively ruining the model weights.
D. Increase the voltage using ‘ngc config‘ to ‘burn through‘ bits This is factually incorrect and physically destructive.
Tool Mismatch: The NGC CLI (ngc config) is a cloud registry tool for managing API keys and containers; it has no ability to control hardware voltages.
Hardware Risk: Increasing voltage to “burn through“ memory bits is not a valid engineering practice and would likely result in total hardware failure and the voiding of the system warranty.
Question 11 of 60
11. Question
In a scenario where an AI cluster is experiencing high latency during collective communications, an administrator suspects that the MIG configuration on the GPUs is improperly aligned with the network topology. What is the most effective way to verify the current MIG status and its impact on the hardware resources?
Correct
Correct: D Run the command ‘nvidia-smi mig -lgip‘ to list the GPU instance profiles and cross-reference them with the physical PCIe placement of the network cards.
The Technical Reason: To ensure the lowest latency in an AI cluster, traffic should follow a “Rail-Optimized“ path. This means a specific GPU instance should ideally communicate through a network card (HCA) that shares the same PCIe Root Complex or NVSwitch path.
The command nvidia-smi mig -lgip (List GPU Instance Profiles) provides a detailed breakdown of the current Multi-Instance GPU (MIG) geometry, including the placement of compute and memory slices.
By cross-referencing these logical instances with the physical PCIe topology (which can be viewed with nvidia-smi topo -m or lspci -tv), an administrator can verify if the MIG partitions are “aligned“ with the nearest high-speed network interfaces.
The NCP-AII Context: The certification emphasizes that misaligned MIG partitions can force data to cross the CPU socket or unnecessary PCIe switches, drastically increasing latency during NCCL collective operations.
Incorrect: A. Download the NGC CLI and use ‘ngc config‘ to reset hardware registers The NGC CLI is a tool for interacting with the NVIDIA GPU Cloud registry (pulling containers, models, etc.). It does not have the capability to modify or reset low-level hardware registers on a local GPU. Hardware configuration and MIG management are handled locally via the NVIDIA driver and nvidia-smi.
B. Use ‘ibstatus‘ to check MIG InfiniBand beacons and enable SR-IOV for NVLink This option contains several technical inaccuracies:
ibstatus checks the state of the InfiniBand Host Channel Adapter (HCA), not the MIG status.
MIG partitions do not emit “InfiniBand beacons.“
SR-IOV (Single Root I/O Virtualization) is a networking technology used to virtualize PCIe devices; it is not a protocol for NVLink, which is a proprietary GPU-to-GPU interconnect that does not use SR-IOV for its basic operation.
C. Execute an HPL test and compare thermal output to fan specifications While an HPL (High-Performance Linpack) test validates the computational stability of a node, it is not an effective tool for identifying network topology alignment issues. Furthermore, thermal output is an indicator of cooling efficiency, not logical resource mapping between MIG partitions and network cards.
Incorrect
Correct: D Run the command ‘nvidia-smi mig -lgip‘ to list the GPU instance profiles and cross-reference them with the physical PCIe placement of the network cards.
The Technical Reason: To ensure the lowest latency in an AI cluster, traffic should follow a “Rail-Optimized“ path. This means a specific GPU instance should ideally communicate through a network card (HCA) that shares the same PCIe Root Complex or NVSwitch path.
The command nvidia-smi mig -lgip (List GPU Instance Profiles) provides a detailed breakdown of the current Multi-Instance GPU (MIG) geometry, including the placement of compute and memory slices.
By cross-referencing these logical instances with the physical PCIe topology (which can be viewed with nvidia-smi topo -m or lspci -tv), an administrator can verify if the MIG partitions are “aligned“ with the nearest high-speed network interfaces.
The NCP-AII Context: The certification emphasizes that misaligned MIG partitions can force data to cross the CPU socket or unnecessary PCIe switches, drastically increasing latency during NCCL collective operations.
Incorrect: A. Download the NGC CLI and use ‘ngc config‘ to reset hardware registers The NGC CLI is a tool for interacting with the NVIDIA GPU Cloud registry (pulling containers, models, etc.). It does not have the capability to modify or reset low-level hardware registers on a local GPU. Hardware configuration and MIG management are handled locally via the NVIDIA driver and nvidia-smi.
B. Use ‘ibstatus‘ to check MIG InfiniBand beacons and enable SR-IOV for NVLink This option contains several technical inaccuracies:
ibstatus checks the state of the InfiniBand Host Channel Adapter (HCA), not the MIG status.
MIG partitions do not emit “InfiniBand beacons.“
SR-IOV (Single Root I/O Virtualization) is a networking technology used to virtualize PCIe devices; it is not a protocol for NVLink, which is a proprietary GPU-to-GPU interconnect that does not use SR-IOV for its basic operation.
C. Execute an HPL test and compare thermal output to fan specifications While an HPL (High-Performance Linpack) test validates the computational stability of a node, it is not an effective tool for identifying network topology alignment issues. Furthermore, thermal output is an indicator of cooling efficiency, not logical resource mapping between MIG partitions and network cards.
Unattempted
Correct: D Run the command ‘nvidia-smi mig -lgip‘ to list the GPU instance profiles and cross-reference them with the physical PCIe placement of the network cards.
The Technical Reason: To ensure the lowest latency in an AI cluster, traffic should follow a “Rail-Optimized“ path. This means a specific GPU instance should ideally communicate through a network card (HCA) that shares the same PCIe Root Complex or NVSwitch path.
The command nvidia-smi mig -lgip (List GPU Instance Profiles) provides a detailed breakdown of the current Multi-Instance GPU (MIG) geometry, including the placement of compute and memory slices.
By cross-referencing these logical instances with the physical PCIe topology (which can be viewed with nvidia-smi topo -m or lspci -tv), an administrator can verify if the MIG partitions are “aligned“ with the nearest high-speed network interfaces.
The NCP-AII Context: The certification emphasizes that misaligned MIG partitions can force data to cross the CPU socket or unnecessary PCIe switches, drastically increasing latency during NCCL collective operations.
Incorrect: A. Download the NGC CLI and use ‘ngc config‘ to reset hardware registers The NGC CLI is a tool for interacting with the NVIDIA GPU Cloud registry (pulling containers, models, etc.). It does not have the capability to modify or reset low-level hardware registers on a local GPU. Hardware configuration and MIG management are handled locally via the NVIDIA driver and nvidia-smi.
B. Use ‘ibstatus‘ to check MIG InfiniBand beacons and enable SR-IOV for NVLink This option contains several technical inaccuracies:
ibstatus checks the state of the InfiniBand Host Channel Adapter (HCA), not the MIG status.
MIG partitions do not emit “InfiniBand beacons.“
SR-IOV (Single Root I/O Virtualization) is a networking technology used to virtualize PCIe devices; it is not a protocol for NVLink, which is a proprietary GPU-to-GPU interconnect that does not use SR-IOV for its basic operation.
C. Execute an HPL test and compare thermal output to fan specifications While an HPL (High-Performance Linpack) test validates the computational stability of a node, it is not an effective tool for identifying network topology alignment issues. Furthermore, thermal output is an indicator of cooling efficiency, not logical resource mapping between MIG partitions and network cards.
Question 12 of 60
12. Question
An administrator is configuring the Trusted Platform Module and Out-of-Band management for a new cluster of NVIDIA-Certified servers. What is the primary security benefit of enabling and initializing the TPM 2.0 module during the system bring-up phase, and how does it relate to the integrity of the AI infrastructure software stack?
Correct
Correct: A The TPM provides a hardware root of trust that allows the system to perform measured boots, ensuring that the bootloader and OS kernel have not been tampered with.
The Technical Reason: A Measured Boot uses the TPM to record “measurements“ (hashes) of each component in the boot process—from the UEFI firmware to the bootloader, the OS kernel, and even the NVIDIA drivers. These hashes are stored in the TPMÂ’s Platform Configuration Registers (PCRs). If any component is altered (e.g., by rootkits or unauthorized modifications), the measurements will not match the known “good“ state, and the system can be prevented from booting or accessing sensitive keys.
The NCP-AII Context: In a large-scale AI cluster, ensuring that every node is running a “Known Good“ software stack is critical. The TPM 2.0 module works alongside Secure Boot to provide the Hardware Root of Trust required for Zero Trust security models in NVIDIA-certified environments.
Incorrect: B. The TPM is required to bypass the license check for NVIDIA AI Enterprise This is factually incorrect. NVIDIA AI Enterprise (NAIE) licensing is typically managed via the NVIDIA License System (NLS), which can be hosted locally as a Delegated License Service (DLS) for air-gapped environments or via a cloud-based Cloud License Service (CLS). The TPM is a security and integrity component, not a license-bypass mechanism.
C. The TPM acts as a high-speed cache for GPU kernels This is a technical mismatch. The TPM is a low-speed, secure microcontroller designed for cryptographic operations and secure storage of small secrets (keys, hashes). It does not have the bandwidth or architecture to cache GPU kernels or math functions. GPU kernels are cached in the System RAM or the GPU‘s internal L2/L3 cache.
D. Enabling the TPM automatically encrypts data on third-party storage arrays While a TPM can be used to store encryption keys for local drives (like using BitLocker or LUKS), it does not automatically manage or encrypt data on external, third-party storage arrays (like Lustre, Weka, or NetApp) via the BMC network. Key management for external storage is typically handled by dedicated Key Management Servers (KMS) using protocols like KMIP.
Incorrect
Correct: A The TPM provides a hardware root of trust that allows the system to perform measured boots, ensuring that the bootloader and OS kernel have not been tampered with.
The Technical Reason: A Measured Boot uses the TPM to record “measurements“ (hashes) of each component in the boot process—from the UEFI firmware to the bootloader, the OS kernel, and even the NVIDIA drivers. These hashes are stored in the TPMÂ’s Platform Configuration Registers (PCRs). If any component is altered (e.g., by rootkits or unauthorized modifications), the measurements will not match the known “good“ state, and the system can be prevented from booting or accessing sensitive keys.
The NCP-AII Context: In a large-scale AI cluster, ensuring that every node is running a “Known Good“ software stack is critical. The TPM 2.0 module works alongside Secure Boot to provide the Hardware Root of Trust required for Zero Trust security models in NVIDIA-certified environments.
Incorrect: B. The TPM is required to bypass the license check for NVIDIA AI Enterprise This is factually incorrect. NVIDIA AI Enterprise (NAIE) licensing is typically managed via the NVIDIA License System (NLS), which can be hosted locally as a Delegated License Service (DLS) for air-gapped environments or via a cloud-based Cloud License Service (CLS). The TPM is a security and integrity component, not a license-bypass mechanism.
C. The TPM acts as a high-speed cache for GPU kernels This is a technical mismatch. The TPM is a low-speed, secure microcontroller designed for cryptographic operations and secure storage of small secrets (keys, hashes). It does not have the bandwidth or architecture to cache GPU kernels or math functions. GPU kernels are cached in the System RAM or the GPU‘s internal L2/L3 cache.
D. Enabling the TPM automatically encrypts data on third-party storage arrays While a TPM can be used to store encryption keys for local drives (like using BitLocker or LUKS), it does not automatically manage or encrypt data on external, third-party storage arrays (like Lustre, Weka, or NetApp) via the BMC network. Key management for external storage is typically handled by dedicated Key Management Servers (KMS) using protocols like KMIP.
Unattempted
Correct: A The TPM provides a hardware root of trust that allows the system to perform measured boots, ensuring that the bootloader and OS kernel have not been tampered with.
The Technical Reason: A Measured Boot uses the TPM to record “measurements“ (hashes) of each component in the boot process—from the UEFI firmware to the bootloader, the OS kernel, and even the NVIDIA drivers. These hashes are stored in the TPMÂ’s Platform Configuration Registers (PCRs). If any component is altered (e.g., by rootkits or unauthorized modifications), the measurements will not match the known “good“ state, and the system can be prevented from booting or accessing sensitive keys.
The NCP-AII Context: In a large-scale AI cluster, ensuring that every node is running a “Known Good“ software stack is critical. The TPM 2.0 module works alongside Secure Boot to provide the Hardware Root of Trust required for Zero Trust security models in NVIDIA-certified environments.
Incorrect: B. The TPM is required to bypass the license check for NVIDIA AI Enterprise This is factually incorrect. NVIDIA AI Enterprise (NAIE) licensing is typically managed via the NVIDIA License System (NLS), which can be hosted locally as a Delegated License Service (DLS) for air-gapped environments or via a cloud-based Cloud License Service (CLS). The TPM is a security and integrity component, not a license-bypass mechanism.
C. The TPM acts as a high-speed cache for GPU kernels This is a technical mismatch. The TPM is a low-speed, secure microcontroller designed for cryptographic operations and secure storage of small secrets (keys, hashes). It does not have the bandwidth or architecture to cache GPU kernels or math functions. GPU kernels are cached in the System RAM or the GPU‘s internal L2/L3 cache.
D. Enabling the TPM automatically encrypts data on third-party storage arrays While a TPM can be used to store encryption keys for local drives (like using BitLocker or LUKS), it does not automatically manage or encrypt data on external, third-party storage arrays (like Lustre, Weka, or NetApp) via the BMC network. Key management for external storage is typically handled by dedicated Key Management Servers (KMS) using protocols like KMIP.
Question 13 of 60
13. Question
To optimize the performance of an AI workload on a server with AMD EPYC processors and NVIDIA GPUs, which specific configuration setting is often adjusted to ensure that the GPU and the local NIC have the fastest possible access to the same CPU memory zone?
Correct
Correct: D Enabling NUMA (Non-Uniform Memory Access) awareness in the application and OS.
The Technical Reason: Modern servers (particularly AMD EPYC with its Multi-Die design) are divided into NUMA nodes. A GPU and a Network Interface Card (NIC) are physically wired to a specific PCIe root complex associated with a specific NUMA node. If a GPU on NUMA Node 0 tries to access data stored in memory assigned to NUMA Node 1, it must cross the CPU interconnect (Infinity Fabric), which introduces latency and reduces bandwidth.
The Solution: By ensuring the OS and the AI application (via libraries like NCCL and UCX) are NUMA-aware, the system ensures that memory buffers are allocated on the “local“ NUMA node closest to the GPU and NIC. This ensures the fastest possible “Direct Memory Access“ (DMA) and is essential for GPUDirect RDMA performance.
Incorrect: A. Disabling the GPU‘s ECC memory to save power While disabling ECC might slightly increase available memory bandwidth and reduce power consumption, it is never recommended for production AI training. Uncorrected bit flips can lead to “silent data corruption,“ causing models to diverge or produce “NaN“ values. Furthermore, it does nothing to address the memory access latency between the CPU and the NIC.
B. Increasing the size of the Linux swap partition Swap space is used when physical RAM is exhausted. If an AI workload is hitting the swap partition (which resides on much slower NVMe or SSD storage), performance will drop by orders of magnitude. Increasing swap size does not optimize the high-speed data path between the GPU, NIC, and physical RAM.
C. Setting the GPU fans to run at a constant 10% speed This is factually incorrect and dangerous. Under AI training loads, GPUs generate massive amounts of heat. Restricting fans to 10% would lead to thermal throttling (where the GPU slows its clock speed to prevent melting) or an emergency system shutdown. It has no impact on the logical memory zoning or NUMA affinity.
Incorrect
Correct: D Enabling NUMA (Non-Uniform Memory Access) awareness in the application and OS.
The Technical Reason: Modern servers (particularly AMD EPYC with its Multi-Die design) are divided into NUMA nodes. A GPU and a Network Interface Card (NIC) are physically wired to a specific PCIe root complex associated with a specific NUMA node. If a GPU on NUMA Node 0 tries to access data stored in memory assigned to NUMA Node 1, it must cross the CPU interconnect (Infinity Fabric), which introduces latency and reduces bandwidth.
The Solution: By ensuring the OS and the AI application (via libraries like NCCL and UCX) are NUMA-aware, the system ensures that memory buffers are allocated on the “local“ NUMA node closest to the GPU and NIC. This ensures the fastest possible “Direct Memory Access“ (DMA) and is essential for GPUDirect RDMA performance.
Incorrect: A. Disabling the GPU‘s ECC memory to save power While disabling ECC might slightly increase available memory bandwidth and reduce power consumption, it is never recommended for production AI training. Uncorrected bit flips can lead to “silent data corruption,“ causing models to diverge or produce “NaN“ values. Furthermore, it does nothing to address the memory access latency between the CPU and the NIC.
B. Increasing the size of the Linux swap partition Swap space is used when physical RAM is exhausted. If an AI workload is hitting the swap partition (which resides on much slower NVMe or SSD storage), performance will drop by orders of magnitude. Increasing swap size does not optimize the high-speed data path between the GPU, NIC, and physical RAM.
C. Setting the GPU fans to run at a constant 10% speed This is factually incorrect and dangerous. Under AI training loads, GPUs generate massive amounts of heat. Restricting fans to 10% would lead to thermal throttling (where the GPU slows its clock speed to prevent melting) or an emergency system shutdown. It has no impact on the logical memory zoning or NUMA affinity.
Unattempted
Correct: D Enabling NUMA (Non-Uniform Memory Access) awareness in the application and OS.
The Technical Reason: Modern servers (particularly AMD EPYC with its Multi-Die design) are divided into NUMA nodes. A GPU and a Network Interface Card (NIC) are physically wired to a specific PCIe root complex associated with a specific NUMA node. If a GPU on NUMA Node 0 tries to access data stored in memory assigned to NUMA Node 1, it must cross the CPU interconnect (Infinity Fabric), which introduces latency and reduces bandwidth.
The Solution: By ensuring the OS and the AI application (via libraries like NCCL and UCX) are NUMA-aware, the system ensures that memory buffers are allocated on the “local“ NUMA node closest to the GPU and NIC. This ensures the fastest possible “Direct Memory Access“ (DMA) and is essential for GPUDirect RDMA performance.
Incorrect: A. Disabling the GPU‘s ECC memory to save power While disabling ECC might slightly increase available memory bandwidth and reduce power consumption, it is never recommended for production AI training. Uncorrected bit flips can lead to “silent data corruption,“ causing models to diverge or produce “NaN“ values. Furthermore, it does nothing to address the memory access latency between the CPU and the NIC.
B. Increasing the size of the Linux swap partition Swap space is used when physical RAM is exhausted. If an AI workload is hitting the swap partition (which resides on much slower NVMe or SSD storage), performance will drop by orders of magnitude. Increasing swap size does not optimize the high-speed data path between the GPU, NIC, and physical RAM.
C. Setting the GPU fans to run at a constant 10% speed This is factually incorrect and dangerous. Under AI training loads, GPUs generate massive amounts of heat. Restricting fans to 10% would lead to thermal throttling (where the GPU slows its clock speed to prevent melting) or an emergency system shutdown. It has no impact on the logical memory zoning or NUMA affinity.
Question 14 of 60
14. Question
A Linux administrator is installing the NVIDIA Container Toolkit on a fresh Ubuntu installation to support Docker-based AI training workloads. After successfully installing the package, what is the mandatory next step to ensure the Docker daemon can utilize the NVIDIA GPU runtime correctly?
Correct
Correct: A Edit the daemon.json file in the docker directory to set the default-runtime to nvidia and restart the Docker service.
The Technical Reason: Installing the nvidia-container-toolkit package provides the necessary binaries (like nvidia-container-runtime), but the Docker daemon does not automatically know how to use them. The administrator must register the NVIDIA runtime within the Docker configuration file (typically /etc/docker/daemon.json) and then restart the docker service to reload that configuration.
The NCP-AII Context: This step “hooks“ the NVIDIA hardware-specific libraries into the standard Docker workflow. Once configured, users can run containers with the –gpus all flag, and the Docker daemon will correctly invoke the NVIDIA runtime to map the GPU device nodes into the container namespace.
Incorrect: B. Recompile the Linux kernel with the CUDA-SUPPORT flag This is factually incorrect. NVIDIA drivers and the CUDA toolkit are designed to work as Kernel Modules (via DKMS) on top of standard Linux kernels (like the Generic Ubuntu kernel). There is no “CUDA-SUPPORT“ flag in the upstream Linux kernel source, and recompiling the kernel is not a requirement for any standard NVIDIA-certified AI infrastructure deployment.
C. Install the DOCA SDK and map the GPU via a virtual PCIe switch While the DOCA SDK is critical for programming the BlueField DPU, it is not a requirement for standard GPU-based Docker training on a Linux host. “Mapping a GPU via a virtual PCIe switch“ is a complex virtualization/MIG-vGPU concept that is not part of the standard, mandatory post-install routine for the NVIDIA Container Toolkit on a bare-metal Linux installation.
D. Run the nvidia-smi –factory-reset command The nvidia-smi –factory-reset command (or -fr) is used to return GPU settings (like power limits, clock offsets, or volatile ECC accounts) to their original manufacturer defaults. While useful for troubleshooting a “dirty“ system, it is not a mandatory or even a common step for enabling the Docker daemon to utilize the GPU runtime.
Incorrect
Correct: A Edit the daemon.json file in the docker directory to set the default-runtime to nvidia and restart the Docker service.
The Technical Reason: Installing the nvidia-container-toolkit package provides the necessary binaries (like nvidia-container-runtime), but the Docker daemon does not automatically know how to use them. The administrator must register the NVIDIA runtime within the Docker configuration file (typically /etc/docker/daemon.json) and then restart the docker service to reload that configuration.
The NCP-AII Context: This step “hooks“ the NVIDIA hardware-specific libraries into the standard Docker workflow. Once configured, users can run containers with the –gpus all flag, and the Docker daemon will correctly invoke the NVIDIA runtime to map the GPU device nodes into the container namespace.
Incorrect: B. Recompile the Linux kernel with the CUDA-SUPPORT flag This is factually incorrect. NVIDIA drivers and the CUDA toolkit are designed to work as Kernel Modules (via DKMS) on top of standard Linux kernels (like the Generic Ubuntu kernel). There is no “CUDA-SUPPORT“ flag in the upstream Linux kernel source, and recompiling the kernel is not a requirement for any standard NVIDIA-certified AI infrastructure deployment.
C. Install the DOCA SDK and map the GPU via a virtual PCIe switch While the DOCA SDK is critical for programming the BlueField DPU, it is not a requirement for standard GPU-based Docker training on a Linux host. “Mapping a GPU via a virtual PCIe switch“ is a complex virtualization/MIG-vGPU concept that is not part of the standard, mandatory post-install routine for the NVIDIA Container Toolkit on a bare-metal Linux installation.
D. Run the nvidia-smi –factory-reset command The nvidia-smi –factory-reset command (or -fr) is used to return GPU settings (like power limits, clock offsets, or volatile ECC accounts) to their original manufacturer defaults. While useful for troubleshooting a “dirty“ system, it is not a mandatory or even a common step for enabling the Docker daemon to utilize the GPU runtime.
Unattempted
Correct: A Edit the daemon.json file in the docker directory to set the default-runtime to nvidia and restart the Docker service.
The Technical Reason: Installing the nvidia-container-toolkit package provides the necessary binaries (like nvidia-container-runtime), but the Docker daemon does not automatically know how to use them. The administrator must register the NVIDIA runtime within the Docker configuration file (typically /etc/docker/daemon.json) and then restart the docker service to reload that configuration.
The NCP-AII Context: This step “hooks“ the NVIDIA hardware-specific libraries into the standard Docker workflow. Once configured, users can run containers with the –gpus all flag, and the Docker daemon will correctly invoke the NVIDIA runtime to map the GPU device nodes into the container namespace.
Incorrect: B. Recompile the Linux kernel with the CUDA-SUPPORT flag This is factually incorrect. NVIDIA drivers and the CUDA toolkit are designed to work as Kernel Modules (via DKMS) on top of standard Linux kernels (like the Generic Ubuntu kernel). There is no “CUDA-SUPPORT“ flag in the upstream Linux kernel source, and recompiling the kernel is not a requirement for any standard NVIDIA-certified AI infrastructure deployment.
C. Install the DOCA SDK and map the GPU via a virtual PCIe switch While the DOCA SDK is critical for programming the BlueField DPU, it is not a requirement for standard GPU-based Docker training on a Linux host. “Mapping a GPU via a virtual PCIe switch“ is a complex virtualization/MIG-vGPU concept that is not part of the standard, mandatory post-install routine for the NVIDIA Container Toolkit on a bare-metal Linux installation.
D. Run the nvidia-smi –factory-reset command The nvidia-smi –factory-reset command (or -fr) is used to return GPU settings (like power limits, clock offsets, or volatile ECC accounts) to their original manufacturer defaults. While useful for troubleshooting a “dirty“ system, it is not a mandatory or even a common step for enabling the Docker daemon to utilize the GPU runtime.
Question 15 of 60
15. Question
During management of the physical layer in an AI factory, an engineer notices a high rate of Cyclic Redundancy Check errors on a BlueField-3 network link running at 400Gb/s. What is the most likely cause of these errors when using high-speed optical transceivers, and how should it be addressed according to professional standards?
Correct
Correct: D The optical fiber connectors are likely contaminated with dust or oil and should be cleaned with a specialized fiber cleaner and inspected with a scope.
The Technical Reason: Cyclic Redundancy Check (CRC) errors are a clear indicator of data corruption occurring at the physical layer (Layer 1). At 400Gb/s, the light signals are highly sensitive. A single speck of dust or a fingerprint smudge on an OSFP or QSFP112 connector causes light scattering and attenuation. This leads to bit flips that the BlueField-3 DPU detects as CRC mismatches.
Professional Standard: According to NVIDIA best practices for “AI Factory“ bring-up, the first response to link errors is the “Inspect, Clean, Inspect“ workflow using specialized fiber scopes and dry-cleaners (like IBC cleaners).
The NCP-AII Context: The exam tests your ability to troubleshoot the physical fabric. You are expected to know that software cannot fix a physical signal degradation caused by dirty optics.
Incorrect: A. Update the Linux kernel to fix the CRC algorithm CRC calculations for network packets are performed in the hardware logic of the BlueField-3 DPU and the network switch, not by the Linux kernel‘s mathematical functions. While keeping the kernel updated is a general best practice, it will not resolve physical signal degradation or hardware-level bit errors.
B. Slurm sending too many jobs causing the NIC to overheat While heavy congestion can lead to dropped packets (buffer overflows), it does not cause CRC errors. CRC errors imply the packet arrived but was malformed. Furthermore, modern DPUs like the BlueField-3 are designed to handle line-rate traffic (400Gb/s) without “miscalculating“ checksums due to load; if a DPU overheats, it will throttle performance or shut down, not introduce mathematical errors into the packet stream.
C. GPU clock speed causing EMI inside the glass fiber This is a common distractor. Optical fibers use light (photons), which are immune to Electromagnetic Interference (EMI). While EMI can affect copper cables (like DACs), it cannot disrupt the signal inside a glass fiber. High GPU clock speeds might affect nearby unshielded copper management cables, but they have no impact on the 400Gb/s optical fabric.
Incorrect
Correct: D The optical fiber connectors are likely contaminated with dust or oil and should be cleaned with a specialized fiber cleaner and inspected with a scope.
The Technical Reason: Cyclic Redundancy Check (CRC) errors are a clear indicator of data corruption occurring at the physical layer (Layer 1). At 400Gb/s, the light signals are highly sensitive. A single speck of dust or a fingerprint smudge on an OSFP or QSFP112 connector causes light scattering and attenuation. This leads to bit flips that the BlueField-3 DPU detects as CRC mismatches.
Professional Standard: According to NVIDIA best practices for “AI Factory“ bring-up, the first response to link errors is the “Inspect, Clean, Inspect“ workflow using specialized fiber scopes and dry-cleaners (like IBC cleaners).
The NCP-AII Context: The exam tests your ability to troubleshoot the physical fabric. You are expected to know that software cannot fix a physical signal degradation caused by dirty optics.
Incorrect: A. Update the Linux kernel to fix the CRC algorithm CRC calculations for network packets are performed in the hardware logic of the BlueField-3 DPU and the network switch, not by the Linux kernel‘s mathematical functions. While keeping the kernel updated is a general best practice, it will not resolve physical signal degradation or hardware-level bit errors.
B. Slurm sending too many jobs causing the NIC to overheat While heavy congestion can lead to dropped packets (buffer overflows), it does not cause CRC errors. CRC errors imply the packet arrived but was malformed. Furthermore, modern DPUs like the BlueField-3 are designed to handle line-rate traffic (400Gb/s) without “miscalculating“ checksums due to load; if a DPU overheats, it will throttle performance or shut down, not introduce mathematical errors into the packet stream.
C. GPU clock speed causing EMI inside the glass fiber This is a common distractor. Optical fibers use light (photons), which are immune to Electromagnetic Interference (EMI). While EMI can affect copper cables (like DACs), it cannot disrupt the signal inside a glass fiber. High GPU clock speeds might affect nearby unshielded copper management cables, but they have no impact on the 400Gb/s optical fabric.
Unattempted
Correct: D The optical fiber connectors are likely contaminated with dust or oil and should be cleaned with a specialized fiber cleaner and inspected with a scope.
The Technical Reason: Cyclic Redundancy Check (CRC) errors are a clear indicator of data corruption occurring at the physical layer (Layer 1). At 400Gb/s, the light signals are highly sensitive. A single speck of dust or a fingerprint smudge on an OSFP or QSFP112 connector causes light scattering and attenuation. This leads to bit flips that the BlueField-3 DPU detects as CRC mismatches.
Professional Standard: According to NVIDIA best practices for “AI Factory“ bring-up, the first response to link errors is the “Inspect, Clean, Inspect“ workflow using specialized fiber scopes and dry-cleaners (like IBC cleaners).
The NCP-AII Context: The exam tests your ability to troubleshoot the physical fabric. You are expected to know that software cannot fix a physical signal degradation caused by dirty optics.
Incorrect: A. Update the Linux kernel to fix the CRC algorithm CRC calculations for network packets are performed in the hardware logic of the BlueField-3 DPU and the network switch, not by the Linux kernel‘s mathematical functions. While keeping the kernel updated is a general best practice, it will not resolve physical signal degradation or hardware-level bit errors.
B. Slurm sending too many jobs causing the NIC to overheat While heavy congestion can lead to dropped packets (buffer overflows), it does not cause CRC errors. CRC errors imply the packet arrived but was malformed. Furthermore, modern DPUs like the BlueField-3 are designed to handle line-rate traffic (400Gb/s) without “miscalculating“ checksums due to load; if a DPU overheats, it will throttle performance or shut down, not introduce mathematical errors into the packet stream.
C. GPU clock speed causing EMI inside the glass fiber This is a common distractor. Optical fibers use light (photons), which are immune to Electromagnetic Interference (EMI). While EMI can affect copper cables (like DACs), it cannot disrupt the signal inside a glass fiber. High GPU clock speeds might affect nearby unshielded copper management cables, but they have no impact on the 400Gb/s optical fabric.
Question 16 of 60
16. Question
An administrator is setting up a cluster and needs to install the Slurm workload manager along with Enroot and Pyxis. How do these three components interact to enable a user to run a containerized AI training job across multiple nodes using a standard ‘srun‘ command?
Correct
Correct: C Slurm manages the scheduling, Pyxis acts as a Slurm plugin to handle container integration, and Enroot serves as the runtime to execute the containers.
The Technical Reason: This represents the modern “NVIDIA-native“ way to run containers without the overhead or security complexities of a full Docker daemon on every compute node.
Slurm: The orchestrator. It decides which nodes are available and allocates the resources (CPUs, GPUs, Memory).
Pyxis: A Slurm SPANK plugin. It allows the srun command to recognize container-specific flags (like –container-image). It bridges the gap between the scheduler and the container runtime.
Enroot: A simple, unprivileged tool that turns container images into unprivileged sandboxes. It is the “runtime“ that actually executes the code inside the container.
The NCP-AII Context: This stack is the recommended configuration for NVIDIA DGX SuperPOD and Base Command architectures because it allows users to run containerized jobs as easily as standard Linux binaries using the familiar srun command.
Incorrect: A. They do not interact; users must manually log into each node This contradicts the purpose of a Workload Manager (WLM). The goal of Slurm is to automate the distribution of tasks. Manually logging into nodes to start containers (like Docker) is inefficient, prone to error, and does not scale in an AI Factory environment.
B. Pyxis is the OS, Slurm is the driver, and Enroot is the hardware tool This is a fundamental misclassification of the software layers:
OS: Typically Ubuntu or Base Command OS (RHEL-based).
GPU Driver: The NVIDIA Datacenter Driver.
Hardware Validation: Usually handled by NVIDIA ClusterKit or the Diagnostic (NVVS) tools. The options listed (Pyxis/Slurm/Enroot) are all User-Space Software components for job management.
D. Enroot schedules the jobs, Slurm provides the images, and Pyxis encrypts traffic This flips the roles of the components entirely:
Enroot does not schedule; it executes.
Slurm does not provide images; images are pulled from a registry like NVIDIA NGC.
Pyxis does not encrypt traffic; network encryption is typically handled at the hardware/fabric level or via specialized DOCA/BlueField services.
Incorrect
Correct: C Slurm manages the scheduling, Pyxis acts as a Slurm plugin to handle container integration, and Enroot serves as the runtime to execute the containers.
The Technical Reason: This represents the modern “NVIDIA-native“ way to run containers without the overhead or security complexities of a full Docker daemon on every compute node.
Slurm: The orchestrator. It decides which nodes are available and allocates the resources (CPUs, GPUs, Memory).
Pyxis: A Slurm SPANK plugin. It allows the srun command to recognize container-specific flags (like –container-image). It bridges the gap between the scheduler and the container runtime.
Enroot: A simple, unprivileged tool that turns container images into unprivileged sandboxes. It is the “runtime“ that actually executes the code inside the container.
The NCP-AII Context: This stack is the recommended configuration for NVIDIA DGX SuperPOD and Base Command architectures because it allows users to run containerized jobs as easily as standard Linux binaries using the familiar srun command.
Incorrect: A. They do not interact; users must manually log into each node This contradicts the purpose of a Workload Manager (WLM). The goal of Slurm is to automate the distribution of tasks. Manually logging into nodes to start containers (like Docker) is inefficient, prone to error, and does not scale in an AI Factory environment.
B. Pyxis is the OS, Slurm is the driver, and Enroot is the hardware tool This is a fundamental misclassification of the software layers:
OS: Typically Ubuntu or Base Command OS (RHEL-based).
GPU Driver: The NVIDIA Datacenter Driver.
Hardware Validation: Usually handled by NVIDIA ClusterKit or the Diagnostic (NVVS) tools. The options listed (Pyxis/Slurm/Enroot) are all User-Space Software components for job management.
D. Enroot schedules the jobs, Slurm provides the images, and Pyxis encrypts traffic This flips the roles of the components entirely:
Enroot does not schedule; it executes.
Slurm does not provide images; images are pulled from a registry like NVIDIA NGC.
Pyxis does not encrypt traffic; network encryption is typically handled at the hardware/fabric level or via specialized DOCA/BlueField services.
Unattempted
Correct: C Slurm manages the scheduling, Pyxis acts as a Slurm plugin to handle container integration, and Enroot serves as the runtime to execute the containers.
The Technical Reason: This represents the modern “NVIDIA-native“ way to run containers without the overhead or security complexities of a full Docker daemon on every compute node.
Slurm: The orchestrator. It decides which nodes are available and allocates the resources (CPUs, GPUs, Memory).
Pyxis: A Slurm SPANK plugin. It allows the srun command to recognize container-specific flags (like –container-image). It bridges the gap between the scheduler and the container runtime.
Enroot: A simple, unprivileged tool that turns container images into unprivileged sandboxes. It is the “runtime“ that actually executes the code inside the container.
The NCP-AII Context: This stack is the recommended configuration for NVIDIA DGX SuperPOD and Base Command architectures because it allows users to run containerized jobs as easily as standard Linux binaries using the familiar srun command.
Incorrect: A. They do not interact; users must manually log into each node This contradicts the purpose of a Workload Manager (WLM). The goal of Slurm is to automate the distribution of tasks. Manually logging into nodes to start containers (like Docker) is inefficient, prone to error, and does not scale in an AI Factory environment.
B. Pyxis is the OS, Slurm is the driver, and Enroot is the hardware tool This is a fundamental misclassification of the software layers:
OS: Typically Ubuntu or Base Command OS (RHEL-based).
GPU Driver: The NVIDIA Datacenter Driver.
Hardware Validation: Usually handled by NVIDIA ClusterKit or the Diagnostic (NVVS) tools. The options listed (Pyxis/Slurm/Enroot) are all User-Space Software components for job management.
D. Enroot schedules the jobs, Slurm provides the images, and Pyxis encrypts traffic This flips the roles of the components entirely:
Enroot does not schedule; it executes.
Slurm does not provide images; images are pulled from a registry like NVIDIA NGC.
Pyxis does not encrypt traffic; network encryption is typically handled at the hardware/fabric level or via specialized DOCA/BlueField services.
Question 17 of 60
17. Question
An administrator is setting up the software stack on an AI cluster and needs to enable GPU support for containerized workloads. After installing the NVIDIA GPU drivers and the Docker engine, which component must be installed and configured to allow Docker containers to access the underlying GPU hardware?
Correct
Correct: A NVIDIA Container Toolkit (nvidia-docker2)
The Technical Reason: While the NVIDIA GPU drivers allow the Host OS to talk to the hardware, Docker containers are isolated by design and cannot “see“ the GPUs. The NVIDIA Container Toolkit provides a specialized library (libnvidia-container) and a CLI wrapper/runtime that “mounts“ the GPU device nodes and driver libraries into the container at runtime.
The NCP-AII Context: This toolkit is the industry-standard way to expose GPUs to container engines (Docker, Podman, LXC). It includes the nvidia-container-runtime, which is configured as a custom runtime in the Docker daemon.json to enable the –gpus flag.
Incorrect: B. The Apache Spark distributed processing framework Apache Spark is a data processing engine used for big data analytics. While Spark can be configured to use GPUs for accelerated data frames (using the NVIDIA RAPIDS accelerator), it is an application-level framework. It does not provide the underlying system-level integration required for Docker to access GPU hardware.
C. Standard Java Runtime Environment (JRE) The JRE is used to run Java-based applications. Most modern AI training workloads (PyTorch, TensorFlow) are based on Python and C++/CUDA. The JRE has no role in hardware abstraction or mapping PCIe GPU devices into a container namespace.
D. The Microsoft DirectX End-User Runtime DirectX is a collection of APIs developed by Microsoft for handling tasks related to multimedia and gaming on Windows. NVIDIA AI Infrastructure (the focus of the NCP-AII exam) is built almost exclusively on Enterprise Linux (Ubuntu, RHEL, or SLES). DirectX is not used in data center AI training environments; CUDA and Vulkan are the relevant APIs.
Incorrect
Correct: A NVIDIA Container Toolkit (nvidia-docker2)
The Technical Reason: While the NVIDIA GPU drivers allow the Host OS to talk to the hardware, Docker containers are isolated by design and cannot “see“ the GPUs. The NVIDIA Container Toolkit provides a specialized library (libnvidia-container) and a CLI wrapper/runtime that “mounts“ the GPU device nodes and driver libraries into the container at runtime.
The NCP-AII Context: This toolkit is the industry-standard way to expose GPUs to container engines (Docker, Podman, LXC). It includes the nvidia-container-runtime, which is configured as a custom runtime in the Docker daemon.json to enable the –gpus flag.
Incorrect: B. The Apache Spark distributed processing framework Apache Spark is a data processing engine used for big data analytics. While Spark can be configured to use GPUs for accelerated data frames (using the NVIDIA RAPIDS accelerator), it is an application-level framework. It does not provide the underlying system-level integration required for Docker to access GPU hardware.
C. Standard Java Runtime Environment (JRE) The JRE is used to run Java-based applications. Most modern AI training workloads (PyTorch, TensorFlow) are based on Python and C++/CUDA. The JRE has no role in hardware abstraction or mapping PCIe GPU devices into a container namespace.
D. The Microsoft DirectX End-User Runtime DirectX is a collection of APIs developed by Microsoft for handling tasks related to multimedia and gaming on Windows. NVIDIA AI Infrastructure (the focus of the NCP-AII exam) is built almost exclusively on Enterprise Linux (Ubuntu, RHEL, or SLES). DirectX is not used in data center AI training environments; CUDA and Vulkan are the relevant APIs.
Unattempted
Correct: A NVIDIA Container Toolkit (nvidia-docker2)
The Technical Reason: While the NVIDIA GPU drivers allow the Host OS to talk to the hardware, Docker containers are isolated by design and cannot “see“ the GPUs. The NVIDIA Container Toolkit provides a specialized library (libnvidia-container) and a CLI wrapper/runtime that “mounts“ the GPU device nodes and driver libraries into the container at runtime.
The NCP-AII Context: This toolkit is the industry-standard way to expose GPUs to container engines (Docker, Podman, LXC). It includes the nvidia-container-runtime, which is configured as a custom runtime in the Docker daemon.json to enable the –gpus flag.
Incorrect: B. The Apache Spark distributed processing framework Apache Spark is a data processing engine used for big data analytics. While Spark can be configured to use GPUs for accelerated data frames (using the NVIDIA RAPIDS accelerator), it is an application-level framework. It does not provide the underlying system-level integration required for Docker to access GPU hardware.
C. Standard Java Runtime Environment (JRE) The JRE is used to run Java-based applications. Most modern AI training workloads (PyTorch, TensorFlow) are based on Python and C++/CUDA. The JRE has no role in hardware abstraction or mapping PCIe GPU devices into a container namespace.
D. The Microsoft DirectX End-User Runtime DirectX is a collection of APIs developed by Microsoft for handling tasks related to multimedia and gaming on Windows. NVIDIA AI Infrastructure (the focus of the NCP-AII exam) is built almost exclusively on Enterprise Linux (Ubuntu, RHEL, or SLES). DirectX is not used in data center AI training environments; CUDA and Vulkan are the relevant APIs.
Question 18 of 60
18. Question
When installing NVIDIA drivers on a cluster node that will utilize BlueField-3 DPUs, which driver suite must be installed to ensure the host can properly communicate with both the GPU and the DPU‘s network functions effectively?
Correct
Correct: B The NVIDIA Datacenter Driver and DOCA Driver
The Technical Reason: An AI cluster node requires two distinct but interoperable driver stacks on the host OS:
NVIDIA Datacenter Driver: This is the robust, long-term support (LTS) driver required for enterprise-grade GPUs (like H100, A100, or B200). It manages the GPU hardware, CUDA context, and nvidia-smi functionality.
DOCA (Data Center Infrastructure-on-a-Chip Architecture) Driver: For a node equipped with a BlueField-3 DPU, the DOCA-Host package is mandatory. It includes the doca-ofed (InfiniBand/Ethernet) drivers, the rshim driver for communicating with the DPU‘s internal Arm subsystem, and the libraries needed for offloading networking and security tasks to the DPU.
The NCP-AII Context: The exam validates that an administrator can identify the “Validated Recipe“ for an AI node. In an NVIDIA-certified environment, using the standard Datacenter Driver alongside the DOCA suite ensures that the host can perform GPUDirect RDMA (direct communication between the GPU and the DPU) without bottlenecking at the CPU.
Incorrect: A. The Open Source Nouveau Driver The Nouveau driver is a reverse-engineered, open-source driver for NVIDIA GPUs. While it is included in many Linux distributions for basic display output, it lacks support for CUDA, NVLink, and advanced networking features. It is incompatible with the high-performance requirements of an AI cluster and must typically be blacklisted during the installation of professional NVIDIA drivers.
C. The standard GeForce Game Ready Driver Game Ready Drivers (GRD) are optimized for consumer gaming and creative applications. They lack the data center-specific features—such as peer-to-peer (P2P) memory support, specialized thermals, and long-term stability—required for HGX/DGX systems. Using consumer drivers in an AI factory environment is unsupported and will not integrate correctly with the DOCA management stack.
D. The Legacy 340xx Driver series The 340xx series is an extremely old legacy driver meant for GPUs from the Tesla and Fermi architectures (circa 2010-2014). It does not support modern hardware like the H100 GPU or the BlueField-3 DPU and would be physically unable to initialize or communicate with current PCIe Gen 5 infrastructure.
Incorrect
Correct: B The NVIDIA Datacenter Driver and DOCA Driver
The Technical Reason: An AI cluster node requires two distinct but interoperable driver stacks on the host OS:
NVIDIA Datacenter Driver: This is the robust, long-term support (LTS) driver required for enterprise-grade GPUs (like H100, A100, or B200). It manages the GPU hardware, CUDA context, and nvidia-smi functionality.
DOCA (Data Center Infrastructure-on-a-Chip Architecture) Driver: For a node equipped with a BlueField-3 DPU, the DOCA-Host package is mandatory. It includes the doca-ofed (InfiniBand/Ethernet) drivers, the rshim driver for communicating with the DPU‘s internal Arm subsystem, and the libraries needed for offloading networking and security tasks to the DPU.
The NCP-AII Context: The exam validates that an administrator can identify the “Validated Recipe“ for an AI node. In an NVIDIA-certified environment, using the standard Datacenter Driver alongside the DOCA suite ensures that the host can perform GPUDirect RDMA (direct communication between the GPU and the DPU) without bottlenecking at the CPU.
Incorrect: A. The Open Source Nouveau Driver The Nouveau driver is a reverse-engineered, open-source driver for NVIDIA GPUs. While it is included in many Linux distributions for basic display output, it lacks support for CUDA, NVLink, and advanced networking features. It is incompatible with the high-performance requirements of an AI cluster and must typically be blacklisted during the installation of professional NVIDIA drivers.
C. The standard GeForce Game Ready Driver Game Ready Drivers (GRD) are optimized for consumer gaming and creative applications. They lack the data center-specific features—such as peer-to-peer (P2P) memory support, specialized thermals, and long-term stability—required for HGX/DGX systems. Using consumer drivers in an AI factory environment is unsupported and will not integrate correctly with the DOCA management stack.
D. The Legacy 340xx Driver series The 340xx series is an extremely old legacy driver meant for GPUs from the Tesla and Fermi architectures (circa 2010-2014). It does not support modern hardware like the H100 GPU or the BlueField-3 DPU and would be physically unable to initialize or communicate with current PCIe Gen 5 infrastructure.
Unattempted
Correct: B The NVIDIA Datacenter Driver and DOCA Driver
The Technical Reason: An AI cluster node requires two distinct but interoperable driver stacks on the host OS:
NVIDIA Datacenter Driver: This is the robust, long-term support (LTS) driver required for enterprise-grade GPUs (like H100, A100, or B200). It manages the GPU hardware, CUDA context, and nvidia-smi functionality.
DOCA (Data Center Infrastructure-on-a-Chip Architecture) Driver: For a node equipped with a BlueField-3 DPU, the DOCA-Host package is mandatory. It includes the doca-ofed (InfiniBand/Ethernet) drivers, the rshim driver for communicating with the DPU‘s internal Arm subsystem, and the libraries needed for offloading networking and security tasks to the DPU.
The NCP-AII Context: The exam validates that an administrator can identify the “Validated Recipe“ for an AI node. In an NVIDIA-certified environment, using the standard Datacenter Driver alongside the DOCA suite ensures that the host can perform GPUDirect RDMA (direct communication between the GPU and the DPU) without bottlenecking at the CPU.
Incorrect: A. The Open Source Nouveau Driver The Nouveau driver is a reverse-engineered, open-source driver for NVIDIA GPUs. While it is included in many Linux distributions for basic display output, it lacks support for CUDA, NVLink, and advanced networking features. It is incompatible with the high-performance requirements of an AI cluster and must typically be blacklisted during the installation of professional NVIDIA drivers.
C. The standard GeForce Game Ready Driver Game Ready Drivers (GRD) are optimized for consumer gaming and creative applications. They lack the data center-specific features—such as peer-to-peer (P2P) memory support, specialized thermals, and long-term stability—required for HGX/DGX systems. Using consumer drivers in an AI factory environment is unsupported and will not integrate correctly with the DOCA management stack.
D. The Legacy 340xx Driver series The 340xx series is an extremely old legacy driver meant for GPUs from the Tesla and Fermi architectures (circa 2010-2014). It does not support modern hardware like the H100 GPU or the BlueField-3 DPU and would be physically unable to initialize or communicate with current PCIe Gen 5 infrastructure.
Question 19 of 60
19. Question
When performing the initial bring-up of an NVIDIA HGX system within a large-scale AI factory, an administrator must ensure that the firmware versions across all components are synchronized for stability. During the firmware upgrade process for the GPU complex, which specific utility should be prioritized to verify the current firmware versions of the NVIDIA GPUs and the NVSwitch fabric before proceeding with a production-level update using the NVIDIA Firmware Update (NVFW) tool?
Correct
Correct: C The NVIDIA System Management Interface (nvidia-smi) command
The Technical Reason: nvidia-smi is the primary and most accessible command-line utility for managing and monitoring NVIDIA GPU devices. For an HGX system (which contains a complex of 8 or more GPUs and NVSwitches), nvidia-smi provides an immediate and authoritative query of the VBIOS (Video BIOS) and firmware versions currently running on each GPU.
By running nvidia-smi -q, an administrator can extract detailed information, including the VBIOS Version, which is essential for confirming if the hardware is at the correct baseline before initiating a cluster-wide update with the NVFW (nvfwupd) tool.
The NCP-AII Context: The blueprint explicitly lists SMI (System Management Interface) as a core tool for installing and validating GPU-based servers. In the sequence of events for deployment, nvidia-smi is used to verify that the OS sees the hardware correctly and that the firmware is compatible with the installed driver before moving to more advanced orchestration tools.
Incorrect: A. The standard Linux dmidecode utility dmidecode reads the system‘s DMI (SMBIOS) table to provide information about the server‘s motherboard, CPU, and RAM. While it can identify that a PCIe device is present, it cannot “reach into“ the NVIDIA GPU complex to report specific VBIOS versions or NVSwitch fabric firmware. It is a general-purpose tool, not an NVIDIA-specific hardware validator.
B. The NVIDIA Fabric Manager status dashboard via the web UI While the NVIDIA Fabric Manager is critical for initializing the NVSwitch fabric on HGX systems, it does not typically offer a “web UI“ dashboard for manual firmware verification as part of the initial CLI-based bring-up. The Fabric Manager is a background service; verification of its state and the underlying firmware is still performed via command-line tools like nvidia-smi or nvswitch-cli.
D. The ipmitool sensors command ipmitool interacts with the Baseboard Management Controller (BMC) to report environmental data such as fan speeds, voltages, and chassis temperatures. While the BMC can sometimes report firmware versions via the Redfish API (which the NVFW tool uses), the sensors command specifically is used for thermal and power monitoring, not for querying the specific VBIOS versions of individual GPUs within the HGX tray.
Incorrect
Correct: C The NVIDIA System Management Interface (nvidia-smi) command
The Technical Reason: nvidia-smi is the primary and most accessible command-line utility for managing and monitoring NVIDIA GPU devices. For an HGX system (which contains a complex of 8 or more GPUs and NVSwitches), nvidia-smi provides an immediate and authoritative query of the VBIOS (Video BIOS) and firmware versions currently running on each GPU.
By running nvidia-smi -q, an administrator can extract detailed information, including the VBIOS Version, which is essential for confirming if the hardware is at the correct baseline before initiating a cluster-wide update with the NVFW (nvfwupd) tool.
The NCP-AII Context: The blueprint explicitly lists SMI (System Management Interface) as a core tool for installing and validating GPU-based servers. In the sequence of events for deployment, nvidia-smi is used to verify that the OS sees the hardware correctly and that the firmware is compatible with the installed driver before moving to more advanced orchestration tools.
Incorrect: A. The standard Linux dmidecode utility dmidecode reads the system‘s DMI (SMBIOS) table to provide information about the server‘s motherboard, CPU, and RAM. While it can identify that a PCIe device is present, it cannot “reach into“ the NVIDIA GPU complex to report specific VBIOS versions or NVSwitch fabric firmware. It is a general-purpose tool, not an NVIDIA-specific hardware validator.
B. The NVIDIA Fabric Manager status dashboard via the web UI While the NVIDIA Fabric Manager is critical for initializing the NVSwitch fabric on HGX systems, it does not typically offer a “web UI“ dashboard for manual firmware verification as part of the initial CLI-based bring-up. The Fabric Manager is a background service; verification of its state and the underlying firmware is still performed via command-line tools like nvidia-smi or nvswitch-cli.
D. The ipmitool sensors command ipmitool interacts with the Baseboard Management Controller (BMC) to report environmental data such as fan speeds, voltages, and chassis temperatures. While the BMC can sometimes report firmware versions via the Redfish API (which the NVFW tool uses), the sensors command specifically is used for thermal and power monitoring, not for querying the specific VBIOS versions of individual GPUs within the HGX tray.
Unattempted
Correct: C The NVIDIA System Management Interface (nvidia-smi) command
The Technical Reason: nvidia-smi is the primary and most accessible command-line utility for managing and monitoring NVIDIA GPU devices. For an HGX system (which contains a complex of 8 or more GPUs and NVSwitches), nvidia-smi provides an immediate and authoritative query of the VBIOS (Video BIOS) and firmware versions currently running on each GPU.
By running nvidia-smi -q, an administrator can extract detailed information, including the VBIOS Version, which is essential for confirming if the hardware is at the correct baseline before initiating a cluster-wide update with the NVFW (nvfwupd) tool.
The NCP-AII Context: The blueprint explicitly lists SMI (System Management Interface) as a core tool for installing and validating GPU-based servers. In the sequence of events for deployment, nvidia-smi is used to verify that the OS sees the hardware correctly and that the firmware is compatible with the installed driver before moving to more advanced orchestration tools.
Incorrect: A. The standard Linux dmidecode utility dmidecode reads the system‘s DMI (SMBIOS) table to provide information about the server‘s motherboard, CPU, and RAM. While it can identify that a PCIe device is present, it cannot “reach into“ the NVIDIA GPU complex to report specific VBIOS versions or NVSwitch fabric firmware. It is a general-purpose tool, not an NVIDIA-specific hardware validator.
B. The NVIDIA Fabric Manager status dashboard via the web UI While the NVIDIA Fabric Manager is critical for initializing the NVSwitch fabric on HGX systems, it does not typically offer a “web UI“ dashboard for manual firmware verification as part of the initial CLI-based bring-up. The Fabric Manager is a background service; verification of its state and the underlying firmware is still performed via command-line tools like nvidia-smi or nvswitch-cli.
D. The ipmitool sensors command ipmitool interacts with the Baseboard Management Controller (BMC) to report environmental data such as fan speeds, voltages, and chassis temperatures. While the BMC can sometimes report firmware versions via the Redfish API (which the NVFW tool uses), the sensors command specifically is used for thermal and power monitoring, not for querying the specific VBIOS versions of individual GPUs within the HGX tray.
Question 20 of 60
20. Question
During the physical installation of GPU-based servers, a technician must validate that the cooling parameters meet the requirements for NVIDIA H100 GPUs. If the BMC reports that the GPU inlet temperature is nearing the thermal throttle limit despite low ambient room temperatures, what is the most likely physical configuration error within the rack?
Correct
Correct: A The server is missing blanking panels in the rack, causing hot air recirculation into the cold aisle.
The Technical Reason: NVIDIA HGX H100 systems are designed for high-pressure, front-to-back airflow. In a standard “Hot Aisle/Cold Aisle“ data center, blanking panels are mandatory to seal empty rack spaces.
Without them, the high-pressure hot air exhausted from the rear of the servers can leak through the open gaps and circulate back to the front (the cold aisle).
This causes the server to pull in pre-heated air rather than chilled air, leading to high Inlet Temperatures even if the room‘s ambient temperature is low.
The NCP-AII Context: The certification teaches that GPU “thermal throttling“ significantly degrades training performance. Validating the physical rack environment—including blanking panels and proper cable management—is a primary step in the “Infrastructure Validation“ phase.
Incorrect: B. GPU-based servers are configured with the wrong IP addresses The BMC (Baseboard Management Controller) manages fan speeds autonomously based on internal thermal sensors and pre-defined lookup tables in the firmware. While an incorrect OOB (Out-of-Band) IP address prevents an administrator from viewing the fan status remotely, it does not physically stop the BMC from controlling the fans or cause the hardware to overheat.
C. The TPM is not initialized The TPM (Trusted Platform Module) is a security component used for Measured Boot and hardware attestation. It has no logical or physical connection to the fan controller or the thermal management subsystem. A system with an uninitialized TPM will still cool itself perfectly fine as long as the BMC is functioning.
D. The storage array is connected via SAS instead of NVMe This is a technical mismatch regarding heat density. While NVMe drives are faster, they often generate more heat than SAS drives due to higher controller speeds. Regardless, the thermal profile of a few storage drives is negligible compared to the 700W+ TDP of a single H100 GPU. Switching to SAS would not be the cause of a GPU hitting its thermal throttle limit.
Incorrect
Correct: A The server is missing blanking panels in the rack, causing hot air recirculation into the cold aisle.
The Technical Reason: NVIDIA HGX H100 systems are designed for high-pressure, front-to-back airflow. In a standard “Hot Aisle/Cold Aisle“ data center, blanking panels are mandatory to seal empty rack spaces.
Without them, the high-pressure hot air exhausted from the rear of the servers can leak through the open gaps and circulate back to the front (the cold aisle).
This causes the server to pull in pre-heated air rather than chilled air, leading to high Inlet Temperatures even if the room‘s ambient temperature is low.
The NCP-AII Context: The certification teaches that GPU “thermal throttling“ significantly degrades training performance. Validating the physical rack environment—including blanking panels and proper cable management—is a primary step in the “Infrastructure Validation“ phase.
Incorrect: B. GPU-based servers are configured with the wrong IP addresses The BMC (Baseboard Management Controller) manages fan speeds autonomously based on internal thermal sensors and pre-defined lookup tables in the firmware. While an incorrect OOB (Out-of-Band) IP address prevents an administrator from viewing the fan status remotely, it does not physically stop the BMC from controlling the fans or cause the hardware to overheat.
C. The TPM is not initialized The TPM (Trusted Platform Module) is a security component used for Measured Boot and hardware attestation. It has no logical or physical connection to the fan controller or the thermal management subsystem. A system with an uninitialized TPM will still cool itself perfectly fine as long as the BMC is functioning.
D. The storage array is connected via SAS instead of NVMe This is a technical mismatch regarding heat density. While NVMe drives are faster, they often generate more heat than SAS drives due to higher controller speeds. Regardless, the thermal profile of a few storage drives is negligible compared to the 700W+ TDP of a single H100 GPU. Switching to SAS would not be the cause of a GPU hitting its thermal throttle limit.
Unattempted
Correct: A The server is missing blanking panels in the rack, causing hot air recirculation into the cold aisle.
The Technical Reason: NVIDIA HGX H100 systems are designed for high-pressure, front-to-back airflow. In a standard “Hot Aisle/Cold Aisle“ data center, blanking panels are mandatory to seal empty rack spaces.
Without them, the high-pressure hot air exhausted from the rear of the servers can leak through the open gaps and circulate back to the front (the cold aisle).
This causes the server to pull in pre-heated air rather than chilled air, leading to high Inlet Temperatures even if the room‘s ambient temperature is low.
The NCP-AII Context: The certification teaches that GPU “thermal throttling“ significantly degrades training performance. Validating the physical rack environment—including blanking panels and proper cable management—is a primary step in the “Infrastructure Validation“ phase.
Incorrect: B. GPU-based servers are configured with the wrong IP addresses The BMC (Baseboard Management Controller) manages fan speeds autonomously based on internal thermal sensors and pre-defined lookup tables in the firmware. While an incorrect OOB (Out-of-Band) IP address prevents an administrator from viewing the fan status remotely, it does not physically stop the BMC from controlling the fans or cause the hardware to overheat.
C. The TPM is not initialized The TPM (Trusted Platform Module) is a security component used for Measured Boot and hardware attestation. It has no logical or physical connection to the fan controller or the thermal management subsystem. A system with an uninitialized TPM will still cool itself perfectly fine as long as the BMC is functioning.
D. The storage array is connected via SAS instead of NVMe This is a technical mismatch regarding heat density. While NVMe drives are faster, they often generate more heat than SAS drives due to higher controller speeds. Regardless, the thermal profile of a few storage drives is negligible compared to the 700W+ TDP of a single H100 GPU. Switching to SAS would not be the cause of a GPU hitting its thermal throttle limit.
Question 21 of 60
21. Question
During the NCCL (NVIDIA Collective Communications Library) tests, an engineer notices that the East-West fabric bandwidth is significantly lower than expected for a 400Gb/s InfiniBand network. Which tool or diagnostic should be run to specifically isolate whether the issue is with the NVLink Switch inside the node or the external fabric cabling?
Correct
Correct: C Running the ‘nccl-tests‘ with the ‘NVLINK_DISABLE=1‘ environment variable. • The Technical Reason: nccl-tests (like all_reduce_perf) automatically detect the fastest available path, which is typically NVLink for intra-node communication and InfiniBand/RoCE for inter-node communication. ? If bandwidth is low, an engineer must determine if the bottleneck is the Internal NVSwitch (NVLink) or the External Network Interface Cards/Cabling (InfiniBand). ? By setting NVLINK_DISABLE=1 (or NCCL_P2P_DISABLE=1 in some versions), you force NCCL to ignore the NVLink path and use the network fabric (PCIe ? NIC ? Switch) even for GPUs within the same node. ? The Logic: If bandwidth remains low after disabling NVLink, the issue is likely with the external fabric or NICs. If bandwidth significantly improves or matches the expected network line rate (e.g., 400Gb/s) after disabling NVLink, the issue is isolated to the internal NVLink/NVSwitch complex. • The NCP-AII Context: The exam tests your ability to use environment variables to manipulate the NCCL Topology Detection engine for diagnostic purposes.
Incorrect: A. Executing a ‘burn-in‘ test on the local NVMe storage drives While storage performance is important, a “burn-in“ for NVMe drives tests the I/O throughput and IOPS of the storage subsystem. It has no diagnostic value for a collective communication bandwidth issue between GPUs, which relies on the East-West network fabric and memory-to-memory transfers.
B. Checking the server‘s BIOS to see if hyper-threading is enabled Hyper-threading (Simultaneous Multithreading) affects CPU core utilization and general compute tasks. While it can have a minor impact on the “control plane“ overhead of communication libraries, it is not a tool used to isolate hardware-level bandwidth bottlenecks between NVLink and the 400Gb/s InfiniBand fabric.
D. Using the ‘ibstat‘ command to check the port state on the BMC This option contains a technical error. ibstat is a Linux command used to check the status of local InfiniBand HCAs (Host Channel Adapters) on the host OS, not on the BMC (Baseboard Management Controller). While ibstat can tell you if a link is “Active“ or “Down,“ it cannot help you differentiate between an internal NVLink failure and an external fabric performance degradation.
Incorrect
Correct: C Running the ‘nccl-tests‘ with the ‘NVLINK_DISABLE=1‘ environment variable. • The Technical Reason: nccl-tests (like all_reduce_perf) automatically detect the fastest available path, which is typically NVLink for intra-node communication and InfiniBand/RoCE for inter-node communication. ? If bandwidth is low, an engineer must determine if the bottleneck is the Internal NVSwitch (NVLink) or the External Network Interface Cards/Cabling (InfiniBand). ? By setting NVLINK_DISABLE=1 (or NCCL_P2P_DISABLE=1 in some versions), you force NCCL to ignore the NVLink path and use the network fabric (PCIe ? NIC ? Switch) even for GPUs within the same node. ? The Logic: If bandwidth remains low after disabling NVLink, the issue is likely with the external fabric or NICs. If bandwidth significantly improves or matches the expected network line rate (e.g., 400Gb/s) after disabling NVLink, the issue is isolated to the internal NVLink/NVSwitch complex. • The NCP-AII Context: The exam tests your ability to use environment variables to manipulate the NCCL Topology Detection engine for diagnostic purposes.
Incorrect: A. Executing a ‘burn-in‘ test on the local NVMe storage drives While storage performance is important, a “burn-in“ for NVMe drives tests the I/O throughput and IOPS of the storage subsystem. It has no diagnostic value for a collective communication bandwidth issue between GPUs, which relies on the East-West network fabric and memory-to-memory transfers.
B. Checking the server‘s BIOS to see if hyper-threading is enabled Hyper-threading (Simultaneous Multithreading) affects CPU core utilization and general compute tasks. While it can have a minor impact on the “control plane“ overhead of communication libraries, it is not a tool used to isolate hardware-level bandwidth bottlenecks between NVLink and the 400Gb/s InfiniBand fabric.
D. Using the ‘ibstat‘ command to check the port state on the BMC This option contains a technical error. ibstat is a Linux command used to check the status of local InfiniBand HCAs (Host Channel Adapters) on the host OS, not on the BMC (Baseboard Management Controller). While ibstat can tell you if a link is “Active“ or “Down,“ it cannot help you differentiate between an internal NVLink failure and an external fabric performance degradation.
Unattempted
Correct: C Running the ‘nccl-tests‘ with the ‘NVLINK_DISABLE=1‘ environment variable. • The Technical Reason: nccl-tests (like all_reduce_perf) automatically detect the fastest available path, which is typically NVLink for intra-node communication and InfiniBand/RoCE for inter-node communication. ? If bandwidth is low, an engineer must determine if the bottleneck is the Internal NVSwitch (NVLink) or the External Network Interface Cards/Cabling (InfiniBand). ? By setting NVLINK_DISABLE=1 (or NCCL_P2P_DISABLE=1 in some versions), you force NCCL to ignore the NVLink path and use the network fabric (PCIe ? NIC ? Switch) even for GPUs within the same node. ? The Logic: If bandwidth remains low after disabling NVLink, the issue is likely with the external fabric or NICs. If bandwidth significantly improves or matches the expected network line rate (e.g., 400Gb/s) after disabling NVLink, the issue is isolated to the internal NVLink/NVSwitch complex. • The NCP-AII Context: The exam tests your ability to use environment variables to manipulate the NCCL Topology Detection engine for diagnostic purposes.
Incorrect: A. Executing a ‘burn-in‘ test on the local NVMe storage drives While storage performance is important, a “burn-in“ for NVMe drives tests the I/O throughput and IOPS of the storage subsystem. It has no diagnostic value for a collective communication bandwidth issue between GPUs, which relies on the East-West network fabric and memory-to-memory transfers.
B. Checking the server‘s BIOS to see if hyper-threading is enabled Hyper-threading (Simultaneous Multithreading) affects CPU core utilization and general compute tasks. While it can have a minor impact on the “control plane“ overhead of communication libraries, it is not a tool used to isolate hardware-level bandwidth bottlenecks between NVLink and the 400Gb/s InfiniBand fabric.
D. Using the ‘ibstat‘ command to check the port state on the BMC This option contains a technical error. ibstat is a Linux command used to check the status of local InfiniBand HCAs (Host Channel Adapters) on the host OS, not on the BMC (Baseboard Management Controller). While ibstat can tell you if a link is “Active“ or “Down,“ it cannot help you differentiate between an internal NVLink failure and an external fabric performance degradation.
Question 22 of 60
22. Question
A data center engineer is performing the initial bring-up of a new NVIDIA HGX system. After connecting the power cables and ensuring the cooling systems are operational, the engineer needs to perform a firmware upgrade on the HGX board and its associated components. Which sequence of events is most critical to ensure the integrity of the firmware upgrade process and prevent hardware initialization faults during the subsequent boot cycle?
Correct
Correct: A Validate power and cooling parameters, then use the Baseboard Management Controller (BMC) to verify current firmware versions before applying updates in a specific sequence.
The Technical Reason: An HGX system is not just a collection of GPUs; it is a complex “baseboard assembly“ containing NVSwitches, PCIe Retimers, and Electronic Root of Trust (ERoT) modules.
Environmental Validation: Before flashing firmware, power and cooling must be stable. A power failure during a firmware write can “brick“ the hardware.
The BMC and Redfish: In modern NVIDIA-certified systems (like DGX H100/H200), the BMC acts as the gateway for firmware updates via the Redfish API. The nvfwupd tool (NVIDIA Firmware Update) communicates with the BMC to query the current state before pushing a “Firmware Recipe.“
Sequence Matters: Certain components (like the BMC itself or the CPLD) often need to be updated before the GPU VBIOS or NVSwitch firmware to ensure communication compatibility.
The NCP-AII Context: The exam blueprint explicitly lists “Perform initial configuration of BMC, OOB, and TPM“ and “Perform firmware upgrades (including on HGX™)“ as key skills. The “Correct“ path always prioritizes verification and environmental stability over immediate execution.
Incorrect: B. Disconnect all InfiniBand cables to prevent network interference This is a common distractor. While InfiniBand is the high-speed “Data Plane,“ firmware updates typically happen over the Management Plane (the OOB/BMC network). Disconnecting InfiniBand cables is unnecessary and does not impact the integrity of the firmware flash process. In fact, for multi-node automated updates via Base Command Manager (BCM), network connectivity is required.
C. Initiate the firmware update directly from the OS using NVIDIA SMI While nvidia-smi can be used to check versions, it is primarily a driver-level tool. For a full HGX system upgrade (which includes NVSwitches, BMC, and BIOS), the industry-standard tool is nvfwupd, which operates out-of-band via the BMC. Updating only via the OS risks skipping critical infrastructure components that the OS cannot “see“ directly.
D. Perform a hard reset immediately after the firmware flash starts This is a “catastrophic“ error. Interrupting a firmware flash (“hard reset“) before it completes will leave the hardware in an inconsistent state, often resulting in a permanent hardware failure that requires a physical board replacement. Firmware updates require a Power Cycle after the successful completion of the flash to activate the new image, never during the process.
Incorrect
Correct: A Validate power and cooling parameters, then use the Baseboard Management Controller (BMC) to verify current firmware versions before applying updates in a specific sequence.
The Technical Reason: An HGX system is not just a collection of GPUs; it is a complex “baseboard assembly“ containing NVSwitches, PCIe Retimers, and Electronic Root of Trust (ERoT) modules.
Environmental Validation: Before flashing firmware, power and cooling must be stable. A power failure during a firmware write can “brick“ the hardware.
The BMC and Redfish: In modern NVIDIA-certified systems (like DGX H100/H200), the BMC acts as the gateway for firmware updates via the Redfish API. The nvfwupd tool (NVIDIA Firmware Update) communicates with the BMC to query the current state before pushing a “Firmware Recipe.“
Sequence Matters: Certain components (like the BMC itself or the CPLD) often need to be updated before the GPU VBIOS or NVSwitch firmware to ensure communication compatibility.
The NCP-AII Context: The exam blueprint explicitly lists “Perform initial configuration of BMC, OOB, and TPM“ and “Perform firmware upgrades (including on HGX™)“ as key skills. The “Correct“ path always prioritizes verification and environmental stability over immediate execution.
Incorrect: B. Disconnect all InfiniBand cables to prevent network interference This is a common distractor. While InfiniBand is the high-speed “Data Plane,“ firmware updates typically happen over the Management Plane (the OOB/BMC network). Disconnecting InfiniBand cables is unnecessary and does not impact the integrity of the firmware flash process. In fact, for multi-node automated updates via Base Command Manager (BCM), network connectivity is required.
C. Initiate the firmware update directly from the OS using NVIDIA SMI While nvidia-smi can be used to check versions, it is primarily a driver-level tool. For a full HGX system upgrade (which includes NVSwitches, BMC, and BIOS), the industry-standard tool is nvfwupd, which operates out-of-band via the BMC. Updating only via the OS risks skipping critical infrastructure components that the OS cannot “see“ directly.
D. Perform a hard reset immediately after the firmware flash starts This is a “catastrophic“ error. Interrupting a firmware flash (“hard reset“) before it completes will leave the hardware in an inconsistent state, often resulting in a permanent hardware failure that requires a physical board replacement. Firmware updates require a Power Cycle after the successful completion of the flash to activate the new image, never during the process.
Unattempted
Correct: A Validate power and cooling parameters, then use the Baseboard Management Controller (BMC) to verify current firmware versions before applying updates in a specific sequence.
The Technical Reason: An HGX system is not just a collection of GPUs; it is a complex “baseboard assembly“ containing NVSwitches, PCIe Retimers, and Electronic Root of Trust (ERoT) modules.
Environmental Validation: Before flashing firmware, power and cooling must be stable. A power failure during a firmware write can “brick“ the hardware.
The BMC and Redfish: In modern NVIDIA-certified systems (like DGX H100/H200), the BMC acts as the gateway for firmware updates via the Redfish API. The nvfwupd tool (NVIDIA Firmware Update) communicates with the BMC to query the current state before pushing a “Firmware Recipe.“
Sequence Matters: Certain components (like the BMC itself or the CPLD) often need to be updated before the GPU VBIOS or NVSwitch firmware to ensure communication compatibility.
The NCP-AII Context: The exam blueprint explicitly lists “Perform initial configuration of BMC, OOB, and TPM“ and “Perform firmware upgrades (including on HGX™)“ as key skills. The “Correct“ path always prioritizes verification and environmental stability over immediate execution.
Incorrect: B. Disconnect all InfiniBand cables to prevent network interference This is a common distractor. While InfiniBand is the high-speed “Data Plane,“ firmware updates typically happen over the Management Plane (the OOB/BMC network). Disconnecting InfiniBand cables is unnecessary and does not impact the integrity of the firmware flash process. In fact, for multi-node automated updates via Base Command Manager (BCM), network connectivity is required.
C. Initiate the firmware update directly from the OS using NVIDIA SMI While nvidia-smi can be used to check versions, it is primarily a driver-level tool. For a full HGX system upgrade (which includes NVSwitches, BMC, and BIOS), the industry-standard tool is nvfwupd, which operates out-of-band via the BMC. Updating only via the OS risks skipping critical infrastructure components that the OS cannot “see“ directly.
D. Perform a hard reset immediately after the firmware flash starts This is a “catastrophic“ error. Interrupting a firmware flash (“hard reset“) before it completes will leave the hardware in an inconsistent state, often resulting in a permanent hardware failure that requires a physical board replacement. Firmware updates require a Power Cycle after the successful completion of the flash to activate the new image, never during the process.
Question 23 of 60
23. Question
The ClusterKit tool is being used for a multifaceted node assessment. One of the tests fails because the NVLink Switch cannot be verified. What does this failure imply for the AI workloads intended for that node, and which physical component should be inspected first?
Correct
Correct: D The GPUs cannot communicate with each other at full speed; the physical HGX baseboard or NVSwitch modules should be inspected.
The Technical Reason: NVLink is the high-bandwidth, energy-efficient, low-latency interconnect that allows GPUs within a node (the HGX/DGX tray) to share data. The NVSwitch is the physical ASIC on the HGX baseboard that routes this traffic. If ClusterKit fails to verify the NVLink Switch, it means the peer-to-peer (P2P) communication fabric is broken or degraded.
Impact on Workloads: AI training (especially Large Language Models) relies heavily on Collective Communications (All-Reduce, All-to-All). Without a functional NVLink fabric, the GPUs are forced to communicate over the much slower PCIe bus, leading to a massive drop in training throughput (often >80% performance loss).
Physical Inspection: On an HGX system, the NVSwitch chips are integrated into the baseboard. A failure here requires inspecting the physical seating of the GPU modules on the baseboard or checking for hardware defects in the NVSwitch modules themselves.
Incorrect: A. The storage system is too slow; update NVMe firmware This is a distractor. While storage is a critical “North-South“ component, it has no logical or physical connection to the NVLink Switch. NVLink is purely for GPU-to-GPU data transfer. Storage bottlenecks are diagnosed using tools like fio or iozone, not by testing the NVLink fabric.
B. The node cannot connect to the internet; inspect the RJ45 cable Network connectivity to the internet or the management network is handled by the LOM (LAN on Motherboard) or the OOB (Out-of-Band) management port. ClusterKit uses internal PCIe/NVLink queries to test the switch. If the node were completely offline, you wouldn‘t even be able to run ClusterKit or receive the error report, but “internet access“ is irrelevant to internal GPU-to-GPU communication.
C. The CPU cannot access system memory; inspect DDR5 slots This describes a NUMA or Memory Controller issue. While CPU-to-System-Memory performance is important for loading data, NVLink exists specifically to bypass the CPU and System RAM when GPUs talk to each other. A failure in the NVSwitch verification points to the GPU fabric, not the standard CPU memory architecture.
Incorrect
Correct: D The GPUs cannot communicate with each other at full speed; the physical HGX baseboard or NVSwitch modules should be inspected.
The Technical Reason: NVLink is the high-bandwidth, energy-efficient, low-latency interconnect that allows GPUs within a node (the HGX/DGX tray) to share data. The NVSwitch is the physical ASIC on the HGX baseboard that routes this traffic. If ClusterKit fails to verify the NVLink Switch, it means the peer-to-peer (P2P) communication fabric is broken or degraded.
Impact on Workloads: AI training (especially Large Language Models) relies heavily on Collective Communications (All-Reduce, All-to-All). Without a functional NVLink fabric, the GPUs are forced to communicate over the much slower PCIe bus, leading to a massive drop in training throughput (often >80% performance loss).
Physical Inspection: On an HGX system, the NVSwitch chips are integrated into the baseboard. A failure here requires inspecting the physical seating of the GPU modules on the baseboard or checking for hardware defects in the NVSwitch modules themselves.
Incorrect: A. The storage system is too slow; update NVMe firmware This is a distractor. While storage is a critical “North-South“ component, it has no logical or physical connection to the NVLink Switch. NVLink is purely for GPU-to-GPU data transfer. Storage bottlenecks are diagnosed using tools like fio or iozone, not by testing the NVLink fabric.
B. The node cannot connect to the internet; inspect the RJ45 cable Network connectivity to the internet or the management network is handled by the LOM (LAN on Motherboard) or the OOB (Out-of-Band) management port. ClusterKit uses internal PCIe/NVLink queries to test the switch. If the node were completely offline, you wouldn‘t even be able to run ClusterKit or receive the error report, but “internet access“ is irrelevant to internal GPU-to-GPU communication.
C. The CPU cannot access system memory; inspect DDR5 slots This describes a NUMA or Memory Controller issue. While CPU-to-System-Memory performance is important for loading data, NVLink exists specifically to bypass the CPU and System RAM when GPUs talk to each other. A failure in the NVSwitch verification points to the GPU fabric, not the standard CPU memory architecture.
Unattempted
Correct: D The GPUs cannot communicate with each other at full speed; the physical HGX baseboard or NVSwitch modules should be inspected.
The Technical Reason: NVLink is the high-bandwidth, energy-efficient, low-latency interconnect that allows GPUs within a node (the HGX/DGX tray) to share data. The NVSwitch is the physical ASIC on the HGX baseboard that routes this traffic. If ClusterKit fails to verify the NVLink Switch, it means the peer-to-peer (P2P) communication fabric is broken or degraded.
Impact on Workloads: AI training (especially Large Language Models) relies heavily on Collective Communications (All-Reduce, All-to-All). Without a functional NVLink fabric, the GPUs are forced to communicate over the much slower PCIe bus, leading to a massive drop in training throughput (often >80% performance loss).
Physical Inspection: On an HGX system, the NVSwitch chips are integrated into the baseboard. A failure here requires inspecting the physical seating of the GPU modules on the baseboard or checking for hardware defects in the NVSwitch modules themselves.
Incorrect: A. The storage system is too slow; update NVMe firmware This is a distractor. While storage is a critical “North-South“ component, it has no logical or physical connection to the NVLink Switch. NVLink is purely for GPU-to-GPU data transfer. Storage bottlenecks are diagnosed using tools like fio or iozone, not by testing the NVLink fabric.
B. The node cannot connect to the internet; inspect the RJ45 cable Network connectivity to the internet or the management network is handled by the LOM (LAN on Motherboard) or the OOB (Out-of-Band) management port. ClusterKit uses internal PCIe/NVLink queries to test the switch. If the node were completely offline, you wouldn‘t even be able to run ClusterKit or receive the error report, but “internet access“ is irrelevant to internal GPU-to-GPU communication.
C. The CPU cannot access system memory; inspect DDR5 slots This describes a NUMA or Memory Controller issue. While CPU-to-System-Memory performance is important for loading data, NVLink exists specifically to bypass the CPU and System RAM when GPUs talk to each other. A failure in the NVSwitch verification points to the GPU fabric, not the standard CPU memory architecture.
Question 24 of 60
24. Question
During the deployment of an AI factory, the security policy requires the initialization of the Trusted Platform Module (TPM) and the configuration of the Baseboard Management Controller (BMC) to ensure a secure boot process and remote management. If the administrator needs to perform these tasks across 100 nodes simultaneously before the operating system is installed, which methodology is considered best practice for scale and efficiency?
Correct
Correct: B Using a Redfish-compliant API script to push configurations via the OOB network.
The Technical Reason: Modern NVIDIA-Certified Systems (like DGX H100 or HGX-based clusters) utilize the Redfish API, an industry-standard RESTful interface for hardware management.
Scale: Redfish allows an administrator to send JSON-based configuration payloads (for TPM initialization, BMC users, BIOS settings, and Boot Order) to hundreds of nodes simultaneously over the Out-of-Band (OOB) management network.
Pre-OS Execution: Because Redfish operates at the BMC level, it does not require an operating system to be installed. This is the “Day 0“ method for preparing hardware for PXE booting or OS deployment via NVIDIA Base Command Manager (BCM).
The NCP-AII Context: The certification specifically highlights Remote Management and BMC/Redfish as the primary tools for cluster initialization. Utilizing scripts (Python/Curl) or orchestration tools like Ansible with Redfish modules is the only “Best Practice“ for an AI Factory scale.
Incorrect: A. Configuring one master node and physically cloning its SSD This is a “Legacy“ approach that fails in a modern AI infrastructure for several reasons:
Hardware Identity: SSD cloning does not configure the BMC or TPM, which reside on the motherboard/baseboard firmware, not the drive.
Security: TPMs are unique to each motherboard; a cloned OS would fail integrity checks because the unique hardware keys wouldn‘t match.
Efficiency: Physically moving 100 SSDs is labor-intensive and error-prone compared to a 10-second API call.
C. Manually logging into each individual BMC web interface While functionally possible, this is the opposite of a “Best Practice“ for an AI Factory. Manual configuration of 100 nodes is:
Inconsistent: High risk of human error (typos in IP addresses or security settings).
Inefficient: It would take hours or days to complete, whereas an API script completes the task in minutes.
Not Scalable: AI Factories are built to scale to thousands of GPUs; manual processes are a bottleneck.
D. Inserting a physical USB boot drive into every server This method requires physical presence in the data center for every node. It is inefficient for 100 nodes and impossible for large-scale “Lights Out“ data centers. Furthermore, running a local script on a USB drive requires the system to boot into an environment, which may be blocked by Secure Boot if the TPM and BIOS are not yet configured—creating a “chicken and egg“ problem that Redfish solves by working out-of-band.
Incorrect
Correct: B Using a Redfish-compliant API script to push configurations via the OOB network.
The Technical Reason: Modern NVIDIA-Certified Systems (like DGX H100 or HGX-based clusters) utilize the Redfish API, an industry-standard RESTful interface for hardware management.
Scale: Redfish allows an administrator to send JSON-based configuration payloads (for TPM initialization, BMC users, BIOS settings, and Boot Order) to hundreds of nodes simultaneously over the Out-of-Band (OOB) management network.
Pre-OS Execution: Because Redfish operates at the BMC level, it does not require an operating system to be installed. This is the “Day 0“ method for preparing hardware for PXE booting or OS deployment via NVIDIA Base Command Manager (BCM).
The NCP-AII Context: The certification specifically highlights Remote Management and BMC/Redfish as the primary tools for cluster initialization. Utilizing scripts (Python/Curl) or orchestration tools like Ansible with Redfish modules is the only “Best Practice“ for an AI Factory scale.
Incorrect: A. Configuring one master node and physically cloning its SSD This is a “Legacy“ approach that fails in a modern AI infrastructure for several reasons:
Hardware Identity: SSD cloning does not configure the BMC or TPM, which reside on the motherboard/baseboard firmware, not the drive.
Security: TPMs are unique to each motherboard; a cloned OS would fail integrity checks because the unique hardware keys wouldn‘t match.
Efficiency: Physically moving 100 SSDs is labor-intensive and error-prone compared to a 10-second API call.
C. Manually logging into each individual BMC web interface While functionally possible, this is the opposite of a “Best Practice“ for an AI Factory. Manual configuration of 100 nodes is:
Inconsistent: High risk of human error (typos in IP addresses or security settings).
Inefficient: It would take hours or days to complete, whereas an API script completes the task in minutes.
Not Scalable: AI Factories are built to scale to thousands of GPUs; manual processes are a bottleneck.
D. Inserting a physical USB boot drive into every server This method requires physical presence in the data center for every node. It is inefficient for 100 nodes and impossible for large-scale “Lights Out“ data centers. Furthermore, running a local script on a USB drive requires the system to boot into an environment, which may be blocked by Secure Boot if the TPM and BIOS are not yet configured—creating a “chicken and egg“ problem that Redfish solves by working out-of-band.
Unattempted
Correct: B Using a Redfish-compliant API script to push configurations via the OOB network.
The Technical Reason: Modern NVIDIA-Certified Systems (like DGX H100 or HGX-based clusters) utilize the Redfish API, an industry-standard RESTful interface for hardware management.
Scale: Redfish allows an administrator to send JSON-based configuration payloads (for TPM initialization, BMC users, BIOS settings, and Boot Order) to hundreds of nodes simultaneously over the Out-of-Band (OOB) management network.
Pre-OS Execution: Because Redfish operates at the BMC level, it does not require an operating system to be installed. This is the “Day 0“ method for preparing hardware for PXE booting or OS deployment via NVIDIA Base Command Manager (BCM).
The NCP-AII Context: The certification specifically highlights Remote Management and BMC/Redfish as the primary tools for cluster initialization. Utilizing scripts (Python/Curl) or orchestration tools like Ansible with Redfish modules is the only “Best Practice“ for an AI Factory scale.
Incorrect: A. Configuring one master node and physically cloning its SSD This is a “Legacy“ approach that fails in a modern AI infrastructure for several reasons:
Hardware Identity: SSD cloning does not configure the BMC or TPM, which reside on the motherboard/baseboard firmware, not the drive.
Security: TPMs are unique to each motherboard; a cloned OS would fail integrity checks because the unique hardware keys wouldn‘t match.
Efficiency: Physically moving 100 SSDs is labor-intensive and error-prone compared to a 10-second API call.
C. Manually logging into each individual BMC web interface While functionally possible, this is the opposite of a “Best Practice“ for an AI Factory. Manual configuration of 100 nodes is:
Inconsistent: High risk of human error (typos in IP addresses or security settings).
Inefficient: It would take hours or days to complete, whereas an API script completes the task in minutes.
Not Scalable: AI Factories are built to scale to thousands of GPUs; manual processes are a bottleneck.
D. Inserting a physical USB boot drive into every server This method requires physical presence in the data center for every node. It is inefficient for 100 nodes and impossible for large-scale “Lights Out“ data centers. Furthermore, running a local script on a USB drive requires the system to boot into an environment, which may be blocked by Secure Boot if the TPM and BIOS are not yet configured—creating a “chicken and egg“ problem that Redfish solves by working out-of-band.
Question 25 of 60
25. Question
An administrator needs to install the NGC CLI on several host machines to allow researchers to pull NVIDIA-optimized containers and models. What is the correct procedure for ensuring the NGC CLI is properly configured for a private organization in an AI factory?
Correct
Correct: D Download the binary, add it to the system PATH, and use the ‘ngc config set‘ command to provide a valid API key from the NGC portal.
The Technical Reason: The NGC CLI is distributed as a standalone binary (available for Linux, Windows, and macOS).
Installation: Unlike standard OS packages, it is typically downloaded directly from the NGC website or via wget. For ease of use across the terminal, the binary must be moved to a directory in the user‘s $PATH (e.g., /usr/local/bin).
Configuration: To access private organizations or pull specific containers, the CLI must be authenticated. This is done by generating an API Key in the NGC web portal and then running ngc config set on the host. This command prompts for the key and sets the default organization and output format.
The NCP-AII Context: This setup is a prerequisite for researchers to pull optimized frameworks like PyTorch or TensorFlow from the nvcr.io registry.
Incorrect: A. Install the NGC CLI via the BMC‘s virtual media interface This is a distractor that confuses hardware management with cloud software. The BMC (Baseboard Management Controller) and Virtual Media are used for mounting OS ISOs or firmware updates. The NGC CLI is a user-space application used after the OS is running; it does not interact with the TPM for API key storage, as keys are typically stored in a local configuration file (e.g., ~/.ngc/config).
B. Use ‘apt-get install ngc-cli‘ and restart the InfiniBand switches The NGC CLI is not currently available in standard public APT repositories (it is a direct binary download). Furthermore, the InfiniBand switches manage the high-speed data fabric; they have no role in authenticating a cloud-based CLI or managing user credentials for a container registry.
C. The NGC CLI is pre-installed on all BlueField DPUs While the BlueField DPU runs a Linux OS, the NGC CLI is not a standard pre-installed component of the DOCA image. Additionally, if a DPU is in NIC mode, the internal ARM cores are disabled, meaning no software (including a CLI) can run on the DPU itself.
Incorrect
Correct: D Download the binary, add it to the system PATH, and use the ‘ngc config set‘ command to provide a valid API key from the NGC portal.
The Technical Reason: The NGC CLI is distributed as a standalone binary (available for Linux, Windows, and macOS).
Installation: Unlike standard OS packages, it is typically downloaded directly from the NGC website or via wget. For ease of use across the terminal, the binary must be moved to a directory in the user‘s $PATH (e.g., /usr/local/bin).
Configuration: To access private organizations or pull specific containers, the CLI must be authenticated. This is done by generating an API Key in the NGC web portal and then running ngc config set on the host. This command prompts for the key and sets the default organization and output format.
The NCP-AII Context: This setup is a prerequisite for researchers to pull optimized frameworks like PyTorch or TensorFlow from the nvcr.io registry.
Incorrect: A. Install the NGC CLI via the BMC‘s virtual media interface This is a distractor that confuses hardware management with cloud software. The BMC (Baseboard Management Controller) and Virtual Media are used for mounting OS ISOs or firmware updates. The NGC CLI is a user-space application used after the OS is running; it does not interact with the TPM for API key storage, as keys are typically stored in a local configuration file (e.g., ~/.ngc/config).
B. Use ‘apt-get install ngc-cli‘ and restart the InfiniBand switches The NGC CLI is not currently available in standard public APT repositories (it is a direct binary download). Furthermore, the InfiniBand switches manage the high-speed data fabric; they have no role in authenticating a cloud-based CLI or managing user credentials for a container registry.
C. The NGC CLI is pre-installed on all BlueField DPUs While the BlueField DPU runs a Linux OS, the NGC CLI is not a standard pre-installed component of the DOCA image. Additionally, if a DPU is in NIC mode, the internal ARM cores are disabled, meaning no software (including a CLI) can run on the DPU itself.
Unattempted
Correct: D Download the binary, add it to the system PATH, and use the ‘ngc config set‘ command to provide a valid API key from the NGC portal.
The Technical Reason: The NGC CLI is distributed as a standalone binary (available for Linux, Windows, and macOS).
Installation: Unlike standard OS packages, it is typically downloaded directly from the NGC website or via wget. For ease of use across the terminal, the binary must be moved to a directory in the user‘s $PATH (e.g., /usr/local/bin).
Configuration: To access private organizations or pull specific containers, the CLI must be authenticated. This is done by generating an API Key in the NGC web portal and then running ngc config set on the host. This command prompts for the key and sets the default organization and output format.
The NCP-AII Context: This setup is a prerequisite for researchers to pull optimized frameworks like PyTorch or TensorFlow from the nvcr.io registry.
Incorrect: A. Install the NGC CLI via the BMC‘s virtual media interface This is a distractor that confuses hardware management with cloud software. The BMC (Baseboard Management Controller) and Virtual Media are used for mounting OS ISOs or firmware updates. The NGC CLI is a user-space application used after the OS is running; it does not interact with the TPM for API key storage, as keys are typically stored in a local configuration file (e.g., ~/.ngc/config).
B. Use ‘apt-get install ngc-cli‘ and restart the InfiniBand switches The NGC CLI is not currently available in standard public APT repositories (it is a direct binary download). Furthermore, the InfiniBand switches manage the high-speed data fabric; they have no role in authenticating a cloud-based CLI or managing user credentials for a container registry.
C. The NGC CLI is pre-installed on all BlueField DPUs While the BlueField DPU runs a Linux OS, the NGC CLI is not a standard pre-installed component of the DOCA image. Additionally, if a DPU is in NIC mode, the internal ARM cores are disabled, meaning no software (including a CLI) can run on the DPU itself.
Question 26 of 60
26. Question
An engineer is conducting a single-node stress test and an HPL benchmark on a new AI server. During the test, the HPL performance starts high but gradually drops by 30 percent over a 15-minute period. Which hardware-related issue is most likely causing this performance degradation, and how can it be verified?
Correct
Correct: A Thermal throttling of the GPUs; it can be verified by checking the Clocks Throttle Reasons in the output of nvidia-smi -q.
The Technical Reason: HPL is an extremely compute-intensive benchmark that pushes GPUs to their thermal and power limits. If performance starts high but drops gradually over 15 minutes, it strongly indicates a thermal saturation issue.
As the GPU temperature approaches the maximum operating threshold, the driver automatically reduces the clock speeds to prevent hardware damage.
Verification: The standard way to confirm this is by querying the GPU‘s internal state using nvidia-smi -q. Under the Clocks Throttle Reasons section, the system will explicitly report gpu_idle, applications_clocks_setting, or most importantly, sw_thermal_slowdown (software-level thermal throttling) and hw_thermal_slowdown (hardware-level thermal throttling).
The NCP-AII Context: The certification emphasizes using the NVIDIA System Management Interface (nvidia-smi) as the primary tool for “Day 1“ validation. Identifying “throttling reasons“ is a specific troubleshooting step required to differentiate between power limits, thermal limits, or software configuration bottlenecks.
Incorrect: B. The SSD storage is becoming fragmented HPL is a floating-point computational benchmark that runs primarily in GPU VRAM and System RAM. Once the initial matrices are loaded into memory, the storage subsystem (SSD) has virtually zero impact on the ongoing calculation. Fragmentation on a Linux SSD would affect boot times or initial data loading, but it cannot cause a 30% performance drop in a GPU-resident computational loop.
C. The BIOS is in Power Save mode If the BIOS were in “Power Save“ mode, the performance would be consistently low from the very beginning of the test. It would not start at a high performance level and then “gradually drop“ over 15 minutes. Furthermore, checking a motherboard serial number in a BMC log provides inventory data, but it does not reveal the active power profile or real-time clock speed behavior.
D. The Slurm scheduler is losing its connection to the head node Slurm is used for job orchestration (scheduling and launching). Once a job is running on a node, Slurm‘s connection to the head node is only used for heartbeats and status updates. If the connection were lost, the job might be marked as “failed“ or “drained,“ but it would not cause the running HPL binary to slowly degrade in computational speed. Network “ping latency“ tests the management fabric, not the GPU‘s internal performance.
Incorrect
Correct: A Thermal throttling of the GPUs; it can be verified by checking the Clocks Throttle Reasons in the output of nvidia-smi -q.
The Technical Reason: HPL is an extremely compute-intensive benchmark that pushes GPUs to their thermal and power limits. If performance starts high but drops gradually over 15 minutes, it strongly indicates a thermal saturation issue.
As the GPU temperature approaches the maximum operating threshold, the driver automatically reduces the clock speeds to prevent hardware damage.
Verification: The standard way to confirm this is by querying the GPU‘s internal state using nvidia-smi -q. Under the Clocks Throttle Reasons section, the system will explicitly report gpu_idle, applications_clocks_setting, or most importantly, sw_thermal_slowdown (software-level thermal throttling) and hw_thermal_slowdown (hardware-level thermal throttling).
The NCP-AII Context: The certification emphasizes using the NVIDIA System Management Interface (nvidia-smi) as the primary tool for “Day 1“ validation. Identifying “throttling reasons“ is a specific troubleshooting step required to differentiate between power limits, thermal limits, or software configuration bottlenecks.
Incorrect: B. The SSD storage is becoming fragmented HPL is a floating-point computational benchmark that runs primarily in GPU VRAM and System RAM. Once the initial matrices are loaded into memory, the storage subsystem (SSD) has virtually zero impact on the ongoing calculation. Fragmentation on a Linux SSD would affect boot times or initial data loading, but it cannot cause a 30% performance drop in a GPU-resident computational loop.
C. The BIOS is in Power Save mode If the BIOS were in “Power Save“ mode, the performance would be consistently low from the very beginning of the test. It would not start at a high performance level and then “gradually drop“ over 15 minutes. Furthermore, checking a motherboard serial number in a BMC log provides inventory data, but it does not reveal the active power profile or real-time clock speed behavior.
D. The Slurm scheduler is losing its connection to the head node Slurm is used for job orchestration (scheduling and launching). Once a job is running on a node, Slurm‘s connection to the head node is only used for heartbeats and status updates. If the connection were lost, the job might be marked as “failed“ or “drained,“ but it would not cause the running HPL binary to slowly degrade in computational speed. Network “ping latency“ tests the management fabric, not the GPU‘s internal performance.
Unattempted
Correct: A Thermal throttling of the GPUs; it can be verified by checking the Clocks Throttle Reasons in the output of nvidia-smi -q.
The Technical Reason: HPL is an extremely compute-intensive benchmark that pushes GPUs to their thermal and power limits. If performance starts high but drops gradually over 15 minutes, it strongly indicates a thermal saturation issue.
As the GPU temperature approaches the maximum operating threshold, the driver automatically reduces the clock speeds to prevent hardware damage.
Verification: The standard way to confirm this is by querying the GPU‘s internal state using nvidia-smi -q. Under the Clocks Throttle Reasons section, the system will explicitly report gpu_idle, applications_clocks_setting, or most importantly, sw_thermal_slowdown (software-level thermal throttling) and hw_thermal_slowdown (hardware-level thermal throttling).
The NCP-AII Context: The certification emphasizes using the NVIDIA System Management Interface (nvidia-smi) as the primary tool for “Day 1“ validation. Identifying “throttling reasons“ is a specific troubleshooting step required to differentiate between power limits, thermal limits, or software configuration bottlenecks.
Incorrect: B. The SSD storage is becoming fragmented HPL is a floating-point computational benchmark that runs primarily in GPU VRAM and System RAM. Once the initial matrices are loaded into memory, the storage subsystem (SSD) has virtually zero impact on the ongoing calculation. Fragmentation on a Linux SSD would affect boot times or initial data loading, but it cannot cause a 30% performance drop in a GPU-resident computational loop.
C. The BIOS is in Power Save mode If the BIOS were in “Power Save“ mode, the performance would be consistently low from the very beginning of the test. It would not start at a high performance level and then “gradually drop“ over 15 minutes. Furthermore, checking a motherboard serial number in a BMC log provides inventory data, but it does not reveal the active power profile or real-time clock speed behavior.
D. The Slurm scheduler is losing its connection to the head node Slurm is used for job orchestration (scheduling and launching). Once a job is running on a node, Slurm‘s connection to the head node is only used for heartbeats and status updates. If the connection were lost, the job might be marked as “failed“ or “drained,“ but it would not cause the running HPL binary to slowly degrade in computational speed. Network “ping latency“ tests the management fabric, not the GPU‘s internal performance.
Question 27 of 60
27. Question
The ClusterKit tool is used during the verification phase of an AI infrastructure deployment. What is the primary advantage of running ClusterKit compared to manually executing individual stress tests like HPL or NCCL-tests?
Correct
Correct: A It provides a multifaceted assessment by automating a suite of tests that validate everything from node health to fabric performance in a single workflow.
The Technical Reason: While manual tests like HPL (High Performance Linpack) and NCCL-tests (NVIDIA Collective Communications Library) are excellent for specific performance metrics, they are disparate tools. ClusterKit is an orchestration wrapper that automates these tests alongside health checks for GPUs, NVSwitches, InfiniBand links, and CPU-to-GPU bandwidth (bandwidthTest). It correlates these results into a unified report, ensuring that the “AI Factory“ is balanced and production-ready.
The NCP-AII Context: The certification emphasizes efficiency at scale. Manually running five different benchmarks across 64 or 128 nodes is error-prone. ClusterKit ensures a standardized, repeatable validation process that covers the entire hardware and software stack in one pass.
Incorrect: B. It guarantees that the cluster will achieve the top spot on the TOP500 list While ClusterKit helps identify bottlenecks that might prevent a cluster from reaching its theoretical peak performance, it is a diagnostic and validation tool, not a real-time OS kernel optimizer. Achieving a TOP500 ranking requires extensive manual tuning of libraries, MPI ranks, and thermal management beyond what a validation script provides.
C. It automatically repairs hardware faults by re-soldering loose connections This is a physical impossibility for a software tool. ClusterKit can detect a hardware fault (such as an NVSwitch link failure or a GPU “fallen off the bus“), but physical repairs require manual intervention by a technician (e.g., reseating a module or replacing a baseboard).
D. It eliminates the need for any NVIDIA drivers to be installed This is factually incorrect. ClusterKit depends on the NVIDIA Datacenter Driver and the CUDA Toolkit to interact with the hardware. It cannot run “low-level“ without the driver‘s kernel module (nvidia.ko) providing the interface to the GPU and NVSwitch registers.
Incorrect
Correct: A It provides a multifaceted assessment by automating a suite of tests that validate everything from node health to fabric performance in a single workflow.
The Technical Reason: While manual tests like HPL (High Performance Linpack) and NCCL-tests (NVIDIA Collective Communications Library) are excellent for specific performance metrics, they are disparate tools. ClusterKit is an orchestration wrapper that automates these tests alongside health checks for GPUs, NVSwitches, InfiniBand links, and CPU-to-GPU bandwidth (bandwidthTest). It correlates these results into a unified report, ensuring that the “AI Factory“ is balanced and production-ready.
The NCP-AII Context: The certification emphasizes efficiency at scale. Manually running five different benchmarks across 64 or 128 nodes is error-prone. ClusterKit ensures a standardized, repeatable validation process that covers the entire hardware and software stack in one pass.
Incorrect: B. It guarantees that the cluster will achieve the top spot on the TOP500 list While ClusterKit helps identify bottlenecks that might prevent a cluster from reaching its theoretical peak performance, it is a diagnostic and validation tool, not a real-time OS kernel optimizer. Achieving a TOP500 ranking requires extensive manual tuning of libraries, MPI ranks, and thermal management beyond what a validation script provides.
C. It automatically repairs hardware faults by re-soldering loose connections This is a physical impossibility for a software tool. ClusterKit can detect a hardware fault (such as an NVSwitch link failure or a GPU “fallen off the bus“), but physical repairs require manual intervention by a technician (e.g., reseating a module or replacing a baseboard).
D. It eliminates the need for any NVIDIA drivers to be installed This is factually incorrect. ClusterKit depends on the NVIDIA Datacenter Driver and the CUDA Toolkit to interact with the hardware. It cannot run “low-level“ without the driver‘s kernel module (nvidia.ko) providing the interface to the GPU and NVSwitch registers.
Unattempted
Correct: A It provides a multifaceted assessment by automating a suite of tests that validate everything from node health to fabric performance in a single workflow.
The Technical Reason: While manual tests like HPL (High Performance Linpack) and NCCL-tests (NVIDIA Collective Communications Library) are excellent for specific performance metrics, they are disparate tools. ClusterKit is an orchestration wrapper that automates these tests alongside health checks for GPUs, NVSwitches, InfiniBand links, and CPU-to-GPU bandwidth (bandwidthTest). It correlates these results into a unified report, ensuring that the “AI Factory“ is balanced and production-ready.
The NCP-AII Context: The certification emphasizes efficiency at scale. Manually running five different benchmarks across 64 or 128 nodes is error-prone. ClusterKit ensures a standardized, repeatable validation process that covers the entire hardware and software stack in one pass.
Incorrect: B. It guarantees that the cluster will achieve the top spot on the TOP500 list While ClusterKit helps identify bottlenecks that might prevent a cluster from reaching its theoretical peak performance, it is a diagnostic and validation tool, not a real-time OS kernel optimizer. Achieving a TOP500 ranking requires extensive manual tuning of libraries, MPI ranks, and thermal management beyond what a validation script provides.
C. It automatically repairs hardware faults by re-soldering loose connections This is a physical impossibility for a software tool. ClusterKit can detect a hardware fault (such as an NVSwitch link failure or a GPU “fallen off the bus“), but physical repairs require manual intervention by a technician (e.g., reseating a module or replacing a baseboard).
D. It eliminates the need for any NVIDIA drivers to be installed This is factually incorrect. ClusterKit depends on the NVIDIA Datacenter Driver and the CUDA Toolkit to interact with the hardware. It cannot run “low-level“ without the driver‘s kernel module (nvidia.ko) providing the interface to the GPU and NVSwitch registers.
Question 28 of 60
28. Question
A cluster node is reporting a hardware fault where one of the six fans is spinning at 0 RPM, and the corresponding GPU is reporting ‘Thermal Violation‘ in nvidia-smi. What is the correct troubleshooting and remediation path for an NVIDIA-Certified system in a production environment?
Correct
Correct: B Identify the faulty fan module, verify the fault in the BMC logs, and replace the fan unit following the server‘s hot-swap or FRU procedures.
The Technical Reason: In an NVIDIA-Certified system (such as an HGX or DGX platform), the Baseboard Management Controller (BMC) is the authoritative source for hardware health. A fan spinning at 0 RPM is a critical physical failure. Because high-density GPU systems rely on specific static pressure and airflow patterns, a single fan failure can cause an immediate “Hot Spot,“ leading the affected GPU to trigger a Thermal Violation (throttling its clock speed to 300-600MHz to prevent permanent damage).
The Remediation Path: Professional standards require using the BMC (via Web UI, CLI, or Redfish) to confirm which specific fan index (e.g., Fan_4) has failed. Most NVIDIA-Certified servers are designed with Hot-Swap fan modules that can be replaced while the system is powered on, provided the thermal threshold isn‘t exceeded during the brief swap.
The NCP-AII Context: The exam validates your ability to use Out-of-Band (OOB) management tools to diagnose physical faults before they lead to cluster-wide job failures.
Incorrect: A. Overclock the other five fans to 150% speed This is physically impossible and technically unsound. Fans are rated for a maximum RPM defined by their firmware; they cannot be “overclocked“ to 150% via software. Furthermore, air takes the path of least resistance. A dead fan creates a “hole“ in the pressure chamber where hot air can recirculate, and increasing the speed of other fans will not effectively bridge the cooling gap for the specific GPU located behind the failed module.
C. Switch the server to a different Linux distribution The underlying Linux distribution (Ubuntu, RHEL, etc.) has a negligible impact on the thermal output of a GPU running an AI workload like HPL or PyTorch. If the hardware cooling (the fan) has failed, no software-level OS change can compensate for the lack of physical airflow required to dissipate the 700W+ of heat generated by a modern datacenter GPU.
D. Remove the thermal paste from the GPU This is factually dangerous and would result in immediate hardware destruction. Thermal Interface Material (TIM/Paste) is required to fill the microscopic air gaps between the GPU silicon and the heatsink. Removing it would cause the GPU temperature to spike to critical levels (100°C+) within seconds of applying a load, even if the fans were working perfectly.
Incorrect
Correct: B Identify the faulty fan module, verify the fault in the BMC logs, and replace the fan unit following the server‘s hot-swap or FRU procedures.
The Technical Reason: In an NVIDIA-Certified system (such as an HGX or DGX platform), the Baseboard Management Controller (BMC) is the authoritative source for hardware health. A fan spinning at 0 RPM is a critical physical failure. Because high-density GPU systems rely on specific static pressure and airflow patterns, a single fan failure can cause an immediate “Hot Spot,“ leading the affected GPU to trigger a Thermal Violation (throttling its clock speed to 300-600MHz to prevent permanent damage).
The Remediation Path: Professional standards require using the BMC (via Web UI, CLI, or Redfish) to confirm which specific fan index (e.g., Fan_4) has failed. Most NVIDIA-Certified servers are designed with Hot-Swap fan modules that can be replaced while the system is powered on, provided the thermal threshold isn‘t exceeded during the brief swap.
The NCP-AII Context: The exam validates your ability to use Out-of-Band (OOB) management tools to diagnose physical faults before they lead to cluster-wide job failures.
Incorrect: A. Overclock the other five fans to 150% speed This is physically impossible and technically unsound. Fans are rated for a maximum RPM defined by their firmware; they cannot be “overclocked“ to 150% via software. Furthermore, air takes the path of least resistance. A dead fan creates a “hole“ in the pressure chamber where hot air can recirculate, and increasing the speed of other fans will not effectively bridge the cooling gap for the specific GPU located behind the failed module.
C. Switch the server to a different Linux distribution The underlying Linux distribution (Ubuntu, RHEL, etc.) has a negligible impact on the thermal output of a GPU running an AI workload like HPL or PyTorch. If the hardware cooling (the fan) has failed, no software-level OS change can compensate for the lack of physical airflow required to dissipate the 700W+ of heat generated by a modern datacenter GPU.
D. Remove the thermal paste from the GPU This is factually dangerous and would result in immediate hardware destruction. Thermal Interface Material (TIM/Paste) is required to fill the microscopic air gaps between the GPU silicon and the heatsink. Removing it would cause the GPU temperature to spike to critical levels (100°C+) within seconds of applying a load, even if the fans were working perfectly.
Unattempted
Correct: B Identify the faulty fan module, verify the fault in the BMC logs, and replace the fan unit following the server‘s hot-swap or FRU procedures.
The Technical Reason: In an NVIDIA-Certified system (such as an HGX or DGX platform), the Baseboard Management Controller (BMC) is the authoritative source for hardware health. A fan spinning at 0 RPM is a critical physical failure. Because high-density GPU systems rely on specific static pressure and airflow patterns, a single fan failure can cause an immediate “Hot Spot,“ leading the affected GPU to trigger a Thermal Violation (throttling its clock speed to 300-600MHz to prevent permanent damage).
The Remediation Path: Professional standards require using the BMC (via Web UI, CLI, or Redfish) to confirm which specific fan index (e.g., Fan_4) has failed. Most NVIDIA-Certified servers are designed with Hot-Swap fan modules that can be replaced while the system is powered on, provided the thermal threshold isn‘t exceeded during the brief swap.
The NCP-AII Context: The exam validates your ability to use Out-of-Band (OOB) management tools to diagnose physical faults before they lead to cluster-wide job failures.
Incorrect: A. Overclock the other five fans to 150% speed This is physically impossible and technically unsound. Fans are rated for a maximum RPM defined by their firmware; they cannot be “overclocked“ to 150% via software. Furthermore, air takes the path of least resistance. A dead fan creates a “hole“ in the pressure chamber where hot air can recirculate, and increasing the speed of other fans will not effectively bridge the cooling gap for the specific GPU located behind the failed module.
C. Switch the server to a different Linux distribution The underlying Linux distribution (Ubuntu, RHEL, etc.) has a negligible impact on the thermal output of a GPU running an AI workload like HPL or PyTorch. If the hardware cooling (the fan) has failed, no software-level OS change can compensate for the lack of physical airflow required to dissipate the 700W+ of heat generated by a modern datacenter GPU.
D. Remove the thermal paste from the GPU This is factually dangerous and would result in immediate hardware destruction. Thermal Interface Material (TIM/Paste) is required to fill the microscopic air gaps between the GPU silicon and the heatsink. Removing it would cause the GPU temperature to spike to critical levels (100°C+) within seconds of applying a load, even if the fans were working perfectly.
Question 29 of 60
29. Question
A deployment team is configuring the Out-of-Band (OOB) management network for a new NVIDIA-based server farm. They must ensure that the BMC is reachable for remote firmware updates of the HGX baseboard. What is the recommended sequence for performing a firmware upgrade on an HGX system to ensure component compatibility and system stability during the bring-up phase?
Correct
Correct: A Update the BMC firmware first, then the BIOS/UEFI, followed by the HGX baseboard firmware, and then the individual GPU firmware.
The Technical Reason: This is the “Bottom-Up“ approach required for hardware stability.
BMC (Baseboard Management Controller): This is the foundation of the management plane. Since the BMC handles the actual flashing process (often via Redfish) for other components, it must be updated first to ensure it has the latest communication protocols and bug fixes to handle the subsequent payloads.
BIOS/UEFI: This manages the CPU and PCIe initialization. It must be updated to support any new PCIe features or security enhancements required by the HGX baseboard.
HGX Baseboard: This includes the NVSwitch fabric and PCIe retimers. Once the motherboard (BIOS) is ready, the fabric that connects the GPUs can be initialized.
GPU Firmware (VBIOS): These are the final endpoints. They rely on the underlying stability of the NVSwitch fabric and the BIOS to be correctly addressed during the final stage of the boot cycle.
The NCP-AII Context: Following the NVIDIA Validated Recipe is a core exam concept. Deviating from this sequence can lead to “I2C timeout“ errors or the BMC losing visibility of the GPUs during the update process.
Incorrect: B. Install the NVIDIA Container Toolkit and use the update-firmware flag The NVIDIA Container Toolkit is a user-space utility used to expose GPUs to Docker or Podman. It is not a firmware management tool. It does not have the capability to flash low-level hardware components like the BMC or BIOS. Firmware updates are performed out-of-band (via BMC/Redfish) or in-band using the NVIDIA Firmware Update (NVFW/nvfwupd) tool, not the container toolkit.
C. Update the OS drivers first, then the GPU firmware, and finally the BMC This sequence is reversed and dangerous. If you update the OS drivers to a version that requires a newer VBIOS or NVSwitch firmware, the driver may fail to initialize the hardware (“GPU fallen off bus“). Furthermore, the BMC should always be updated before the components it manages, not after, to ensure the management interface remains stable throughout the maintenance window.
D. Flash all components simultaneously using a broadcast script Simultaneous flashing (broadcasting) to multiple different components on a single motherboard is a recipe for a “brick.“ Many components share the same management buses (like I2C or PCIe SMBus). Attempting to flash the BIOS while the BMC is busy flashing an NVSwitch can cause bus contention, resulting in a failed flash and potential hardware corruption. Updates within a single node must be serialized.
Incorrect
Correct: A Update the BMC firmware first, then the BIOS/UEFI, followed by the HGX baseboard firmware, and then the individual GPU firmware.
The Technical Reason: This is the “Bottom-Up“ approach required for hardware stability.
BMC (Baseboard Management Controller): This is the foundation of the management plane. Since the BMC handles the actual flashing process (often via Redfish) for other components, it must be updated first to ensure it has the latest communication protocols and bug fixes to handle the subsequent payloads.
BIOS/UEFI: This manages the CPU and PCIe initialization. It must be updated to support any new PCIe features or security enhancements required by the HGX baseboard.
HGX Baseboard: This includes the NVSwitch fabric and PCIe retimers. Once the motherboard (BIOS) is ready, the fabric that connects the GPUs can be initialized.
GPU Firmware (VBIOS): These are the final endpoints. They rely on the underlying stability of the NVSwitch fabric and the BIOS to be correctly addressed during the final stage of the boot cycle.
The NCP-AII Context: Following the NVIDIA Validated Recipe is a core exam concept. Deviating from this sequence can lead to “I2C timeout“ errors or the BMC losing visibility of the GPUs during the update process.
Incorrect: B. Install the NVIDIA Container Toolkit and use the update-firmware flag The NVIDIA Container Toolkit is a user-space utility used to expose GPUs to Docker or Podman. It is not a firmware management tool. It does not have the capability to flash low-level hardware components like the BMC or BIOS. Firmware updates are performed out-of-band (via BMC/Redfish) or in-band using the NVIDIA Firmware Update (NVFW/nvfwupd) tool, not the container toolkit.
C. Update the OS drivers first, then the GPU firmware, and finally the BMC This sequence is reversed and dangerous. If you update the OS drivers to a version that requires a newer VBIOS or NVSwitch firmware, the driver may fail to initialize the hardware (“GPU fallen off bus“). Furthermore, the BMC should always be updated before the components it manages, not after, to ensure the management interface remains stable throughout the maintenance window.
D. Flash all components simultaneously using a broadcast script Simultaneous flashing (broadcasting) to multiple different components on a single motherboard is a recipe for a “brick.“ Many components share the same management buses (like I2C or PCIe SMBus). Attempting to flash the BIOS while the BMC is busy flashing an NVSwitch can cause bus contention, resulting in a failed flash and potential hardware corruption. Updates within a single node must be serialized.
Unattempted
Correct: A Update the BMC firmware first, then the BIOS/UEFI, followed by the HGX baseboard firmware, and then the individual GPU firmware.
The Technical Reason: This is the “Bottom-Up“ approach required for hardware stability.
BMC (Baseboard Management Controller): This is the foundation of the management plane. Since the BMC handles the actual flashing process (often via Redfish) for other components, it must be updated first to ensure it has the latest communication protocols and bug fixes to handle the subsequent payloads.
BIOS/UEFI: This manages the CPU and PCIe initialization. It must be updated to support any new PCIe features or security enhancements required by the HGX baseboard.
HGX Baseboard: This includes the NVSwitch fabric and PCIe retimers. Once the motherboard (BIOS) is ready, the fabric that connects the GPUs can be initialized.
GPU Firmware (VBIOS): These are the final endpoints. They rely on the underlying stability of the NVSwitch fabric and the BIOS to be correctly addressed during the final stage of the boot cycle.
The NCP-AII Context: Following the NVIDIA Validated Recipe is a core exam concept. Deviating from this sequence can lead to “I2C timeout“ errors or the BMC losing visibility of the GPUs during the update process.
Incorrect: B. Install the NVIDIA Container Toolkit and use the update-firmware flag The NVIDIA Container Toolkit is a user-space utility used to expose GPUs to Docker or Podman. It is not a firmware management tool. It does not have the capability to flash low-level hardware components like the BMC or BIOS. Firmware updates are performed out-of-band (via BMC/Redfish) or in-band using the NVIDIA Firmware Update (NVFW/nvfwupd) tool, not the container toolkit.
C. Update the OS drivers first, then the GPU firmware, and finally the BMC This sequence is reversed and dangerous. If you update the OS drivers to a version that requires a newer VBIOS or NVSwitch firmware, the driver may fail to initialize the hardware (“GPU fallen off bus“). Furthermore, the BMC should always be updated before the components it manages, not after, to ensure the management interface remains stable throughout the maintenance window.
D. Flash all components simultaneously using a broadcast script Simultaneous flashing (broadcasting) to multiple different components on a single motherboard is a recipe for a “brick.“ Many components share the same management buses (like I2C or PCIe SMBus). Attempting to flash the BIOS while the BMC is busy flashing an NVSwitch can cause bus contention, resulting in a failed flash and potential hardware corruption. Updates within a single node must be serialized.
Question 30 of 60
30. Question
When installing NVIDIA drivers on a cluster node that will utilize BlueField-3 DPUs, which driver suite must be installed to ensure the host can properly communicate with both the GPU and the DPU network functions effectively?
Correct
Correct: B The NVIDIA Datacenter Driver and DOCA Driver
The Technical Reason: To manage an AI node equipped with both high-performance GPUs (like H100) and BlueField-3 DPUs, two primary software components must coexist on the Host OS:
NVIDIA Datacenter Driver: This is the specialized driver suite for enterprise-grade GPUs. It manages GPU memory, CUDA contexts, and enables the nvidia-smi monitoring tool.
DOCA (Data Center Infrastructure-on-a-Chip Architecture) Driver: For BlueField-3, the DOCA-Host package is essential. It provides the necessary drivers (like rshim for DPU management and mlx5_core for high-speed networking) and libraries to offload networking, security, and storage tasks to the DPU.
The NCP-AII Context: The exam blueprint (specifically in the Control Plane Installation and Configuration domain) validates that an administrator can identify and install the correct “validated recipe“ of drivers. For a converged node, this means pairing the Datacenter GPU driver with the DOCA framework to enable features like GPUDirect RDMA, allowing the GPU and DPU to communicate directly without CPU bottlenecking.
Incorrect Options: A. The Legacy 340xx Driver series The 340xx series is a legacy driver for much older NVIDIA architectures (Tesla and Fermi) from over a decade ago. It lacks any support for modern hardware like H100 GPUs or BlueField-3 DPUs and is entirely incompatible with current AI infrastructure requirements.
C. The Open Source Nouveau Driver The Nouveau driver is a basic, reverse-engineered open-source driver included with many Linux distributions. It is intended for simple display output and does not support CUDA, high-performance networking offloads, or DPU management. In an AI infrastructure deployment, the Nouveau driver must typically be blacklisted to allow the official NVIDIA drivers to function.
D. The standard GeForce Game Ready Driver Game Ready Drivers (GRD) are optimized for consumer gaming and creative applications on Windows or Linux workstations. They lack the data center-grade stability, long-term support (LTS), and specific enterprise features (such as Peer-to-Peer memory support and advanced thermal management) required for HGX/DGX cluster nodes.
Incorrect
Correct: B The NVIDIA Datacenter Driver and DOCA Driver
The Technical Reason: To manage an AI node equipped with both high-performance GPUs (like H100) and BlueField-3 DPUs, two primary software components must coexist on the Host OS:
NVIDIA Datacenter Driver: This is the specialized driver suite for enterprise-grade GPUs. It manages GPU memory, CUDA contexts, and enables the nvidia-smi monitoring tool.
DOCA (Data Center Infrastructure-on-a-Chip Architecture) Driver: For BlueField-3, the DOCA-Host package is essential. It provides the necessary drivers (like rshim for DPU management and mlx5_core for high-speed networking) and libraries to offload networking, security, and storage tasks to the DPU.
The NCP-AII Context: The exam blueprint (specifically in the Control Plane Installation and Configuration domain) validates that an administrator can identify and install the correct “validated recipe“ of drivers. For a converged node, this means pairing the Datacenter GPU driver with the DOCA framework to enable features like GPUDirect RDMA, allowing the GPU and DPU to communicate directly without CPU bottlenecking.
Incorrect Options: A. The Legacy 340xx Driver series The 340xx series is a legacy driver for much older NVIDIA architectures (Tesla and Fermi) from over a decade ago. It lacks any support for modern hardware like H100 GPUs or BlueField-3 DPUs and is entirely incompatible with current AI infrastructure requirements.
C. The Open Source Nouveau Driver The Nouveau driver is a basic, reverse-engineered open-source driver included with many Linux distributions. It is intended for simple display output and does not support CUDA, high-performance networking offloads, or DPU management. In an AI infrastructure deployment, the Nouveau driver must typically be blacklisted to allow the official NVIDIA drivers to function.
D. The standard GeForce Game Ready Driver Game Ready Drivers (GRD) are optimized for consumer gaming and creative applications on Windows or Linux workstations. They lack the data center-grade stability, long-term support (LTS), and specific enterprise features (such as Peer-to-Peer memory support and advanced thermal management) required for HGX/DGX cluster nodes.
Unattempted
Correct: B The NVIDIA Datacenter Driver and DOCA Driver
The Technical Reason: To manage an AI node equipped with both high-performance GPUs (like H100) and BlueField-3 DPUs, two primary software components must coexist on the Host OS:
NVIDIA Datacenter Driver: This is the specialized driver suite for enterprise-grade GPUs. It manages GPU memory, CUDA contexts, and enables the nvidia-smi monitoring tool.
DOCA (Data Center Infrastructure-on-a-Chip Architecture) Driver: For BlueField-3, the DOCA-Host package is essential. It provides the necessary drivers (like rshim for DPU management and mlx5_core for high-speed networking) and libraries to offload networking, security, and storage tasks to the DPU.
The NCP-AII Context: The exam blueprint (specifically in the Control Plane Installation and Configuration domain) validates that an administrator can identify and install the correct “validated recipe“ of drivers. For a converged node, this means pairing the Datacenter GPU driver with the DOCA framework to enable features like GPUDirect RDMA, allowing the GPU and DPU to communicate directly without CPU bottlenecking.
Incorrect Options: A. The Legacy 340xx Driver series The 340xx series is a legacy driver for much older NVIDIA architectures (Tesla and Fermi) from over a decade ago. It lacks any support for modern hardware like H100 GPUs or BlueField-3 DPUs and is entirely incompatible with current AI infrastructure requirements.
C. The Open Source Nouveau Driver The Nouveau driver is a basic, reverse-engineered open-source driver included with many Linux distributions. It is intended for simple display output and does not support CUDA, high-performance networking offloads, or DPU management. In an AI infrastructure deployment, the Nouveau driver must typically be blacklisted to allow the official NVIDIA drivers to function.
D. The standard GeForce Game Ready Driver Game Ready Drivers (GRD) are optimized for consumer gaming and creative applications on Windows or Linux workstations. They lack the data center-grade stability, long-term support (LTS), and specific enterprise features (such as Peer-to-Peer memory support and advanced thermal management) required for HGX/DGX cluster nodes.
Question 31 of 60
31. Question
A researcher wants to run multiple independent AI inference workloads on a single NVIDIA A100 or H100 GPU to maximize resource utilization. They decide to use Multi-Instance GPU (MIG) technology. What is a key requirement and characteristic when configuring MIG for these high-performance computing (HPC) and AI workloads?
Correct
Correct: C. MIG allows the GPU to be partitioned into up to seven independent instances, each with its own dedicated high-bandwidth memory and compute cores.
Physical Partitioning: The A100 and H100 architectures are physically designed with 7 GPCs (Graphics Processing Clusters). MIG leverages this by slicing the GPU into up to seven hardware-isolated instances.
Dedicated Resources: Each instance is allocated its own dedicated slice of the GPU‘s Compute (SMs), L2 Cache, and Memory Controllers (VRAM).
Isolation: Because the resources are physically partitioned, a workload on one instance cannot impact the performance or latency of another, making it the “Gold Standard“ for concurrent inference in the NCP-AII curriculum.
Incorrect: A. MIG instances share the same L2 cache and memory controllers, allowing workloads to dynamically steal bandwidth…
The “Noisy Neighbor“ Problem: This describes standard Time-Slicing (software-level partitioning), not MIG. The entire purpose of MIG is to prevent workloads from sharing or “stealing“ cache and bandwidth.
Fixed Allocation: In the NCP-AII framework, MIG is defined by its deterministic performance. Once an instance is created, its bandwidth is guaranteed and isolated from other instances.
B. MIG is only compatible with Windows-based desktop environments…
Platform Target: While some NVIDIA professional workstation GPUs support MIG, the technology is primarily designed for Linux-based AI factory clusters and data center environments (DGX, HGX).
Enterprise Focus: The NCP-AII exam focuses almost exclusively on Linux (Ubuntu/RHEL) environments managed by tools like Base Command Manager (BCM) and Slurm.
D. To enable MIG, the administrator must first disable the NVIDIA GPU driver…
Driver Dependency: MIG is a feature managed by the NVIDIA driver. To enable it, the driver must be active.
Enablement Workflow: The correct procedure taught in the NCP-AII course is to use nvidia-smi -i -mig 1. While this may require a GPU reset or a system reboot (especially on A100), the driver must be functional to communicate the change to the GPU‘s firmware.
Incorrect
Correct: C. MIG allows the GPU to be partitioned into up to seven independent instances, each with its own dedicated high-bandwidth memory and compute cores.
Physical Partitioning: The A100 and H100 architectures are physically designed with 7 GPCs (Graphics Processing Clusters). MIG leverages this by slicing the GPU into up to seven hardware-isolated instances.
Dedicated Resources: Each instance is allocated its own dedicated slice of the GPU‘s Compute (SMs), L2 Cache, and Memory Controllers (VRAM).
Isolation: Because the resources are physically partitioned, a workload on one instance cannot impact the performance or latency of another, making it the “Gold Standard“ for concurrent inference in the NCP-AII curriculum.
Incorrect: A. MIG instances share the same L2 cache and memory controllers, allowing workloads to dynamically steal bandwidth…
The “Noisy Neighbor“ Problem: This describes standard Time-Slicing (software-level partitioning), not MIG. The entire purpose of MIG is to prevent workloads from sharing or “stealing“ cache and bandwidth.
Fixed Allocation: In the NCP-AII framework, MIG is defined by its deterministic performance. Once an instance is created, its bandwidth is guaranteed and isolated from other instances.
B. MIG is only compatible with Windows-based desktop environments…
Platform Target: While some NVIDIA professional workstation GPUs support MIG, the technology is primarily designed for Linux-based AI factory clusters and data center environments (DGX, HGX).
Enterprise Focus: The NCP-AII exam focuses almost exclusively on Linux (Ubuntu/RHEL) environments managed by tools like Base Command Manager (BCM) and Slurm.
D. To enable MIG, the administrator must first disable the NVIDIA GPU driver…
Driver Dependency: MIG is a feature managed by the NVIDIA driver. To enable it, the driver must be active.
Enablement Workflow: The correct procedure taught in the NCP-AII course is to use nvidia-smi -i -mig 1. While this may require a GPU reset or a system reboot (especially on A100), the driver must be functional to communicate the change to the GPU‘s firmware.
Unattempted
Correct: C. MIG allows the GPU to be partitioned into up to seven independent instances, each with its own dedicated high-bandwidth memory and compute cores.
Physical Partitioning: The A100 and H100 architectures are physically designed with 7 GPCs (Graphics Processing Clusters). MIG leverages this by slicing the GPU into up to seven hardware-isolated instances.
Dedicated Resources: Each instance is allocated its own dedicated slice of the GPU‘s Compute (SMs), L2 Cache, and Memory Controllers (VRAM).
Isolation: Because the resources are physically partitioned, a workload on one instance cannot impact the performance or latency of another, making it the “Gold Standard“ for concurrent inference in the NCP-AII curriculum.
Incorrect: A. MIG instances share the same L2 cache and memory controllers, allowing workloads to dynamically steal bandwidth…
The “Noisy Neighbor“ Problem: This describes standard Time-Slicing (software-level partitioning), not MIG. The entire purpose of MIG is to prevent workloads from sharing or “stealing“ cache and bandwidth.
Fixed Allocation: In the NCP-AII framework, MIG is defined by its deterministic performance. Once an instance is created, its bandwidth is guaranteed and isolated from other instances.
B. MIG is only compatible with Windows-based desktop environments…
Platform Target: While some NVIDIA professional workstation GPUs support MIG, the technology is primarily designed for Linux-based AI factory clusters and data center environments (DGX, HGX).
Enterprise Focus: The NCP-AII exam focuses almost exclusively on Linux (Ubuntu/RHEL) environments managed by tools like Base Command Manager (BCM) and Slurm.
D. To enable MIG, the administrator must first disable the NVIDIA GPU driver…
Driver Dependency: MIG is a feature managed by the NVIDIA driver. To enable it, the driver must be active.
Enablement Workflow: The correct procedure taught in the NCP-AII course is to use nvidia-smi -i -mig 1. While this may require a GPU reset or a system reboot (especially on A100), the driver must be functional to communicate the change to the GPU‘s firmware.
Question 32 of 60
32. Question
An AI infrastructure specialist is optimizing an AMD-based server for maximum GPU throughput. They notice that the PCIe bandwidth between the CPU and the GPUs is lower than expected. Which optimization technique is most likely to resolve this and improve the overall performance for AI training workloads?
Correct
Correct: C. Disable the IOMMU (Input-Output Memory Management Unit) in the BIOS to reduce the overhead of memory address translation for PCIe devices.
Overhead Reduction: The IOMMU is responsible for translating virtual memory addresses to physical ones for I/O devices. In high-performance AI training, where GPUs are constantly fetching massive datasets via PCIe, this translation layer introduces latency and CPU overhead.
Throughput Optimization: In a secure, controlled “AI Factory“ environment (the focus of NCP-AII), the security benefits of IOMMU are often traded for the performance gains of disabling it. This allows the GPU to access memory more directly, maximizing the effective PCIe bandwidth.
AMD Specifics: On AMD-based systems (like those using EPYC CPUs often paired with NVIDIA HGX boards), disabling IOMMU and SR-IOV (if not needed) is a standard “Best Practice“ in the NVIDIA certification prep material to ensure peak throughput.
Incorrect: A. Decrease the MTU size on the management network to sixty-four bytes…
Wrong Network: The management network is for “Out-of-Band“ (OOB) traffic (like IPMI/BMC). It has no impact on the PCIe data path between the CPU and GPU.
Performance Impact: Decreasing MTU to 64 bytes would actually increase the number of interrupts and CPU overhead because the system would have to process many more small packets for the same amount of data.
B. Replace all copper DAC cables with active optical cables (AOC)…
Internal vs. External: DAC and AOC cables are used for external networking (InfiniBand/Ethernet). They do not affect the internal PCIe bandwidth between the CPU and the GPUs residing on the motherboard or baseboard.
EMI Misconception: While AOCs are better for long-distance EMI resistance, they would not resolve a PCIe bandwidth bottleneck occurring inside the server chassis.
D. Enable the ‘Power Saver‘ profile in the OS…
Throttling Risk: In AI infrastructure, you almost always use the “High Performance“ or “Maximum Performance“ profile.
Performance Degradation: A ‘Power Saver‘ profile would likely downclock the CPU and the PCIe controller to save energy, which would decrease GPU throughput and significantly increase training times. NCP-AII focuses on maximizing “Time-to-Solution,“ which is incompatible with power-saving modes.
Incorrect
Correct: C. Disable the IOMMU (Input-Output Memory Management Unit) in the BIOS to reduce the overhead of memory address translation for PCIe devices.
Overhead Reduction: The IOMMU is responsible for translating virtual memory addresses to physical ones for I/O devices. In high-performance AI training, where GPUs are constantly fetching massive datasets via PCIe, this translation layer introduces latency and CPU overhead.
Throughput Optimization: In a secure, controlled “AI Factory“ environment (the focus of NCP-AII), the security benefits of IOMMU are often traded for the performance gains of disabling it. This allows the GPU to access memory more directly, maximizing the effective PCIe bandwidth.
AMD Specifics: On AMD-based systems (like those using EPYC CPUs often paired with NVIDIA HGX boards), disabling IOMMU and SR-IOV (if not needed) is a standard “Best Practice“ in the NVIDIA certification prep material to ensure peak throughput.
Incorrect: A. Decrease the MTU size on the management network to sixty-four bytes…
Wrong Network: The management network is for “Out-of-Band“ (OOB) traffic (like IPMI/BMC). It has no impact on the PCIe data path between the CPU and GPU.
Performance Impact: Decreasing MTU to 64 bytes would actually increase the number of interrupts and CPU overhead because the system would have to process many more small packets for the same amount of data.
B. Replace all copper DAC cables with active optical cables (AOC)…
Internal vs. External: DAC and AOC cables are used for external networking (InfiniBand/Ethernet). They do not affect the internal PCIe bandwidth between the CPU and the GPUs residing on the motherboard or baseboard.
EMI Misconception: While AOCs are better for long-distance EMI resistance, they would not resolve a PCIe bandwidth bottleneck occurring inside the server chassis.
D. Enable the ‘Power Saver‘ profile in the OS…
Throttling Risk: In AI infrastructure, you almost always use the “High Performance“ or “Maximum Performance“ profile.
Performance Degradation: A ‘Power Saver‘ profile would likely downclock the CPU and the PCIe controller to save energy, which would decrease GPU throughput and significantly increase training times. NCP-AII focuses on maximizing “Time-to-Solution,“ which is incompatible with power-saving modes.
Unattempted
Correct: C. Disable the IOMMU (Input-Output Memory Management Unit) in the BIOS to reduce the overhead of memory address translation for PCIe devices.
Overhead Reduction: The IOMMU is responsible for translating virtual memory addresses to physical ones for I/O devices. In high-performance AI training, where GPUs are constantly fetching massive datasets via PCIe, this translation layer introduces latency and CPU overhead.
Throughput Optimization: In a secure, controlled “AI Factory“ environment (the focus of NCP-AII), the security benefits of IOMMU are often traded for the performance gains of disabling it. This allows the GPU to access memory more directly, maximizing the effective PCIe bandwidth.
AMD Specifics: On AMD-based systems (like those using EPYC CPUs often paired with NVIDIA HGX boards), disabling IOMMU and SR-IOV (if not needed) is a standard “Best Practice“ in the NVIDIA certification prep material to ensure peak throughput.
Incorrect: A. Decrease the MTU size on the management network to sixty-four bytes…
Wrong Network: The management network is for “Out-of-Band“ (OOB) traffic (like IPMI/BMC). It has no impact on the PCIe data path between the CPU and GPU.
Performance Impact: Decreasing MTU to 64 bytes would actually increase the number of interrupts and CPU overhead because the system would have to process many more small packets for the same amount of data.
B. Replace all copper DAC cables with active optical cables (AOC)…
Internal vs. External: DAC and AOC cables are used for external networking (InfiniBand/Ethernet). They do not affect the internal PCIe bandwidth between the CPU and the GPUs residing on the motherboard or baseboard.
EMI Misconception: While AOCs are better for long-distance EMI resistance, they would not resolve a PCIe bandwidth bottleneck occurring inside the server chassis.
D. Enable the ‘Power Saver‘ profile in the OS…
Throttling Risk: In AI infrastructure, you almost always use the “High Performance“ or “Maximum Performance“ profile.
Performance Degradation: A ‘Power Saver‘ profile would likely downclock the CPU and the PCIe controller to save energy, which would decrease GPU throughput and significantly increase training times. NCP-AII focuses on maximizing “Time-to-Solution,“ which is incompatible with power-saving modes.
Question 33 of 60
33. Question
As part of the cluster verification process, an engineer runs the High-Performance Linpack (HPL) benchmark. The results show a significantly lower GFLOPS value than the theoretical peak for an NVIDIA H100 cluster. Upon investigation, the engineer notices that the GPU temperatures are spiking and then dropping rapidly. What is the most likely cause of this performance inconsistency during the HPL test?
Correct
Correct: B. The system is experiencing thermal throttling due to insufficient cooling or an incorrect fan policy, causing the GPUs to lower their clock speeds to prevent damage.
Thermal Management Logic: The HPL benchmark is a “power virus“ for GPUs; it pushes the Tensor Cores to their absolute thermal limit. In the NCP-AII curriculum, an inconsistent performance profile (spiking and dropping) is a classic indicator of ClocksThrottleReasons: ThermalSlowdown.
Protective Mechanisms: When a GPU hits its thermal ceiling (typically around 85°C–90°C for H100), the firmware forces the clock speed down to shed heat. Once the temperature drops, the clocks ramp back up, creating the “spiking“ behavior described.
Troubleshooting Step: An engineer would verify this using nvidia-smi -q -d PERFORMANCE, which explicitly lists if “Thermal Slowdown“ is Active.
Incorrect: A. The NVLink Switch is faulty and is dropping fifty percent of the packets…
Latency vs. Throughput: While a faulty NVLink switch or a high Bit Error Rate (BER) on the fabric would lower HPL scores, it would typically result in a sustained low performance or a “Bus Error“ / “XID 74“ crash. It would not cause the GPU temperatures to spike and drop rapidly in a rhythmic fashion.
Diagnostic Tool: NVLink issues are typically diagnosed via nvidia-smi nvlink -s or ibdiagnet, not via temperature-related behavior.
C. The Slurm scheduler is over-subscribing the GPUs…
Resource Isolation: Modern NVIDIA clusters use cgroups and Enroot/Pyxis (major NCP-AII topics) to ensure strict resource isolation. If two HPL instances were forced onto one GPU, the job would likely fail immediately due to Out of Memory (OOM) errors, as HPL is configured to fill the available VRAM.
Register Contention: Over-subscription causes “context switching“ latency, which would lower the GFLOPS, but it wouldn‘t explain the rapid temperature cycling.
D. The HPL benchmark is incorrectly configured to use only the CPU cores…
Idle State: If HPL were running only on the CPUs, the GPUs would remain in a P8 (idle) power state. Their temperatures would stay low and stable (ambient + 10°C), contradicting the “spiking“ temperatures mentioned in the scenario.
Theoretical Peak: CPU-only performance is orders of magnitude lower than GPU peak (e.g., ~2 TFLOPS vs. ~60+ TFLOPS). The discrepancy would be so massive that it would be identified as a configuration error rather than a “performance inconsistency.“
Incorrect
Correct: B. The system is experiencing thermal throttling due to insufficient cooling or an incorrect fan policy, causing the GPUs to lower their clock speeds to prevent damage.
Thermal Management Logic: The HPL benchmark is a “power virus“ for GPUs; it pushes the Tensor Cores to their absolute thermal limit. In the NCP-AII curriculum, an inconsistent performance profile (spiking and dropping) is a classic indicator of ClocksThrottleReasons: ThermalSlowdown.
Protective Mechanisms: When a GPU hits its thermal ceiling (typically around 85°C–90°C for H100), the firmware forces the clock speed down to shed heat. Once the temperature drops, the clocks ramp back up, creating the “spiking“ behavior described.
Troubleshooting Step: An engineer would verify this using nvidia-smi -q -d PERFORMANCE, which explicitly lists if “Thermal Slowdown“ is Active.
Incorrect: A. The NVLink Switch is faulty and is dropping fifty percent of the packets…
Latency vs. Throughput: While a faulty NVLink switch or a high Bit Error Rate (BER) on the fabric would lower HPL scores, it would typically result in a sustained low performance or a “Bus Error“ / “XID 74“ crash. It would not cause the GPU temperatures to spike and drop rapidly in a rhythmic fashion.
Diagnostic Tool: NVLink issues are typically diagnosed via nvidia-smi nvlink -s or ibdiagnet, not via temperature-related behavior.
C. The Slurm scheduler is over-subscribing the GPUs…
Resource Isolation: Modern NVIDIA clusters use cgroups and Enroot/Pyxis (major NCP-AII topics) to ensure strict resource isolation. If two HPL instances were forced onto one GPU, the job would likely fail immediately due to Out of Memory (OOM) errors, as HPL is configured to fill the available VRAM.
Register Contention: Over-subscription causes “context switching“ latency, which would lower the GFLOPS, but it wouldn‘t explain the rapid temperature cycling.
D. The HPL benchmark is incorrectly configured to use only the CPU cores…
Idle State: If HPL were running only on the CPUs, the GPUs would remain in a P8 (idle) power state. Their temperatures would stay low and stable (ambient + 10°C), contradicting the “spiking“ temperatures mentioned in the scenario.
Theoretical Peak: CPU-only performance is orders of magnitude lower than GPU peak (e.g., ~2 TFLOPS vs. ~60+ TFLOPS). The discrepancy would be so massive that it would be identified as a configuration error rather than a “performance inconsistency.“
Unattempted
Correct: B. The system is experiencing thermal throttling due to insufficient cooling or an incorrect fan policy, causing the GPUs to lower their clock speeds to prevent damage.
Thermal Management Logic: The HPL benchmark is a “power virus“ for GPUs; it pushes the Tensor Cores to their absolute thermal limit. In the NCP-AII curriculum, an inconsistent performance profile (spiking and dropping) is a classic indicator of ClocksThrottleReasons: ThermalSlowdown.
Protective Mechanisms: When a GPU hits its thermal ceiling (typically around 85°C–90°C for H100), the firmware forces the clock speed down to shed heat. Once the temperature drops, the clocks ramp back up, creating the “spiking“ behavior described.
Troubleshooting Step: An engineer would verify this using nvidia-smi -q -d PERFORMANCE, which explicitly lists if “Thermal Slowdown“ is Active.
Incorrect: A. The NVLink Switch is faulty and is dropping fifty percent of the packets…
Latency vs. Throughput: While a faulty NVLink switch or a high Bit Error Rate (BER) on the fabric would lower HPL scores, it would typically result in a sustained low performance or a “Bus Error“ / “XID 74“ crash. It would not cause the GPU temperatures to spike and drop rapidly in a rhythmic fashion.
Diagnostic Tool: NVLink issues are typically diagnosed via nvidia-smi nvlink -s or ibdiagnet, not via temperature-related behavior.
C. The Slurm scheduler is over-subscribing the GPUs…
Resource Isolation: Modern NVIDIA clusters use cgroups and Enroot/Pyxis (major NCP-AII topics) to ensure strict resource isolation. If two HPL instances were forced onto one GPU, the job would likely fail immediately due to Out of Memory (OOM) errors, as HPL is configured to fill the available VRAM.
Register Contention: Over-subscription causes “context switching“ latency, which would lower the GFLOPS, but it wouldn‘t explain the rapid temperature cycling.
D. The HPL benchmark is incorrectly configured to use only the CPU cores…
Idle State: If HPL were running only on the CPUs, the GPUs would remain in a P8 (idle) power state. Their temperatures would stay low and stable (ambient + 10°C), contradicting the “spiking“ temperatures mentioned in the scenario.
Theoretical Peak: CPU-only performance is orders of magnitude lower than GPU peak (e.g., ~2 TFLOPS vs. ~60+ TFLOPS). The discrepancy would be so massive that it would be identified as a configuration error rather than a “performance inconsistency.“
Question 34 of 60
34. Question
To optimize a cluster for high-performance computing (HPC) and AI workloads, an administrator decides to use MIG on multiple H100 GPUs. If the administrator creates a 3g.40gb instance on an H100, which of the following statements accurately describes the resource allocation and the remaining capacity of that physical GPU?
Correct
Correct: B. The instance takes 3 GPCs and 40GB of VRAM; due to partitioning rules, the administrator can now only create specific remaining instance types that fit the hardware layout.
Hardware Partitioning: On an H100, MIG is not just a software wrapper; it physically partitions the GPCs (Graphics Processing Clusters), memory controllers, and cache. A “3g“ instance specifically allocates 3 GPCs.
Placement Rules: MIG follows strict “slice“ placement rules. An H100 has 7 compute slices. If you allocate a 3g instance, you are using a specific block of hardware. The remaining 4 GPCs can only be partitioned into specific valid combinations (e.g., another 3g + 1g, or two 2g instances, or four 1g instances) as defined by the NVIDIA MIG Profiles.
Deterministic Performance: Because these are physical partitions, the 3g.40gb instance has its own dedicated path to memory and its own compute resources, ensuring no “noisy neighbor“ effect.
Incorrect: A. The instance uses 30 percent of the GPU cores… the remaining 70 percent can be used for any other purpose without restriction.
Fixed Profiles: You cannot use the remaining cores “without restriction.“ MIG requires you to select from predefined profiles that align with the hardware‘s crossbar and memory controller layout.
Not Percentage Based: While 3 GPCs is roughly 42% of a 7-GPC H100, NVIDIA uses “slices“ rather than raw percentages to define these boundaries.
C. The 3g.40gb instance is a logical software construct… allowing other users to oversubscribe the same GPCs if needed.
No Oversubscription: This is the most common “trap“ answer on the NCP-AII exam. MIG is designed specifically to prevent oversubscription. Once a GPC is assigned to a MIG instance, it is hardware-isolated. It cannot be shared or oversubscribed by other instances or processes.
D. Creating a 3g instance automatically disables all other MIG capabilities on that card…
Concurrent Instances: One of the main benefits of MIG on the H100 is the ability to run up to 7 independent instances (1g.10gb each) or a mix of sizes simultaneously. Creating a 3g instance does not “lock“ the rest of the card into idleness; it simply leaves the remaining 4 GPCs available for further MIG partitioning.
Incorrect
Correct: B. The instance takes 3 GPCs and 40GB of VRAM; due to partitioning rules, the administrator can now only create specific remaining instance types that fit the hardware layout.
Hardware Partitioning: On an H100, MIG is not just a software wrapper; it physically partitions the GPCs (Graphics Processing Clusters), memory controllers, and cache. A “3g“ instance specifically allocates 3 GPCs.
Placement Rules: MIG follows strict “slice“ placement rules. An H100 has 7 compute slices. If you allocate a 3g instance, you are using a specific block of hardware. The remaining 4 GPCs can only be partitioned into specific valid combinations (e.g., another 3g + 1g, or two 2g instances, or four 1g instances) as defined by the NVIDIA MIG Profiles.
Deterministic Performance: Because these are physical partitions, the 3g.40gb instance has its own dedicated path to memory and its own compute resources, ensuring no “noisy neighbor“ effect.
Incorrect: A. The instance uses 30 percent of the GPU cores… the remaining 70 percent can be used for any other purpose without restriction.
Fixed Profiles: You cannot use the remaining cores “without restriction.“ MIG requires you to select from predefined profiles that align with the hardware‘s crossbar and memory controller layout.
Not Percentage Based: While 3 GPCs is roughly 42% of a 7-GPC H100, NVIDIA uses “slices“ rather than raw percentages to define these boundaries.
C. The 3g.40gb instance is a logical software construct… allowing other users to oversubscribe the same GPCs if needed.
No Oversubscription: This is the most common “trap“ answer on the NCP-AII exam. MIG is designed specifically to prevent oversubscription. Once a GPC is assigned to a MIG instance, it is hardware-isolated. It cannot be shared or oversubscribed by other instances or processes.
D. Creating a 3g instance automatically disables all other MIG capabilities on that card…
Concurrent Instances: One of the main benefits of MIG on the H100 is the ability to run up to 7 independent instances (1g.10gb each) or a mix of sizes simultaneously. Creating a 3g instance does not “lock“ the rest of the card into idleness; it simply leaves the remaining 4 GPCs available for further MIG partitioning.
Unattempted
Correct: B. The instance takes 3 GPCs and 40GB of VRAM; due to partitioning rules, the administrator can now only create specific remaining instance types that fit the hardware layout.
Hardware Partitioning: On an H100, MIG is not just a software wrapper; it physically partitions the GPCs (Graphics Processing Clusters), memory controllers, and cache. A “3g“ instance specifically allocates 3 GPCs.
Placement Rules: MIG follows strict “slice“ placement rules. An H100 has 7 compute slices. If you allocate a 3g instance, you are using a specific block of hardware. The remaining 4 GPCs can only be partitioned into specific valid combinations (e.g., another 3g + 1g, or two 2g instances, or four 1g instances) as defined by the NVIDIA MIG Profiles.
Deterministic Performance: Because these are physical partitions, the 3g.40gb instance has its own dedicated path to memory and its own compute resources, ensuring no “noisy neighbor“ effect.
Incorrect: A. The instance uses 30 percent of the GPU cores… the remaining 70 percent can be used for any other purpose without restriction.
Fixed Profiles: You cannot use the remaining cores “without restriction.“ MIG requires you to select from predefined profiles that align with the hardware‘s crossbar and memory controller layout.
Not Percentage Based: While 3 GPCs is roughly 42% of a 7-GPC H100, NVIDIA uses “slices“ rather than raw percentages to define these boundaries.
C. The 3g.40gb instance is a logical software construct… allowing other users to oversubscribe the same GPCs if needed.
No Oversubscription: This is the most common “trap“ answer on the NCP-AII exam. MIG is designed specifically to prevent oversubscription. Once a GPC is assigned to a MIG instance, it is hardware-isolated. It cannot be shared or oversubscribed by other instances or processes.
D. Creating a 3g instance automatically disables all other MIG capabilities on that card…
Concurrent Instances: One of the main benefits of MIG on the H100 is the ability to run up to 7 independent instances (1g.10gb each) or a mix of sizes simultaneously. Creating a 3g instance does not “lock“ the rest of the card into idleness; it simply leaves the remaining 4 GPCs available for further MIG partitioning.
Question 35 of 60
35. Question
An AI infrastructure team is validating a newly installed GPU-based server using the NVIDIA System Management Interface (nvidia-smi). They notice that one of the GPUs is not listed in the output. After confirming the physical GPU installation and power connections, what should be the next logical step in the system bring-up sequence to diagnose the missing hardware component?
Correct
Correct: C. Check the BMC logs and the PCIe enumeration in the BIOS/UEFI to see if the device is detected at the hardware level.
Hardware Hierarchy: In the NCP-AII “Day 1“ bring-up sequence, diagnostics must move from the bottom up (Physical ? Firmware/BIOS ? OS ? Driver). If nvidia-smi (which depends on the driver) fails to see a GPU, the engineer must verify if the PCIe bus even recognized the device during the Power-On Self-Test (POST).
Out-of-Band (OOB) Diagnostics: The BMC (Baseboard Management Controller) is the “source of truth“ for hardware health. It can report hardware faults, power-good signals, and thermal trips that occur before the OS even boots.
Validation Step: Checking the BIOS/UEFI PCIe Enumeration table confirms if the GPU is communicating with the CPU root complex. If it is missing there, it is a physical or firmware issue; if it is present there but missing in nvidia-smi, it is a driver or OS configuration issue.
Incorrect: A. Install the latest version of Pyxis and Enroot to see if the container runtime can force the GPU to appear.
Layer Mismatch: Pyxis and Enroot are high-level tools used for unprivileged containerization (typically with Slurm). They rely entirely on the underlying NVIDIA driver and kernel modules being functional.
Logical Error: A container runtime cannot “force“ the kernel or hardware to see a device that hasn‘t been enumerated on the PCIe bus. This step is far too late in the deployment sequence.
B. Immediately replace the HGX baseboard, as a missing GPU always indicates a catastrophic failure…
Premature Action: NVIDIA certification emphasizes “fault isolation“ before replacement. A missing GPU could be caused by a simple BIOS setting (e.g., Above 4G Decoding disabled), a loose power cable, or a firmware mismatch. Replacing an entire HGX baseboard is a “last resort“ after all logical checks (like those in Option C) are exhausted.
D. Run a High-Performance Linpack (HPL) test on the remaining GPUs to stress the system…
Safety Risk: Running a high-intensity stress test like HPL on a system with a known hardware anomaly (a missing GPU) is dangerous and counterproductive. It will not “initialize“ a missing device; rather, it could potentially cause further damage if the missing GPU is due to a short circuit or power delivery failure.
Incomplete Data: HPL is a verification tool for known-good hardware, not a discovery tool for missing hardware.
Incorrect
Correct: C. Check the BMC logs and the PCIe enumeration in the BIOS/UEFI to see if the device is detected at the hardware level.
Hardware Hierarchy: In the NCP-AII “Day 1“ bring-up sequence, diagnostics must move from the bottom up (Physical ? Firmware/BIOS ? OS ? Driver). If nvidia-smi (which depends on the driver) fails to see a GPU, the engineer must verify if the PCIe bus even recognized the device during the Power-On Self-Test (POST).
Out-of-Band (OOB) Diagnostics: The BMC (Baseboard Management Controller) is the “source of truth“ for hardware health. It can report hardware faults, power-good signals, and thermal trips that occur before the OS even boots.
Validation Step: Checking the BIOS/UEFI PCIe Enumeration table confirms if the GPU is communicating with the CPU root complex. If it is missing there, it is a physical or firmware issue; if it is present there but missing in nvidia-smi, it is a driver or OS configuration issue.
Incorrect: A. Install the latest version of Pyxis and Enroot to see if the container runtime can force the GPU to appear.
Layer Mismatch: Pyxis and Enroot are high-level tools used for unprivileged containerization (typically with Slurm). They rely entirely on the underlying NVIDIA driver and kernel modules being functional.
Logical Error: A container runtime cannot “force“ the kernel or hardware to see a device that hasn‘t been enumerated on the PCIe bus. This step is far too late in the deployment sequence.
B. Immediately replace the HGX baseboard, as a missing GPU always indicates a catastrophic failure…
Premature Action: NVIDIA certification emphasizes “fault isolation“ before replacement. A missing GPU could be caused by a simple BIOS setting (e.g., Above 4G Decoding disabled), a loose power cable, or a firmware mismatch. Replacing an entire HGX baseboard is a “last resort“ after all logical checks (like those in Option C) are exhausted.
D. Run a High-Performance Linpack (HPL) test on the remaining GPUs to stress the system…
Safety Risk: Running a high-intensity stress test like HPL on a system with a known hardware anomaly (a missing GPU) is dangerous and counterproductive. It will not “initialize“ a missing device; rather, it could potentially cause further damage if the missing GPU is due to a short circuit or power delivery failure.
Incomplete Data: HPL is a verification tool for known-good hardware, not a discovery tool for missing hardware.
Unattempted
Correct: C. Check the BMC logs and the PCIe enumeration in the BIOS/UEFI to see if the device is detected at the hardware level.
Hardware Hierarchy: In the NCP-AII “Day 1“ bring-up sequence, diagnostics must move from the bottom up (Physical ? Firmware/BIOS ? OS ? Driver). If nvidia-smi (which depends on the driver) fails to see a GPU, the engineer must verify if the PCIe bus even recognized the device during the Power-On Self-Test (POST).
Out-of-Band (OOB) Diagnostics: The BMC (Baseboard Management Controller) is the “source of truth“ for hardware health. It can report hardware faults, power-good signals, and thermal trips that occur before the OS even boots.
Validation Step: Checking the BIOS/UEFI PCIe Enumeration table confirms if the GPU is communicating with the CPU root complex. If it is missing there, it is a physical or firmware issue; if it is present there but missing in nvidia-smi, it is a driver or OS configuration issue.
Incorrect: A. Install the latest version of Pyxis and Enroot to see if the container runtime can force the GPU to appear.
Layer Mismatch: Pyxis and Enroot are high-level tools used for unprivileged containerization (typically with Slurm). They rely entirely on the underlying NVIDIA driver and kernel modules being functional.
Logical Error: A container runtime cannot “force“ the kernel or hardware to see a device that hasn‘t been enumerated on the PCIe bus. This step is far too late in the deployment sequence.
B. Immediately replace the HGX baseboard, as a missing GPU always indicates a catastrophic failure…
Premature Action: NVIDIA certification emphasizes “fault isolation“ before replacement. A missing GPU could be caused by a simple BIOS setting (e.g., Above 4G Decoding disabled), a loose power cable, or a firmware mismatch. Replacing an entire HGX baseboard is a “last resort“ after all logical checks (like those in Option C) are exhausted.
D. Run a High-Performance Linpack (HPL) test on the remaining GPUs to stress the system…
Safety Risk: Running a high-intensity stress test like HPL on a system with a known hardware anomaly (a missing GPU) is dangerous and counterproductive. It will not “initialize“ a missing device; rather, it could potentially cause further damage if the missing GPU is due to a short circuit or power delivery failure.
Incomplete Data: HPL is a verification tool for known-good hardware, not a discovery tool for missing hardware.
Question 36 of 60
36. Question
An administrator needs to partition an NVIDIA A100 GPU into multiple instances to support concurrent small-scale training jobs and inference services. Which technology should be configured, and what is a key requirement for this configuration to be persistent across system reboots according to professional standards?
Correct
Correct: A. Configure Multi-Instance GPU (MIG) and ensure the MIG mode is enabled via nvidia-smi with the reboot flag.
Enabling MIG Mode: On the NVIDIA A100, MIG is disabled by default. It must be enabled using the command nvidia-smi -i -mig 1.
Persistence of Mode: For Ampere (A100) GPUs, the status of “MIG Mode: Enabled“ is stored in the GPU‘s non-volatile InfoROM. This means once the mode is turned on, it persists across reboots. (Note: While the mode persists, the specific instances you create—like 1g.5gb—are ephemeral and require automation like mig-parted or a startup script to recreate them after a reboot.)
Reboot Requirement: On A100, changing the MIG mode usually requires a GPU reset or a full system reboot to re-enumerate the PCIe handles for the new virtual instances.
Incorrect: B. Configure NVLink Bridge settings in the BIOS to split the GPU into virtual lanes…
NVLink vs. MIG: NVLink is a high-speed interconnect for GPU-to-GPU communication. It cannot be used to “split“ a single GPU into smaller instances.
Wrong Driver: DOCA (Data Center-on-a-Chip Architecture) is primarily for BlueField DPUs, not for managing MIG partitioning on an A100 GPU.
C. Use Slurm to partition the GPU logically and save the configuration in the slurm.conf…
Hardware vs. Software: Slurm is a workload manager that can schedule jobs on existing MIG instances, but it does not create the physical hardware partitions.
Logical Partitioning: The question asks for the technology to support concurrent jobs with hardware isolation; Slurm‘s logical partitioning (GRES) without MIG does not provide the hardware-level QoS (Quality of Service) and memory isolation required for professional AI infrastructure standards.
D. Enable NVIDIA vGPU profiles within the VMware ESXi hypervisor…
Technology Overlap: While NVIDIA vGPU is a valid virtualization technology, MIG is the native, preferred way to partition an A100 at the hardware level without requiring a hypervisor.
MAC Addresses: GPUs do not use MAC addresses for slicing; they use UUIDs or Device Nodes. Assigning MAC addresses is a networking concept irrelevant to GPU compute partitioning.
Incorrect
Correct: A. Configure Multi-Instance GPU (MIG) and ensure the MIG mode is enabled via nvidia-smi with the reboot flag.
Enabling MIG Mode: On the NVIDIA A100, MIG is disabled by default. It must be enabled using the command nvidia-smi -i -mig 1.
Persistence of Mode: For Ampere (A100) GPUs, the status of “MIG Mode: Enabled“ is stored in the GPU‘s non-volatile InfoROM. This means once the mode is turned on, it persists across reboots. (Note: While the mode persists, the specific instances you create—like 1g.5gb—are ephemeral and require automation like mig-parted or a startup script to recreate them after a reboot.)
Reboot Requirement: On A100, changing the MIG mode usually requires a GPU reset or a full system reboot to re-enumerate the PCIe handles for the new virtual instances.
Incorrect: B. Configure NVLink Bridge settings in the BIOS to split the GPU into virtual lanes…
NVLink vs. MIG: NVLink is a high-speed interconnect for GPU-to-GPU communication. It cannot be used to “split“ a single GPU into smaller instances.
Wrong Driver: DOCA (Data Center-on-a-Chip Architecture) is primarily for BlueField DPUs, not for managing MIG partitioning on an A100 GPU.
C. Use Slurm to partition the GPU logically and save the configuration in the slurm.conf…
Hardware vs. Software: Slurm is a workload manager that can schedule jobs on existing MIG instances, but it does not create the physical hardware partitions.
Logical Partitioning: The question asks for the technology to support concurrent jobs with hardware isolation; Slurm‘s logical partitioning (GRES) without MIG does not provide the hardware-level QoS (Quality of Service) and memory isolation required for professional AI infrastructure standards.
D. Enable NVIDIA vGPU profiles within the VMware ESXi hypervisor…
Technology Overlap: While NVIDIA vGPU is a valid virtualization technology, MIG is the native, preferred way to partition an A100 at the hardware level without requiring a hypervisor.
MAC Addresses: GPUs do not use MAC addresses for slicing; they use UUIDs or Device Nodes. Assigning MAC addresses is a networking concept irrelevant to GPU compute partitioning.
Unattempted
Correct: A. Configure Multi-Instance GPU (MIG) and ensure the MIG mode is enabled via nvidia-smi with the reboot flag.
Enabling MIG Mode: On the NVIDIA A100, MIG is disabled by default. It must be enabled using the command nvidia-smi -i -mig 1.
Persistence of Mode: For Ampere (A100) GPUs, the status of “MIG Mode: Enabled“ is stored in the GPU‘s non-volatile InfoROM. This means once the mode is turned on, it persists across reboots. (Note: While the mode persists, the specific instances you create—like 1g.5gb—are ephemeral and require automation like mig-parted or a startup script to recreate them after a reboot.)
Reboot Requirement: On A100, changing the MIG mode usually requires a GPU reset or a full system reboot to re-enumerate the PCIe handles for the new virtual instances.
Incorrect: B. Configure NVLink Bridge settings in the BIOS to split the GPU into virtual lanes…
NVLink vs. MIG: NVLink is a high-speed interconnect for GPU-to-GPU communication. It cannot be used to “split“ a single GPU into smaller instances.
Wrong Driver: DOCA (Data Center-on-a-Chip Architecture) is primarily for BlueField DPUs, not for managing MIG partitioning on an A100 GPU.
C. Use Slurm to partition the GPU logically and save the configuration in the slurm.conf…
Hardware vs. Software: Slurm is a workload manager that can schedule jobs on existing MIG instances, but it does not create the physical hardware partitions.
Logical Partitioning: The question asks for the technology to support concurrent jobs with hardware isolation; Slurm‘s logical partitioning (GRES) without MIG does not provide the hardware-level QoS (Quality of Service) and memory isolation required for professional AI infrastructure standards.
D. Enable NVIDIA vGPU profiles within the VMware ESXi hypervisor…
Technology Overlap: While NVIDIA vGPU is a valid virtualization technology, MIG is the native, preferred way to partition an A100 at the hardware level without requiring a hypervisor.
MAC Addresses: GPUs do not use MAC addresses for slicing; they use UUIDs or Device Nodes. Assigning MAC addresses is a networking concept irrelevant to GPU compute partitioning.
Question 37 of 60
37. Question
An administrator notices that one GPU in an 8-GPU HGX system is consistently reporting higher temperatures and lower clock speeds than the others during training. What is the most likely cause and the appropriate remediation step?
Correct
Correct: B. The GPU is undergoing a ‘thermal throttle‘ due to a failing cooling fan or poor thermal paste contact; the GPU or the fan module should be replaced.
Thermal Throttling Mechanics: In an HGX system, if one GPU shows significantly higher temperatures than its peers under the same workload, it is likely experiencing Thermal Slowdown. When a GPU exceeds its thermal threshold (T-Limit), the firmware automatically reduces the clock speed (frequency) to lower power consumption and prevent permanent silicon damage.
Diagnostic Verification: The NCP-AII curriculum teaches the use of nvidia-smi -q -d PERFORMANCE to check the Clocks Throttle Reasons. If “Thermal Slowdown“ is marked as Active, it confirms the hardware is protecting itself from heat.
Remediation: In enterprise environments, if airflow is verified as clear, the next step is replacing the faulty sub-component—either the specific fan module serving that GPU “bay“ or, if the thermal interface material (TIM) has failed, the GPU/OAM module itself.
Incorrect: A. The administrator should enable MIG on that specific GPU…
Logic Error: Multi-Instance GPU (MIG) is a resource partitioning technology, not a cooling solution. While smaller instances might generate less heat individually, it does not fix the underlying physical cooling failure. Furthermore, applying a software configuration change to a physically failing component violates the “Fault Isolation“ principles of the NCP-AII exam.
C. The GPU has been assigned to a higher priority Slurm queue…
Contradictory Result: A “higher priority“ queue would typically lead to higher clock speeds (as it would be under load) but should not cause a single GPU to behave differently than others in the same node if the cooling is functional. Higher priority does not change the physical thermal envelope of the hardware.
D. The DOCA driver version is too new for the hardware…
Incorrect Scope: DOCA (Data Center-on-a-Chip Architecture) primarily manages the BlueField DPU, not the thermal management of individual HGX GPUs.
System Uniformity: In a professional NVIDIA cluster, all GPUs in a node share the same kernel driver. You cannot have “different“ driver versions for individual GPUs on the same motherboard; a driver update or downgrade affects the entire system.
Incorrect
Correct: B. The GPU is undergoing a ‘thermal throttle‘ due to a failing cooling fan or poor thermal paste contact; the GPU or the fan module should be replaced.
Thermal Throttling Mechanics: In an HGX system, if one GPU shows significantly higher temperatures than its peers under the same workload, it is likely experiencing Thermal Slowdown. When a GPU exceeds its thermal threshold (T-Limit), the firmware automatically reduces the clock speed (frequency) to lower power consumption and prevent permanent silicon damage.
Diagnostic Verification: The NCP-AII curriculum teaches the use of nvidia-smi -q -d PERFORMANCE to check the Clocks Throttle Reasons. If “Thermal Slowdown“ is marked as Active, it confirms the hardware is protecting itself from heat.
Remediation: In enterprise environments, if airflow is verified as clear, the next step is replacing the faulty sub-component—either the specific fan module serving that GPU “bay“ or, if the thermal interface material (TIM) has failed, the GPU/OAM module itself.
Incorrect: A. The administrator should enable MIG on that specific GPU…
Logic Error: Multi-Instance GPU (MIG) is a resource partitioning technology, not a cooling solution. While smaller instances might generate less heat individually, it does not fix the underlying physical cooling failure. Furthermore, applying a software configuration change to a physically failing component violates the “Fault Isolation“ principles of the NCP-AII exam.
C. The GPU has been assigned to a higher priority Slurm queue…
Contradictory Result: A “higher priority“ queue would typically lead to higher clock speeds (as it would be under load) but should not cause a single GPU to behave differently than others in the same node if the cooling is functional. Higher priority does not change the physical thermal envelope of the hardware.
D. The DOCA driver version is too new for the hardware…
Incorrect Scope: DOCA (Data Center-on-a-Chip Architecture) primarily manages the BlueField DPU, not the thermal management of individual HGX GPUs.
System Uniformity: In a professional NVIDIA cluster, all GPUs in a node share the same kernel driver. You cannot have “different“ driver versions for individual GPUs on the same motherboard; a driver update or downgrade affects the entire system.
Unattempted
Correct: B. The GPU is undergoing a ‘thermal throttle‘ due to a failing cooling fan or poor thermal paste contact; the GPU or the fan module should be replaced.
Thermal Throttling Mechanics: In an HGX system, if one GPU shows significantly higher temperatures than its peers under the same workload, it is likely experiencing Thermal Slowdown. When a GPU exceeds its thermal threshold (T-Limit), the firmware automatically reduces the clock speed (frequency) to lower power consumption and prevent permanent silicon damage.
Diagnostic Verification: The NCP-AII curriculum teaches the use of nvidia-smi -q -d PERFORMANCE to check the Clocks Throttle Reasons. If “Thermal Slowdown“ is marked as Active, it confirms the hardware is protecting itself from heat.
Remediation: In enterprise environments, if airflow is verified as clear, the next step is replacing the faulty sub-component—either the specific fan module serving that GPU “bay“ or, if the thermal interface material (TIM) has failed, the GPU/OAM module itself.
Incorrect: A. The administrator should enable MIG on that specific GPU…
Logic Error: Multi-Instance GPU (MIG) is a resource partitioning technology, not a cooling solution. While smaller instances might generate less heat individually, it does not fix the underlying physical cooling failure. Furthermore, applying a software configuration change to a physically failing component violates the “Fault Isolation“ principles of the NCP-AII exam.
C. The GPU has been assigned to a higher priority Slurm queue…
Contradictory Result: A “higher priority“ queue would typically lead to higher clock speeds (as it would be under load) but should not cause a single GPU to behave differently than others in the same node if the cooling is functional. Higher priority does not change the physical thermal envelope of the hardware.
D. The DOCA driver version is too new for the hardware…
Incorrect Scope: DOCA (Data Center-on-a-Chip Architecture) primarily manages the BlueField DPU, not the thermal management of individual HGX GPUs.
System Uniformity: In a professional NVIDIA cluster, all GPUs in a node share the same kernel driver. You cannot have “different“ driver versions for individual GPUs on the same motherboard; a driver update or downgrade affects the entire system.
Question 38 of 60
38. Question
When designing the network topology for a multi-rack AI factory deployment, an architect must select the appropriate transceivers for the East-West compute fabric. The design requires 400Gb/s InfiniBand connectivity between Leaf and Spine switches with a maximum distance of 50 meters. Which transceiver type and cabling combination provides the best balance of signal quality, power efficiency, and cost for this specific distance?
Correct
Correct: B. Active Optical Cables (AOC) or Multimode Fiber with 400G-SR4 transceivers, as they are optimized for high-bandwidth communication within the 30 to 100-meter range common in AI clusters.
Distance Suitability: For 400Gb/s (NDR) InfiniBand, standard Passive Copper DACs are limited to approximately 2–3 meters due to signal attenuation. For a 50-meter run between racks (Leaf-to-Spine), AOCs or SR4 (Short Reach) transceivers over OM4/OM5 Multimode Fiber are the architectural standard.
Balance of Factors: AOCs and SR4 transceivers offer significantly lower cost and power consumption compared to Single-mode (DR4/FR4) solutions, while easily supporting the 50-meter requirement with high signal integrity.
AII Standard: The NCP-AII curriculum defines the “sweet spot“ for SR (Short Reach) optics as anything exceeding the 3-meter copper limit up to 100 meters.
Incorrect: A. Standard Category 6 Ethernet cables using RJ45 connectors…
Technological Impossibility: Category 6 (or even 6A/7) Ethernet cabling is capped at 10Gb/s (or 40Gb/s at very short distances). It is physically impossible to run 400Gb/s InfiniBand over RJ45 copper cabling.
Protocol Mismatch: InfiniBand NDR uses OSFP or QSFP-DD form factors, which are incompatible with RJ45 enterprise patch panels.
C. Single-mode Fiber (SMF) with 400G-DR4 transceivers…
Over-engineering: While 400G-DR4 over Single-mode fiber would work for 50 meters, it is designed for reaches up to 500m or 2km (inter-row or inter-hall).
Cost/Power Penalty: DR4 transceivers are significantly more expensive and consume more power than SR4 equivalents. In an “AI Factory“ with thousands of links, using DR4 for 50-meter runs is considered a poor design choice in the NCP-AII framework.
D. Passive Copper Direct Attach Cables (DAC)… regardless of the 50-meter distance requirement.
Physical Limit: This is a “trap“ answer. While DACs do provide the lowest latency and zero power consumption, they cannot physically reach 50 meters at 400Gb/s. At NDR speeds, copper loses signal integrity after just a few meters.
Correction: The NCP-AII exam expects you to know that DACs are for “Top-of-Rack“ (Intra-rack) use only, typically limited to 1.5m–3m.
Incorrect
Correct: B. Active Optical Cables (AOC) or Multimode Fiber with 400G-SR4 transceivers, as they are optimized for high-bandwidth communication within the 30 to 100-meter range common in AI clusters.
Distance Suitability: For 400Gb/s (NDR) InfiniBand, standard Passive Copper DACs are limited to approximately 2–3 meters due to signal attenuation. For a 50-meter run between racks (Leaf-to-Spine), AOCs or SR4 (Short Reach) transceivers over OM4/OM5 Multimode Fiber are the architectural standard.
Balance of Factors: AOCs and SR4 transceivers offer significantly lower cost and power consumption compared to Single-mode (DR4/FR4) solutions, while easily supporting the 50-meter requirement with high signal integrity.
AII Standard: The NCP-AII curriculum defines the “sweet spot“ for SR (Short Reach) optics as anything exceeding the 3-meter copper limit up to 100 meters.
Incorrect: A. Standard Category 6 Ethernet cables using RJ45 connectors…
Technological Impossibility: Category 6 (or even 6A/7) Ethernet cabling is capped at 10Gb/s (or 40Gb/s at very short distances). It is physically impossible to run 400Gb/s InfiniBand over RJ45 copper cabling.
Protocol Mismatch: InfiniBand NDR uses OSFP or QSFP-DD form factors, which are incompatible with RJ45 enterprise patch panels.
C. Single-mode Fiber (SMF) with 400G-DR4 transceivers…
Over-engineering: While 400G-DR4 over Single-mode fiber would work for 50 meters, it is designed for reaches up to 500m or 2km (inter-row or inter-hall).
Cost/Power Penalty: DR4 transceivers are significantly more expensive and consume more power than SR4 equivalents. In an “AI Factory“ with thousands of links, using DR4 for 50-meter runs is considered a poor design choice in the NCP-AII framework.
D. Passive Copper Direct Attach Cables (DAC)… regardless of the 50-meter distance requirement.
Physical Limit: This is a “trap“ answer. While DACs do provide the lowest latency and zero power consumption, they cannot physically reach 50 meters at 400Gb/s. At NDR speeds, copper loses signal integrity after just a few meters.
Correction: The NCP-AII exam expects you to know that DACs are for “Top-of-Rack“ (Intra-rack) use only, typically limited to 1.5m–3m.
Unattempted
Correct: B. Active Optical Cables (AOC) or Multimode Fiber with 400G-SR4 transceivers, as they are optimized for high-bandwidth communication within the 30 to 100-meter range common in AI clusters.
Distance Suitability: For 400Gb/s (NDR) InfiniBand, standard Passive Copper DACs are limited to approximately 2–3 meters due to signal attenuation. For a 50-meter run between racks (Leaf-to-Spine), AOCs or SR4 (Short Reach) transceivers over OM4/OM5 Multimode Fiber are the architectural standard.
Balance of Factors: AOCs and SR4 transceivers offer significantly lower cost and power consumption compared to Single-mode (DR4/FR4) solutions, while easily supporting the 50-meter requirement with high signal integrity.
AII Standard: The NCP-AII curriculum defines the “sweet spot“ for SR (Short Reach) optics as anything exceeding the 3-meter copper limit up to 100 meters.
Incorrect: A. Standard Category 6 Ethernet cables using RJ45 connectors…
Technological Impossibility: Category 6 (or even 6A/7) Ethernet cabling is capped at 10Gb/s (or 40Gb/s at very short distances). It is physically impossible to run 400Gb/s InfiniBand over RJ45 copper cabling.
Protocol Mismatch: InfiniBand NDR uses OSFP or QSFP-DD form factors, which are incompatible with RJ45 enterprise patch panels.
C. Single-mode Fiber (SMF) with 400G-DR4 transceivers…
Over-engineering: While 400G-DR4 over Single-mode fiber would work for 50 meters, it is designed for reaches up to 500m or 2km (inter-row or inter-hall).
Cost/Power Penalty: DR4 transceivers are significantly more expensive and consume more power than SR4 equivalents. In an “AI Factory“ with thousands of links, using DR4 for 50-meter runs is considered a poor design choice in the NCP-AII framework.
D. Passive Copper Direct Attach Cables (DAC)… regardless of the 50-meter distance requirement.
Physical Limit: This is a “trap“ answer. While DACs do provide the lowest latency and zero power consumption, they cannot physically reach 50 meters at 400Gb/s. At NDR speeds, copper loses signal integrity after just a few meters.
Correction: The NCP-AII exam expects you to know that DACs are for “Top-of-Rack“ (Intra-rack) use only, typically limited to 1.5m–3m.
Question 39 of 60
39. Question
An IT professional is setting up the control plane for a new NVIDIA-certified cluster using Base Command Manager (BCM). During the installation, the administrator needs to configure High Availability (HA) for the head nodes. Which of the following is a requirement for a successful BCM HA configuration to ensure the cluster remains operational if the primary node fails?
Correct
Correct: B. A dedicated heartbeat network between the head nodes and a shared storage mechanism or synchronized database for the cluster configuration metadata.
Heartbeat Mechanism: BCM HA requires a low-latency, dedicated connection between the active and passive head nodes. This heartbeat allows the standby node to monitor the health of the primary node in real-time.
State Synchronization: For a failover to be successful, the cluster configuration (nodes, jobs, users, and LDAP/AD data) must be identical on both nodes. BCM achieves this through a synchronized database (typically MariaDB/MySQL) and shared or replicated storage for the /cm/shared and /home directories.
AII Standard: The NCP-AII curriculum emphasizes that without synchronized metadata and a reliable heartbeat, the cluster would experience a “split-brain“ scenario where both nodes attempt to manage the cluster simultaneously.
Incorrect: A. The installation of the NVIDIA Container Toolkit on the BMC…
Location Error: The NVIDIA Container Toolkit is installed on the host Operating System to manage GPU resources for containers; it is never installed on the BMC (Baseboard Management Controller), which runs its own specialized firmware.
Functional Error: The BMC does not handle the failover of the Slurm daemon. While BCM uses the OOB (Out-of-Band) network for power management, the HA logic resides within the BCM software layer on the head nodes themselves.
C. Using the NGC CLI to replicate the entire OS partition… to the NVIDIA GPU Cloud.
Inappropriate Use Case: The NGC (NVIDIA GPU Cloud) CLI is used for pulling container images, models, and datasets. It is not a backup or replication tool for OS partitions or local cluster metadata.
Latency Issues: Real-time failover in a high-performance cluster cannot depend on cloud-based recovery due to the massive latency and bandwidth requirements of replicating a live head node OS to the cloud.
D. Configuring the BlueField-3 DPUs to act as the primary head nodes…
Role Misalignment: While BlueField-3 DPUs can run Linux and offload infrastructure tasks (like networking or storage encryption), they are not designed to serve as the primary cluster head nodes for BCM.
Architectural Standard: In an NVIDIA-certified cluster, the head nodes are standard x86 or Grace-based servers with significant RAM and disk I/O. The DPUs serve as “engines“ within the compute nodes or storage fabric, not as the central management brain for the entire cluster.
Incorrect
Correct: B. A dedicated heartbeat network between the head nodes and a shared storage mechanism or synchronized database for the cluster configuration metadata.
Heartbeat Mechanism: BCM HA requires a low-latency, dedicated connection between the active and passive head nodes. This heartbeat allows the standby node to monitor the health of the primary node in real-time.
State Synchronization: For a failover to be successful, the cluster configuration (nodes, jobs, users, and LDAP/AD data) must be identical on both nodes. BCM achieves this through a synchronized database (typically MariaDB/MySQL) and shared or replicated storage for the /cm/shared and /home directories.
AII Standard: The NCP-AII curriculum emphasizes that without synchronized metadata and a reliable heartbeat, the cluster would experience a “split-brain“ scenario where both nodes attempt to manage the cluster simultaneously.
Incorrect: A. The installation of the NVIDIA Container Toolkit on the BMC…
Location Error: The NVIDIA Container Toolkit is installed on the host Operating System to manage GPU resources for containers; it is never installed on the BMC (Baseboard Management Controller), which runs its own specialized firmware.
Functional Error: The BMC does not handle the failover of the Slurm daemon. While BCM uses the OOB (Out-of-Band) network for power management, the HA logic resides within the BCM software layer on the head nodes themselves.
C. Using the NGC CLI to replicate the entire OS partition… to the NVIDIA GPU Cloud.
Inappropriate Use Case: The NGC (NVIDIA GPU Cloud) CLI is used for pulling container images, models, and datasets. It is not a backup or replication tool for OS partitions or local cluster metadata.
Latency Issues: Real-time failover in a high-performance cluster cannot depend on cloud-based recovery due to the massive latency and bandwidth requirements of replicating a live head node OS to the cloud.
D. Configuring the BlueField-3 DPUs to act as the primary head nodes…
Role Misalignment: While BlueField-3 DPUs can run Linux and offload infrastructure tasks (like networking or storage encryption), they are not designed to serve as the primary cluster head nodes for BCM.
Architectural Standard: In an NVIDIA-certified cluster, the head nodes are standard x86 or Grace-based servers with significant RAM and disk I/O. The DPUs serve as “engines“ within the compute nodes or storage fabric, not as the central management brain for the entire cluster.
Unattempted
Correct: B. A dedicated heartbeat network between the head nodes and a shared storage mechanism or synchronized database for the cluster configuration metadata.
Heartbeat Mechanism: BCM HA requires a low-latency, dedicated connection between the active and passive head nodes. This heartbeat allows the standby node to monitor the health of the primary node in real-time.
State Synchronization: For a failover to be successful, the cluster configuration (nodes, jobs, users, and LDAP/AD data) must be identical on both nodes. BCM achieves this through a synchronized database (typically MariaDB/MySQL) and shared or replicated storage for the /cm/shared and /home directories.
AII Standard: The NCP-AII curriculum emphasizes that without synchronized metadata and a reliable heartbeat, the cluster would experience a “split-brain“ scenario where both nodes attempt to manage the cluster simultaneously.
Incorrect: A. The installation of the NVIDIA Container Toolkit on the BMC…
Location Error: The NVIDIA Container Toolkit is installed on the host Operating System to manage GPU resources for containers; it is never installed on the BMC (Baseboard Management Controller), which runs its own specialized firmware.
Functional Error: The BMC does not handle the failover of the Slurm daemon. While BCM uses the OOB (Out-of-Band) network for power management, the HA logic resides within the BCM software layer on the head nodes themselves.
C. Using the NGC CLI to replicate the entire OS partition… to the NVIDIA GPU Cloud.
Inappropriate Use Case: The NGC (NVIDIA GPU Cloud) CLI is used for pulling container images, models, and datasets. It is not a backup or replication tool for OS partitions or local cluster metadata.
Latency Issues: Real-time failover in a high-performance cluster cannot depend on cloud-based recovery due to the massive latency and bandwidth requirements of replicating a live head node OS to the cloud.
D. Configuring the BlueField-3 DPUs to act as the primary head nodes…
Role Misalignment: While BlueField-3 DPUs can run Linux and offload infrastructure tasks (like networking or storage encryption), they are not designed to serve as the primary cluster head nodes for BCM.
Architectural Standard: In an NVIDIA-certified cluster, the head nodes are standard x86 or Grace-based servers with significant RAM and disk I/O. The DPUs serve as “engines“ within the compute nodes or storage fabric, not as the central management brain for the entire cluster.
Question 40 of 60
40. Question
A research team needs to run multiple small AI inference jobs on a single H100 GPU to maximize resource utilization. The administrator decides to implement Multi-Instance GPU MIG. Which of the following conditions must be met to successfully configure and partition the GPU into multiple hardware-isolated instances?
Correct
Correct: A. The GPU must be in a specific MIG mode enabled through nvidia-smi.
The “Gatekeeper“ Setting: By default, NVIDIA H100 GPUs ship with MIG mode disabled. To partition the GPU, an administrator must first toggle the physical mode using the command nvidia-smi -i -mig 1.
State Change: Enabling MIG mode triggers a re-enumeration of the GPU on the PCIe bus. On the H100 (unlike the older A100), this mode is driver-resident and typically requires a GPU reset or a systemctl restart nvidia-fabricmanager to take effect.
Prerequisite for Slicing: Until this mode is set to “Enabled,“ all commands to create specific GPU instances (e.g., 1g.10gb) will fail.
Incorrect: B. The server must have at least 1TB of system RAM for MIG to function.
Arbitrary Requirement: MIG resource requirements are focused on the GPU‘s VRAM, not the host‘s system RAM. While AI workloads generally require significant system memory, there is no “1TB minimum“ enforced by the NVIDIA driver for MIG to function. The H100 itself manages the internal partitioning of its 80GB HBM3 memory.
C. The administrator must disable all NVLink connections between GPUs.
Compatibility: MIG and NVLink are not mutually exclusive. While a single MIG instance cannot span across multiple GPUs via NVLink (P2P is restricted within a single GPU‘s slices), you do not need to disable the physical NVLink fabric or the Fabric Manager to use MIG.
Fabric Manager Role: In fact, on HGX™ H100 systems, the NVIDIA Fabric Manager must be running and healthy for the GPUs to initialize correctly, regardless of whether MIG is being used.
D. The GPU must be connected via an external USB-C Thunderbolt cable.
Enterprise Hardware: The H100 is an enterprise-grade data center GPU typically found in SXM5 or PCIe form factors. It communicates via high-bandwidth PCIe Gen5 lanes or the SXM5 proprietary interconnect.
Consumer vs. Professional: Thunderbolt is a consumer/workstation interface and is never used as the primary interconnect for H100-based AI infrastructure in the scope of the NCP-AII certification.
Incorrect
Correct: A. The GPU must be in a specific MIG mode enabled through nvidia-smi.
The “Gatekeeper“ Setting: By default, NVIDIA H100 GPUs ship with MIG mode disabled. To partition the GPU, an administrator must first toggle the physical mode using the command nvidia-smi -i -mig 1.
State Change: Enabling MIG mode triggers a re-enumeration of the GPU on the PCIe bus. On the H100 (unlike the older A100), this mode is driver-resident and typically requires a GPU reset or a systemctl restart nvidia-fabricmanager to take effect.
Prerequisite for Slicing: Until this mode is set to “Enabled,“ all commands to create specific GPU instances (e.g., 1g.10gb) will fail.
Incorrect: B. The server must have at least 1TB of system RAM for MIG to function.
Arbitrary Requirement: MIG resource requirements are focused on the GPU‘s VRAM, not the host‘s system RAM. While AI workloads generally require significant system memory, there is no “1TB minimum“ enforced by the NVIDIA driver for MIG to function. The H100 itself manages the internal partitioning of its 80GB HBM3 memory.
C. The administrator must disable all NVLink connections between GPUs.
Compatibility: MIG and NVLink are not mutually exclusive. While a single MIG instance cannot span across multiple GPUs via NVLink (P2P is restricted within a single GPU‘s slices), you do not need to disable the physical NVLink fabric or the Fabric Manager to use MIG.
Fabric Manager Role: In fact, on HGX™ H100 systems, the NVIDIA Fabric Manager must be running and healthy for the GPUs to initialize correctly, regardless of whether MIG is being used.
D. The GPU must be connected via an external USB-C Thunderbolt cable.
Enterprise Hardware: The H100 is an enterprise-grade data center GPU typically found in SXM5 or PCIe form factors. It communicates via high-bandwidth PCIe Gen5 lanes or the SXM5 proprietary interconnect.
Consumer vs. Professional: Thunderbolt is a consumer/workstation interface and is never used as the primary interconnect for H100-based AI infrastructure in the scope of the NCP-AII certification.
Unattempted
Correct: A. The GPU must be in a specific MIG mode enabled through nvidia-smi.
The “Gatekeeper“ Setting: By default, NVIDIA H100 GPUs ship with MIG mode disabled. To partition the GPU, an administrator must first toggle the physical mode using the command nvidia-smi -i -mig 1.
State Change: Enabling MIG mode triggers a re-enumeration of the GPU on the PCIe bus. On the H100 (unlike the older A100), this mode is driver-resident and typically requires a GPU reset or a systemctl restart nvidia-fabricmanager to take effect.
Prerequisite for Slicing: Until this mode is set to “Enabled,“ all commands to create specific GPU instances (e.g., 1g.10gb) will fail.
Incorrect: B. The server must have at least 1TB of system RAM for MIG to function.
Arbitrary Requirement: MIG resource requirements are focused on the GPU‘s VRAM, not the host‘s system RAM. While AI workloads generally require significant system memory, there is no “1TB minimum“ enforced by the NVIDIA driver for MIG to function. The H100 itself manages the internal partitioning of its 80GB HBM3 memory.
C. The administrator must disable all NVLink connections between GPUs.
Compatibility: MIG and NVLink are not mutually exclusive. While a single MIG instance cannot span across multiple GPUs via NVLink (P2P is restricted within a single GPU‘s slices), you do not need to disable the physical NVLink fabric or the Fabric Manager to use MIG.
Fabric Manager Role: In fact, on HGX™ H100 systems, the NVIDIA Fabric Manager must be running and healthy for the GPUs to initialize correctly, regardless of whether MIG is being used.
D. The GPU must be connected via an external USB-C Thunderbolt cable.
Enterprise Hardware: The H100 is an enterprise-grade data center GPU typically found in SXM5 or PCIe form factors. It communicates via high-bandwidth PCIe Gen5 lanes or the SXM5 proprietary interconnect.
Consumer vs. Professional: Thunderbolt is a consumer/workstation interface and is never used as the primary interconnect for H100-based AI infrastructure in the scope of the NCP-AII certification.
Question 41 of 60
41. Question
During the configuration of the network interfaces for a cluster managed by Base Command Manager, the administrator must define the external and internal networks. Which statement accurately describes the best practice for configuring these interfaces to ensure secure and efficient cluster management and data movement?
Correct
Correct: D. Assign a private management network for BCM to communicate with node BMCs and a high-speed fabric for compute/storage traffic, keeping management and data planes separate.
Plan Separation: In a professional AI factory, the Management Plane (used for provisioning, monitoring, and OOB power control) must be physically or logically isolated from the Data Plane (the 400G/800G fabric used for RDMA/NCCL traffic).
The “internalnet“ Concept: BCM defines a primary management network (often called internalnet) to handle DHCP, PXE booting, and node management. This ensures that even if the compute fabric is saturated by an HPL or training job, the administrator maintains control over the hardware.
Reliability: Separating these planes prevents a “broadcast storm“ or network congestion on the compute fabric from locking out the BCM head node‘s ability to manage the cluster.
Incorrect: A. Combine all management, compute, and storage traffic onto a single 1Gbps OOB network…
Bandwidth Bottleneck: A 1Gbps network is insufficient for AI data movement (which requires 400Gbps+).
Congestion Risk: Merging these types of traffic violates the core design principle of the NVIDIA DGX SuperPOD and BCM reference architectures, which mandate dedicated high-speed rails for compute and storage to ensure non-blocking performance.
B. Use the NGC CLI to assign static public IP addresses to all compute nodes…
Security Risk: Assigning public IP addresses to compute nodes is a severe security violation for an internal AI cluster.
Tool Misuse: The NGC CLI is a tool for managing container images and datasets; it has no functionality for IP address assignment or network interface configuration. Networking in BCM is handled via the cmsh (Command Management Shell) or the Base View GUI.
C. Configure the BlueField-3 DPU to bridge the management network directly into the NVLink fabric…
Architectural Error: NVLink is a specialized, ultra-high-speed point-to-point interconnect specifically for GPU-to-GPU memory access (e.g., between GPUs in an HGX tray). It is not a general-purpose network fabric and cannot be “bridged“ into a standard Ethernet management network.
DPU Role: While the BlueField-3 DPU can offload management tasks, its role is to accelerate the In-Band data plane (Encryption, NVMe-oF, or SDN), not to bridge OOB management into the NVLink switch fabric.
Incorrect
Correct: D. Assign a private management network for BCM to communicate with node BMCs and a high-speed fabric for compute/storage traffic, keeping management and data planes separate.
Plan Separation: In a professional AI factory, the Management Plane (used for provisioning, monitoring, and OOB power control) must be physically or logically isolated from the Data Plane (the 400G/800G fabric used for RDMA/NCCL traffic).
The “internalnet“ Concept: BCM defines a primary management network (often called internalnet) to handle DHCP, PXE booting, and node management. This ensures that even if the compute fabric is saturated by an HPL or training job, the administrator maintains control over the hardware.
Reliability: Separating these planes prevents a “broadcast storm“ or network congestion on the compute fabric from locking out the BCM head node‘s ability to manage the cluster.
Incorrect: A. Combine all management, compute, and storage traffic onto a single 1Gbps OOB network…
Bandwidth Bottleneck: A 1Gbps network is insufficient for AI data movement (which requires 400Gbps+).
Congestion Risk: Merging these types of traffic violates the core design principle of the NVIDIA DGX SuperPOD and BCM reference architectures, which mandate dedicated high-speed rails for compute and storage to ensure non-blocking performance.
B. Use the NGC CLI to assign static public IP addresses to all compute nodes…
Security Risk: Assigning public IP addresses to compute nodes is a severe security violation for an internal AI cluster.
Tool Misuse: The NGC CLI is a tool for managing container images and datasets; it has no functionality for IP address assignment or network interface configuration. Networking in BCM is handled via the cmsh (Command Management Shell) or the Base View GUI.
C. Configure the BlueField-3 DPU to bridge the management network directly into the NVLink fabric…
Architectural Error: NVLink is a specialized, ultra-high-speed point-to-point interconnect specifically for GPU-to-GPU memory access (e.g., between GPUs in an HGX tray). It is not a general-purpose network fabric and cannot be “bridged“ into a standard Ethernet management network.
DPU Role: While the BlueField-3 DPU can offload management tasks, its role is to accelerate the In-Band data plane (Encryption, NVMe-oF, or SDN), not to bridge OOB management into the NVLink switch fabric.
Unattempted
Correct: D. Assign a private management network for BCM to communicate with node BMCs and a high-speed fabric for compute/storage traffic, keeping management and data planes separate.
Plan Separation: In a professional AI factory, the Management Plane (used for provisioning, monitoring, and OOB power control) must be physically or logically isolated from the Data Plane (the 400G/800G fabric used for RDMA/NCCL traffic).
The “internalnet“ Concept: BCM defines a primary management network (often called internalnet) to handle DHCP, PXE booting, and node management. This ensures that even if the compute fabric is saturated by an HPL or training job, the administrator maintains control over the hardware.
Reliability: Separating these planes prevents a “broadcast storm“ or network congestion on the compute fabric from locking out the BCM head node‘s ability to manage the cluster.
Incorrect: A. Combine all management, compute, and storage traffic onto a single 1Gbps OOB network…
Bandwidth Bottleneck: A 1Gbps network is insufficient for AI data movement (which requires 400Gbps+).
Congestion Risk: Merging these types of traffic violates the core design principle of the NVIDIA DGX SuperPOD and BCM reference architectures, which mandate dedicated high-speed rails for compute and storage to ensure non-blocking performance.
B. Use the NGC CLI to assign static public IP addresses to all compute nodes…
Security Risk: Assigning public IP addresses to compute nodes is a severe security violation for an internal AI cluster.
Tool Misuse: The NGC CLI is a tool for managing container images and datasets; it has no functionality for IP address assignment or network interface configuration. Networking in BCM is handled via the cmsh (Command Management Shell) or the Base View GUI.
C. Configure the BlueField-3 DPU to bridge the management network directly into the NVLink fabric…
Architectural Error: NVLink is a specialized, ultra-high-speed point-to-point interconnect specifically for GPU-to-GPU memory access (e.g., between GPUs in an HGX tray). It is not a general-purpose network fabric and cannot be “bridged“ into a standard Ethernet management network.
DPU Role: While the BlueField-3 DPU can offload management tasks, its role is to accelerate the In-Band data plane (Encryption, NVMe-oF, or SDN), not to bridge OOB management into the NVLink switch fabric.
Question 42 of 60
42. Question
A researcher needs to partition an NVIDIA A100 GPU using Multi-Instance GPU technology to support seven distinct users, each requiring isolated compute and memory resources. Which configuration step and characteristic are essential for ensuring that these users do not interfere with each other‘s performance while running concurrent AI inference tasks?
Correct
Correct: A. The administrator must enable MIG mode using nvidia-smi and then create GPU instances based on the 1g.5gb profile to provide hardware-level isolation.
Maximum Density: An NVIDIA A100 (40GB or 80GB) features 7 Compute Slices. The 1g.5gb profile (on the A100-40GB) or the 1g.10gb profile (on the A100-80GB) is the only way to support up to seven distinct users simultaneously.
Hardware Isolation: Unlike software-level partitioning, MIG creates a dedicated execution path through the entire memory system. Each of the seven instances receives its own isolated L2 cache banks, memory controllers, and DRAM address buses. This ensures that one user‘s intensive inference task cannot cause a “noisy neighbor“ effect or latency spikes for the other six users.
AII Standard Workflow: The NCP-AII curriculum defines the mandatory sequence as:
Enable MIG mode (nvidia-smi -mig 1).
Create GPU Instances (GI).
Create Compute Instances (CI).
Incorrect: B. The administrator should disable the NVIDIA drivers and use the Linux cgroups utility…
Driver Dependency: The NVIDIA driver is required to communicate with the GPU hardware. Disabling it would make the GPU invisible to the OS.
Visibility vs. Isolation: While cgroups (Control Groups) can limit CPU or system RAM, they cannot manage the internal hardware sub-partitions of a GPU. Only the NVIDIA driver/MIG can physically isolate the GPU‘s internal crossbar and cache.
C. MIG instances must be configured to allow memory oversubscription…
QoS Guarantee: One of the defining characteristics of MIG in the NCP-AII framework is that it prohibits oversubscription. Each instance is granted a fixed, dedicated amount of VRAM. If an instance has 5GB, it cannot “borrow“ from an idle neighbor. This is what provides the “deterministic performance“ required for professional AI infrastructure.
D. The users must share the same CUDA context and use software-level time-slicing…
Context Isolation: MIG allows each instance to have its own independent CUDA context. This means if one userÂ’s process crashes (e.g., an XID error), it only impacts their specific MIG instance and does not bring down the other six users.
Parallelism vs. Rotation: Time-slicing rotates jobs sequentially (introducing latency), whereas MIG runs them concurrently in parallel on separate hardware slices. NCP-AII emphasizes MIG as the superior solution for low-latency inference compared to legacy time-slicing.
Incorrect
Correct: A. The administrator must enable MIG mode using nvidia-smi and then create GPU instances based on the 1g.5gb profile to provide hardware-level isolation.
Maximum Density: An NVIDIA A100 (40GB or 80GB) features 7 Compute Slices. The 1g.5gb profile (on the A100-40GB) or the 1g.10gb profile (on the A100-80GB) is the only way to support up to seven distinct users simultaneously.
Hardware Isolation: Unlike software-level partitioning, MIG creates a dedicated execution path through the entire memory system. Each of the seven instances receives its own isolated L2 cache banks, memory controllers, and DRAM address buses. This ensures that one user‘s intensive inference task cannot cause a “noisy neighbor“ effect or latency spikes for the other six users.
AII Standard Workflow: The NCP-AII curriculum defines the mandatory sequence as:
Enable MIG mode (nvidia-smi -mig 1).
Create GPU Instances (GI).
Create Compute Instances (CI).
Incorrect: B. The administrator should disable the NVIDIA drivers and use the Linux cgroups utility…
Driver Dependency: The NVIDIA driver is required to communicate with the GPU hardware. Disabling it would make the GPU invisible to the OS.
Visibility vs. Isolation: While cgroups (Control Groups) can limit CPU or system RAM, they cannot manage the internal hardware sub-partitions of a GPU. Only the NVIDIA driver/MIG can physically isolate the GPU‘s internal crossbar and cache.
C. MIG instances must be configured to allow memory oversubscription…
QoS Guarantee: One of the defining characteristics of MIG in the NCP-AII framework is that it prohibits oversubscription. Each instance is granted a fixed, dedicated amount of VRAM. If an instance has 5GB, it cannot “borrow“ from an idle neighbor. This is what provides the “deterministic performance“ required for professional AI infrastructure.
D. The users must share the same CUDA context and use software-level time-slicing…
Context Isolation: MIG allows each instance to have its own independent CUDA context. This means if one userÂ’s process crashes (e.g., an XID error), it only impacts their specific MIG instance and does not bring down the other six users.
Parallelism vs. Rotation: Time-slicing rotates jobs sequentially (introducing latency), whereas MIG runs them concurrently in parallel on separate hardware slices. NCP-AII emphasizes MIG as the superior solution for low-latency inference compared to legacy time-slicing.
Unattempted
Correct: A. The administrator must enable MIG mode using nvidia-smi and then create GPU instances based on the 1g.5gb profile to provide hardware-level isolation.
Maximum Density: An NVIDIA A100 (40GB or 80GB) features 7 Compute Slices. The 1g.5gb profile (on the A100-40GB) or the 1g.10gb profile (on the A100-80GB) is the only way to support up to seven distinct users simultaneously.
Hardware Isolation: Unlike software-level partitioning, MIG creates a dedicated execution path through the entire memory system. Each of the seven instances receives its own isolated L2 cache banks, memory controllers, and DRAM address buses. This ensures that one user‘s intensive inference task cannot cause a “noisy neighbor“ effect or latency spikes for the other six users.
AII Standard Workflow: The NCP-AII curriculum defines the mandatory sequence as:
Enable MIG mode (nvidia-smi -mig 1).
Create GPU Instances (GI).
Create Compute Instances (CI).
Incorrect: B. The administrator should disable the NVIDIA drivers and use the Linux cgroups utility…
Driver Dependency: The NVIDIA driver is required to communicate with the GPU hardware. Disabling it would make the GPU invisible to the OS.
Visibility vs. Isolation: While cgroups (Control Groups) can limit CPU or system RAM, they cannot manage the internal hardware sub-partitions of a GPU. Only the NVIDIA driver/MIG can physically isolate the GPU‘s internal crossbar and cache.
C. MIG instances must be configured to allow memory oversubscription…
QoS Guarantee: One of the defining characteristics of MIG in the NCP-AII framework is that it prohibits oversubscription. Each instance is granted a fixed, dedicated amount of VRAM. If an instance has 5GB, it cannot “borrow“ from an idle neighbor. This is what provides the “deterministic performance“ required for professional AI infrastructure.
D. The users must share the same CUDA context and use software-level time-slicing…
Context Isolation: MIG allows each instance to have its own independent CUDA context. This means if one userÂ’s process crashes (e.g., an XID error), it only impacts their specific MIG instance and does not bring down the other six users.
Parallelism vs. Rotation: Time-slicing rotates jobs sequentially (introducing latency), whereas MIG runs them concurrently in parallel on separate hardware slices. NCP-AII emphasizes MIG as the superior solution for low-latency inference compared to legacy time-slicing.
Question 43 of 60
43. Question
When designing the network topology for a large-scale AI factory, an architect must choose between different cabling and transceiver types for the 400Gb/s NDR InfiniBand compute fabric. Given a scenario where the distance between the leaf switches and the spine switches is 45 meters, which cabling solution is the most appropriate to ensure signal integrity and performance?
Correct
Correct: B. Active Optical Cables (AOC) or Multimode Fiber with SR4 transceivers to support the distance while maintaining the required 400Gb/s bandwidth.
The 50-Meter Standard: For 400Gb/s (NDR) InfiniBand using 100G-PAM4 modulation, the maximum reach for Multimode Fiber (OM4/OM5) is exactly 50 meters. Since the leaf-to-spine distance is 45 meters, this is the most cost-effective and power-efficient solution.
AOC vs. Transceivers: Both AOCs (which have permanently attached transceivers) and SR4 (Short Reach, 4-channel) transceivers with MPO-12/APC jumpers are validated for this distance.
Efficiency: SR4 transceivers consume approximately 8 Watts per port, whereas long-reach single-mode transceivers consume more power and are significantly more expensive.
Incorrect: A. Category 6A Ethernet cables with RJ45 connectors…
Throughput Limitation: Cat 6A is physically incapable of supporting 400Gb/s. It is capped at 10Gb/s.
Form Factor Mismatch: 400G NDR InfiniBand utilizes OSFP or QSFP112 connectors. RJ45 connectors are not used in high-speed compute fabrics in the NCP-AII curriculum.
C. Passive Copper Direct Attach Cables (DAC)…
The 3-Meter Limit: At 400Gb/s (NDR), signal attenuation in copper is extreme. Passive DACs are strictly limited to a maximum length of 3 meters (or 5 meters for Active Copper Cables/ACC). Attempting to run a 45-meter link with copper would result in immediate signal failure and is a “trap“ answer on the exam.
D. Single-mode fiber with LR4 transceivers…
Misinformation on Multimode: The claim that multimode cannot exceed 100Gb/s is incorrect; modern SR4 and SR8 transceivers easily handle 400Gb/s and 800Gb/s.
Over-Engineering: While Single-mode Fiber (SMF) with DR4/LR4 optics can reach 500m to 10km, it is unnecessary for a 45-meter run. In the NCP-AII framework, using SMF for short distances is discouraged due to higher costs and power budgets (typically 12W-17W for long-reach optics).
Incorrect
Correct: B. Active Optical Cables (AOC) or Multimode Fiber with SR4 transceivers to support the distance while maintaining the required 400Gb/s bandwidth.
The 50-Meter Standard: For 400Gb/s (NDR) InfiniBand using 100G-PAM4 modulation, the maximum reach for Multimode Fiber (OM4/OM5) is exactly 50 meters. Since the leaf-to-spine distance is 45 meters, this is the most cost-effective and power-efficient solution.
AOC vs. Transceivers: Both AOCs (which have permanently attached transceivers) and SR4 (Short Reach, 4-channel) transceivers with MPO-12/APC jumpers are validated for this distance.
Efficiency: SR4 transceivers consume approximately 8 Watts per port, whereas long-reach single-mode transceivers consume more power and are significantly more expensive.
Incorrect: A. Category 6A Ethernet cables with RJ45 connectors…
Throughput Limitation: Cat 6A is physically incapable of supporting 400Gb/s. It is capped at 10Gb/s.
Form Factor Mismatch: 400G NDR InfiniBand utilizes OSFP or QSFP112 connectors. RJ45 connectors are not used in high-speed compute fabrics in the NCP-AII curriculum.
C. Passive Copper Direct Attach Cables (DAC)…
The 3-Meter Limit: At 400Gb/s (NDR), signal attenuation in copper is extreme. Passive DACs are strictly limited to a maximum length of 3 meters (or 5 meters for Active Copper Cables/ACC). Attempting to run a 45-meter link with copper would result in immediate signal failure and is a “trap“ answer on the exam.
D. Single-mode fiber with LR4 transceivers…
Misinformation on Multimode: The claim that multimode cannot exceed 100Gb/s is incorrect; modern SR4 and SR8 transceivers easily handle 400Gb/s and 800Gb/s.
Over-Engineering: While Single-mode Fiber (SMF) with DR4/LR4 optics can reach 500m to 10km, it is unnecessary for a 45-meter run. In the NCP-AII framework, using SMF for short distances is discouraged due to higher costs and power budgets (typically 12W-17W for long-reach optics).
Unattempted
Correct: B. Active Optical Cables (AOC) or Multimode Fiber with SR4 transceivers to support the distance while maintaining the required 400Gb/s bandwidth.
The 50-Meter Standard: For 400Gb/s (NDR) InfiniBand using 100G-PAM4 modulation, the maximum reach for Multimode Fiber (OM4/OM5) is exactly 50 meters. Since the leaf-to-spine distance is 45 meters, this is the most cost-effective and power-efficient solution.
AOC vs. Transceivers: Both AOCs (which have permanently attached transceivers) and SR4 (Short Reach, 4-channel) transceivers with MPO-12/APC jumpers are validated for this distance.
Efficiency: SR4 transceivers consume approximately 8 Watts per port, whereas long-reach single-mode transceivers consume more power and are significantly more expensive.
Incorrect: A. Category 6A Ethernet cables with RJ45 connectors…
Throughput Limitation: Cat 6A is physically incapable of supporting 400Gb/s. It is capped at 10Gb/s.
Form Factor Mismatch: 400G NDR InfiniBand utilizes OSFP or QSFP112 connectors. RJ45 connectors are not used in high-speed compute fabrics in the NCP-AII curriculum.
C. Passive Copper Direct Attach Cables (DAC)…
The 3-Meter Limit: At 400Gb/s (NDR), signal attenuation in copper is extreme. Passive DACs are strictly limited to a maximum length of 3 meters (or 5 meters for Active Copper Cables/ACC). Attempting to run a 45-meter link with copper would result in immediate signal failure and is a “trap“ answer on the exam.
D. Single-mode fiber with LR4 transceivers…
Misinformation on Multimode: The claim that multimode cannot exceed 100Gb/s is incorrect; modern SR4 and SR8 transceivers easily handle 400Gb/s and 800Gb/s.
Over-Engineering: While Single-mode Fiber (SMF) with DR4/LR4 optics can reach 500m to 10km, it is unnecessary for a 45-meter run. In the NCP-AII framework, using SMF for short distances is discouraged due to higher costs and power budgets (typically 12W-17W for long-reach optics).
Question 44 of 60
44. Question
To verify the health and performance of the inter-GPU communication within a node, an administrator executes the NVIDIA Collective Communications Library (NCCL) tests. If the all_reduce test shows significantly lower bandwidth than expected on an HGX system, which specific hardware component should be investigated first?
Correct
Correct: B. The NVLink Switch and NVLink connections.
Intra-node Bottleneck: In an HGX™ system (like the H100 or A100), GPUs communicate directly with one another through NVLink via on-board NVSwitches. The all_reduce test is a collective operation where every GPU both sends and receives data. If bandwidth is low, the primary hardware suspect is the high-speed fabric responsible for that specific traffic.
Fabric Integrity: Low bandwidth or high latency in NCCL usually indicates a “degraded“ link where one or more NVLink lanes are down or operating at reduced speeds due to signal integrity issues.
NCP-AII Diagnostic Path: The curriculum teaches using nvidia-smi nvlink -s to check for link errors and the NVIDIA Fabric Manager logs to ensure the NVSwitch topology is correctly initialized.
Incorrect: A. The 1GbE management switch.
Purpose Mismatch: The 1GbE (Gigabit Ethernet) network is the Management Plane used for Out-of-Band (OOB) tasks like IPMI and BMC access.
Bandwidth Disparity: NCCL all_reduce operations on an HGX system target bandwidths in the range of hundreds of gigabytes per second (e.g., 900 GB/s for H100). A 1GbE switch would be thousands of times too slow to even participate in this data path, and NCCL does not use the management network for compute collectives.
C. The SATA boot drive.
Data Path Isolation: The boot drive is used for the Operating System and loading the initial application binaries. Once the NCCL test is running, all data movement occurs between GPU memory (HBM) and the NVLink/Network fabric.
Latency vs. Throughput: A slow SATA drive might increase the time it takes to start the test, but it has zero impact on the bandwidth of the GPU-to-GPU communication once the kernels are executing.
D. The TPM 2.0 module.
Security vs. Performance: The Trusted Platform Module (TPM) is a security chip used for hardware-based root of trust, encryption keys, and secure boot.
Irrelevance: The TPM module does not sit on the data path for GPU collectives. It has no functional role in the high-speed signaling or memory copy operations performed during a Linpack or NCCL benchmark.
Incorrect
Correct: B. The NVLink Switch and NVLink connections.
Intra-node Bottleneck: In an HGX™ system (like the H100 or A100), GPUs communicate directly with one another through NVLink via on-board NVSwitches. The all_reduce test is a collective operation where every GPU both sends and receives data. If bandwidth is low, the primary hardware suspect is the high-speed fabric responsible for that specific traffic.
Fabric Integrity: Low bandwidth or high latency in NCCL usually indicates a “degraded“ link where one or more NVLink lanes are down or operating at reduced speeds due to signal integrity issues.
NCP-AII Diagnostic Path: The curriculum teaches using nvidia-smi nvlink -s to check for link errors and the NVIDIA Fabric Manager logs to ensure the NVSwitch topology is correctly initialized.
Incorrect: A. The 1GbE management switch.
Purpose Mismatch: The 1GbE (Gigabit Ethernet) network is the Management Plane used for Out-of-Band (OOB) tasks like IPMI and BMC access.
Bandwidth Disparity: NCCL all_reduce operations on an HGX system target bandwidths in the range of hundreds of gigabytes per second (e.g., 900 GB/s for H100). A 1GbE switch would be thousands of times too slow to even participate in this data path, and NCCL does not use the management network for compute collectives.
C. The SATA boot drive.
Data Path Isolation: The boot drive is used for the Operating System and loading the initial application binaries. Once the NCCL test is running, all data movement occurs between GPU memory (HBM) and the NVLink/Network fabric.
Latency vs. Throughput: A slow SATA drive might increase the time it takes to start the test, but it has zero impact on the bandwidth of the GPU-to-GPU communication once the kernels are executing.
D. The TPM 2.0 module.
Security vs. Performance: The Trusted Platform Module (TPM) is a security chip used for hardware-based root of trust, encryption keys, and secure boot.
Irrelevance: The TPM module does not sit on the data path for GPU collectives. It has no functional role in the high-speed signaling or memory copy operations performed during a Linpack or NCCL benchmark.
Unattempted
Correct: B. The NVLink Switch and NVLink connections.
Intra-node Bottleneck: In an HGX™ system (like the H100 or A100), GPUs communicate directly with one another through NVLink via on-board NVSwitches. The all_reduce test is a collective operation where every GPU both sends and receives data. If bandwidth is low, the primary hardware suspect is the high-speed fabric responsible for that specific traffic.
Fabric Integrity: Low bandwidth or high latency in NCCL usually indicates a “degraded“ link where one or more NVLink lanes are down or operating at reduced speeds due to signal integrity issues.
NCP-AII Diagnostic Path: The curriculum teaches using nvidia-smi nvlink -s to check for link errors and the NVIDIA Fabric Manager logs to ensure the NVSwitch topology is correctly initialized.
Incorrect: A. The 1GbE management switch.
Purpose Mismatch: The 1GbE (Gigabit Ethernet) network is the Management Plane used for Out-of-Band (OOB) tasks like IPMI and BMC access.
Bandwidth Disparity: NCCL all_reduce operations on an HGX system target bandwidths in the range of hundreds of gigabytes per second (e.g., 900 GB/s for H100). A 1GbE switch would be thousands of times too slow to even participate in this data path, and NCCL does not use the management network for compute collectives.
C. The SATA boot drive.
Data Path Isolation: The boot drive is used for the Operating System and loading the initial application binaries. Once the NCCL test is running, all data movement occurs between GPU memory (HBM) and the NVLink/Network fabric.
Latency vs. Throughput: A slow SATA drive might increase the time it takes to start the test, but it has zero impact on the bandwidth of the GPU-to-GPU communication once the kernels are executing.
D. The TPM 2.0 module.
Security vs. Performance: The Trusted Platform Module (TPM) is a security chip used for hardware-based root of trust, encryption keys, and secure boot.
Irrelevance: The TPM module does not sit on the data path for GPU collectives. It has no functional role in the high-speed signaling or memory copy operations performed during a Linpack or NCCL benchmark.
Question 45 of 60
45. Question
A cluster administrator is running the NVIDIA Collective Communications Library (NCCL) tests to verify the East-West fabric bandwidth. They observe that the all-reduce performance is significantly lower than expected for an NDR InfiniBand network. Which tool should be used to verify if the NVLink Switch fabric is functioning correctly within the nodes?
Correct
Correct: C. The nvidia-smi nvlink –status command to check the health and lane activity of the internal GPU interconnects.
Internal Fabric Validation: NCCL all-reduce operations rely heavily on the NVLink fabric for intra-node (inside the box) communication. If performance is low, the first step is to ensure that the physical links between GPUs and the NVSwitches are up and running at full width (e.g., 18 lanes for H100).
Link Health: This command provides real-time status on whether the links are “Active“ or “Inactive.“ It also allows the administrator to see if there are high error counts (symbol errors or recovery counts) on specific lanes, which indicates a hardware fault in the HGX™ tray.
AII Standard: The NCP-AII curriculum identifies nvidia-smi as the primary tool for GPU-level health, while ibdiagnet is used for the external InfiniBand fabric.
Incorrect: A. The hpl-burnin script to check if the CPU-to-Memory bandwidth is bottlenecking…
Wrong Data Path: HPL (High-Performance Linpack) focuses on floating-point compute performance and CPU-to-GPU memory copies (PCIe). It does not specifically stress or report on the bandwidth of the NVLink Switch fabric used for the collective operations mentioned in the scenario.
Scope: While HPL is part of the burn-in process, it is a “system-wide“ benchmark rather than a surgical diagnostic tool for interconnect lanes.
B. The ipmitool sensor list command to check if the ambient temperature…
Secondary Indicator: While high temperatures can cause throttling, ipmitool provides chassis-level thermal data. It does not provide any visibility into the logical status or lane activity of the NVLink fabric.
Throttling Mechanics: If a switch throttles, it affects the clock speed, but it wouldn‘t tell you if a specific NVLink lane has failed or is experiencing high Bit Error Rates (BER), which is what nvidia-smi reveals.
D. The ngc registry image list command to ensure the latest version… is available in the cloud.
Cloud vs. Local: This command simply lists available containers on the NVIDIA GPU Cloud (NGC). It does not provide any diagnostic information about the physical hardware currently running in the data center.
Version Check: While having the latest NCCL library is a best practice, a “significantly lower“ bandwidth result on an NDR network is almost always a physical layer or configuration issue, not a cloud-registry-versioning issue.
Incorrect
Correct: C. The nvidia-smi nvlink –status command to check the health and lane activity of the internal GPU interconnects.
Internal Fabric Validation: NCCL all-reduce operations rely heavily on the NVLink fabric for intra-node (inside the box) communication. If performance is low, the first step is to ensure that the physical links between GPUs and the NVSwitches are up and running at full width (e.g., 18 lanes for H100).
Link Health: This command provides real-time status on whether the links are “Active“ or “Inactive.“ It also allows the administrator to see if there are high error counts (symbol errors or recovery counts) on specific lanes, which indicates a hardware fault in the HGX™ tray.
AII Standard: The NCP-AII curriculum identifies nvidia-smi as the primary tool for GPU-level health, while ibdiagnet is used for the external InfiniBand fabric.
Incorrect: A. The hpl-burnin script to check if the CPU-to-Memory bandwidth is bottlenecking…
Wrong Data Path: HPL (High-Performance Linpack) focuses on floating-point compute performance and CPU-to-GPU memory copies (PCIe). It does not specifically stress or report on the bandwidth of the NVLink Switch fabric used for the collective operations mentioned in the scenario.
Scope: While HPL is part of the burn-in process, it is a “system-wide“ benchmark rather than a surgical diagnostic tool for interconnect lanes.
B. The ipmitool sensor list command to check if the ambient temperature…
Secondary Indicator: While high temperatures can cause throttling, ipmitool provides chassis-level thermal data. It does not provide any visibility into the logical status or lane activity of the NVLink fabric.
Throttling Mechanics: If a switch throttles, it affects the clock speed, but it wouldn‘t tell you if a specific NVLink lane has failed or is experiencing high Bit Error Rates (BER), which is what nvidia-smi reveals.
D. The ngc registry image list command to ensure the latest version… is available in the cloud.
Cloud vs. Local: This command simply lists available containers on the NVIDIA GPU Cloud (NGC). It does not provide any diagnostic information about the physical hardware currently running in the data center.
Version Check: While having the latest NCCL library is a best practice, a “significantly lower“ bandwidth result on an NDR network is almost always a physical layer or configuration issue, not a cloud-registry-versioning issue.
Unattempted
Correct: C. The nvidia-smi nvlink –status command to check the health and lane activity of the internal GPU interconnects.
Internal Fabric Validation: NCCL all-reduce operations rely heavily on the NVLink fabric for intra-node (inside the box) communication. If performance is low, the first step is to ensure that the physical links between GPUs and the NVSwitches are up and running at full width (e.g., 18 lanes for H100).
Link Health: This command provides real-time status on whether the links are “Active“ or “Inactive.“ It also allows the administrator to see if there are high error counts (symbol errors or recovery counts) on specific lanes, which indicates a hardware fault in the HGX™ tray.
AII Standard: The NCP-AII curriculum identifies nvidia-smi as the primary tool for GPU-level health, while ibdiagnet is used for the external InfiniBand fabric.
Incorrect: A. The hpl-burnin script to check if the CPU-to-Memory bandwidth is bottlenecking…
Wrong Data Path: HPL (High-Performance Linpack) focuses on floating-point compute performance and CPU-to-GPU memory copies (PCIe). It does not specifically stress or report on the bandwidth of the NVLink Switch fabric used for the collective operations mentioned in the scenario.
Scope: While HPL is part of the burn-in process, it is a “system-wide“ benchmark rather than a surgical diagnostic tool for interconnect lanes.
B. The ipmitool sensor list command to check if the ambient temperature…
Secondary Indicator: While high temperatures can cause throttling, ipmitool provides chassis-level thermal data. It does not provide any visibility into the logical status or lane activity of the NVLink fabric.
Throttling Mechanics: If a switch throttles, it affects the clock speed, but it wouldn‘t tell you if a specific NVLink lane has failed or is experiencing high Bit Error Rates (BER), which is what nvidia-smi reveals.
D. The ngc registry image list command to ensure the latest version… is available in the cloud.
Cloud vs. Local: This command simply lists available containers on the NVIDIA GPU Cloud (NGC). It does not provide any diagnostic information about the physical hardware currently running in the data center.
Version Check: While having the latest NCCL library is a best practice, a “significantly lower“ bandwidth result on an NDR network is almost always a physical layer or configuration issue, not a cloud-registry-versioning issue.
Question 46 of 60
46. Question
A data center engineer is connecting several NVIDIA DGX nodes to a leaf-and-spine network fabric. To ensure optimal performance and avoid signal degradation, the engineer must validate the cabling and transceivers. Which specific action correctly identifies a fault in the physical layer when the link fails to come up at the expected 400Gbps or 800Gbps speed?
Correct
Correct: A. Inspect the optical fiber end-faces for contamination using a digital microscope and verify that the transceiver power levels in the BMC fall within the specified dBm range.
Physical Layer Precision: At 400Gbps and 800Gbps, the signal margin is extremely tight. Even a single speck of dust can cause high Bit Error Rates (BER) or a total link failure. Use of a fiber inspection microscope is a standard “Day 1“ validation step.
BMC Monitoring: The Baseboard Management Controller (BMC) provides Out-of-Band (OOB) access to internal sensor data. Checking the dBm (decibel-milliwatts) levels for Transmit (Tx) and Receive (Rx) power is the primary way to confirm if a transceiver is seated correctly and receiving a clean light signal.
Incorrect: B. Swap the OSFP transceivers with SFP+ equivalents to determine if the high-speed signaling is being restricted by the HGX firmware power limits.
Physical Incompatibility: OSFP (Octal Small Form-factor Pluggable) used for 400G/800G links is physically larger and electrically different from SFP+ (10G/25G). They are not interchangeable.
Logic Error: While HGX firmware does manage power, you cannot diagnose a high-speed signaling issue by downgrading to a completely different hardware standard.
C. Reinstall the NVIDIA Container Toolkit to ensure that the drivers can properly communicate with the physical layer transceivers and reset the link state.
Layer Mismatch: The NVIDIA Container Toolkit is a software utility that allows containers (like Docker) to access GPU resources. It operates at the Application/Runtime layer and has no control over the hardware-level Link Training or physical transceiver negotiation.
Irrelevant to Physical Link: If the “link fails to come up“ at the physical level, the OS/Drivers cannot see the device properly, making a container-level fix useless.
D. Increase the MTU size to 9000 on the BMC management port to force the InfiniBand transceivers to negotiate a higher clock speed via the OOB network.
Path Confusion: The BMC management port is a standard 1GbE Ethernet port used for management. It is electrically isolated from the high-speed InfiniBand/Compute fabric.
Technical Misconception: MTU (Maximum Transmission Unit) dictates the size of data packets, not the physical clock speed or frequency of the transceiver. Furthermore, InfiniBand and Ethernet speeds are determined during the hardware “handshake“ (Link Training), not via management port settings.
Incorrect
Correct: A. Inspect the optical fiber end-faces for contamination using a digital microscope and verify that the transceiver power levels in the BMC fall within the specified dBm range.
Physical Layer Precision: At 400Gbps and 800Gbps, the signal margin is extremely tight. Even a single speck of dust can cause high Bit Error Rates (BER) or a total link failure. Use of a fiber inspection microscope is a standard “Day 1“ validation step.
BMC Monitoring: The Baseboard Management Controller (BMC) provides Out-of-Band (OOB) access to internal sensor data. Checking the dBm (decibel-milliwatts) levels for Transmit (Tx) and Receive (Rx) power is the primary way to confirm if a transceiver is seated correctly and receiving a clean light signal.
Incorrect: B. Swap the OSFP transceivers with SFP+ equivalents to determine if the high-speed signaling is being restricted by the HGX firmware power limits.
Physical Incompatibility: OSFP (Octal Small Form-factor Pluggable) used for 400G/800G links is physically larger and electrically different from SFP+ (10G/25G). They are not interchangeable.
Logic Error: While HGX firmware does manage power, you cannot diagnose a high-speed signaling issue by downgrading to a completely different hardware standard.
C. Reinstall the NVIDIA Container Toolkit to ensure that the drivers can properly communicate with the physical layer transceivers and reset the link state.
Layer Mismatch: The NVIDIA Container Toolkit is a software utility that allows containers (like Docker) to access GPU resources. It operates at the Application/Runtime layer and has no control over the hardware-level Link Training or physical transceiver negotiation.
Irrelevant to Physical Link: If the “link fails to come up“ at the physical level, the OS/Drivers cannot see the device properly, making a container-level fix useless.
D. Increase the MTU size to 9000 on the BMC management port to force the InfiniBand transceivers to negotiate a higher clock speed via the OOB network.
Path Confusion: The BMC management port is a standard 1GbE Ethernet port used for management. It is electrically isolated from the high-speed InfiniBand/Compute fabric.
Technical Misconception: MTU (Maximum Transmission Unit) dictates the size of data packets, not the physical clock speed or frequency of the transceiver. Furthermore, InfiniBand and Ethernet speeds are determined during the hardware “handshake“ (Link Training), not via management port settings.
Unattempted
Correct: A. Inspect the optical fiber end-faces for contamination using a digital microscope and verify that the transceiver power levels in the BMC fall within the specified dBm range.
Physical Layer Precision: At 400Gbps and 800Gbps, the signal margin is extremely tight. Even a single speck of dust can cause high Bit Error Rates (BER) or a total link failure. Use of a fiber inspection microscope is a standard “Day 1“ validation step.
BMC Monitoring: The Baseboard Management Controller (BMC) provides Out-of-Band (OOB) access to internal sensor data. Checking the dBm (decibel-milliwatts) levels for Transmit (Tx) and Receive (Rx) power is the primary way to confirm if a transceiver is seated correctly and receiving a clean light signal.
Incorrect: B. Swap the OSFP transceivers with SFP+ equivalents to determine if the high-speed signaling is being restricted by the HGX firmware power limits.
Physical Incompatibility: OSFP (Octal Small Form-factor Pluggable) used for 400G/800G links is physically larger and electrically different from SFP+ (10G/25G). They are not interchangeable.
Logic Error: While HGX firmware does manage power, you cannot diagnose a high-speed signaling issue by downgrading to a completely different hardware standard.
C. Reinstall the NVIDIA Container Toolkit to ensure that the drivers can properly communicate with the physical layer transceivers and reset the link state.
Layer Mismatch: The NVIDIA Container Toolkit is a software utility that allows containers (like Docker) to access GPU resources. It operates at the Application/Runtime layer and has no control over the hardware-level Link Training or physical transceiver negotiation.
Irrelevant to Physical Link: If the “link fails to come up“ at the physical level, the OS/Drivers cannot see the device properly, making a container-level fix useless.
D. Increase the MTU size to 9000 on the BMC management port to force the InfiniBand transceivers to negotiate a higher clock speed via the OOB network.
Path Confusion: The BMC management port is a standard 1GbE Ethernet port used for management. It is electrically isolated from the high-speed InfiniBand/Compute fabric.
Technical Misconception: MTU (Maximum Transmission Unit) dictates the size of data packets, not the physical clock speed or frequency of the transceiver. Furthermore, InfiniBand and Ethernet speeds are determined during the hardware “handshake“ (Link Training), not via management port settings.
Question 47 of 60
47. Question
During the verification of an AI cluster, the administrator executes a High-Performance Linpack (HPL) test. The HPL score is significantly lower than the expected GFLOPS for an H100-based system. Which of the following is the most likely cause of this performance discrepancy?
Correct
Correct: C. The GPUs are experiencing thermal throttling due to inadequate cooling, or the CPU‘s power-management settings are limiting the data feed rate.
Thermal Throttling: On H100 systems, a GPU can draw up to 700W. If the data center‘s cooling or the server‘s internal fans cannot dissipate this heat, the GPU firmware triggers a thermal slowdown. This drops the clock speeds significantly to prevent hardware damage, resulting in a direct and massive reduction in GFLOPS during HPL.
CPU Power Management: HPL is not just a GPU test; it is a system-wide stress test. The CPU is responsible for managing the DGEMM (Double-precision General Matrix Multiplication) problem distribution and feeding data to the GPUs. If the CPU is in a “Power Save“ or “Balanced“ profile instead of the required “High Performance“ mode, it creates a bottleneck that starves the GPUs, leading to suboptimal scores.
AII Diagnostic Path: The NCP-AII curriculum instructs administrators to use nvidia-smi -q -d PERFORMANCE to check for Thermal Slowdown status and cpupower frequency-info to verify the CPU governor.
Incorrect: A. The NVIDIA Container Toolkit is missing a license key…
Functional Error: The NVIDIA Container Toolkit (which allows Docker to access GPUs) is open-source and does not require a license key to operate.
Performance Logic: Licenses are required for some enterprise software like NVIDIA AI Enterprise (NVAIE) for support and certain proprietary libraries, but there is no mechanism that artificially throttles GPU clock speeds to 10% based on a missing cloud license.
B. The Slurm scheduler has not been configured to use the BlueField-3 DPU…
Architectural Misalignment: HPL math kernels run primarily on the GPU Tensor Cores (and secondarily on the CPU). While BlueField-3 DPUs can offload network and storage tasks, they are not designed to act as “floating-point accelerators“ for the heavy matrix math required by Linpack.
Slurm‘s Role: Slurm is a scheduler that places the job on the nodes; it does not change the internal math kernel execution path between the CPU and DPU.
D. The InfiniBand transceivers are using the wrong version of the TPM protocol…
Security vs. Math: The TPM (Trusted Platform Module) is a security chip for encryption keys and secure boot. It has no “TPM protocol“ associated with InfiniBand transceivers.
Non-existent Feature: HPL results are standard log files; they do not require encryption via a hardware TPM before being saved to disk. Even if they did, this would have no impact on the GFLOPS (computational speed) measured during the test itself.
Incorrect
Correct: C. The GPUs are experiencing thermal throttling due to inadequate cooling, or the CPU‘s power-management settings are limiting the data feed rate.
Thermal Throttling: On H100 systems, a GPU can draw up to 700W. If the data center‘s cooling or the server‘s internal fans cannot dissipate this heat, the GPU firmware triggers a thermal slowdown. This drops the clock speeds significantly to prevent hardware damage, resulting in a direct and massive reduction in GFLOPS during HPL.
CPU Power Management: HPL is not just a GPU test; it is a system-wide stress test. The CPU is responsible for managing the DGEMM (Double-precision General Matrix Multiplication) problem distribution and feeding data to the GPUs. If the CPU is in a “Power Save“ or “Balanced“ profile instead of the required “High Performance“ mode, it creates a bottleneck that starves the GPUs, leading to suboptimal scores.
AII Diagnostic Path: The NCP-AII curriculum instructs administrators to use nvidia-smi -q -d PERFORMANCE to check for Thermal Slowdown status and cpupower frequency-info to verify the CPU governor.
Incorrect: A. The NVIDIA Container Toolkit is missing a license key…
Functional Error: The NVIDIA Container Toolkit (which allows Docker to access GPUs) is open-source and does not require a license key to operate.
Performance Logic: Licenses are required for some enterprise software like NVIDIA AI Enterprise (NVAIE) for support and certain proprietary libraries, but there is no mechanism that artificially throttles GPU clock speeds to 10% based on a missing cloud license.
B. The Slurm scheduler has not been configured to use the BlueField-3 DPU…
Architectural Misalignment: HPL math kernels run primarily on the GPU Tensor Cores (and secondarily on the CPU). While BlueField-3 DPUs can offload network and storage tasks, they are not designed to act as “floating-point accelerators“ for the heavy matrix math required by Linpack.
Slurm‘s Role: Slurm is a scheduler that places the job on the nodes; it does not change the internal math kernel execution path between the CPU and DPU.
D. The InfiniBand transceivers are using the wrong version of the TPM protocol…
Security vs. Math: The TPM (Trusted Platform Module) is a security chip for encryption keys and secure boot. It has no “TPM protocol“ associated with InfiniBand transceivers.
Non-existent Feature: HPL results are standard log files; they do not require encryption via a hardware TPM before being saved to disk. Even if they did, this would have no impact on the GFLOPS (computational speed) measured during the test itself.
Unattempted
Correct: C. The GPUs are experiencing thermal throttling due to inadequate cooling, or the CPU‘s power-management settings are limiting the data feed rate.
Thermal Throttling: On H100 systems, a GPU can draw up to 700W. If the data center‘s cooling or the server‘s internal fans cannot dissipate this heat, the GPU firmware triggers a thermal slowdown. This drops the clock speeds significantly to prevent hardware damage, resulting in a direct and massive reduction in GFLOPS during HPL.
CPU Power Management: HPL is not just a GPU test; it is a system-wide stress test. The CPU is responsible for managing the DGEMM (Double-precision General Matrix Multiplication) problem distribution and feeding data to the GPUs. If the CPU is in a “Power Save“ or “Balanced“ profile instead of the required “High Performance“ mode, it creates a bottleneck that starves the GPUs, leading to suboptimal scores.
AII Diagnostic Path: The NCP-AII curriculum instructs administrators to use nvidia-smi -q -d PERFORMANCE to check for Thermal Slowdown status and cpupower frequency-info to verify the CPU governor.
Incorrect: A. The NVIDIA Container Toolkit is missing a license key…
Functional Error: The NVIDIA Container Toolkit (which allows Docker to access GPUs) is open-source and does not require a license key to operate.
Performance Logic: Licenses are required for some enterprise software like NVIDIA AI Enterprise (NVAIE) for support and certain proprietary libraries, but there is no mechanism that artificially throttles GPU clock speeds to 10% based on a missing cloud license.
B. The Slurm scheduler has not been configured to use the BlueField-3 DPU…
Architectural Misalignment: HPL math kernels run primarily on the GPU Tensor Cores (and secondarily on the CPU). While BlueField-3 DPUs can offload network and storage tasks, they are not designed to act as “floating-point accelerators“ for the heavy matrix math required by Linpack.
Slurm‘s Role: Slurm is a scheduler that places the job on the nodes; it does not change the internal math kernel execution path between the CPU and DPU.
D. The InfiniBand transceivers are using the wrong version of the TPM protocol…
Security vs. Math: The TPM (Trusted Platform Module) is a security chip for encryption keys and secure boot. It has no “TPM protocol“ associated with InfiniBand transceivers.
Non-existent Feature: HPL results are standard log files; they do not require encryption via a hardware TPM before being saved to disk. Even if they did, this would have no impact on the GFLOPS (computational speed) measured during the test itself.
Question 48 of 60
48. Question
A system administrator is tasked with the initial physical bring-up of an NVIDIA HGX H100 system within a new AI factory deployment. During the validation phase, the administrator must ensure that the power and cooling parameters are within the specific tolerances required for maximum TDP workloads. Which sequence of actions is most critical for validating that the server environment can sustain high-performance AI training without thermal throttling or power delivery failures?
Correct
Correct: D. Utilize the NVIDIA System Management Interface (nvidia-smi) to monitor GPU temperatures while running a synthetic workload and cross-reference these with the Out-of-Band (OOB) management thermal sensors.
Active Validation: High-Performance AI training consumes peak power. The NCP-AII curriculum dictates that passive monitoring is insufficient. You must stress the system (e.g., using burn-in scripts or HPL) and use nvidia-smi to check the Clocks Throttle Reasons.
Sensor Correlation: An essential skill for an AII professional is correlating In-Band data (from the driver/GPU) with Out-of-Band data (from the BMC/IPMI). If nvidia-smi reports high temperatures but the BMC shows low fan speeds, it indicates a failure in the cooling policy or thermal communication.
TDP Headroom: By running a workload, you validate that the Power Distribution Units (PDUs) and the server‘s 6x 3.3 kW power supplies can handle the transient loads without triggering a power-cap or an OCP (Over-Current Protection) event.
Incorrect: A. Inspect the physical cabling… and ensure that the TPM is initialized before checking thermal or power parameters.
Logic Gap: While cabling is important, the Trusted Platform Module (TPM) is a security component for root-of-trust and encryption. It has no functional impact on a server‘s ability to sustain thermal or power loads during training. Initializing the TPM is a secondary security step, not a primary thermal validation step.
B. Perform a firmware upgrade… and then manually disable all fan speed controls.
Dangerous Practice: Manually disabling fan controls (locking them at a low speed) in a high-TDP HGX system is a guaranteed way to cause hardware damage or immediate emergency shutdown. The NCP-AII standard practice is to ensure the BMC cooling policy is set to “Performance“ or “Heavy IO“ mode, letting the system manage the fan curve dynamically based on sensor feedback.
C. Check the BMC web interface… and verify that the ambient room temperature is exactly twenty-five degrees Celsius.
Static vs. Dynamic: 25°C is a common target for data centers, but the exam expects you to know that the server must be validated across a range of conditions. Simply checking a static temperature does not validate the system’s behavior under the massive heat output generated by an 8-GPU H100 baseboard during actual AI operations.
Incorrect
Correct: D. Utilize the NVIDIA System Management Interface (nvidia-smi) to monitor GPU temperatures while running a synthetic workload and cross-reference these with the Out-of-Band (OOB) management thermal sensors.
Active Validation: High-Performance AI training consumes peak power. The NCP-AII curriculum dictates that passive monitoring is insufficient. You must stress the system (e.g., using burn-in scripts or HPL) and use nvidia-smi to check the Clocks Throttle Reasons.
Sensor Correlation: An essential skill for an AII professional is correlating In-Band data (from the driver/GPU) with Out-of-Band data (from the BMC/IPMI). If nvidia-smi reports high temperatures but the BMC shows low fan speeds, it indicates a failure in the cooling policy or thermal communication.
TDP Headroom: By running a workload, you validate that the Power Distribution Units (PDUs) and the server‘s 6x 3.3 kW power supplies can handle the transient loads without triggering a power-cap or an OCP (Over-Current Protection) event.
Incorrect: A. Inspect the physical cabling… and ensure that the TPM is initialized before checking thermal or power parameters.
Logic Gap: While cabling is important, the Trusted Platform Module (TPM) is a security component for root-of-trust and encryption. It has no functional impact on a server‘s ability to sustain thermal or power loads during training. Initializing the TPM is a secondary security step, not a primary thermal validation step.
B. Perform a firmware upgrade… and then manually disable all fan speed controls.
Dangerous Practice: Manually disabling fan controls (locking them at a low speed) in a high-TDP HGX system is a guaranteed way to cause hardware damage or immediate emergency shutdown. The NCP-AII standard practice is to ensure the BMC cooling policy is set to “Performance“ or “Heavy IO“ mode, letting the system manage the fan curve dynamically based on sensor feedback.
C. Check the BMC web interface… and verify that the ambient room temperature is exactly twenty-five degrees Celsius.
Static vs. Dynamic: 25°C is a common target for data centers, but the exam expects you to know that the server must be validated across a range of conditions. Simply checking a static temperature does not validate the system’s behavior under the massive heat output generated by an 8-GPU H100 baseboard during actual AI operations.
Unattempted
Correct: D. Utilize the NVIDIA System Management Interface (nvidia-smi) to monitor GPU temperatures while running a synthetic workload and cross-reference these with the Out-of-Band (OOB) management thermal sensors.
Active Validation: High-Performance AI training consumes peak power. The NCP-AII curriculum dictates that passive monitoring is insufficient. You must stress the system (e.g., using burn-in scripts or HPL) and use nvidia-smi to check the Clocks Throttle Reasons.
Sensor Correlation: An essential skill for an AII professional is correlating In-Band data (from the driver/GPU) with Out-of-Band data (from the BMC/IPMI). If nvidia-smi reports high temperatures but the BMC shows low fan speeds, it indicates a failure in the cooling policy or thermal communication.
TDP Headroom: By running a workload, you validate that the Power Distribution Units (PDUs) and the server‘s 6x 3.3 kW power supplies can handle the transient loads without triggering a power-cap or an OCP (Over-Current Protection) event.
Incorrect: A. Inspect the physical cabling… and ensure that the TPM is initialized before checking thermal or power parameters.
Logic Gap: While cabling is important, the Trusted Platform Module (TPM) is a security component for root-of-trust and encryption. It has no functional impact on a server‘s ability to sustain thermal or power loads during training. Initializing the TPM is a secondary security step, not a primary thermal validation step.
B. Perform a firmware upgrade… and then manually disable all fan speed controls.
Dangerous Practice: Manually disabling fan controls (locking them at a low speed) in a high-TDP HGX system is a guaranteed way to cause hardware damage or immediate emergency shutdown. The NCP-AII standard practice is to ensure the BMC cooling policy is set to “Performance“ or “Heavy IO“ mode, letting the system manage the fan curve dynamically based on sensor feedback.
C. Check the BMC web interface… and verify that the ambient room temperature is exactly twenty-five degrees Celsius.
Static vs. Dynamic: 25°C is a common target for data centers, but the exam expects you to know that the server must be validated across a range of conditions. Simply checking a static temperature does not validate the system’s behavior under the massive heat output generated by an 8-GPU H100 baseboard during actual AI operations.
Question 49 of 60
49. Question
During the verification of a BlueField-3 DPU deployment, a technician must confirm that the firmware and software versions on the DPUs and the associated transceivers are correct. Which tool or command is most appropriate for checking the firmware version of an optical transceiver plugged into a BlueField DPU or an NVIDIA Quantum switch?
Correct
Correct: B. The ‘flint‘ utility from the MFT (Mellanox Firmware Tools) package can be used to query the attributes and firmware versions of the devices and their components.
The Industry Standard: The Mellanox Firmware Tools (MFT) package is the primary suite for managing NVIDIA networking hardware. The flint utility is specifically designed to burn, query, and verify firmware for network adapters (NICs), DPUs, and switches.
Transceiver Firmware (IFFU): Modern 400G (NDR) and 800G (XDR) transceivers are active components that run their own firmware. To query a transceiver plugged into a BlueField DPU or switch, an administrator uses the command flint -d –linkx q.
LinkX Support: The –linkx flag allows flint to communicate through the DPU‘s management interface to the EEPROM of the optical module, extracting versioning and diagnostic data that standard OS tools cannot see.
Incorrect: A. The ‘nvidia-smi -q‘ command will provide a detailed report including the firmware versions of all connected network transceivers…
Scope Mismatch: nvidia-smi is the tool for managing GPUs. While it can show some information about the PCIe link to a NIC, it does not have the capability to query the internal firmware of networking transceivers or the BlueField DPU‘s subsystem.
Protocol Gap: Transceiver management uses I2C or specialized vendor-specific registers (MCC/PRM) that are handled by the networking driver stack (MLNX_OFED), not the NVIDIA GPU driver.
C. A physical inspection of the transceiver‘s serial number label is the only way…
Information Inaccuracy: Labels typically contain the Part Number (P/N), Serial Number (S/N), and Hardware Revision. However, firmware is software stored in the module‘s flash memory. Just as a GPU‘s label won‘t tell you its current driver version, a transceiver‘s label won‘t reflect a firmware update performed in the field. Software-based querying (via flint or mlxlink) is the professional standard.
D. The ‘ip link show‘ command provides a verbose output that includes the manufacturer‘s firmware version…
OS Limitation: The ip command (part of the iproute2 package) is a high-level Linux networking tool for managing interfaces and routing tables. It provides MAC addresses and link status but does not interact with the low-level hardware registers required to pull firmware versions from optical modules.
Incorrect
Correct: B. The ‘flint‘ utility from the MFT (Mellanox Firmware Tools) package can be used to query the attributes and firmware versions of the devices and their components.
The Industry Standard: The Mellanox Firmware Tools (MFT) package is the primary suite for managing NVIDIA networking hardware. The flint utility is specifically designed to burn, query, and verify firmware for network adapters (NICs), DPUs, and switches.
Transceiver Firmware (IFFU): Modern 400G (NDR) and 800G (XDR) transceivers are active components that run their own firmware. To query a transceiver plugged into a BlueField DPU or switch, an administrator uses the command flint -d –linkx q.
LinkX Support: The –linkx flag allows flint to communicate through the DPU‘s management interface to the EEPROM of the optical module, extracting versioning and diagnostic data that standard OS tools cannot see.
Incorrect: A. The ‘nvidia-smi -q‘ command will provide a detailed report including the firmware versions of all connected network transceivers…
Scope Mismatch: nvidia-smi is the tool for managing GPUs. While it can show some information about the PCIe link to a NIC, it does not have the capability to query the internal firmware of networking transceivers or the BlueField DPU‘s subsystem.
Protocol Gap: Transceiver management uses I2C or specialized vendor-specific registers (MCC/PRM) that are handled by the networking driver stack (MLNX_OFED), not the NVIDIA GPU driver.
C. A physical inspection of the transceiver‘s serial number label is the only way…
Information Inaccuracy: Labels typically contain the Part Number (P/N), Serial Number (S/N), and Hardware Revision. However, firmware is software stored in the module‘s flash memory. Just as a GPU‘s label won‘t tell you its current driver version, a transceiver‘s label won‘t reflect a firmware update performed in the field. Software-based querying (via flint or mlxlink) is the professional standard.
D. The ‘ip link show‘ command provides a verbose output that includes the manufacturer‘s firmware version…
OS Limitation: The ip command (part of the iproute2 package) is a high-level Linux networking tool for managing interfaces and routing tables. It provides MAC addresses and link status but does not interact with the low-level hardware registers required to pull firmware versions from optical modules.
Unattempted
Correct: B. The ‘flint‘ utility from the MFT (Mellanox Firmware Tools) package can be used to query the attributes and firmware versions of the devices and their components.
The Industry Standard: The Mellanox Firmware Tools (MFT) package is the primary suite for managing NVIDIA networking hardware. The flint utility is specifically designed to burn, query, and verify firmware for network adapters (NICs), DPUs, and switches.
Transceiver Firmware (IFFU): Modern 400G (NDR) and 800G (XDR) transceivers are active components that run their own firmware. To query a transceiver plugged into a BlueField DPU or switch, an administrator uses the command flint -d –linkx q.
LinkX Support: The –linkx flag allows flint to communicate through the DPU‘s management interface to the EEPROM of the optical module, extracting versioning and diagnostic data that standard OS tools cannot see.
Incorrect: A. The ‘nvidia-smi -q‘ command will provide a detailed report including the firmware versions of all connected network transceivers…
Scope Mismatch: nvidia-smi is the tool for managing GPUs. While it can show some information about the PCIe link to a NIC, it does not have the capability to query the internal firmware of networking transceivers or the BlueField DPU‘s subsystem.
Protocol Gap: Transceiver management uses I2C or specialized vendor-specific registers (MCC/PRM) that are handled by the networking driver stack (MLNX_OFED), not the NVIDIA GPU driver.
C. A physical inspection of the transceiver‘s serial number label is the only way…
Information Inaccuracy: Labels typically contain the Part Number (P/N), Serial Number (S/N), and Hardware Revision. However, firmware is software stored in the module‘s flash memory. Just as a GPU‘s label won‘t tell you its current driver version, a transceiver‘s label won‘t reflect a firmware update performed in the field. Software-based querying (via flint or mlxlink) is the professional standard.
D. The ‘ip link show‘ command provides a verbose output that includes the manufacturer‘s firmware version…
OS Limitation: The ip command (part of the iproute2 package) is a high-level Linux networking tool for managing interfaces and routing tables. It provides MAC addresses and link status but does not interact with the low-level hardware registers required to pull firmware versions from optical modules.
Question 50 of 60
50. Question
During the ‘Cluster Test and Verification‘ phase, an administrator uses NVIDIA ClusterKit. What is the primary function of this tool in the context of a multi-node AI factory assessment?
Correct
Correct: C. To perform a multifaceted assessment of node health, including PCIe bandwidth and GPU connectivity.
End-to-End Validation: ClusterKit is a specialized diagnostic suite designed to test the entire data path of an AI node. It doesn‘t just check if a GPU is “there“; it measures GPU-to-GPU bandwidth (via NVLink), GPU-to-Host bandwidth (via PCIe), and memory copy speeds. +1
Fabric Verification: It automates tests like NCCL all-reduce and MPI latency benchmarks across multiple nodes to ensure the InfiniBand or Ethernet fabric is correctly configured and free of “silent“ performance degraders like bad transceivers or misconfigured switches.
Health Check: In the NCP-AII framework, ClusterKit is the recommended tool to run before moving to heavier burn-in tests like HPL. If ClusterKit reveals that one GPU has only 50% of the expected PCIe bandwidth, the administrator knows to check the physical seating or the BIOS bifurcation settings first.
Incorrect: A. To replace the need for the Linux operating system.
Dependency: ClusterKit is a Linux-based utility. It requires a functional OS (such as Ubuntu or RHEL), a working NVIDIA driver stack, and a communication layer (like SSH or Slurm) to execute its tests across the cluster. It is a management tool, not an operating system.
B. To act as a primary compiler for CUDA C++ code.
Functional Error: While ClusterKit uses pre-compiled CUDA kernels to test GPU performance, it is not a development tool. Compiling CUDA code is the role of the NVCC (NVIDIA CUDA Compiler), which is part of the CUDA Toolkit, not ClusterKit.
D. To design the physical floor plan of the data center.
Misplaced Tooling: Designing floor plans, rack layouts, and power distribution is part of the “Physical Infrastructure“ phase and typically involves CAD software or specialized data center infrastructure management (DCIM) tools. ClusterKit is used once the hardware is already racked, cabled, and powered on to verify its logical and electrical performance.
Incorrect
Correct: C. To perform a multifaceted assessment of node health, including PCIe bandwidth and GPU connectivity.
End-to-End Validation: ClusterKit is a specialized diagnostic suite designed to test the entire data path of an AI node. It doesn‘t just check if a GPU is “there“; it measures GPU-to-GPU bandwidth (via NVLink), GPU-to-Host bandwidth (via PCIe), and memory copy speeds. +1
Fabric Verification: It automates tests like NCCL all-reduce and MPI latency benchmarks across multiple nodes to ensure the InfiniBand or Ethernet fabric is correctly configured and free of “silent“ performance degraders like bad transceivers or misconfigured switches.
Health Check: In the NCP-AII framework, ClusterKit is the recommended tool to run before moving to heavier burn-in tests like HPL. If ClusterKit reveals that one GPU has only 50% of the expected PCIe bandwidth, the administrator knows to check the physical seating or the BIOS bifurcation settings first.
Incorrect: A. To replace the need for the Linux operating system.
Dependency: ClusterKit is a Linux-based utility. It requires a functional OS (such as Ubuntu or RHEL), a working NVIDIA driver stack, and a communication layer (like SSH or Slurm) to execute its tests across the cluster. It is a management tool, not an operating system.
B. To act as a primary compiler for CUDA C++ code.
Functional Error: While ClusterKit uses pre-compiled CUDA kernels to test GPU performance, it is not a development tool. Compiling CUDA code is the role of the NVCC (NVIDIA CUDA Compiler), which is part of the CUDA Toolkit, not ClusterKit.
D. To design the physical floor plan of the data center.
Misplaced Tooling: Designing floor plans, rack layouts, and power distribution is part of the “Physical Infrastructure“ phase and typically involves CAD software or specialized data center infrastructure management (DCIM) tools. ClusterKit is used once the hardware is already racked, cabled, and powered on to verify its logical and electrical performance.
Unattempted
Correct: C. To perform a multifaceted assessment of node health, including PCIe bandwidth and GPU connectivity.
End-to-End Validation: ClusterKit is a specialized diagnostic suite designed to test the entire data path of an AI node. It doesn‘t just check if a GPU is “there“; it measures GPU-to-GPU bandwidth (via NVLink), GPU-to-Host bandwidth (via PCIe), and memory copy speeds. +1
Fabric Verification: It automates tests like NCCL all-reduce and MPI latency benchmarks across multiple nodes to ensure the InfiniBand or Ethernet fabric is correctly configured and free of “silent“ performance degraders like bad transceivers or misconfigured switches.
Health Check: In the NCP-AII framework, ClusterKit is the recommended tool to run before moving to heavier burn-in tests like HPL. If ClusterKit reveals that one GPU has only 50% of the expected PCIe bandwidth, the administrator knows to check the physical seating or the BIOS bifurcation settings first.
Incorrect: A. To replace the need for the Linux operating system.
Dependency: ClusterKit is a Linux-based utility. It requires a functional OS (such as Ubuntu or RHEL), a working NVIDIA driver stack, and a communication layer (like SSH or Slurm) to execute its tests across the cluster. It is a management tool, not an operating system.
B. To act as a primary compiler for CUDA C++ code.
Functional Error: While ClusterKit uses pre-compiled CUDA kernels to test GPU performance, it is not a development tool. Compiling CUDA code is the role of the NVCC (NVIDIA CUDA Compiler), which is part of the CUDA Toolkit, not ClusterKit.
D. To design the physical floor plan of the data center.
Misplaced Tooling: Designing floor plans, rack layouts, and power distribution is part of the “Physical Infrastructure“ phase and typically involves CAD software or specialized data center infrastructure management (DCIM) tools. ClusterKit is used once the hardware is already racked, cabled, and powered on to verify its logical and electrical performance.
Question 51 of 60
51. Question
While troubleshooting a performance bottleneck in an AI factory, an engineer discovers that the GPU-to-GPU communication within a single node is significantly slower than the NVLink 4.0 specifications. What is a likely reason for this internal communication slowdown?
Correct
Correct: C. The NVLink Bridge is missing or improperly seated, or the system BIOS has been configured to use the PCIe bus for all peer-to-peer traffic instead of NVLink.
Physical Connectivity: In PCIe-based multi-GPU servers (like a 4-GPU workstation or specific HGX configurations), physical NVLink Bridges must be installed to enable the high-speed path. If they are loose or missing, the GPUs will default to communicating over the PCIe Gen5 bus, which is roughly 14x slower (64 GB/s vs 900 GB/s).
BIOS/Firmware Settings: Even with physical connectivity, the BIOS must have “Above 4G Decoding“ enabled and P2P communication correctly configured. If the BIOS or the NVIDIA Fabric Manager is misconfigured, the software stack (CUDA/NCCL) will report that NVLink is “unavailable“ and use the slower system bus instead.
AII Diagnostic Path: The certification teaches using nvidia-smi nvlink –status to verify link health and nvidia-smi topo -m to see if the “Matrix“ shows NV# (NVLink) or SYS/NODE (PCIe/System bus) between GPU pairs.
Incorrect: A. The BlueField-3 DPU has been set to ‘Promiscuous Mode‘…
Fabric Isolation: NVLink is an internal GPU-to-GPU interconnect that exists on a completely separate physical layer from the networking fabric managed by the BlueField-3 DPU.
Architectural Impossibility: Internal NVLink traffic does not pass through the DPU or the PCIe controller of the DPU. Promiscuous mode affects Ethernet/InfiniBand packets, not high-speed memory-coherence signals on the NVLink fabric.
B. The administrator has enabled ‘Eco-MIG‘…
Fictional Feature: There is no such profile as “Eco-MIG“ in the NVIDIA software stack. While Multi-Instance GPU (MIG) partitions a single GPU into instances, it does not have a mode that intentionally throttles NVLink bandwidth to save electricity.
MIG and NVLink: In fact, enabling MIG on certain older architectures actually disables P2P communication via NVLink between those instances to ensure strict hardware isolation.
D. The Slurm scheduler is running in ‘debug‘ mode…
Scheduler Role: Slurm is a workload manager that handles job placement and resource allocation. Once a job is launched, Slurm stays “out of the way.“
Traffic Interception: Slurm does not possess the hardware hooks to “intercept NVLink packets.“ NVLink is a hardware-level memory interconnect; the packets are moved by the NVLink Switch and GPU hardware logic, which is invisible to the operating system‘s job scheduler.
Incorrect
Correct: C. The NVLink Bridge is missing or improperly seated, or the system BIOS has been configured to use the PCIe bus for all peer-to-peer traffic instead of NVLink.
Physical Connectivity: In PCIe-based multi-GPU servers (like a 4-GPU workstation or specific HGX configurations), physical NVLink Bridges must be installed to enable the high-speed path. If they are loose or missing, the GPUs will default to communicating over the PCIe Gen5 bus, which is roughly 14x slower (64 GB/s vs 900 GB/s).
BIOS/Firmware Settings: Even with physical connectivity, the BIOS must have “Above 4G Decoding“ enabled and P2P communication correctly configured. If the BIOS or the NVIDIA Fabric Manager is misconfigured, the software stack (CUDA/NCCL) will report that NVLink is “unavailable“ and use the slower system bus instead.
AII Diagnostic Path: The certification teaches using nvidia-smi nvlink –status to verify link health and nvidia-smi topo -m to see if the “Matrix“ shows NV# (NVLink) or SYS/NODE (PCIe/System bus) between GPU pairs.
Incorrect: A. The BlueField-3 DPU has been set to ‘Promiscuous Mode‘…
Fabric Isolation: NVLink is an internal GPU-to-GPU interconnect that exists on a completely separate physical layer from the networking fabric managed by the BlueField-3 DPU.
Architectural Impossibility: Internal NVLink traffic does not pass through the DPU or the PCIe controller of the DPU. Promiscuous mode affects Ethernet/InfiniBand packets, not high-speed memory-coherence signals on the NVLink fabric.
B. The administrator has enabled ‘Eco-MIG‘…
Fictional Feature: There is no such profile as “Eco-MIG“ in the NVIDIA software stack. While Multi-Instance GPU (MIG) partitions a single GPU into instances, it does not have a mode that intentionally throttles NVLink bandwidth to save electricity.
MIG and NVLink: In fact, enabling MIG on certain older architectures actually disables P2P communication via NVLink between those instances to ensure strict hardware isolation.
D. The Slurm scheduler is running in ‘debug‘ mode…
Scheduler Role: Slurm is a workload manager that handles job placement and resource allocation. Once a job is launched, Slurm stays “out of the way.“
Traffic Interception: Slurm does not possess the hardware hooks to “intercept NVLink packets.“ NVLink is a hardware-level memory interconnect; the packets are moved by the NVLink Switch and GPU hardware logic, which is invisible to the operating system‘s job scheduler.
Unattempted
Correct: C. The NVLink Bridge is missing or improperly seated, or the system BIOS has been configured to use the PCIe bus for all peer-to-peer traffic instead of NVLink.
Physical Connectivity: In PCIe-based multi-GPU servers (like a 4-GPU workstation or specific HGX configurations), physical NVLink Bridges must be installed to enable the high-speed path. If they are loose or missing, the GPUs will default to communicating over the PCIe Gen5 bus, which is roughly 14x slower (64 GB/s vs 900 GB/s).
BIOS/Firmware Settings: Even with physical connectivity, the BIOS must have “Above 4G Decoding“ enabled and P2P communication correctly configured. If the BIOS or the NVIDIA Fabric Manager is misconfigured, the software stack (CUDA/NCCL) will report that NVLink is “unavailable“ and use the slower system bus instead.
AII Diagnostic Path: The certification teaches using nvidia-smi nvlink –status to verify link health and nvidia-smi topo -m to see if the “Matrix“ shows NV# (NVLink) or SYS/NODE (PCIe/System bus) between GPU pairs.
Incorrect: A. The BlueField-3 DPU has been set to ‘Promiscuous Mode‘…
Fabric Isolation: NVLink is an internal GPU-to-GPU interconnect that exists on a completely separate physical layer from the networking fabric managed by the BlueField-3 DPU.
Architectural Impossibility: Internal NVLink traffic does not pass through the DPU or the PCIe controller of the DPU. Promiscuous mode affects Ethernet/InfiniBand packets, not high-speed memory-coherence signals on the NVLink fabric.
B. The administrator has enabled ‘Eco-MIG‘…
Fictional Feature: There is no such profile as “Eco-MIG“ in the NVIDIA software stack. While Multi-Instance GPU (MIG) partitions a single GPU into instances, it does not have a mode that intentionally throttles NVLink bandwidth to save electricity.
MIG and NVLink: In fact, enabling MIG on certain older architectures actually disables P2P communication via NVLink between those instances to ensure strict hardware isolation.
D. The Slurm scheduler is running in ‘debug‘ mode…
Scheduler Role: Slurm is a workload manager that handles job placement and resource allocation. Once a job is launched, Slurm stays “out of the way.“
Traffic Interception: Slurm does not possess the hardware hooks to “intercept NVLink packets.“ NVLink is a hardware-level memory interconnect; the packets are moved by the NVLink Switch and GPU hardware logic, which is invisible to the operating system‘s job scheduler.
Question 52 of 60
52. Question
A cluster administrator is optimizing an AI factory with both AMD and Intel-based servers. They notice that the storage throughput is lower than expected on the Intel nodes. Which optimization step should be performed to improve the I/O performance for AI workloads?
Correct
Correct: B. Check the PCIe bifurcation settings in the BIOS and ensure that the ‘Data Direct I/O‘ (DDIO) and ‘IOMMU‘ settings are optimized for high-speed networking.
Intel-Specific Optimization (DDIO): For Intel-based nodes, Intel Data Direct I/O (DDIO) is a critical performance feature. It allows I/O devices (like InfiniBand NICs or NVMe drives) to post data directly to the CPU‘s Last Level Cache (LLC) instead of system memory. In AI workloads, where high-speed storage throughput is required to keep GPUs fed, DDIO significantly reduces latency and memory bandwidth contention.
IOMMU and GPUDirect: The NCP-AII curriculum emphasizes that IOMMU (Input-Output Memory Management Unit) settings can drastically affect GPUDirect Storage (GDS) and RDMA performance. While IOMMU provides security and virtualization benefits, it can introduce a “bounce buffer“ performance penalty. For bare-metal AI training, IOMMU is typically set to “Pass-through“ or disabled to allow direct P2P (Peer-to-Peer) memory access between the storage and the GPU.
PCIe Bifurcation: Proper bifurcation ensures that the PCIe lanes are correctly divided to support high-speed NVMe and NICs at their maximum rated speeds (e.g., Gen5 x16).
Incorrect: A. Use the NGC CLI to convert all datasets… into a single TPM-encrypted binary blob.
Tool Misuse: The NGC CLI is used for pulling containers, models, and datasets from the cloud. It is not a data engineering tool for file format conversion or encryption.
Performance Impact: Encrypting datasets into a single “blob“ via the TPM would actually decrease performance by creating a massive CPU bottleneck during the decryption process, and it does not address underlying storage I/O throughput issues.
C. Install the NVIDIA SMI tool on the storage controllers to allow them to offload… to the BlueField-3 DPU.
Purpose Mismatch: NVIDIA-SMI (System Management Interface) is a utility for monitoring and managing GPUs. It has no functionality related to NFS metadata processing or storage controller offloading.
Software Layer: Offloading storage tasks to a BlueField-3 DPU is handled via the NVIDIA DOCA software framework or specialized storage drivers, not a GPU management tool like SMI.
D. Disable the fans on the storage array to reduce the acoustic vibration…
Hardware Danger: Disabling fans on a storage array will lead to immediate thermal failure and potential data loss.
Signal Theory Error: While acoustic vibration can theoretically affect high-density spinning disks (HDD), AI factories primarily use NVMe (SSD) which is unaffected by sound. Furthermore, storage vibrations have no physical mechanism to interfere with the high-frequency electromagnetic signals of NVLink internal to a separate server node.
Incorrect
Correct: B. Check the PCIe bifurcation settings in the BIOS and ensure that the ‘Data Direct I/O‘ (DDIO) and ‘IOMMU‘ settings are optimized for high-speed networking.
Intel-Specific Optimization (DDIO): For Intel-based nodes, Intel Data Direct I/O (DDIO) is a critical performance feature. It allows I/O devices (like InfiniBand NICs or NVMe drives) to post data directly to the CPU‘s Last Level Cache (LLC) instead of system memory. In AI workloads, where high-speed storage throughput is required to keep GPUs fed, DDIO significantly reduces latency and memory bandwidth contention.
IOMMU and GPUDirect: The NCP-AII curriculum emphasizes that IOMMU (Input-Output Memory Management Unit) settings can drastically affect GPUDirect Storage (GDS) and RDMA performance. While IOMMU provides security and virtualization benefits, it can introduce a “bounce buffer“ performance penalty. For bare-metal AI training, IOMMU is typically set to “Pass-through“ or disabled to allow direct P2P (Peer-to-Peer) memory access between the storage and the GPU.
PCIe Bifurcation: Proper bifurcation ensures that the PCIe lanes are correctly divided to support high-speed NVMe and NICs at their maximum rated speeds (e.g., Gen5 x16).
Incorrect: A. Use the NGC CLI to convert all datasets… into a single TPM-encrypted binary blob.
Tool Misuse: The NGC CLI is used for pulling containers, models, and datasets from the cloud. It is not a data engineering tool for file format conversion or encryption.
Performance Impact: Encrypting datasets into a single “blob“ via the TPM would actually decrease performance by creating a massive CPU bottleneck during the decryption process, and it does not address underlying storage I/O throughput issues.
C. Install the NVIDIA SMI tool on the storage controllers to allow them to offload… to the BlueField-3 DPU.
Purpose Mismatch: NVIDIA-SMI (System Management Interface) is a utility for monitoring and managing GPUs. It has no functionality related to NFS metadata processing or storage controller offloading.
Software Layer: Offloading storage tasks to a BlueField-3 DPU is handled via the NVIDIA DOCA software framework or specialized storage drivers, not a GPU management tool like SMI.
D. Disable the fans on the storage array to reduce the acoustic vibration…
Hardware Danger: Disabling fans on a storage array will lead to immediate thermal failure and potential data loss.
Signal Theory Error: While acoustic vibration can theoretically affect high-density spinning disks (HDD), AI factories primarily use NVMe (SSD) which is unaffected by sound. Furthermore, storage vibrations have no physical mechanism to interfere with the high-frequency electromagnetic signals of NVLink internal to a separate server node.
Unattempted
Correct: B. Check the PCIe bifurcation settings in the BIOS and ensure that the ‘Data Direct I/O‘ (DDIO) and ‘IOMMU‘ settings are optimized for high-speed networking.
Intel-Specific Optimization (DDIO): For Intel-based nodes, Intel Data Direct I/O (DDIO) is a critical performance feature. It allows I/O devices (like InfiniBand NICs or NVMe drives) to post data directly to the CPU‘s Last Level Cache (LLC) instead of system memory. In AI workloads, where high-speed storage throughput is required to keep GPUs fed, DDIO significantly reduces latency and memory bandwidth contention.
IOMMU and GPUDirect: The NCP-AII curriculum emphasizes that IOMMU (Input-Output Memory Management Unit) settings can drastically affect GPUDirect Storage (GDS) and RDMA performance. While IOMMU provides security and virtualization benefits, it can introduce a “bounce buffer“ performance penalty. For bare-metal AI training, IOMMU is typically set to “Pass-through“ or disabled to allow direct P2P (Peer-to-Peer) memory access between the storage and the GPU.
PCIe Bifurcation: Proper bifurcation ensures that the PCIe lanes are correctly divided to support high-speed NVMe and NICs at their maximum rated speeds (e.g., Gen5 x16).
Incorrect: A. Use the NGC CLI to convert all datasets… into a single TPM-encrypted binary blob.
Tool Misuse: The NGC CLI is used for pulling containers, models, and datasets from the cloud. It is not a data engineering tool for file format conversion or encryption.
Performance Impact: Encrypting datasets into a single “blob“ via the TPM would actually decrease performance by creating a massive CPU bottleneck during the decryption process, and it does not address underlying storage I/O throughput issues.
C. Install the NVIDIA SMI tool on the storage controllers to allow them to offload… to the BlueField-3 DPU.
Purpose Mismatch: NVIDIA-SMI (System Management Interface) is a utility for monitoring and managing GPUs. It has no functionality related to NFS metadata processing or storage controller offloading.
Software Layer: Offloading storage tasks to a BlueField-3 DPU is handled via the NVIDIA DOCA software framework or specialized storage drivers, not a GPU management tool like SMI.
D. Disable the fans on the storage array to reduce the acoustic vibration…
Hardware Danger: Disabling fans on a storage array will lead to immediate thermal failure and potential data loss.
Signal Theory Error: While acoustic vibration can theoretically affect high-density spinning disks (HDD), AI factories primarily use NVMe (SSD) which is unaffected by sound. Furthermore, storage vibrations have no physical mechanism to interfere with the high-frequency electromagnetic signals of NVLink internal to a separate server node.
Question 53 of 60
53. Question
A storage bottleneck is suspected because the GPUs are frequently idling while waiting for data. The storage system is connected via NVMe-over-Fabrics. Which optimization technique would most effectively reduce the CPU overhead on the compute nodes and improve data throughput to the GPUs?
Correct
Correct: C Implementing GPUDirect Storage (GDS) to bypass the CPU and system memory.
The Technical Reason: In a standard data path, data must be copied from storage to the CPUÂ’s system memory (RAM) and then to the GPU memory. This creates a “bounce buffer“ that consumes CPU cycles and increases latency.
The NCP-AII Context: GPUDirect Storage enables a direct DMA (Direct Memory Access) path between NVMe-oF storage and GPU memory. This significantly reduces CPU overhead and removes the system memory bottleneck, which is the primary goal in high-performance AI infrastructure.
Incorrect: A. Installing a faster web browser on the head node A web browser is an application-layer tool used for management interfaces (like a GUI for a cluster manager). It has zero impact on the high-speed data plane or the hardware-level throughput between NVMe storage and the GPU. This is a “nonsense“ distractor in the context of infrastructure performance.
B. Switching from NVMe-oF to legacy NFS over a 1GbE network This would actually worsen the problem.
Protocol: NVMe-over-Fabrics is designed for low latency; legacy NFS adds significant overhead.
Bandwidth: A 1GbE management network is roughly 100x–200x slower than the high-speed InfiniBand or Ethernet fabrics (100G/200G/400G) typically used in NVIDIA AI deployments.
D. Reducing the number of GPUs used in the training job While this would technically lower the total data demand, it does not solve the bottleneck or the CPU overhead. It simply reduces the scale of your compute. In an AI infrastructure environment, the goal is to keep all GPUs at 100% utilization; reducing the GPU count is a step backward in productivity and does not optimize the data path.
Incorrect
Correct: C Implementing GPUDirect Storage (GDS) to bypass the CPU and system memory.
The Technical Reason: In a standard data path, data must be copied from storage to the CPUÂ’s system memory (RAM) and then to the GPU memory. This creates a “bounce buffer“ that consumes CPU cycles and increases latency.
The NCP-AII Context: GPUDirect Storage enables a direct DMA (Direct Memory Access) path between NVMe-oF storage and GPU memory. This significantly reduces CPU overhead and removes the system memory bottleneck, which is the primary goal in high-performance AI infrastructure.
Incorrect: A. Installing a faster web browser on the head node A web browser is an application-layer tool used for management interfaces (like a GUI for a cluster manager). It has zero impact on the high-speed data plane or the hardware-level throughput between NVMe storage and the GPU. This is a “nonsense“ distractor in the context of infrastructure performance.
B. Switching from NVMe-oF to legacy NFS over a 1GbE network This would actually worsen the problem.
Protocol: NVMe-over-Fabrics is designed for low latency; legacy NFS adds significant overhead.
Bandwidth: A 1GbE management network is roughly 100x–200x slower than the high-speed InfiniBand or Ethernet fabrics (100G/200G/400G) typically used in NVIDIA AI deployments.
D. Reducing the number of GPUs used in the training job While this would technically lower the total data demand, it does not solve the bottleneck or the CPU overhead. It simply reduces the scale of your compute. In an AI infrastructure environment, the goal is to keep all GPUs at 100% utilization; reducing the GPU count is a step backward in productivity and does not optimize the data path.
Unattempted
Correct: C Implementing GPUDirect Storage (GDS) to bypass the CPU and system memory.
The Technical Reason: In a standard data path, data must be copied from storage to the CPUÂ’s system memory (RAM) and then to the GPU memory. This creates a “bounce buffer“ that consumes CPU cycles and increases latency.
The NCP-AII Context: GPUDirect Storage enables a direct DMA (Direct Memory Access) path between NVMe-oF storage and GPU memory. This significantly reduces CPU overhead and removes the system memory bottleneck, which is the primary goal in high-performance AI infrastructure.
Incorrect: A. Installing a faster web browser on the head node A web browser is an application-layer tool used for management interfaces (like a GUI for a cluster manager). It has zero impact on the high-speed data plane or the hardware-level throughput between NVMe storage and the GPU. This is a “nonsense“ distractor in the context of infrastructure performance.
B. Switching from NVMe-oF to legacy NFS over a 1GbE network This would actually worsen the problem.
Protocol: NVMe-over-Fabrics is designed for low latency; legacy NFS adds significant overhead.
Bandwidth: A 1GbE management network is roughly 100x–200x slower than the high-speed InfiniBand or Ethernet fabrics (100G/200G/400G) typically used in NVIDIA AI deployments.
D. Reducing the number of GPUs used in the training job While this would technically lower the total data demand, it does not solve the bottleneck or the CPU overhead. It simply reduces the scale of your compute. In an AI infrastructure environment, the goal is to keep all GPUs at 100% utilization; reducing the GPU count is a step backward in productivity and does not optimize the data path.
Question 54 of 60
54. Question
A storage bottleneck is suspected because the GPUs are frequently idling while waiting for data. The storage system is connected via NVMe-over-Fabrics (NVMe-oF). Which optimization technique would most effectively reduce the CPU overhead on the compute nodes and improve data throughput to the GPUs?
Correct
Correct: A Implementing GPUDirect Storage (GDS) to bypass the CPU and system memory.
The Technical Reason: Traditionally, data moving from storage to GPU memory must first be copied into a “bounce buffer“ in the CPUÂ’s system RAM. This process consumes CPU cycles and creates a latency bottleneck in the PCIe bus and system memory.
The NCP-AII Context: GPUDirect Storage (GDS) enables a Direct Memory Access (DMA) path between the NVMe-oF storage (via the NIC or HBA) and the GPU memory. This effectively “shaves off“ the CPU and system memory overhead, allowing for near-line-rate data transfers directly to the GPU.
Incorrect: B. Reducing the number of GPUs used in the training job This does not address the efficiency of the data path. While it might slightly lower the aggregate data demand, it leaves the underlying bottleneck intact and results in underutilized hardware. In an AI infrastructure, the goal is to maximize the workload capacity, not shrink the job to fit a poorly optimized storage path.
C. Installing a faster web browser on the head node The performance of a web browser on a management node has no impact on the Data Plane or the hardware-level communication between storage and compute nodes. This is a management-layer distractor that does not affect I/O throughput or GPU idling issues.
D. Switching from NVMe-oF to legacy NFS over a 1GbE network This would significantly degrade performance.
Protocol: NVMe-oF is designed specifically for low-latency, high-speed flash storage. Legacy NFS (Network File System) introduces massive protocol overhead.
Fabric: A 1GbE (Gigabit Ethernet) management network provides roughly 125 MB/s of theoretical bandwidth, which is several orders of magnitude slower than the 100Gbps to 400Gbps fabrics (InfiniBand/Ethernet) required to feed modern NVIDIA GPUs.
Incorrect
Correct: A Implementing GPUDirect Storage (GDS) to bypass the CPU and system memory.
The Technical Reason: Traditionally, data moving from storage to GPU memory must first be copied into a “bounce buffer“ in the CPUÂ’s system RAM. This process consumes CPU cycles and creates a latency bottleneck in the PCIe bus and system memory.
The NCP-AII Context: GPUDirect Storage (GDS) enables a Direct Memory Access (DMA) path between the NVMe-oF storage (via the NIC or HBA) and the GPU memory. This effectively “shaves off“ the CPU and system memory overhead, allowing for near-line-rate data transfers directly to the GPU.
Incorrect: B. Reducing the number of GPUs used in the training job This does not address the efficiency of the data path. While it might slightly lower the aggregate data demand, it leaves the underlying bottleneck intact and results in underutilized hardware. In an AI infrastructure, the goal is to maximize the workload capacity, not shrink the job to fit a poorly optimized storage path.
C. Installing a faster web browser on the head node The performance of a web browser on a management node has no impact on the Data Plane or the hardware-level communication between storage and compute nodes. This is a management-layer distractor that does not affect I/O throughput or GPU idling issues.
D. Switching from NVMe-oF to legacy NFS over a 1GbE network This would significantly degrade performance.
Protocol: NVMe-oF is designed specifically for low-latency, high-speed flash storage. Legacy NFS (Network File System) introduces massive protocol overhead.
Fabric: A 1GbE (Gigabit Ethernet) management network provides roughly 125 MB/s of theoretical bandwidth, which is several orders of magnitude slower than the 100Gbps to 400Gbps fabrics (InfiniBand/Ethernet) required to feed modern NVIDIA GPUs.
Unattempted
Correct: A Implementing GPUDirect Storage (GDS) to bypass the CPU and system memory.
The Technical Reason: Traditionally, data moving from storage to GPU memory must first be copied into a “bounce buffer“ in the CPUÂ’s system RAM. This process consumes CPU cycles and creates a latency bottleneck in the PCIe bus and system memory.
The NCP-AII Context: GPUDirect Storage (GDS) enables a Direct Memory Access (DMA) path between the NVMe-oF storage (via the NIC or HBA) and the GPU memory. This effectively “shaves off“ the CPU and system memory overhead, allowing for near-line-rate data transfers directly to the GPU.
Incorrect: B. Reducing the number of GPUs used in the training job This does not address the efficiency of the data path. While it might slightly lower the aggregate data demand, it leaves the underlying bottleneck intact and results in underutilized hardware. In an AI infrastructure, the goal is to maximize the workload capacity, not shrink the job to fit a poorly optimized storage path.
C. Installing a faster web browser on the head node The performance of a web browser on a management node has no impact on the Data Plane or the hardware-level communication between storage and compute nodes. This is a management-layer distractor that does not affect I/O throughput or GPU idling issues.
D. Switching from NVMe-oF to legacy NFS over a 1GbE network This would significantly degrade performance.
Protocol: NVMe-oF is designed specifically for low-latency, high-speed flash storage. Legacy NFS (Network File System) introduces massive protocol overhead.
Fabric: A 1GbE (Gigabit Ethernet) management network provides roughly 125 MB/s of theoretical bandwidth, which is several orders of magnitude slower than the 100Gbps to 400Gbps fabrics (InfiniBand/Ethernet) required to feed modern NVIDIA GPUs.
Question 55 of 60
55. Question
In the context of AI infrastructure, the configuration of the BlueField network platform often involves the use of DOCA (Data-Center Infrastructure-on-a-Chip Architecture). What is the primary role of the DOCA telemetry service when managing the physical and logical health of an AI factory network?
Correct
Correct: B To provide real-time visibility into network traffic and hardware performance.
The Technical Reason: DOCA Telemetry is a core service within the NVIDIA DOCA framework designed to collect, aggregate, and export high-fidelity data from the BlueField DPU. It monitors hardware counters, network flows, and system utilization without impacting the performance of the host CPU.
The NCP-AII Context: In an “AI Factory“ (large-scale H100/B200 clusters), identifying congestion or hardware degradation is critical. DOCA Telemetry provides the granular visibility needed for tools like NVIDIA NetQ or third-party collectors to ensure the InfiniBand or Ethernet fabric is performing at the required non-blocking speeds.
Incorrect: A. To replace the need for physical fiber optic cable inspections Telemetric data can indicate a “flapping“ link or high error rates (CRC errors) which suggest a cable issue, but it cannot physically inspect the glass or connectors for dust or damage. Physical maintenance and layer-1 inspections remain a separate manual or specialized hardware process.
C. To automatically overclock the GPU cores when training starts The DOCA framework resides on the BlueField DPU, not the GPU. While the DPU manages the data path to the GPU, it does not control the clock speeds or voltage of the GPU cores. GPU power and clock management are handled by the NVIDIA driver and management tools like nvidia-smi or NVML.
D. To manage the power distribution units in the data center rack Power Distribution Units (PDUs) are managed via standard data center protocols (like SNMP or Redfish) through the BMC (Baseboard Management Controller) or dedicated rack management software. DOCA is focused on the Infrastructure-on-a-Chip (network, storage, and security), not the facility-level power hardware.
Incorrect
Correct: B To provide real-time visibility into network traffic and hardware performance.
The Technical Reason: DOCA Telemetry is a core service within the NVIDIA DOCA framework designed to collect, aggregate, and export high-fidelity data from the BlueField DPU. It monitors hardware counters, network flows, and system utilization without impacting the performance of the host CPU.
The NCP-AII Context: In an “AI Factory“ (large-scale H100/B200 clusters), identifying congestion or hardware degradation is critical. DOCA Telemetry provides the granular visibility needed for tools like NVIDIA NetQ or third-party collectors to ensure the InfiniBand or Ethernet fabric is performing at the required non-blocking speeds.
Incorrect: A. To replace the need for physical fiber optic cable inspections Telemetric data can indicate a “flapping“ link or high error rates (CRC errors) which suggest a cable issue, but it cannot physically inspect the glass or connectors for dust or damage. Physical maintenance and layer-1 inspections remain a separate manual or specialized hardware process.
C. To automatically overclock the GPU cores when training starts The DOCA framework resides on the BlueField DPU, not the GPU. While the DPU manages the data path to the GPU, it does not control the clock speeds or voltage of the GPU cores. GPU power and clock management are handled by the NVIDIA driver and management tools like nvidia-smi or NVML.
D. To manage the power distribution units in the data center rack Power Distribution Units (PDUs) are managed via standard data center protocols (like SNMP or Redfish) through the BMC (Baseboard Management Controller) or dedicated rack management software. DOCA is focused on the Infrastructure-on-a-Chip (network, storage, and security), not the facility-level power hardware.
Unattempted
Correct: B To provide real-time visibility into network traffic and hardware performance.
The Technical Reason: DOCA Telemetry is a core service within the NVIDIA DOCA framework designed to collect, aggregate, and export high-fidelity data from the BlueField DPU. It monitors hardware counters, network flows, and system utilization without impacting the performance of the host CPU.
The NCP-AII Context: In an “AI Factory“ (large-scale H100/B200 clusters), identifying congestion or hardware degradation is critical. DOCA Telemetry provides the granular visibility needed for tools like NVIDIA NetQ or third-party collectors to ensure the InfiniBand or Ethernet fabric is performing at the required non-blocking speeds.
Incorrect: A. To replace the need for physical fiber optic cable inspections Telemetric data can indicate a “flapping“ link or high error rates (CRC errors) which suggest a cable issue, but it cannot physically inspect the glass or connectors for dust or damage. Physical maintenance and layer-1 inspections remain a separate manual or specialized hardware process.
C. To automatically overclock the GPU cores when training starts The DOCA framework resides on the BlueField DPU, not the GPU. While the DPU manages the data path to the GPU, it does not control the clock speeds or voltage of the GPU cores. GPU power and clock management are handled by the NVIDIA driver and management tools like nvidia-smi or NVML.
D. To manage the power distribution units in the data center rack Power Distribution Units (PDUs) are managed via standard data center protocols (like SNMP or Redfish) through the BMC (Baseboard Management Controller) or dedicated rack management software. DOCA is focused on the Infrastructure-on-a-Chip (network, storage, and security), not the facility-level power hardware.
Question 56 of 60
56. Question
During the verification phase, a technician runs the High-Performance Linpack (HPL) benchmark on a single node. The results show that the node is achieving only 60 percent of its theoretical peak GFLOPS. What is the most likely cause of this performance discrepancy in a newly installed NVIDIA HGX system?
Correct
Correct: A The GPU power limits are set to their minimum values in the driver, or the system is experiencing thermal throttling due to insufficient data center cooling.
The Technical Reason: HPL is a computationally intensive benchmark that pushes GPUs to their maximum power draw to achieve peak GFLOPS. If the NVIDIA Management Library (NVML) or the driver has capped the power limit (e.g., set to 150W instead of 450W+), the GPU cannot reach the clock speeds necessary for peak performance. Similarly, if the data center‘s air or liquid cooling is inadequate, the GPUs will automatically “thermal throttle“ (reduce clock speeds) to prevent hardware damage, resulting in a significant performance drop.
The NCP-AII Context: In a newly installed HGX system, verifying the Power Limit and monitoring temperatures via nvidia-smi are the first troubleshooting steps taught for sub-par synthetic benchmark results.
Incorrect: B. The system is using a standard Ethernet cable for management Management network speeds (typically 1GbE) are used for “Out-of-Band“ tasks like IPMI or SSH. HPL on a single node runs entirely within the server‘s internal PCIe/NVLink fabric and system memory. The speed of the external management cable has no impact on the internal floating-point calculation capabilities of the GPUs.
C. HPL is an outdated benchmark that is not supported on NVIDIA hardware This is factually incorrect. NVIDIA provides highly optimized versions of HPL specifically for CUDA-enabled GPUs (often via the NVIDIA HPC Container). HPL remains the industry standard for ranking the TOP500 supercomputers and is the primary tool used to verify the health of new AI infrastructure.
D. The technician forgot to install the sound card drivers This is a “nonsense“ distractor. GPU synchronization (especially in an HGX baseboard) is handled via NVLink and hardware-level clock signals. Acoustic signals or sound card drivers have no role in the synchronization of high-performance compute clusters.
Incorrect
Correct: A The GPU power limits are set to their minimum values in the driver, or the system is experiencing thermal throttling due to insufficient data center cooling.
The Technical Reason: HPL is a computationally intensive benchmark that pushes GPUs to their maximum power draw to achieve peak GFLOPS. If the NVIDIA Management Library (NVML) or the driver has capped the power limit (e.g., set to 150W instead of 450W+), the GPU cannot reach the clock speeds necessary for peak performance. Similarly, if the data center‘s air or liquid cooling is inadequate, the GPUs will automatically “thermal throttle“ (reduce clock speeds) to prevent hardware damage, resulting in a significant performance drop.
The NCP-AII Context: In a newly installed HGX system, verifying the Power Limit and monitoring temperatures via nvidia-smi are the first troubleshooting steps taught for sub-par synthetic benchmark results.
Incorrect: B. The system is using a standard Ethernet cable for management Management network speeds (typically 1GbE) are used for “Out-of-Band“ tasks like IPMI or SSH. HPL on a single node runs entirely within the server‘s internal PCIe/NVLink fabric and system memory. The speed of the external management cable has no impact on the internal floating-point calculation capabilities of the GPUs.
C. HPL is an outdated benchmark that is not supported on NVIDIA hardware This is factually incorrect. NVIDIA provides highly optimized versions of HPL specifically for CUDA-enabled GPUs (often via the NVIDIA HPC Container). HPL remains the industry standard for ranking the TOP500 supercomputers and is the primary tool used to verify the health of new AI infrastructure.
D. The technician forgot to install the sound card drivers This is a “nonsense“ distractor. GPU synchronization (especially in an HGX baseboard) is handled via NVLink and hardware-level clock signals. Acoustic signals or sound card drivers have no role in the synchronization of high-performance compute clusters.
Unattempted
Correct: A The GPU power limits are set to their minimum values in the driver, or the system is experiencing thermal throttling due to insufficient data center cooling.
The Technical Reason: HPL is a computationally intensive benchmark that pushes GPUs to their maximum power draw to achieve peak GFLOPS. If the NVIDIA Management Library (NVML) or the driver has capped the power limit (e.g., set to 150W instead of 450W+), the GPU cannot reach the clock speeds necessary for peak performance. Similarly, if the data center‘s air or liquid cooling is inadequate, the GPUs will automatically “thermal throttle“ (reduce clock speeds) to prevent hardware damage, resulting in a significant performance drop.
The NCP-AII Context: In a newly installed HGX system, verifying the Power Limit and monitoring temperatures via nvidia-smi are the first troubleshooting steps taught for sub-par synthetic benchmark results.
Incorrect: B. The system is using a standard Ethernet cable for management Management network speeds (typically 1GbE) are used for “Out-of-Band“ tasks like IPMI or SSH. HPL on a single node runs entirely within the server‘s internal PCIe/NVLink fabric and system memory. The speed of the external management cable has no impact on the internal floating-point calculation capabilities of the GPUs.
C. HPL is an outdated benchmark that is not supported on NVIDIA hardware This is factually incorrect. NVIDIA provides highly optimized versions of HPL specifically for CUDA-enabled GPUs (often via the NVIDIA HPC Container). HPL remains the industry standard for ranking the TOP500 supercomputers and is the primary tool used to verify the health of new AI infrastructure.
D. The technician forgot to install the sound card drivers This is a “nonsense“ distractor. GPU synchronization (especially in an HGX baseboard) is handled via NVLink and hardware-level clock signals. Acoustic signals or sound card drivers have no role in the synchronization of high-performance compute clusters.
Question 57 of 60
57. Question
During the configuration of an AI cluster using Base Command Manager, the administrator must define ‘Categories‘ for the compute nodes. What is the purpose of using Categories in this context?
Correct
Correct: A To apply consistent software images, kernel parameters, and configuration settings to groups of nodes with similar hardware or roles.
The Technical Reason: In BCM, a “Category“ acts as a template or a functional blueprint. Instead of configuring 1,000 nodes individually, an administrator defines a Category (e.g., “GPU_Compute_Nodes“) and assigns it a specific Software Image, Kernel Parameters, and Role-based services. Any node assigned to that Category automatically inherits those properties.
The NCP-AII Context: This is the foundational method for ensuring “Configuration as Code“ and maintaining uniformity across an AI Factory. It allows for the rapid repurposing of hardware—for example, shifting a group of nodes from a “Training“ category to an “Inference“ category just by changing their assignment.
Incorrect: B. To specify which nodes are allowed to connect to the public internet While Categories can influence network configuration scripts, network isolation (Public vs. InfiniBand) is primarily handled through Network Interfaces, Subnets, and Firewall/Security Groups within the cluster‘s network topology. Categories are broad configuration templates, whereas internet restriction is a specific security and routing policy.
C. To group nodes by their physical color and rack location This is a distractor. While BCM allows you to track “Racks“ and “Chassis“ for physical mapping, “Categories“ are logical grouping mechanisms for software and OS-level configurations. The physical aesthetics or color of a node has no bearing on its functional category or cluster management.
D. To determine the pricing tier for each node in a multi-tenant cloud Base Command Manager is an infrastructure management tool, not a billing or ERP system. While a cloud provider might use the logical groups defined in BCM to help organize their billing, the “primary role“ of Categories within the tool itself is technical configuration management, not financial tiering.
Incorrect
Correct: A To apply consistent software images, kernel parameters, and configuration settings to groups of nodes with similar hardware or roles.
The Technical Reason: In BCM, a “Category“ acts as a template or a functional blueprint. Instead of configuring 1,000 nodes individually, an administrator defines a Category (e.g., “GPU_Compute_Nodes“) and assigns it a specific Software Image, Kernel Parameters, and Role-based services. Any node assigned to that Category automatically inherits those properties.
The NCP-AII Context: This is the foundational method for ensuring “Configuration as Code“ and maintaining uniformity across an AI Factory. It allows for the rapid repurposing of hardware—for example, shifting a group of nodes from a “Training“ category to an “Inference“ category just by changing their assignment.
Incorrect: B. To specify which nodes are allowed to connect to the public internet While Categories can influence network configuration scripts, network isolation (Public vs. InfiniBand) is primarily handled through Network Interfaces, Subnets, and Firewall/Security Groups within the cluster‘s network topology. Categories are broad configuration templates, whereas internet restriction is a specific security and routing policy.
C. To group nodes by their physical color and rack location This is a distractor. While BCM allows you to track “Racks“ and “Chassis“ for physical mapping, “Categories“ are logical grouping mechanisms for software and OS-level configurations. The physical aesthetics or color of a node has no bearing on its functional category or cluster management.
D. To determine the pricing tier for each node in a multi-tenant cloud Base Command Manager is an infrastructure management tool, not a billing or ERP system. While a cloud provider might use the logical groups defined in BCM to help organize their billing, the “primary role“ of Categories within the tool itself is technical configuration management, not financial tiering.
Unattempted
Correct: A To apply consistent software images, kernel parameters, and configuration settings to groups of nodes with similar hardware or roles.
The Technical Reason: In BCM, a “Category“ acts as a template or a functional blueprint. Instead of configuring 1,000 nodes individually, an administrator defines a Category (e.g., “GPU_Compute_Nodes“) and assigns it a specific Software Image, Kernel Parameters, and Role-based services. Any node assigned to that Category automatically inherits those properties.
The NCP-AII Context: This is the foundational method for ensuring “Configuration as Code“ and maintaining uniformity across an AI Factory. It allows for the rapid repurposing of hardware—for example, shifting a group of nodes from a “Training“ category to an “Inference“ category just by changing their assignment.
Incorrect: B. To specify which nodes are allowed to connect to the public internet While Categories can influence network configuration scripts, network isolation (Public vs. InfiniBand) is primarily handled through Network Interfaces, Subnets, and Firewall/Security Groups within the cluster‘s network topology. Categories are broad configuration templates, whereas internet restriction is a specific security and routing policy.
C. To group nodes by their physical color and rack location This is a distractor. While BCM allows you to track “Racks“ and “Chassis“ for physical mapping, “Categories“ are logical grouping mechanisms for software and OS-level configurations. The physical aesthetics or color of a node has no bearing on its functional category or cluster management.
D. To determine the pricing tier for each node in a multi-tenant cloud Base Command Manager is an infrastructure management tool, not a billing or ERP system. While a cloud provider might use the logical groups defined in BCM to help organize their billing, the “primary role“ of Categories within the tool itself is technical configuration management, not financial tiering.
Question 58 of 60
58. Question
In the context of AI infrastructure, the configuration of the BlueField network platform often involves the use of DOCA. What is the primary role of the DOCA telemetry service when managing the physical and logical health of an AI factory network?
Correct
Correct: C To provide real-time visibility into network traffic and hardware performance.
The Technical Reason: The DOCA Telemetry Service (DTS) is designed to collect data from various hardware and software sources within the BlueField DPU and the host. It provides high-fidelity, real-time streaming of network statistics, hardware counters, and system health metrics.
The NCP-AII Context: In a massive AI cluster, congestion or a single failing link can degrade the performance of thousands of GPUs. DOCA Telemetry allows administrators to visualize the fabric‘s health and detect bottlenecks without consuming the host CPU cycles that should be dedicated to AI training.
Incorrect: A. To replace the need for physical fiber optic cable inspections While telemetry data can flag high error rates or “link down“ events that suggest a cable failure, it cannot physically inspect the hardware. Dirty connectors or cracked glass fibers still require manual inspection and physical maintenance tools. Telemetry identifies the symptom, but it doesn‘t replace the physical check.
B. To automatically overclock the GPU cores when training starts DOCA runs on the BlueField DPU ARM cores, not the GPU. While the DPU manages the data path to the GPU, it does not control GPU clock speeds or voltages. GPU performance tuning is handled by the NVIDIA driver and tools like nvidia-smi or NVML (NVIDIA Management Library).
D. To manage the power distribution units in the data center rack Power Distribution Units (PDUs) are facility-level hardware managed via independent management networks (usually via SNMP, Redfish, or IPMI). DOCA is an “Infrastructure-on-a-Chip“ framework focused on networking, storage, and security acceleration, not the electrical infrastructure of the data center rack.
Incorrect
Correct: C To provide real-time visibility into network traffic and hardware performance.
The Technical Reason: The DOCA Telemetry Service (DTS) is designed to collect data from various hardware and software sources within the BlueField DPU and the host. It provides high-fidelity, real-time streaming of network statistics, hardware counters, and system health metrics.
The NCP-AII Context: In a massive AI cluster, congestion or a single failing link can degrade the performance of thousands of GPUs. DOCA Telemetry allows administrators to visualize the fabric‘s health and detect bottlenecks without consuming the host CPU cycles that should be dedicated to AI training.
Incorrect: A. To replace the need for physical fiber optic cable inspections While telemetry data can flag high error rates or “link down“ events that suggest a cable failure, it cannot physically inspect the hardware. Dirty connectors or cracked glass fibers still require manual inspection and physical maintenance tools. Telemetry identifies the symptom, but it doesn‘t replace the physical check.
B. To automatically overclock the GPU cores when training starts DOCA runs on the BlueField DPU ARM cores, not the GPU. While the DPU manages the data path to the GPU, it does not control GPU clock speeds or voltages. GPU performance tuning is handled by the NVIDIA driver and tools like nvidia-smi or NVML (NVIDIA Management Library).
D. To manage the power distribution units in the data center rack Power Distribution Units (PDUs) are facility-level hardware managed via independent management networks (usually via SNMP, Redfish, or IPMI). DOCA is an “Infrastructure-on-a-Chip“ framework focused on networking, storage, and security acceleration, not the electrical infrastructure of the data center rack.
Unattempted
Correct: C To provide real-time visibility into network traffic and hardware performance.
The Technical Reason: The DOCA Telemetry Service (DTS) is designed to collect data from various hardware and software sources within the BlueField DPU and the host. It provides high-fidelity, real-time streaming of network statistics, hardware counters, and system health metrics.
The NCP-AII Context: In a massive AI cluster, congestion or a single failing link can degrade the performance of thousands of GPUs. DOCA Telemetry allows administrators to visualize the fabric‘s health and detect bottlenecks without consuming the host CPU cycles that should be dedicated to AI training.
Incorrect: A. To replace the need for physical fiber optic cable inspections While telemetry data can flag high error rates or “link down“ events that suggest a cable failure, it cannot physically inspect the hardware. Dirty connectors or cracked glass fibers still require manual inspection and physical maintenance tools. Telemetry identifies the symptom, but it doesn‘t replace the physical check.
B. To automatically overclock the GPU cores when training starts DOCA runs on the BlueField DPU ARM cores, not the GPU. While the DPU manages the data path to the GPU, it does not control GPU clock speeds or voltages. GPU performance tuning is handled by the NVIDIA driver and tools like nvidia-smi or NVML (NVIDIA Management Library).
D. To manage the power distribution units in the data center rack Power Distribution Units (PDUs) are facility-level hardware managed via independent management networks (usually via SNMP, Redfish, or IPMI). DOCA is an “Infrastructure-on-a-Chip“ framework focused on networking, storage, and security acceleration, not the electrical infrastructure of the data center rack.
Question 59 of 60
59. Question
A cloud service provider needs to partition a single NVIDIA H100 GPU to serve multiple tenants with guaranteed Quality of Service (QoS) and isolated memory resources. Which technology should be configured, and what is a primary limitation that must be considered during the setup?
Correct
Correct: C Configure MIG (Multi-Instance GPU); the limitation is that once a GPU is partitioned, the individual instances cannot be dynamically resized without resetting the GPU.
The Technical Reason: Multi-Instance GPU (MIG) is the only technology listed that provides hardware-level isolation for both compute (Streaming Multiprocessors) and memory (crossbar, cache, and high-bandwidth memory). This ensures that a “noisy neighbor“ on one partition cannot impact the performance or access the data of another.
The Limitation: In the current H100 architecture, MIG configurations are “static“ in the sense that if you want to change a 3g.40gb instance to two 1g.10gb instances, you must destroy the existing instances and recreate them. This often requires stopping all active workloads on that physical GPU and, in some driver versions, resetting the GPU state.
The NCP-AII Context: The exam tests your ability to distinguish between logical sharing and physical partitioning. For cloud service providers (CSPs) requiring strict SLA and QoS guarantees, MIG is the standard recommendation.
Incorrect: A. Configure NVLink Switch NVLink Switch is a fabric technology used to connect multiple physical GPUs together across a network to act as one giant accelerator. It is not used to partition a single GPU into smaller slices for tenants. While it facilitates communication, it does not provide the memory isolation requested in the scenario.
B. Configure MPS (Multi-Process Service) While MPS allows multiple processes (or tenants) to share a GPU, it is a software-level solution.
The Flaw: MPS processes share the same underlying hardware resources and memory address space. It does not provide the guaranteed QoS or hardware-level memory isolation required for secure multi-tenancy in a cloud environment. If one MPS process crashes or over-utilizes resources, it can impact others.
D. Configure vGPU profiles NVIDIA vGPU is a virtualization technology that allows Virtual Machines (VMs) to share a GPU. While vGPU can use MIG as a backend (MIG-vGPU), the limitation described—requiring a license for every CPU core—is factually incorrect. NVIDIA vGPU licensing is typically based on the number of concurrent users or concurrent VMs, not the physical CPU core count of the host.
Incorrect
Correct: C Configure MIG (Multi-Instance GPU); the limitation is that once a GPU is partitioned, the individual instances cannot be dynamically resized without resetting the GPU.
The Technical Reason: Multi-Instance GPU (MIG) is the only technology listed that provides hardware-level isolation for both compute (Streaming Multiprocessors) and memory (crossbar, cache, and high-bandwidth memory). This ensures that a “noisy neighbor“ on one partition cannot impact the performance or access the data of another.
The Limitation: In the current H100 architecture, MIG configurations are “static“ in the sense that if you want to change a 3g.40gb instance to two 1g.10gb instances, you must destroy the existing instances and recreate them. This often requires stopping all active workloads on that physical GPU and, in some driver versions, resetting the GPU state.
The NCP-AII Context: The exam tests your ability to distinguish between logical sharing and physical partitioning. For cloud service providers (CSPs) requiring strict SLA and QoS guarantees, MIG is the standard recommendation.
Incorrect: A. Configure NVLink Switch NVLink Switch is a fabric technology used to connect multiple physical GPUs together across a network to act as one giant accelerator. It is not used to partition a single GPU into smaller slices for tenants. While it facilitates communication, it does not provide the memory isolation requested in the scenario.
B. Configure MPS (Multi-Process Service) While MPS allows multiple processes (or tenants) to share a GPU, it is a software-level solution.
The Flaw: MPS processes share the same underlying hardware resources and memory address space. It does not provide the guaranteed QoS or hardware-level memory isolation required for secure multi-tenancy in a cloud environment. If one MPS process crashes or over-utilizes resources, it can impact others.
D. Configure vGPU profiles NVIDIA vGPU is a virtualization technology that allows Virtual Machines (VMs) to share a GPU. While vGPU can use MIG as a backend (MIG-vGPU), the limitation described—requiring a license for every CPU core—is factually incorrect. NVIDIA vGPU licensing is typically based on the number of concurrent users or concurrent VMs, not the physical CPU core count of the host.
Unattempted
Correct: C Configure MIG (Multi-Instance GPU); the limitation is that once a GPU is partitioned, the individual instances cannot be dynamically resized without resetting the GPU.
The Technical Reason: Multi-Instance GPU (MIG) is the only technology listed that provides hardware-level isolation for both compute (Streaming Multiprocessors) and memory (crossbar, cache, and high-bandwidth memory). This ensures that a “noisy neighbor“ on one partition cannot impact the performance or access the data of another.
The Limitation: In the current H100 architecture, MIG configurations are “static“ in the sense that if you want to change a 3g.40gb instance to two 1g.10gb instances, you must destroy the existing instances and recreate them. This often requires stopping all active workloads on that physical GPU and, in some driver versions, resetting the GPU state.
The NCP-AII Context: The exam tests your ability to distinguish between logical sharing and physical partitioning. For cloud service providers (CSPs) requiring strict SLA and QoS guarantees, MIG is the standard recommendation.
Incorrect: A. Configure NVLink Switch NVLink Switch is a fabric technology used to connect multiple physical GPUs together across a network to act as one giant accelerator. It is not used to partition a single GPU into smaller slices for tenants. While it facilitates communication, it does not provide the memory isolation requested in the scenario.
B. Configure MPS (Multi-Process Service) While MPS allows multiple processes (or tenants) to share a GPU, it is a software-level solution.
The Flaw: MPS processes share the same underlying hardware resources and memory address space. It does not provide the guaranteed QoS or hardware-level memory isolation required for secure multi-tenancy in a cloud environment. If one MPS process crashes or over-utilizes resources, it can impact others.
D. Configure vGPU profiles NVIDIA vGPU is a virtualization technology that allows Virtual Machines (VMs) to share a GPU. While vGPU can use MIG as a backend (MIG-vGPU), the limitation described—requiring a license for every CPU core—is factually incorrect. NVIDIA vGPU licensing is typically based on the number of concurrent users or concurrent VMs, not the physical CPU core count of the host.
Question 60 of 60
60. Question
A technician is using ClusterKit to perform a multifaceted node assessment on a set of NVIDIA DGX nodes. One of the tests fails with a ‘cable signal quality‘ error. Which physical components and diagnostic tools should be used to resolve this issue and verify the fix?
Correct
Correct: C Inspect the OSFP/QSFP transceivers and fiber for contamination, use ‘mlxlink‘ to check the link health, and re-run the NCCL bandwidth test.
The Technical Reason: A “cable signal quality“ error typically points to a Physical Layer (Layer 1) issue. In high-speed optics (200G/400G+), even microscopic dust on a transceiver or fiber end-face can cause signal attenuation or bit errors.
The Tools:
mlxlink: This is the specific NVIDIA tool (part of the MFT – Mellanox Firmware Tools) used to check the physical link status, lane speeds, Bit Error Rate (BER), and eye diagram signal quality.
NCCL Bandwidth Test: Once the physical fix is applied, the NVIDIA Collective Communications Library (NCCL) tests are the standard for verifying that the “East-West“ fabric bandwidth has returned to its expected non-blocking performance levels.
Incorrect: A. Update the NGC CLI and use ‘ngc diag‘ The NGC CLI is used for managing containers, models, and registry resources; it is not a hardware diagnostic tool for physical switches. While InfiniBand switches can be managed remotely, “ngc diag“ is not the tool used for signal timing recalibration. Physical signal issues almost always require physical inspection before software adjustments.
B. Re-flash the BMC and use ‘nvidia-smi -r‘ The BMC (Baseboard Management Controller) manages server power and cooling, and nvidia-smi -r resets the GPU. Neither of these actions addresses a signal quality issue in the external network cabling or transceivers. NVLink fabric is internal to the HGX/DGX baseboard; “cable signal quality“ usually refers to the external InfiniBand/Ethernet fabric.
D. Disable the BlueField-3 DPU and use a standard Cat6 cable This is factually incorrect and physically impossible for AI workloads.
Fabric Speed: A Cat6 cable supports 1GbE or 10GbE, which is hundreds of times slower than the 200/400Gbps required by DGX nodes.
Architecture: You cannot “bridge“ the GPU baseboard to a storage controller using a management-grade Ethernet cable to solve a high-speed fabric bottleneck. The BlueField DPU is a critical component of the accelerated data path and should not be disabled.
Incorrect
Correct: C Inspect the OSFP/QSFP transceivers and fiber for contamination, use ‘mlxlink‘ to check the link health, and re-run the NCCL bandwidth test.
The Technical Reason: A “cable signal quality“ error typically points to a Physical Layer (Layer 1) issue. In high-speed optics (200G/400G+), even microscopic dust on a transceiver or fiber end-face can cause signal attenuation or bit errors.
The Tools:
mlxlink: This is the specific NVIDIA tool (part of the MFT – Mellanox Firmware Tools) used to check the physical link status, lane speeds, Bit Error Rate (BER), and eye diagram signal quality.
NCCL Bandwidth Test: Once the physical fix is applied, the NVIDIA Collective Communications Library (NCCL) tests are the standard for verifying that the “East-West“ fabric bandwidth has returned to its expected non-blocking performance levels.
Incorrect: A. Update the NGC CLI and use ‘ngc diag‘ The NGC CLI is used for managing containers, models, and registry resources; it is not a hardware diagnostic tool for physical switches. While InfiniBand switches can be managed remotely, “ngc diag“ is not the tool used for signal timing recalibration. Physical signal issues almost always require physical inspection before software adjustments.
B. Re-flash the BMC and use ‘nvidia-smi -r‘ The BMC (Baseboard Management Controller) manages server power and cooling, and nvidia-smi -r resets the GPU. Neither of these actions addresses a signal quality issue in the external network cabling or transceivers. NVLink fabric is internal to the HGX/DGX baseboard; “cable signal quality“ usually refers to the external InfiniBand/Ethernet fabric.
D. Disable the BlueField-3 DPU and use a standard Cat6 cable This is factually incorrect and physically impossible for AI workloads.
Fabric Speed: A Cat6 cable supports 1GbE or 10GbE, which is hundreds of times slower than the 200/400Gbps required by DGX nodes.
Architecture: You cannot “bridge“ the GPU baseboard to a storage controller using a management-grade Ethernet cable to solve a high-speed fabric bottleneck. The BlueField DPU is a critical component of the accelerated data path and should not be disabled.
Unattempted
Correct: C Inspect the OSFP/QSFP transceivers and fiber for contamination, use ‘mlxlink‘ to check the link health, and re-run the NCCL bandwidth test.
The Technical Reason: A “cable signal quality“ error typically points to a Physical Layer (Layer 1) issue. In high-speed optics (200G/400G+), even microscopic dust on a transceiver or fiber end-face can cause signal attenuation or bit errors.
The Tools:
mlxlink: This is the specific NVIDIA tool (part of the MFT – Mellanox Firmware Tools) used to check the physical link status, lane speeds, Bit Error Rate (BER), and eye diagram signal quality.
NCCL Bandwidth Test: Once the physical fix is applied, the NVIDIA Collective Communications Library (NCCL) tests are the standard for verifying that the “East-West“ fabric bandwidth has returned to its expected non-blocking performance levels.
Incorrect: A. Update the NGC CLI and use ‘ngc diag‘ The NGC CLI is used for managing containers, models, and registry resources; it is not a hardware diagnostic tool for physical switches. While InfiniBand switches can be managed remotely, “ngc diag“ is not the tool used for signal timing recalibration. Physical signal issues almost always require physical inspection before software adjustments.
B. Re-flash the BMC and use ‘nvidia-smi -r‘ The BMC (Baseboard Management Controller) manages server power and cooling, and nvidia-smi -r resets the GPU. Neither of these actions addresses a signal quality issue in the external network cabling or transceivers. NVLink fabric is internal to the HGX/DGX baseboard; “cable signal quality“ usually refers to the external InfiniBand/Ethernet fabric.
D. Disable the BlueField-3 DPU and use a standard Cat6 cable This is factually incorrect and physically impossible for AI workloads.
Fabric Speed: A Cat6 cable supports 1GbE or 10GbE, which is hundreds of times slower than the 200/400Gbps required by DGX nodes.
Architecture: You cannot “bridge“ the GPU baseboard to a storage controller using a management-grade Ethernet cable to solve a high-speed fabric bottleneck. The BlueField DPU is a critical component of the accelerated data path and should not be disabled.
X
Use Page numbers below to navigate to other practice tests