You have already completed the Test before. Hence you can not start it again.
Test is loading...
You must sign in or sign up to start the Test.
You have to finish following quiz, to start this Test:
Your results are here!! for" NVIDIA NCP-AIN Practice Test 2 "
0 of 60 questions answered correctly
Your time:
Time has elapsed
Your Final Score is : 0
You have attempted : 0
Number of Correct Questions : 0 and scored 0
Number of Incorrect Questions : 0 and Negative marks 0
Average score
Your score
NVIDIA NCP-AIN
You have attempted: 0
Number of Correct Questions: 0 and scored 0
Number of Incorrect Questions: 0 and Negative marks 0
You can review your answers by clicking on “View Answers” option. Important Note : Open Reference Documentation Links in New Tab (Right Click and Open in New Tab).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Answered
Review
Question 1 of 60
1. Question
What is the primary purpose of monitoring Priority Flow Control (PFC) frames when diagnosing RoCE performance issues in NVIDIA GPU clusters?
Correct
Priority Flow Control (PFC) is essential for RoCE performance troubleshooting because it enables lossless Ethernet by pausing traffic when buffers fill. Excessive PFC generation indicates network congestion that degrades RDMA performance. Monitoring PFC frame counts, pause durations, and affected priority classes helps identify bottlenecks in GPU cluster networks using RoCE for inter-node communication, which is critical for distributed training workloads.
Incorrect
Priority Flow Control (PFC) is essential for RoCE performance troubleshooting because it enables lossless Ethernet by pausing traffic when buffers fill. Excessive PFC generation indicates network congestion that degrades RDMA performance. Monitoring PFC frame counts, pause durations, and affected priority classes helps identify bottlenecks in GPU cluster networks using RoCE for inter-node communication, which is critical for distributed training workloads.
Unattempted
Priority Flow Control (PFC) is essential for RoCE performance troubleshooting because it enables lossless Ethernet by pausing traffic when buffers fill. Excessive PFC generation indicates network congestion that degrades RDMA performance. Monitoring PFC frame counts, pause durations, and affected priority classes helps identify bottlenecks in GPU cluster networks using RoCE for inter-node communication, which is critical for distributed training workloads.
Question 2 of 60
2. Question
Your multi-node H100 cluster experiences degraded AllReduce performance during LLM training, with profiling showing uneven packet arrival times causing idle GPUs. Which configuration technique most effectively addresses head-of-line blocking in this NCCL-based distributed training environment?
Correct
Head-of-line blocking in multi-node GPU training occurs when congested network flows block unrelated traffic at switch buffers. Priority Flow Control (PFC) on InfiniBand provides lossless Ethernet by pausing specific priority classes during congestion, preventing packet drops while allowing other flows to proceed. This is critical for NCCL‘s latency-sensitive collective operations where GPU synchronization depends on predictable network behavior across distributed training workloads.
Incorrect
Head-of-line blocking in multi-node GPU training occurs when congested network flows block unrelated traffic at switch buffers. Priority Flow Control (PFC) on InfiniBand provides lossless Ethernet by pausing specific priority classes during congestion, preventing packet drops while allowing other flows to proceed. This is critical for NCCL‘s latency-sensitive collective operations where GPU synchronization depends on predictable network behavior across distributed training workloads.
Unattempted
Head-of-line blocking in multi-node GPU training occurs when congested network flows block unrelated traffic at switch buffers. Priority Flow Control (PFC) on InfiniBand provides lossless Ethernet by pausing specific priority classes during congestion, preventing packet drops while allowing other flows to proceed. This is critical for NCCL‘s latency-sensitive collective operations where GPU synchronization depends on predictable network behavior across distributed training workloads.
Question 3 of 60
3. Question
What is the primary purpose of deploying NetQ agents in a network infrastructure?
Correct
NetQ agents serve as the data collection layer in NetQ architecture. Installed on switches, hosts, and network devices, they gather comprehensive telemetry including interface states, routing information, system resources, and protocol status. This data flows to the NetQ Platform where it enables real-time monitoring, historical analysis, network validation, and troubleshooting capabilities across the entire network infrastructure.
Incorrect
NetQ agents serve as the data collection layer in NetQ architecture. Installed on switches, hosts, and network devices, they gather comprehensive telemetry including interface states, routing information, system resources, and protocol status. This data flows to the NetQ Platform where it enables real-time monitoring, historical analysis, network validation, and troubleshooting capabilities across the entire network infrastructure.
Unattempted
NetQ agents serve as the data collection layer in NetQ architecture. Installed on switches, hosts, and network devices, they gather comprehensive telemetry including interface states, routing information, system resources, and protocol status. This data flows to the NetQ Platform where it enables real-time monitoring, historical analysis, network validation, and troubleshooting capabilities across the entire network infrastructure.
Question 4 of 60
4. Question
What is the primary purpose of performance counters in UFM (Unified Fabric Manager) monitoring?
Correct
UFM performance counters collect and track critical network fabric metrics including throughput statistics (bandwidth utilization, packet rates) and error conditions (CRC errors, symbol errors, link failures) across InfiniBand switches, adapters, and cables. These counters enable real-time monitoring of network health, bottleneck identification, and proactive troubleshooting in AI infrastructure where efficient multi-node GPU communication via InfiniBand is essential for distributed training workloads.
Incorrect
UFM performance counters collect and track critical network fabric metrics including throughput statistics (bandwidth utilization, packet rates) and error conditions (CRC errors, symbol errors, link failures) across InfiniBand switches, adapters, and cables. These counters enable real-time monitoring of network health, bottleneck identification, and proactive troubleshooting in AI infrastructure where efficient multi-node GPU communication via InfiniBand is essential for distributed training workloads.
Unattempted
UFM performance counters collect and track critical network fabric metrics including throughput statistics (bandwidth utilization, packet rates) and error conditions (CRC errors, symbol errors, link failures) across InfiniBand switches, adapters, and cables. These counters enable real-time monitoring of network health, bottleneck identification, and proactive troubleshooting in AI infrastructure where efficient multi-node GPU communication via InfiniBand is essential for distributed training workloads.
Question 5 of 60
5. Question
Your team is training a 70B parameter LLM across 64 H100 GPUs distributed over 8 DGX nodes connected via HDR InfiniBand. Communication overhead from gradient synchronization is creating bottlenecks during all-reduce operations. When would SHARP in-network reduction provide the most benefit for this distributed training workload?
Correct
SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) performs in-network reduction by executing collective operations directly within InfiniBand switches rather than at endpoints. This is most effective for large gradient tensors in multi-node training where network bandwidth becomes the bottleneck. For 64 GPUs across 8 nodes, SHARP reduces all-reduce traffic by performing partial reductions in-flight, minimizing data movement and improving training throughput compared to traditional host-based aggregation.
Incorrect
SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) performs in-network reduction by executing collective operations directly within InfiniBand switches rather than at endpoints. This is most effective for large gradient tensors in multi-node training where network bandwidth becomes the bottleneck. For 64 GPUs across 8 nodes, SHARP reduces all-reduce traffic by performing partial reductions in-flight, minimizing data movement and improving training throughput compared to traditional host-based aggregation.
Unattempted
SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) performs in-network reduction by executing collective operations directly within InfiniBand switches rather than at endpoints. This is most effective for large gradient tensors in multi-node training where network bandwidth becomes the bottleneck. For 64 GPUs across 8 nodes, SHARP reduces all-reduce traffic by performing partial reductions in-flight, minimizing data movement and improving training throughput compared to traditional host-based aggregation.
Question 6 of 60
6. Question
You are configuring NCCL 2.20+ for multi-node training on H100 GPUs connected via 400G RoCE v2 Ethernet fabric. Which configuration approach ensures optimal RDMA performance for GPU-to-GPU communication across nodes?
Correct
For NCCL over RoCE (RDMA over Converged Ethernet), optimal configuration requires three components: enabling GPUDirect RDMA for direct GPU memory access, configuring lossless Ethernet with Priority Flow Control to prevent packet drops, and using NCCL‘s IB verbs plugin (auto-detected for RoCE). This bypasses CPU, achieving 5-10x faster GPU-to-GPU communication compared to socket-based approaches for distributed training workloads.
Incorrect
For NCCL over RoCE (RDMA over Converged Ethernet), optimal configuration requires three components: enabling GPUDirect RDMA for direct GPU memory access, configuring lossless Ethernet with Priority Flow Control to prevent packet drops, and using NCCL‘s IB verbs plugin (auto-detected for RoCE). This bypasses CPU, achieving 5-10x faster GPU-to-GPU communication compared to socket-based approaches for distributed training workloads.
Unattempted
For NCCL over RoCE (RDMA over Converged Ethernet), optimal configuration requires three components: enabling GPUDirect RDMA for direct GPU memory access, configuring lossless Ethernet with Priority Flow Control to prevent packet drops, and using NCCL‘s IB verbs plugin (auto-detected for RoCE). This bypasses CPU, achieving 5-10x faster GPU-to-GPU communication compared to socket-based approaches for distributed training workloads.
Question 7 of 60
7. Question
What is the primary function of bridge domains in Cumulus Linux when implementing VLAN configurations?
Correct
Bridge domains in Cumulus Linux are fundamental Layer 2 constructs that create isolated broadcast domains for VLAN implementations. They associate ports with VLANs, manage MAC address learning and forwarding, and segment network traffic. Understanding bridge domains is essential for configuring VLAN-aware bridges and traditional bridges in Cumulus Linux fabric architectures.
Incorrect
Bridge domains in Cumulus Linux are fundamental Layer 2 constructs that create isolated broadcast domains for VLAN implementations. They associate ports with VLANs, manage MAC address learning and forwarding, and segment network traffic. Understanding bridge domains is essential for configuring VLAN-aware bridges and traditional bridges in Cumulus Linux fabric architectures.
Unattempted
Bridge domains in Cumulus Linux are fundamental Layer 2 constructs that create isolated broadcast domains for VLAN implementations. They associate ports with VLANs, manage MAC address learning and forwarding, and segment network traffic. Understanding bridge domains is essential for configuring VLAN-aware bridges and traditional bridges in Cumulus Linux fabric architectures.
Question 8 of 60
8. Question
A data center is integrating NVIDIA Cumulus NetQ with their Ethernet fabric running NCCL workloads across 64 H100 GPUs. NetQ telemetry shows intermittent packet loss during AllReduce operations, but link utilization remains below 60%. What is the most critical integration component to diagnose this performance degradation?
Correct
Integrating NetQ with Ethernet fabrics supporting NCCL requires event-level visibility into transient congestion patterns. WJH provides hardware-accelerated packet drop telemetry correlated with precise timestamps, essential for diagnosing microbursts during synchronized GPU AllReduce operations. Unlike steady-state monitoring (link state, utilization), WJH captures ephemeral events causing packet loss despite low average utilization. This integration enables correlation between NCCL collective timing and switch buffer exhaustion, directing optimization efforts to buffer tuning, ECN configuration, or traffic scheduling adjustments specific to GPU-synchronized traffic patterns.
Incorrect
Integrating NetQ with Ethernet fabrics supporting NCCL requires event-level visibility into transient congestion patterns. WJH provides hardware-accelerated packet drop telemetry correlated with precise timestamps, essential for diagnosing microbursts during synchronized GPU AllReduce operations. Unlike steady-state monitoring (link state, utilization), WJH captures ephemeral events causing packet loss despite low average utilization. This integration enables correlation between NCCL collective timing and switch buffer exhaustion, directing optimization efforts to buffer tuning, ECN configuration, or traffic scheduling adjustments specific to GPU-synchronized traffic patterns.
Unattempted
Integrating NetQ with Ethernet fabrics supporting NCCL requires event-level visibility into transient congestion patterns. WJH provides hardware-accelerated packet drop telemetry correlated with precise timestamps, essential for diagnosing microbursts during synchronized GPU AllReduce operations. Unlike steady-state monitoring (link state, utilization), WJH captures ephemeral events causing packet loss despite low average utilization. This integration enables correlation between NCCL collective timing and switch buffer exhaustion, directing optimization efforts to buffer tuning, ECN configuration, or traffic scheduling adjustments specific to GPU-synchronized traffic patterns.
Question 9 of 60
9. Question
You are configuring an 8-node H100 cluster with HDR InfiniBand for multi-node LLM training using NCCL. Each node requires low-latency GPU-to-GPU communication across the fabric with GPUDirect RDMA. When would you use Queue Pairs (QP) for connection and communication setup in this scenario?
Correct
Queue Pairs (QP) are the fundamental building blocks of RDMA over InfiniBand, establishing dedicated communication channels between endpoints. In multi-node GPU clusters, QPs enable GPUDirect RDMA by creating direct memory access paths between GPUs across nodes, bypassing CPU intervention. NCCL leverages QPs to implement efficient collective operations (all-reduce, all-gather) for distributed training, with each GPU pair maintaining QPs for low-latency data transfers essential for scaling LLM training across multiple nodes.
Incorrect
Queue Pairs (QP) are the fundamental building blocks of RDMA over InfiniBand, establishing dedicated communication channels between endpoints. In multi-node GPU clusters, QPs enable GPUDirect RDMA by creating direct memory access paths between GPUs across nodes, bypassing CPU intervention. NCCL leverages QPs to implement efficient collective operations (all-reduce, all-gather) for distributed training, with each GPU pair maintaining QPs for low-latency data transfers essential for scaling LLM training across multiple nodes.
Unattempted
Queue Pairs (QP) are the fundamental building blocks of RDMA over InfiniBand, establishing dedicated communication channels between endpoints. In multi-node GPU clusters, QPs enable GPUDirect RDMA by creating direct memory access paths between GPUs across nodes, bypassing CPU intervention. NCCL leverages QPs to implement efficient collective operations (all-reduce, all-gather) for distributed training, with each GPU pair maintaining QPs for low-latency data transfers essential for scaling LLM training across multiple nodes.
Question 10 of 60
10. Question
A data center fabric uses BGP EVPN with multiple spine switches advertising the same VXLAN routes to leaf switches. When would you configure BGP best path selection with the AS-path attribute to optimize traffic forwarding in this multi-path environment?
Correct
BGP best path selection using AS-path attribute is most appropriate when requiring deterministic routing decisions based on autonomous system path length. In data center fabrics, AS-path manipulation (like prepending) creates predictable traffic patterns by making certain paths less preferred. This differs from ECMP which requires equal AS-path lengths, local metric-based selection which uses IGP metrics or MED, and origin-based selection which evaluates route source type separately in the best path algorithm.
Incorrect
BGP best path selection using AS-path attribute is most appropriate when requiring deterministic routing decisions based on autonomous system path length. In data center fabrics, AS-path manipulation (like prepending) creates predictable traffic patterns by making certain paths less preferred. This differs from ECMP which requires equal AS-path lengths, local metric-based selection which uses IGP metrics or MED, and origin-based selection which evaluates route source type separately in the best path algorithm.
Unattempted
BGP best path selection using AS-path attribute is most appropriate when requiring deterministic routing decisions based on autonomous system path length. In data center fabrics, AS-path manipulation (like prepending) creates predictable traffic patterns by making certain paths less preferred. This differs from ECMP which requires equal AS-path lengths, local metric-based selection which uses IGP metrics or MED, and origin-based selection which evaluates route source type separately in the best path algorithm.
Question 11 of 60
11. Question
A distributed training cluster with 16 DGX H100 nodes experiences uneven InfiniBand link utilization, with some paths at 85% while others remain at 30% during NCCL all-reduce operations. Which Adaptive Routing configuration achieves optimal traffic distribution across all available paths?
Correct
Adaptive Routing with AR-LFT provides dynamic load balancing by monitoring real-time port congestion and selecting optimal paths for each packet flow. During NCCL all-reduce operations in multi-node training, traffic patterns vary significantly, creating hotspots on specific links. AR-LFT‘s congestion-aware routing distributes traffic across all available paths, achieving balanced utilization. This contrasts with static routing approaches that cannot adapt to runtime conditions.
Incorrect
Adaptive Routing with AR-LFT provides dynamic load balancing by monitoring real-time port congestion and selecting optimal paths for each packet flow. During NCCL all-reduce operations in multi-node training, traffic patterns vary significantly, creating hotspots on specific links. AR-LFT‘s congestion-aware routing distributes traffic across all available paths, achieving balanced utilization. This contrasts with static routing approaches that cannot adapt to runtime conditions.
Unattempted
Adaptive Routing with AR-LFT provides dynamic load balancing by monitoring real-time port congestion and selecting optimal paths for each packet flow. During NCCL all-reduce operations in multi-node training, traffic patterns vary significantly, creating hotspots on specific links. AR-LFT‘s congestion-aware routing distributes traffic across all available paths, achieving balanced utilization. This contrasts with static routing approaches that cannot adapt to runtime conditions.
Question 12 of 60
12. Question
A network administrator deploys NetQ in a multi-site data center environment with 500 switches. The NetQ agents are collecting telemetry data, but the NetQ collector shows intermittent data gaps and delayed metric updates during peak traffic hours. What is the MOST critical architectural component to analyze for this integration issue?
Correct
NetQ architecture relies on agents pushing telemetry data to centralized collectors over the network. With 500 switches generating concurrent telemetry streams, the agent-to-collector communication channel becomes the critical bottleneck during peak hours. Intermittent gaps correlating with traffic peaks indicate insufficient bandwidth or network congestion affecting telemetry transport. Proper NetQ integration requires dedicated management network bandwidth or QoS policies to ensure reliable agent-collector communication. The transport layer between distributed agents and centralized collector is the most critical architectural component to analyze when troubleshooting scale-related telemetry delivery issues.
Incorrect
NetQ architecture relies on agents pushing telemetry data to centralized collectors over the network. With 500 switches generating concurrent telemetry streams, the agent-to-collector communication channel becomes the critical bottleneck during peak hours. Intermittent gaps correlating with traffic peaks indicate insufficient bandwidth or network congestion affecting telemetry transport. Proper NetQ integration requires dedicated management network bandwidth or QoS policies to ensure reliable agent-collector communication. The transport layer between distributed agents and centralized collector is the most critical architectural component to analyze when troubleshooting scale-related telemetry delivery issues.
Unattempted
NetQ architecture relies on agents pushing telemetry data to centralized collectors over the network. With 500 switches generating concurrent telemetry streams, the agent-to-collector communication channel becomes the critical bottleneck during peak hours. Intermittent gaps correlating with traffic peaks indicate insufficient bandwidth or network congestion affecting telemetry transport. Proper NetQ integration requires dedicated management network bandwidth or QoS policies to ensure reliable agent-collector communication. The transport layer between distributed agents and centralized collector is the most critical architectural component to analyze when troubleshooting scale-related telemetry delivery issues.
Question 13 of 60
13. Question
A 256-GPU AI cluster using fat-tree topology experiences 40% throughput degradation during multi-node all-reduce operations, despite each GPU achieving line-rate on single-node NVLink tests. Spine switches show balanced utilization, but leaf-to-spine links exhibit asymmetric traffic patterns. What is the most likely cause?
Correct
Non-blocking fat-tree designs require 1:1 oversubscription ratios where leaf-to-spine bandwidth equals total leaf downlink bandwidth. AI workloads generate symmetric all-to-all traffic during NCCL collectives, requiring full bisection bandwidth. Insufficient uplink oversubscription (e.g., 2:1 or 3:1) creates blocking when multiple leaf switches simultaneously transmit through shared spine capacity. Asymmetric leaf-to-spine patterns with balanced spine utilization confirms bandwidth contention at the aggregation layer, not routing or protocol issues.
Incorrect
Non-blocking fat-tree designs require 1:1 oversubscription ratios where leaf-to-spine bandwidth equals total leaf downlink bandwidth. AI workloads generate symmetric all-to-all traffic during NCCL collectives, requiring full bisection bandwidth. Insufficient uplink oversubscription (e.g., 2:1 or 3:1) creates blocking when multiple leaf switches simultaneously transmit through shared spine capacity. Asymmetric leaf-to-spine patterns with balanced spine utilization confirms bandwidth contention at the aggregation layer, not routing or protocol issues.
Unattempted
Non-blocking fat-tree designs require 1:1 oversubscription ratios where leaf-to-spine bandwidth equals total leaf downlink bandwidth. AI workloads generate symmetric all-to-all traffic during NCCL collectives, requiring full bisection bandwidth. Insufficient uplink oversubscription (e.g., 2:1 or 3:1) creates blocking when multiple leaf switches simultaneously transmit through shared spine capacity. Asymmetric leaf-to-spine patterns with balanced spine utilization confirms bandwidth contention at the aggregation layer, not routing or protocol issues.
Question 14 of 60
14. Question
What is a key feature of the NVIDIA Spectrum SN4000 series switches that distinguishes them for AI infrastructure deployments?
Correct
The NVIDIA Spectrum SN4000 series switches are designed for AI infrastructure with 100G/200G Ethernet capabilities and optimized RoCE support. These features enable high-bandwidth, low-latency communication essential for multi-node GPU clusters. The switches facilitate efficient collective operations (AllReduce, AllGather) required in distributed training, making them foundational for scalable AI deployments.
Incorrect
The NVIDIA Spectrum SN4000 series switches are designed for AI infrastructure with 100G/200G Ethernet capabilities and optimized RoCE support. These features enable high-bandwidth, low-latency communication essential for multi-node GPU clusters. The switches facilitate efficient collective operations (AllReduce, AllGather) required in distributed training, making them foundational for scalable AI deployments.
Unattempted
The NVIDIA Spectrum SN4000 series switches are designed for AI infrastructure with 100G/200G Ethernet capabilities and optimized RoCE support. These features enable high-bandwidth, low-latency communication essential for multi-node GPU clusters. The switches facilitate efficient collective operations (AllReduce, AllGather) required in distributed training, making them foundational for scalable AI deployments.
Question 15 of 60
15. Question
What is the primary purpose of Receive-Side Scaling (RSS) in ConnectX Ethernet adapters?
Correct
Receive-Side Scaling (RSS) is a network driver technology in ConnectX Ethernet adapters that distributes network receive processing across multiple CPU cores. By hashing packet headers and directing flows to different receive queues, RSS prevents bottlenecks on a single CPU core and enables parallel packet processing, significantly improving network throughput in multi-core systems.
Incorrect
Receive-Side Scaling (RSS) is a network driver technology in ConnectX Ethernet adapters that distributes network receive processing across multiple CPU cores. By hashing packet headers and directing flows to different receive queues, RSS prevents bottlenecks on a single CPU core and enables parallel packet processing, significantly improving network throughput in multi-core systems.
Unattempted
Receive-Side Scaling (RSS) is a network driver technology in ConnectX Ethernet adapters that distributes network receive processing across multiple CPU cores. By hashing packet headers and directing flows to different receive queues, RSS prevents bottlenecks on a single CPU core and enables parallel packet processing, significantly improving network throughput in multi-core systems.
Question 16 of 60
16. Question
Your InfiniBand fabric supports both low-latency HPC applications and bulk data transfer workloads. Which Subnet Manager feature should you configure to prioritize real-time traffic while maintaining throughput for background jobs?
Correct
The Subnet Manager enables QoS through Virtual Lane configuration with Service Level mappings. This creates distinct traffic classes within the InfiniBand fabric, allowing administrators to assign different priorities and bandwidth guarantees. VLs provide hardware-enforced separation, ensuring real-time HPC traffic receives priority over bulk transfers while maintaining overall fabric efficiency.
Incorrect
The Subnet Manager enables QoS through Virtual Lane configuration with Service Level mappings. This creates distinct traffic classes within the InfiniBand fabric, allowing administrators to assign different priorities and bandwidth guarantees. VLs provide hardware-enforced separation, ensuring real-time HPC traffic receives priority over bulk transfers while maintaining overall fabric efficiency.
Unattempted
The Subnet Manager enables QoS through Virtual Lane configuration with Service Level mappings. This creates distinct traffic classes within the InfiniBand fabric, allowing administrators to assign different priorities and bandwidth guarantees. VLs provide hardware-enforced separation, ensuring real-time HPC traffic receives priority over bulk transfers while maintaining overall fabric efficiency.
Question 17 of 60
17. Question
A network administrator observes intermittent application timeouts on a GPU cluster running distributed training workloads. Which What Just Happened feature should be used first to identify the root cause of packet drops affecting NCCL collective operations?
Correct
Packet drop analysis is the optimal first diagnostic tool because it categorizes drops by specific reasons (buffer congestion, ACL violations, routing failures, TTL expiration), enabling immediate root cause identification. For GPU cluster workloads where NCCL requires low-latency, lossless communication, understanding exact drop mechanisms—not just that drops occurred—is critical for rapid resolution of intermittent timeout issues.
Incorrect
Packet drop analysis is the optimal first diagnostic tool because it categorizes drops by specific reasons (buffer congestion, ACL violations, routing failures, TTL expiration), enabling immediate root cause identification. For GPU cluster workloads where NCCL requires low-latency, lossless communication, understanding exact drop mechanisms—not just that drops occurred—is critical for rapid resolution of intermittent timeout issues.
Unattempted
Packet drop analysis is the optimal first diagnostic tool because it categorizes drops by specific reasons (buffer congestion, ACL violations, routing failures, TTL expiration), enabling immediate root cause identification. For GPU cluster workloads where NCCL requires low-latency, lossless communication, understanding exact drop mechanisms—not just that drops occurred—is critical for rapid resolution of intermittent timeout issues.
Question 18 of 60
18. Question
An administrator needs to verify InfiniBand link state on GPU nodes before starting a multi-node LLM training job on an H100 cluster. The verification must confirm active connectivity and identify any degraded links across 16 nodes. Which approach most effectively uses ibstatus for this inspection?
Correct
The ibstatus command displays InfiniBand adapter status including link state through ‘State:‘ (logical state showing Active/Down/Initializing) and ‘Physical state:‘ (showing LinkUp/LinkDown) fields. For pre-training verification across multiple nodes, administrators should parse these fields to confirm all ports show ‘State: Active‘ and ‘Physical state: LinkUp‘, identifying any degraded links that would impact NCCL communication during distributed training. The command provides point-in-time snapshots without additional flags for continuous monitoring or quality metrics.
Incorrect
The ibstatus command displays InfiniBand adapter status including link state through ‘State:‘ (logical state showing Active/Down/Initializing) and ‘Physical state:‘ (showing LinkUp/LinkDown) fields. For pre-training verification across multiple nodes, administrators should parse these fields to confirm all ports show ‘State: Active‘ and ‘Physical state: LinkUp‘, identifying any degraded links that would impact NCCL communication during distributed training. The command provides point-in-time snapshots without additional flags for continuous monitoring or quality metrics.
Unattempted
The ibstatus command displays InfiniBand adapter status including link state through ‘State:‘ (logical state showing Active/Down/Initializing) and ‘Physical state:‘ (showing LinkUp/LinkDown) fields. For pre-training verification across multiple nodes, administrators should parse these fields to confirm all ports show ‘State: Active‘ and ‘Physical state: LinkUp‘, identifying any degraded links that would impact NCCL communication during distributed training. The command provides point-in-time snapshots without additional flags for continuous monitoring or quality metrics.
Question 19 of 60
19. Question
What is RDMA Read/Write in the context of RDMA over InfiniBand?
Correct
RDMA Read/Write are one-sided communication operations fundamental to InfiniBand‘s performance advantages. Unlike traditional two-sided operations requiring sender and receiver coordination, RDMA Read/Write allows one node to directly access remote memory without involving the remote CPU. This enables zero-copy data transfers with sub-microsecond latencies, critical for GPU-to-GPU communication in distributed AI training workloads.
Incorrect
RDMA Read/Write are one-sided communication operations fundamental to InfiniBand‘s performance advantages. Unlike traditional two-sided operations requiring sender and receiver coordination, RDMA Read/Write allows one node to directly access remote memory without involving the remote CPU. This enables zero-copy data transfers with sub-microsecond latencies, critical for GPU-to-GPU communication in distributed AI training workloads.
Unattempted
RDMA Read/Write are one-sided communication operations fundamental to InfiniBand‘s performance advantages. Unlike traditional two-sided operations requiring sender and receiver coordination, RDMA Read/Write allows one node to directly access remote memory without involving the remote CPU. This enables zero-copy data transfers with sub-microsecond latencies, critical for GPU-to-GPU communication in distributed AI training workloads.
Question 20 of 60
20. Question
What is the primary purpose of baseline establishment in UFM Cyber-AI?
Correct
Baseline establishment in UFM Cyber-AI is the process of learning and profiling normal network behavior patterns. By observing typical operations, traffic flows, and device interactions over time, it creates a behavioral baseline. This learned baseline serves as the reference point for anomaly detection, enabling UFM Cyber-AI to identify deviations that may indicate security threats or operational issues.
Incorrect
Baseline establishment in UFM Cyber-AI is the process of learning and profiling normal network behavior patterns. By observing typical operations, traffic flows, and device interactions over time, it creates a behavioral baseline. This learned baseline serves as the reference point for anomaly detection, enabling UFM Cyber-AI to identify deviations that may indicate security threats or operational issues.
Unattempted
Baseline establishment in UFM Cyber-AI is the process of learning and profiling normal network behavior patterns. By observing typical operations, traffic flows, and device interactions over time, it creates a behavioral baseline. This learned baseline serves as the reference point for anomaly detection, enabling UFM Cyber-AI to identify deviations that may indicate security threats or operational issues.
Question 21 of 60
21. Question
What is the primary purpose of Virtual Lanes (VLs) in InfiniBand architecture?
Correct
Virtual Lanes (VLs) are traffic separation mechanisms in InfiniBand that create multiple logical channels within a single physical link. Their primary purpose is to prevent head-of-line blocking by isolating different traffic classes (e.g., management, data, GPU communication). This ensures high-priority traffic maintains low latency even when lower-priority traffic is congested, providing Quality of Service guarantees critical for multi-tenant AI training environments.
Incorrect
Virtual Lanes (VLs) are traffic separation mechanisms in InfiniBand that create multiple logical channels within a single physical link. Their primary purpose is to prevent head-of-line blocking by isolating different traffic classes (e.g., management, data, GPU communication). This ensures high-priority traffic maintains low latency even when lower-priority traffic is congested, providing Quality of Service guarantees critical for multi-tenant AI training environments.
Unattempted
Virtual Lanes (VLs) are traffic separation mechanisms in InfiniBand that create multiple logical channels within a single physical link. Their primary purpose is to prevent head-of-line blocking by isolating different traffic classes (e.g., management, data, GPU communication). This ensures high-priority traffic maintains low latency even when lower-priority traffic is congested, providing Quality of Service guarantees critical for multi-tenant AI training environments.
Question 22 of 60
22. Question
A network administrator configures NVUE settings on a Cumulus Linux switch and needs to ensure configurations persist across unplanned reboots and software upgrades. Which approach provides automatic configuration persistence without manual intervention after each change?
Correct
Configuration persistence in Cumulus Linux NVUE requires explicit commitment of changes using ‘nv config apply‘, which writes pending configurations to the startup configuration database. This ensures all applied settings survive reboots and upgrades without additional manual intervention. Unlike network operating systems with auto-save features, Cumulus requires intentional configuration commits, providing administrators control over which changes become persistent while preventing accidental configuration loss.
Incorrect
Configuration persistence in Cumulus Linux NVUE requires explicit commitment of changes using ‘nv config apply‘, which writes pending configurations to the startup configuration database. This ensures all applied settings survive reboots and upgrades without additional manual intervention. Unlike network operating systems with auto-save features, Cumulus requires intentional configuration commits, providing administrators control over which changes become persistent while preventing accidental configuration loss.
Unattempted
Configuration persistence in Cumulus Linux NVUE requires explicit commitment of changes using ‘nv config apply‘, which writes pending configurations to the startup configuration database. This ensures all applied settings survive reboots and upgrades without additional manual intervention. Unlike network operating systems with auto-save features, Cumulus requires intentional configuration commits, providing administrators control over which changes become persistent while preventing accidental configuration loss.
Question 23 of 60
23. Question
A research team is deploying a multi-node H100 GPU cluster for LLM training with 400GbE RoCE v2 networking. They experience 15% throughput degradation during NCCL AllReduce operations compared to expected performance. Which MTU configuration best addresses this for AI workloads?
Correct
Jumbo frames with 9000-byte MTU are critical for high-speed Ethernet in AI clusters, reducing packet overhead for large tensor transfers during NCCL operations. End-to-end configuration across NICs, switches, and storage ensures no fragmentation occurs. This directly improves multi-GPU training throughput by 10-20% compared to standard MTU, essential for 400GbE RoCE v2 environments with H100 clusters.
Incorrect
Jumbo frames with 9000-byte MTU are critical for high-speed Ethernet in AI clusters, reducing packet overhead for large tensor transfers during NCCL operations. End-to-end configuration across NICs, switches, and storage ensures no fragmentation occurs. This directly improves multi-GPU training throughput by 10-20% compared to standard MTU, essential for 400GbE RoCE v2 environments with H100 clusters.
Unattempted
Jumbo frames with 9000-byte MTU are critical for high-speed Ethernet in AI clusters, reducing packet overhead for large tensor transfers during NCCL operations. End-to-end configuration across NICs, switches, and storage ensures no fragmentation occurs. This directly improves multi-GPU training throughput by 10-20% compared to standard MTU, essential for 400GbE RoCE v2 environments with H100 clusters.
Question 24 of 60
24. Question
What are the available installation options for deploying Unified Fabric Manager (UFM) in an InfiniBand network infrastructure?
Correct
UFM deployment options include hardware appliance, virtual machine, and containerized installations. These flexible deployment models accommodate different infrastructure requirements while maintaining centralized management of InfiniBand networks. The choice depends on existing infrastructure, scalability needs, and operational preferences for managing high-performance computing and AI fabric environments.
Incorrect
UFM deployment options include hardware appliance, virtual machine, and containerized installations. These flexible deployment models accommodate different infrastructure requirements while maintaining centralized management of InfiniBand networks. The choice depends on existing infrastructure, scalability needs, and operational preferences for managing high-performance computing and AI fabric environments.
Unattempted
UFM deployment options include hardware appliance, virtual machine, and containerized installations. These flexible deployment models accommodate different infrastructure requirements while maintaining centralized management of InfiniBand networks. The choice depends on existing infrastructure, scalability needs, and operational preferences for managing high-performance computing and AI fabric environments.
Question 25 of 60
25. Question
A multi-node H100 training cluster experiences congestion on specific InfiniBand paths during large AllReduce operations. Which Adaptive Routing technology should be implemented to dynamically select optimal paths and balance traffic across the fabric?
Correct
InfiniBand Adaptive Routing provides dynamic path selection by monitoring real-time congestion metrics from switch buffer occupancy. This per-packet routing decision enables traffic to avoid congested paths during intensive NCCL AllReduce operations common in multi-node training. Static routing lacks runtime adaptability, NVLink operates only within nodes, and NCCL topology awareness complements but doesn‘t replace network-layer adaptive routing functionality.
Incorrect
InfiniBand Adaptive Routing provides dynamic path selection by monitoring real-time congestion metrics from switch buffer occupancy. This per-packet routing decision enables traffic to avoid congested paths during intensive NCCL AllReduce operations common in multi-node training. Static routing lacks runtime adaptability, NVLink operates only within nodes, and NCCL topology awareness complements but doesn‘t replace network-layer adaptive routing functionality.
Unattempted
InfiniBand Adaptive Routing provides dynamic path selection by monitoring real-time congestion metrics from switch buffer occupancy. This per-packet routing decision enables traffic to avoid congested paths during intensive NCCL AllReduce operations common in multi-node training. Static routing lacks runtime adaptability, NVLink operates only within nodes, and NCCL topology awareness complements but doesn‘t replace network-layer adaptive routing functionality.
Question 26 of 60
26. Question
A GPU cluster experiences intermittent AllReduce timeouts during multi-node LLM training. What Just Happened logs show “L1_EXCEPTION“ events on specific H100 GPUs with dropped packets on NVLink ports. What is the most likely root cause?
Correct
What Just Happened (WJH) telemetry provides real-time root cause analysis for network and interconnect issues. L1_EXCEPTION events specifically indicate physical layer problems on NVLink interconnects—the 900GB/s GPU-to-GPU links critical for multi-GPU training. Dropped packets on NVLink ports during NCCL AllReduce operations point directly to cable degradation or connection issues. This is distinguished from network-level problems (which would show InfiniBand telemetry) or software issues (which wouldn‘t generate physical layer exceptions). Proper WJH analysis requires correlating event types with affected hardware components.
Incorrect
What Just Happened (WJH) telemetry provides real-time root cause analysis for network and interconnect issues. L1_EXCEPTION events specifically indicate physical layer problems on NVLink interconnects—the 900GB/s GPU-to-GPU links critical for multi-GPU training. Dropped packets on NVLink ports during NCCL AllReduce operations point directly to cable degradation or connection issues. This is distinguished from network-level problems (which would show InfiniBand telemetry) or software issues (which wouldn‘t generate physical layer exceptions). Proper WJH analysis requires correlating event types with affected hardware components.
Unattempted
What Just Happened (WJH) telemetry provides real-time root cause analysis for network and interconnect issues. L1_EXCEPTION events specifically indicate physical layer problems on NVLink interconnects—the 900GB/s GPU-to-GPU links critical for multi-GPU training. Dropped packets on NVLink ports during NCCL AllReduce operations point directly to cable degradation or connection issues. This is distinguished from network-level problems (which would show InfiniBand telemetry) or software issues (which wouldn‘t generate physical layer exceptions). Proper WJH analysis requires correlating event types with affected hardware components.
Question 27 of 60
27. Question
A DGX H100 node experiences intermittent 25% throughput drops during multi-node NCCL AllReduce operations over 400GbE RoCE. Running ‘ethtool -S eth0‘ shows rx_crc_errors incrementing steadily while tx_packets matches application send rate. What is the most likely cause?
Correct
The incrementing rx_crc_errors counter in ethtool statistics definitively indicates physical layer corruption during packet reception. CRC (Cyclic Redundancy Check) errors occur when the computed checksum doesn‘t match the transmitted checksum, signaling bit-level corruption during transmission—typically from damaged cables, dirty connectors, signal attenuation, or electromagnetic interference. With tx_packets normal but rx_crc_errors climbing, the issue isolates to the receive path physical infrastructure. This causes the 25% throughput degradation as TCP or RoCE lossless protocols retransmit corrupted packets, consuming bandwidth without productive data transfer during NCCL operations.
Incorrect
The incrementing rx_crc_errors counter in ethtool statistics definitively indicates physical layer corruption during packet reception. CRC (Cyclic Redundancy Check) errors occur when the computed checksum doesn‘t match the transmitted checksum, signaling bit-level corruption during transmission—typically from damaged cables, dirty connectors, signal attenuation, or electromagnetic interference. With tx_packets normal but rx_crc_errors climbing, the issue isolates to the receive path physical infrastructure. This causes the 25% throughput degradation as TCP or RoCE lossless protocols retransmit corrupted packets, consuming bandwidth without productive data transfer during NCCL operations.
Unattempted
The incrementing rx_crc_errors counter in ethtool statistics definitively indicates physical layer corruption during packet reception. CRC (Cyclic Redundancy Check) errors occur when the computed checksum doesn‘t match the transmitted checksum, signaling bit-level corruption during transmission—typically from damaged cables, dirty connectors, signal attenuation, or electromagnetic interference. With tx_packets normal but rx_crc_errors climbing, the issue isolates to the receive path physical infrastructure. This causes the 25% throughput degradation as TCP or RoCE lossless protocols retransmit corrupted packets, consuming bandwidth without productive data transfer during NCCL operations.
Question 28 of 60
28. Question
What is the primary purpose of the perfquery tool in InfiniBand fabric management?
Correct
perfquery is the standard InfiniBand diagnostic tool for retrieving hardware performance counters from HCA and switch ports. It displays critical metrics including packet counts, error rates, and bandwidth utilization essential for troubleshooting multi-GPU training clusters. Understanding perfquery output helps identify network bottlenecks affecting NCCL collective operations and GPUDirect RDMA efficiency in distributed AI workloads.
Incorrect
perfquery is the standard InfiniBand diagnostic tool for retrieving hardware performance counters from HCA and switch ports. It displays critical metrics including packet counts, error rates, and bandwidth utilization essential for troubleshooting multi-GPU training clusters. Understanding perfquery output helps identify network bottlenecks affecting NCCL collective operations and GPUDirect RDMA efficiency in distributed AI workloads.
Unattempted
perfquery is the standard InfiniBand diagnostic tool for retrieving hardware performance counters from HCA and switch ports. It displays critical metrics including packet counts, error rates, and bandwidth utilization essential for troubleshooting multi-GPU training clusters. Understanding perfquery output helps identify network bottlenecks affecting NCCL collective operations and GPUDirect RDMA efficiency in distributed AI workloads.
Question 29 of 60
29. Question
A multi-node H100 cluster uses InfiniBand HDR networking for distributed LLM training. The infrastructure team needs centralized visibility into fabric topology, link utilization, and port errors across 128 GPUs to optimize NCCL collective performance. Which approach achieves unified fabric management?
Correct
UFM (Unified Fabric Manager) is NVIDIA‘s purpose-built solution for centralized InfiniBand fabric management in GPU clusters. It provides topology discovery, real-time health monitoring, performance analytics, and automated issue detection across the entire fabric. For distributed training workloads using NCCL with GPUDirect RDMA, UFM correlates fabric metrics with GPU communication patterns to identify bottlenecks and optimize collective operation performance.
Incorrect
UFM (Unified Fabric Manager) is NVIDIA‘s purpose-built solution for centralized InfiniBand fabric management in GPU clusters. It provides topology discovery, real-time health monitoring, performance analytics, and automated issue detection across the entire fabric. For distributed training workloads using NCCL with GPUDirect RDMA, UFM correlates fabric metrics with GPU communication patterns to identify bottlenecks and optimize collective operation performance.
Unattempted
UFM (Unified Fabric Manager) is NVIDIA‘s purpose-built solution for centralized InfiniBand fabric management in GPU clusters. It provides topology discovery, real-time health monitoring, performance analytics, and automated issue detection across the entire fabric. For distributed training workloads using NCCL with GPUDirect RDMA, UFM correlates fabric metrics with GPU communication patterns to identify bottlenecks and optimize collective operation performance.
Question 30 of 60
30. Question
What is the primary purpose of 400G/800G Ethernet in AI infrastructure?
Correct
400G/800G Ethernet provides next-generation network bandwidth critical for multi-node AI workloads. As AI models scale beyond single-node capacity (e.g., training LLMs >200B parameters), efficient inter-node communication becomes essential. These high-speed Ethernet standards support distributed training operations like NCCL AllReduce across nodes, reducing network bottlenecks that would otherwise limit cluster scalability and training efficiency.
Incorrect
400G/800G Ethernet provides next-generation network bandwidth critical for multi-node AI workloads. As AI models scale beyond single-node capacity (e.g., training LLMs >200B parameters), efficient inter-node communication becomes essential. These high-speed Ethernet standards support distributed training operations like NCCL AllReduce across nodes, reducing network bottlenecks that would otherwise limit cluster scalability and training efficiency.
Unattempted
400G/800G Ethernet provides next-generation network bandwidth critical for multi-node AI workloads. As AI models scale beyond single-node capacity (e.g., training LLMs >200B parameters), efficient inter-node communication becomes essential. These high-speed Ethernet standards support distributed training operations like NCCL AllReduce across nodes, reducing network bottlenecks that would otherwise limit cluster scalability and training efficiency.
Question 31 of 60
31. Question
A data center architect is designing a RoCE network for multi-node GPU training with H100 clusters. The network must guarantee zero packet loss during NCCL all-reduce operations to prevent training disruptions. Which technology combination is MOST effective for ensuring lossless Ethernet?
Correct
Lossless Ethernet for RoCE requires PFC and ECN working together. PFC pauses transmission at each hop when buffers approach capacity, preventing packet drops entirely. ECN provides early congestion signaling, allowing endpoints to reduce rates before PFC activation. This combination is critical for RDMA operations in GPU training, where packet loss triggers expensive retransmissions and disrupts NCCL collective operations, significantly impacting training efficiency across multi-node H100 clusters.
Incorrect
Lossless Ethernet for RoCE requires PFC and ECN working together. PFC pauses transmission at each hop when buffers approach capacity, preventing packet drops entirely. ECN provides early congestion signaling, allowing endpoints to reduce rates before PFC activation. This combination is critical for RDMA operations in GPU training, where packet loss triggers expensive retransmissions and disrupts NCCL collective operations, significantly impacting training efficiency across multi-node H100 clusters.
Unattempted
Lossless Ethernet for RoCE requires PFC and ECN working together. PFC pauses transmission at each hop when buffers approach capacity, preventing packet drops entirely. ECN provides early congestion signaling, allowing endpoints to reduce rates before PFC activation. This combination is critical for RDMA operations in GPU training, where packet loss triggers expensive retransmissions and disrupts NCCL collective operations, significantly impacting training efficiency across multi-node H100 clusters.
Question 32 of 60
32. Question
Which statement best describes the primary function of ethtool diagnostics when troubleshooting network interface card (NIC) issues in AI fabric deployments?
Correct
Ethtool is the standard Linux utility for querying and configuring Ethernet NIC parameters. It displays critical diagnostic information including link status, speed negotiation, duplex mode, hardware error counters, and driver statistics. For AI fabric troubleshooting, ethtool helps identify physical layer issues, verify proper NIC configuration, and detect packet drops or errors that could impact distributed training performance.
Incorrect
Ethtool is the standard Linux utility for querying and configuring Ethernet NIC parameters. It displays critical diagnostic information including link status, speed negotiation, duplex mode, hardware error counters, and driver statistics. For AI fabric troubleshooting, ethtool helps identify physical layer issues, verify proper NIC configuration, and detect packet drops or errors that could impact distributed training performance.
Unattempted
Ethtool is the standard Linux utility for querying and configuring Ethernet NIC parameters. It displays critical diagnostic information including link status, speed negotiation, duplex mode, hardware error counters, and driver statistics. For AI fabric troubleshooting, ethtool helps identify physical layer issues, verify proper NIC configuration, and detect packet drops or errors that could impact distributed training performance.
Question 33 of 60
33. Question
A network engineer needs to configure OSPF routing on a Cumulus Linux switch. The organization requires all configuration changes to be atomic and easily reversible. Which CLI tool should the engineer use to meet these requirements?
Correct
NVUE is the appropriate CLI tool for this scenario because it provides atomic configuration changes with built-in rollback capabilities. Unlike vtysh, which applies changes immediately, NVUE uses a staged configuration approach where changes are validated and committed as a single transaction. This allows easy reversion to previous configurations using revision control, meeting the organization‘s requirement for reversible changes while maintaining system consistency.
Incorrect
NVUE is the appropriate CLI tool for this scenario because it provides atomic configuration changes with built-in rollback capabilities. Unlike vtysh, which applies changes immediately, NVUE uses a staged configuration approach where changes are validated and committed as a single transaction. This allows easy reversion to previous configurations using revision control, meeting the organization‘s requirement for reversible changes while maintaining system consistency.
Unattempted
NVUE is the appropriate CLI tool for this scenario because it provides atomic configuration changes with built-in rollback capabilities. Unlike vtysh, which applies changes immediately, NVUE uses a staged configuration approach where changes are validated and committed as a single transaction. This allows easy reversion to previous configurations using revision control, meeting the organization‘s requirement for reversible changes while maintaining system consistency.
Question 34 of 60
34. Question
A network engineer configures a Cumulus Linux switch with VLAN-aware bridge br0 containing ports swp1-swp8. Performance monitoring reveals 40% higher CPU utilization during peak traffic compared to baseline. Which configuration optimization would most effectively reduce CPU overhead while maintaining VLAN segmentation across all ports?
Correct
VLAN-aware bridge mode in Cumulus Linux leverages hardware ASIC capabilities to perform VLAN filtering and forwarding at wire speed, offloading these operations from the CPU. By configuring bridge-vlan-aware yes, the switch programs VLAN tables directly into hardware, eliminating software-based packet processing for VLAN operations. Removing unused VLAN IDs further optimizes hardware table utilization and reduces lookup complexity. This approach provides optimal performance for multi-VLAN environments while maintaining complete layer 2 segmentation across all ports with minimal CPU involvement.
Incorrect
VLAN-aware bridge mode in Cumulus Linux leverages hardware ASIC capabilities to perform VLAN filtering and forwarding at wire speed, offloading these operations from the CPU. By configuring bridge-vlan-aware yes, the switch programs VLAN tables directly into hardware, eliminating software-based packet processing for VLAN operations. Removing unused VLAN IDs further optimizes hardware table utilization and reduces lookup complexity. This approach provides optimal performance for multi-VLAN environments while maintaining complete layer 2 segmentation across all ports with minimal CPU involvement.
Unattempted
VLAN-aware bridge mode in Cumulus Linux leverages hardware ASIC capabilities to perform VLAN filtering and forwarding at wire speed, offloading these operations from the CPU. By configuring bridge-vlan-aware yes, the switch programs VLAN tables directly into hardware, eliminating software-based packet processing for VLAN operations. Removing unused VLAN IDs further optimizes hardware table utilization and reduces lookup complexity. This approach provides optimal performance for multi-VLAN environments while maintaining complete layer 2 segmentation across all ports with minimal CPU involvement.
Question 35 of 60
35. Question
A team is configuring multi-node LLM training on an 8-node H100 cluster with InfiniBand HDR networking. They need to optimize NCCL communication for maximum bandwidth while enabling detailed performance profiling during initial runs. Which NCCL environment variable configuration best addresses these requirements?
Correct
For H100 clusters with InfiniBand, optimal NCCL configuration requires enabling InfiniBand transport (NCCL_IB_DISABLE=0), maximizing GPUDirect RDMA capabilities (NCCL_NET_GDR_LEVEL=5), and specifying network adapters (NCCL_IB_HCA). The NCCL_DEBUG=INFO setting provides detailed profiling information during initial runs. This configuration leverages InfiniBand HDR‘s 200 Gbps bandwidth with GPUDirect RDMA, bypassing CPU for GPU-to-GPU communication across nodes, critical for efficient multi-node LLM training.
Incorrect
For H100 clusters with InfiniBand, optimal NCCL configuration requires enabling InfiniBand transport (NCCL_IB_DISABLE=0), maximizing GPUDirect RDMA capabilities (NCCL_NET_GDR_LEVEL=5), and specifying network adapters (NCCL_IB_HCA). The NCCL_DEBUG=INFO setting provides detailed profiling information during initial runs. This configuration leverages InfiniBand HDR‘s 200 Gbps bandwidth with GPUDirect RDMA, bypassing CPU for GPU-to-GPU communication across nodes, critical for efficient multi-node LLM training.
Unattempted
For H100 clusters with InfiniBand, optimal NCCL configuration requires enabling InfiniBand transport (NCCL_IB_DISABLE=0), maximizing GPUDirect RDMA capabilities (NCCL_NET_GDR_LEVEL=5), and specifying network adapters (NCCL_IB_HCA). The NCCL_DEBUG=INFO setting provides detailed profiling information during initial runs. This configuration leverages InfiniBand HDR‘s 200 Gbps bandwidth with GPUDirect RDMA, bypassing CPU for GPU-to-GPU communication across nodes, critical for efficient multi-node LLM training.
Question 36 of 60
36. Question
What is the primary purpose of GUIDs and LIDs in InfiniBand fabric addressing?
Correct
InfiniBand uses a two-tier addressing architecture: GUIDs are 64-bit globally unique hardware identifiers permanently assigned by manufacturers, ensuring each port has a unique identity across all fabrics. LIDs are 16-bit local addresses dynamically assigned by the Subnet Manager for efficient routing within a subnet. This combination provides global uniqueness for node identification while enabling fast, scalable local routing through compact LID-based forwarding tables in switches.
Incorrect
InfiniBand uses a two-tier addressing architecture: GUIDs are 64-bit globally unique hardware identifiers permanently assigned by manufacturers, ensuring each port has a unique identity across all fabrics. LIDs are 16-bit local addresses dynamically assigned by the Subnet Manager for efficient routing within a subnet. This combination provides global uniqueness for node identification while enabling fast, scalable local routing through compact LID-based forwarding tables in switches.
Unattempted
InfiniBand uses a two-tier addressing architecture: GUIDs are 64-bit globally unique hardware identifiers permanently assigned by manufacturers, ensuring each port has a unique identity across all fabrics. LIDs are 16-bit local addresses dynamically assigned by the Subnet Manager for efficient routing within a subnet. This combination provides global uniqueness for node identification while enabling fast, scalable local routing through compact LID-based forwarding tables in switches.
Question 37 of 60
37. Question
What is the fundamental architectural design principle of Cumulus Linux that distinguishes it from traditional network operating systems?
Correct
Cumulus Linux revolutionizes network switch architecture by implementing a full Linux distribution on bare-metal switching hardware. This Linux-based NOS design enables network engineers to use familiar Linux commands, automation tools (Ansible, Python), and DevOps workflows. Unlike proprietary systems, it supports open networking with hardware disaggregation across multiple vendors.
Incorrect
Cumulus Linux revolutionizes network switch architecture by implementing a full Linux distribution on bare-metal switching hardware. This Linux-based NOS design enables network engineers to use familiar Linux commands, automation tools (Ansible, Python), and DevOps workflows. Unlike proprietary systems, it supports open networking with hardware disaggregation across multiple vendors.
Unattempted
Cumulus Linux revolutionizes network switch architecture by implementing a full Linux distribution on bare-metal switching hardware. This Linux-based NOS design enables network engineers to use familiar Linux commands, automation tools (Ansible, Python), and DevOps workflows. Unlike proprietary systems, it supports open networking with hardware disaggregation across multiple vendors.
Question 38 of 60
38. Question
What is the primary purpose of Subnet Manager (SM) discovery in an InfiniBand fabric?
Correct
SM discovery is a critical fabric initialization process where nodes identify and establish communication with the active Subnet Manager. During fabric bring-up, devices use SM discovery to locate the active SM, enabling them to receive configuration information including Local Identifiers (LIDs), routing tables, and partition membership. This process ensures all fabric components can properly participate in subnet management and establish connectivity.
Incorrect
SM discovery is a critical fabric initialization process where nodes identify and establish communication with the active Subnet Manager. During fabric bring-up, devices use SM discovery to locate the active SM, enabling them to receive configuration information including Local Identifiers (LIDs), routing tables, and partition membership. This process ensures all fabric components can properly participate in subnet management and establish connectivity.
Unattempted
SM discovery is a critical fabric initialization process where nodes identify and establish communication with the active Subnet Manager. During fabric bring-up, devices use SM discovery to locate the active SM, enabling them to receive configuration information including Local Identifiers (LIDs), routing tables, and partition membership. This process ensures all fabric components can properly participate in subnet management and establish connectivity.
Question 39 of 60
39. Question
A team is training a 70B parameter LLM across 32 H100 GPUs distributed over 4 DGX nodes connected via InfiniBand. Which technology combination provides the MOST efficient collective communication for gradient synchronization during backpropagation?
Correct
Multi-node distributed training requires efficient all-reduce for gradient synchronization. NCCL 2.20+ provides the optimal solution through hierarchical all-reduce algorithms: NVLink 4.0 for intra-node GPU communication (900 GB/s) and GPUDirect RDMA over InfiniBand for inter-node transfers, bypassing CPU entirely. This topology-aware approach minimizes communication overhead during backpropagation, essential for training large models efficiently.
Incorrect
Multi-node distributed training requires efficient all-reduce for gradient synchronization. NCCL 2.20+ provides the optimal solution through hierarchical all-reduce algorithms: NVLink 4.0 for intra-node GPU communication (900 GB/s) and GPUDirect RDMA over InfiniBand for inter-node transfers, bypassing CPU entirely. This topology-aware approach minimizes communication overhead during backpropagation, essential for training large models efficiently.
Unattempted
Multi-node distributed training requires efficient all-reduce for gradient synchronization. NCCL 2.20+ provides the optimal solution through hierarchical all-reduce algorithms: NVLink 4.0 for intra-node GPU communication (900 GB/s) and GPUDirect RDMA over InfiniBand for inter-node transfers, bypassing CPU entirely. This topology-aware approach minimizes communication overhead during backpropagation, essential for training large models efficiently.
Question 40 of 60
40. Question
A UFM Cyber-AI deployment detects 85% fewer network anomalies after upgrading InfiniBand firmware across the cluster, despite unchanged traffic patterns. Security teams verify no actual reduction in malicious activity occurred. What is the most likely root cause of the false negative rate increase?
Correct
This troubleshooting scenario identifies ML model drift as the root cause. UFM Cyber-AI‘s anomaly detection relies on baseline behavioral models trained on normal network patterns. InfiniBand firmware upgrades alter protocol timing, packet structures, and flow characteristics. The ML models, trained on pre-upgrade patterns, cannot recognize these new legitimate behaviors as normal baseline. Anomalies now hide within unrecognized traffic patterns, creating detection blind spots. Resolution requires retraining models on post-upgrade baseline data, typically 7-14 days of clean traffic capture to re-establish normal behavior profiles.
Incorrect
This troubleshooting scenario identifies ML model drift as the root cause. UFM Cyber-AI‘s anomaly detection relies on baseline behavioral models trained on normal network patterns. InfiniBand firmware upgrades alter protocol timing, packet structures, and flow characteristics. The ML models, trained on pre-upgrade patterns, cannot recognize these new legitimate behaviors as normal baseline. Anomalies now hide within unrecognized traffic patterns, creating detection blind spots. Resolution requires retraining models on post-upgrade baseline data, typically 7-14 days of clean traffic capture to re-establish normal behavior profiles.
Unattempted
This troubleshooting scenario identifies ML model drift as the root cause. UFM Cyber-AI‘s anomaly detection relies on baseline behavioral models trained on normal network patterns. InfiniBand firmware upgrades alter protocol timing, packet structures, and flow characteristics. The ML models, trained on pre-upgrade patterns, cannot recognize these new legitimate behaviors as normal baseline. Anomalies now hide within unrecognized traffic patterns, creating detection blind spots. Resolution requires retraining models on post-upgrade baseline data, typically 7-14 days of clean traffic capture to re-establish normal behavior profiles.
Question 41 of 60
41. Question
What is the primary purpose of mlxfwmanager in ConnectX HCA firmware management?
Correct
mlxfwmanager is NVIDIA‘s centralized firmware management utility for ConnectX HCAs, providing capabilities to query installed firmware versions, perform updates, and manage firmware across multiple adapters. It streamlines firmware lifecycle operations in data centers with numerous InfiniBand adapters, ensuring consistent firmware versions and simplifying maintenance workflows. This tool is essential for maintaining firmware currency and security compliance across large-scale InfiniBand deployments.
Incorrect
mlxfwmanager is NVIDIA‘s centralized firmware management utility for ConnectX HCAs, providing capabilities to query installed firmware versions, perform updates, and manage firmware across multiple adapters. It streamlines firmware lifecycle operations in data centers with numerous InfiniBand adapters, ensuring consistent firmware versions and simplifying maintenance workflows. This tool is essential for maintaining firmware currency and security compliance across large-scale InfiniBand deployments.
Unattempted
mlxfwmanager is NVIDIA‘s centralized firmware management utility for ConnectX HCAs, providing capabilities to query installed firmware versions, perform updates, and manage firmware across multiple adapters. It streamlines firmware lifecycle operations in data centers with numerous InfiniBand adapters, ensuring consistent firmware versions and simplifying maintenance workflows. This tool is essential for maintaining firmware currency and security compliance across large-scale InfiniBand deployments.
Question 42 of 60
42. Question
Your organization is designing a 128-GPU AI training cluster using H100 GPUs across 16 nodes for multi-node LLM training. The architecture team proposes a fat-tree topology with oversubscription ratios of 1:1, 2:1, and 4:1. Which oversubscription ratio achieves non-blocking fabric design for full bisection bandwidth during NCCL all-reduce operations?
Correct
Non-blocking fabric design in fat-tree topology requires 1:1 oversubscription where uplink and downlink bandwidth are equal at all tiers. This ensures full bisection bandwidth, allowing simultaneous communication between any subset of nodes without network congestion. For multi-node LLM training using NCCL all-reduce, non-blocking fabric eliminates network bottlenecks during collective operations. Higher oversubscription ratios (2:1, 4:1) reduce costs but introduce blocking behavior and bandwidth contention.
Incorrect
Non-blocking fabric design in fat-tree topology requires 1:1 oversubscription where uplink and downlink bandwidth are equal at all tiers. This ensures full bisection bandwidth, allowing simultaneous communication between any subset of nodes without network congestion. For multi-node LLM training using NCCL all-reduce, non-blocking fabric eliminates network bottlenecks during collective operations. Higher oversubscription ratios (2:1, 4:1) reduce costs but introduce blocking behavior and bandwidth contention.
Unattempted
Non-blocking fabric design in fat-tree topology requires 1:1 oversubscription where uplink and downlink bandwidth are equal at all tiers. This ensures full bisection bandwidth, allowing simultaneous communication between any subset of nodes without network congestion. For multi-node LLM training using NCCL all-reduce, non-blocking fabric eliminates network bottlenecks during collective operations. Higher oversubscription ratios (2:1, 4:1) reduce costs but introduce blocking behavior and bandwidth contention.
Question 43 of 60
43. Question
What is Network Functions Virtualization (NFV) in the context of NVIDIA BlueField DPU?
Correct
NFV on BlueField DPU enables running network services (firewalls, routers, load balancers) as software on the DPU‘s ARM processors and hardware accelerators. This offloads network processing from host CPUs, provides line-rate performance through hardware acceleration, and delivers flexible, software-defined network services without requiring dedicated physical appliances for each function.
Incorrect
NFV on BlueField DPU enables running network services (firewalls, routers, load balancers) as software on the DPU‘s ARM processors and hardware accelerators. This offloads network processing from host CPUs, provides line-rate performance through hardware acceleration, and delivers flexible, software-defined network services without requiring dedicated physical appliances for each function.
Unattempted
NFV on BlueField DPU enables running network services (firewalls, routers, load balancers) as software on the DPU‘s ARM processors and hardware accelerators. This offloads network processing from host CPUs, provides line-rate performance through hardware acceleration, and delivers flexible, software-defined network services without requiring dedicated physical appliances for each function.
Question 44 of 60
44. Question
A data center is deploying ConnectX-7 HCAs to connect H100 GPU nodes for distributed LLM training. The InfiniBand fabric supports both HDR (200 Gbps) and NDR (400 Gbps) speeds. When would you configure port link speed settings to HDR mode instead of using auto-negotiation?
Correct
Manual port configuration to HDR mode is appropriate when ensuring compatibility with legacy infrastructure that doesn‘t support NDR speeds. This prevents auto-negotiation failures and connection instability. For modern deployments with full NDR support, auto-negotiation or explicit NDR configuration maximizes bandwidth for distributed training. Port speed configuration is a fabric interoperability concern, not a GPU feature enabler.
Incorrect
Manual port configuration to HDR mode is appropriate when ensuring compatibility with legacy infrastructure that doesn‘t support NDR speeds. This prevents auto-negotiation failures and connection instability. For modern deployments with full NDR support, auto-negotiation or explicit NDR configuration maximizes bandwidth for distributed training. Port speed configuration is a fabric interoperability concern, not a GPU feature enabler.
Unattempted
Manual port configuration to HDR mode is appropriate when ensuring compatibility with legacy infrastructure that doesn‘t support NDR speeds. This prevents auto-negotiation failures and connection instability. For modern deployments with full NDR support, auto-negotiation or explicit NDR configuration maximizes bandwidth for distributed training. Port speed configuration is a fabric interoperability concern, not a GPU feature enabler.
Question 45 of 60
45. Question
Your organization is deploying a multi-node H100 cluster for distributed LLM training using NCCL over RoCE v2 Ethernet fabric. Network monitoring shows occasional packet drops during AllReduce operations, degrading training performance. When would you implement Priority Flow Control (PFC) and Enhanced Transmission Selection (ETS) for ensuring zero packet loss?
Correct
Lossless Ethernet for RoCE requires PFC (Priority Flow Control) and ETS (Enhanced Transmission Selection) configured on all network elements. PFC provides hop-by-hop backpressure preventing packet drops during congestion, while ETS ensures RoCE traffic receives priority over best-effort traffic. This is critical for RDMA operations used by NCCL in distributed training, as RDMA has no native retransmission mechanism and packet loss causes severe performance degradation.
Incorrect
Lossless Ethernet for RoCE requires PFC (Priority Flow Control) and ETS (Enhanced Transmission Selection) configured on all network elements. PFC provides hop-by-hop backpressure preventing packet drops during congestion, while ETS ensures RoCE traffic receives priority over best-effort traffic. This is critical for RDMA operations used by NCCL in distributed training, as RDMA has no native retransmission mechanism and packet loss causes severe performance degradation.
Unattempted
Lossless Ethernet for RoCE requires PFC (Priority Flow Control) and ETS (Enhanced Transmission Selection) configured on all network elements. PFC provides hop-by-hop backpressure preventing packet drops during congestion, while ETS ensures RoCE traffic receives priority over best-effort traffic. This is critical for RDMA operations used by NCCL in distributed training, as RDMA has no native retransmission mechanism and packet loss causes severe performance degradation.
Question 46 of 60
46. Question
An AI training cluster with 128 H100 GPUs across 16 nodes requires differentiated QoS for NCCL all-reduce traffic (high priority) versus storage checkpoint traffic (low priority) over InfiniBand. What is the critical component required in the Subnet Manager configuration to enforce traffic class separation and prevent storage traffic from impacting training communication latency?
Correct
InfiniBand QoS via Subnet Manager fundamentally relies on Service Level to Virtual Lane mapping. The SM configures SL-to-VL tables at each port, enabling applications to tag traffic with Service Levels (0-15) that map to hardware-scheduled Virtual Lanes. VL arbitration then enforces bandwidth allocation and priority through configured weights. For AI training, NCCL traffic is assigned high-priority SLs mapping to VLs with aggressive arbitration weights, while checkpoint storage traffic uses low-priority SLs with minimal VL allocation. This hardware-enforced separation prevents cross-interference without requiring physical fabric segmentation or application-layer routing workarounds.
Incorrect
InfiniBand QoS via Subnet Manager fundamentally relies on Service Level to Virtual Lane mapping. The SM configures SL-to-VL tables at each port, enabling applications to tag traffic with Service Levels (0-15) that map to hardware-scheduled Virtual Lanes. VL arbitration then enforces bandwidth allocation and priority through configured weights. For AI training, NCCL traffic is assigned high-priority SLs mapping to VLs with aggressive arbitration weights, while checkpoint storage traffic uses low-priority SLs with minimal VL allocation. This hardware-enforced separation prevents cross-interference without requiring physical fabric segmentation or application-layer routing workarounds.
Unattempted
InfiniBand QoS via Subnet Manager fundamentally relies on Service Level to Virtual Lane mapping. The SM configures SL-to-VL tables at each port, enabling applications to tag traffic with Service Levels (0-15) that map to hardware-scheduled Virtual Lanes. VL arbitration then enforces bandwidth allocation and priority through configured weights. For AI training, NCCL traffic is assigned high-priority SLs mapping to VLs with aggressive arbitration weights, while checkpoint storage traffic uses low-priority SLs with minimal VL allocation. This hardware-enforced separation prevents cross-interference without requiring physical fabric segmentation or application-layer routing workarounds.
Question 47 of 60
47. Question
A data center operator is deploying BGP route reflectors to scale their EVPN-VXLAN fabric from 32 to 128 leaf switches. They configure two route reflectors as clients of each other to provide redundancy. After deployment, routing instability and control plane loops are observed. What is the critical configuration error?
Correct
Redundant BGP route reflectors serving the same clients must be configured with the same CLUSTER_ID to prevent routing loops. The CLUSTER_LIST attribute enables loop detection by tracking which clusters have processed a route. When route reflectors lack proper CLUSTER_ID configuration, they cannot identify routes already reflected by their redundant peers, causing routes to be reflected repeatedly between route reflectors and creating control plane instability. This is a critical requirement for scaling BGP in data center fabrics with redundant route reflector deployments.
Incorrect
Redundant BGP route reflectors serving the same clients must be configured with the same CLUSTER_ID to prevent routing loops. The CLUSTER_LIST attribute enables loop detection by tracking which clusters have processed a route. When route reflectors lack proper CLUSTER_ID configuration, they cannot identify routes already reflected by their redundant peers, causing routes to be reflected repeatedly between route reflectors and creating control plane instability. This is a critical requirement for scaling BGP in data center fabrics with redundant route reflector deployments.
Unattempted
Redundant BGP route reflectors serving the same clients must be configured with the same CLUSTER_ID to prevent routing loops. The CLUSTER_LIST attribute enables loop detection by tracking which clusters have processed a route. When route reflectors lack proper CLUSTER_ID configuration, they cannot identify routes already reflected by their redundant peers, causing routes to be reflected repeatedly between route reflectors and creating control plane instability. This is a critical requirement for scaling BGP in data center fabrics with redundant route reflector deployments.
Question 48 of 60
48. Question
An AI research team is deploying a 128-node H100 cluster for LLM training with models exceeding 400B parameters. During multi-node all-reduce operations, they observe significant performance degradation as training scales beyond 64 nodes. What is the critical component limiting cluster communication capacity?
Correct
Network bisection bandwidth is the critical limiting factor for large-scale distributed training. It represents the minimum bandwidth when partitioning the cluster, directly impacting all-reduce efficiency. At 64+ nodes, gradient synchronization requires massive cross-rack communication, exposing bisection bandwidth bottlenecks at spine layer switches. For 128-node H100 clusters training 400B+ parameter models, HDR (200 Gbps) or NDR (400 Gbps) InfiniBand with high bisection bandwidth ratios (1:1 or 2:1) is essential to prevent communication bottlenecks from dominating training time.
Incorrect
Network bisection bandwidth is the critical limiting factor for large-scale distributed training. It represents the minimum bandwidth when partitioning the cluster, directly impacting all-reduce efficiency. At 64+ nodes, gradient synchronization requires massive cross-rack communication, exposing bisection bandwidth bottlenecks at spine layer switches. For 128-node H100 clusters training 400B+ parameter models, HDR (200 Gbps) or NDR (400 Gbps) InfiniBand with high bisection bandwidth ratios (1:1 or 2:1) is essential to prevent communication bottlenecks from dominating training time.
Unattempted
Network bisection bandwidth is the critical limiting factor for large-scale distributed training. It represents the minimum bandwidth when partitioning the cluster, directly impacting all-reduce efficiency. At 64+ nodes, gradient synchronization requires massive cross-rack communication, exposing bisection bandwidth bottlenecks at spine layer switches. For 128-node H100 clusters training 400B+ parameter models, HDR (200 Gbps) or NDR (400 Gbps) InfiniBand with high bisection bandwidth ratios (1:1 or 2:1) is essential to prevent communication bottlenecks from dominating training time.
Question 49 of 60
49. Question
Your multi-node H100 cluster experiences suboptimal NCCL AllReduce performance during LLM training over InfiniBand HDR. Network bandwidth utilization is only 40% despite adequate GPU compute. Which ConnectX HCA optimization parameter should you adjust first to improve collective operation throughput?
Correct
ConnectX HCA adaptive routing is the primary optimization for improving NCCL collective performance on InfiniBand fabrics. When bandwidth utilization is low despite adequate compute, the issue typically stems from traffic concentration on limited fabric paths. Enabling adaptive routing allows the HCA to distribute AllReduce traffic across multiple IB routes dynamically, eliminating congestion hotspots and maximizing effective bandwidth. This is particularly critical for multi-node training where NCCL ring algorithms generate predictable traffic patterns that benefit from path diversity.
Incorrect
ConnectX HCA adaptive routing is the primary optimization for improving NCCL collective performance on InfiniBand fabrics. When bandwidth utilization is low despite adequate compute, the issue typically stems from traffic concentration on limited fabric paths. Enabling adaptive routing allows the HCA to distribute AllReduce traffic across multiple IB routes dynamically, eliminating congestion hotspots and maximizing effective bandwidth. This is particularly critical for multi-node training where NCCL ring algorithms generate predictable traffic patterns that benefit from path diversity.
Unattempted
ConnectX HCA adaptive routing is the primary optimization for improving NCCL collective performance on InfiniBand fabrics. When bandwidth utilization is low despite adequate compute, the issue typically stems from traffic concentration on limited fabric paths. Enabling adaptive routing allows the HCA to distribute AllReduce traffic across multiple IB routes dynamically, eliminating congestion hotspots and maximizing effective bandwidth. This is particularly critical for multi-node training where NCCL ring algorithms generate predictable traffic patterns that benefit from path diversity.
Question 50 of 60
50. Question
Your AI cluster requires upgrading from HDR InfiniBand to support 400G connectivity between compute nodes using NVIDIA Quantum-2 switches. The infrastructure team asks which signaling rate to configure for optimal 400G throughput. Which configuration should you implement?
Correct
For 400G InfiniBand connectivity on NVIDIA Quantum-2 switches, NDR (Next Data Rate) is the correct configuration, operating at 100 Gbps per lane with standard 4x lane width. XDR delivers 800G (over-provisioned), while HDR and EDR are previous-generation standards insufficient for native 400G. NDR provides optimal performance for multi-node GPU clusters with GPUDirect RDMA support.
Incorrect
For 400G InfiniBand connectivity on NVIDIA Quantum-2 switches, NDR (Next Data Rate) is the correct configuration, operating at 100 Gbps per lane with standard 4x lane width. XDR delivers 800G (over-provisioned), while HDR and EDR are previous-generation standards insufficient for native 400G. NDR provides optimal performance for multi-node GPU clusters with GPUDirect RDMA support.
Unattempted
For 400G InfiniBand connectivity on NVIDIA Quantum-2 switches, NDR (Next Data Rate) is the correct configuration, operating at 100 Gbps per lane with standard 4x lane width. XDR delivers 800G (over-provisioned), while HDR and EDR are previous-generation standards insufficient for native 400G. NDR provides optimal performance for multi-node GPU clusters with GPUDirect RDMA support.
Question 51 of 60
51. Question
Your team is integrating SHARP Protocol with NCCL for multi-node LLM training on an 8-node H100 cluster with HDR InfiniBand. NCCL all-reduce operations show 30% slower performance than expected. What is the critical component missing from your SHARP with NCCL integration?
Correct
SHARP with NCCL integration requires NCCL_COLLNET_ENABLE=1 to activate collective offload operations. SHARP offloads all-reduce aggregation to InfiniBand switches, performing in-network computation that significantly accelerates multi-node training. Without COLLNET enabled, NCCL uses standard host-based reduction algorithms, bypassing SHARP entirely despite having SHARP-capable infrastructure. The 30% performance gap indicates NCCL is working over InfiniBand but not leveraging SHARP acceleration, which is the exact symptom of missing COLLNET enablement in production deployments.
Incorrect
SHARP with NCCL integration requires NCCL_COLLNET_ENABLE=1 to activate collective offload operations. SHARP offloads all-reduce aggregation to InfiniBand switches, performing in-network computation that significantly accelerates multi-node training. Without COLLNET enabled, NCCL uses standard host-based reduction algorithms, bypassing SHARP entirely despite having SHARP-capable infrastructure. The 30% performance gap indicates NCCL is working over InfiniBand but not leveraging SHARP acceleration, which is the exact symptom of missing COLLNET enablement in production deployments.
Unattempted
SHARP with NCCL integration requires NCCL_COLLNET_ENABLE=1 to activate collective offload operations. SHARP offloads all-reduce aggregation to InfiniBand switches, performing in-network computation that significantly accelerates multi-node training. Without COLLNET enabled, NCCL uses standard host-based reduction algorithms, bypassing SHARP entirely despite having SHARP-capable infrastructure. The 30% performance gap indicates NCCL is working over InfiniBand but not leveraging SHARP acceleration, which is the exact symptom of missing COLLNET enablement in production deployments.
Question 52 of 60
52. Question
A multi-tenant AI infrastructure hosts inference workloads for three tenants on H100 GPUs, each requiring guaranteed network bandwidth for model serving. Which resource allocation mechanism ensures bandwidth guarantees while preventing tenant interference during high-traffic periods?
Correct
Multi-tenant bandwidth guarantees require network-level resource allocation mechanisms. SR-IOV with QoS policies and minimum bandwidth reservations provides hardware-enforced network isolation through virtual functions, ensuring each tenant receives guaranteed bandwidth regardless of contention. This prevents tenant interference during high-traffic periods while maintaining predictable performance. MIG addresses GPU isolation, Kubernetes manages compute resources, and NVSwitch handles intra-node GPU communication—none directly guarantee external network bandwidth.
Incorrect
Multi-tenant bandwidth guarantees require network-level resource allocation mechanisms. SR-IOV with QoS policies and minimum bandwidth reservations provides hardware-enforced network isolation through virtual functions, ensuring each tenant receives guaranteed bandwidth regardless of contention. This prevents tenant interference during high-traffic periods while maintaining predictable performance. MIG addresses GPU isolation, Kubernetes manages compute resources, and NVSwitch handles intra-node GPU communication—none directly guarantee external network bandwidth.
Unattempted
Multi-tenant bandwidth guarantees require network-level resource allocation mechanisms. SR-IOV with QoS policies and minimum bandwidth reservations provides hardware-enforced network isolation through virtual functions, ensuring each tenant receives guaranteed bandwidth regardless of contention. This prevents tenant interference during high-traffic periods while maintaining predictable performance. MIG addresses GPU isolation, Kubernetes manages compute resources, and NVSwitch handles intra-node GPU communication—none directly guarantee external network bandwidth.
Question 53 of 60
53. Question
A multi-node H100 cluster experiences performance degradation during distributed training over RoCE v2 fabric. Network analysis shows packet drops during all-reduce operations despite 30% average bandwidth utilization. What is the MOST critical ECN configuration optimization to resolve this issue?
Correct
RoCE v2 fabric optimization for NCCL distributed training requires properly tuned ECN thresholds to handle bursty all-reduce traffic patterns. The 40KB/120KB RED_MIN/RED_MAX configuration enables early congestion notification before buffer exhaustion, critical for maintaining lossless operation. Packet drops despite low average utilization indicate microburst congestion that ECN‘s proactive signaling prevents. This works synergistically with PFC for optimal RoCE performance, allowing senders to adapt rates before reactive flow control becomes necessary.
Incorrect
RoCE v2 fabric optimization for NCCL distributed training requires properly tuned ECN thresholds to handle bursty all-reduce traffic patterns. The 40KB/120KB RED_MIN/RED_MAX configuration enables early congestion notification before buffer exhaustion, critical for maintaining lossless operation. Packet drops despite low average utilization indicate microburst congestion that ECN‘s proactive signaling prevents. This works synergistically with PFC for optimal RoCE performance, allowing senders to adapt rates before reactive flow control becomes necessary.
Unattempted
RoCE v2 fabric optimization for NCCL distributed training requires properly tuned ECN thresholds to handle bursty all-reduce traffic patterns. The 40KB/120KB RED_MIN/RED_MAX configuration enables early congestion notification before buffer exhaustion, critical for maintaining lossless operation. Packet drops despite low average utilization indicate microburst congestion that ECN‘s proactive signaling prevents. This works synergistically with PFC for optimal RoCE performance, allowing senders to adapt rates before reactive flow control becomes necessary.
Question 54 of 60
54. Question
A multi-tenant HPC cluster requires isolating GPU workloads between research teams while maintaining full bandwidth within each team‘s partition. Which Subnet Manager approach achieves PKey assignment for this isolation requirement?
Correct
InfiniBand partition isolation requires PKey (Partition Key) assignment through the Subnet Manager. Full membership PKeys (0x8000 bit set) allow unrestricted communication within a partition, while limited membership prevents cross-partition traffic. Hardware-enforced PKey checking at the switch level provides true multi-tenant isolation without impacting bandwidth. This differs from QoS mechanisms (VLs) which prioritize traffic, or routing (LIDs) which determines paths but doesn‘t enforce security boundaries between tenant workloads.
Incorrect
InfiniBand partition isolation requires PKey (Partition Key) assignment through the Subnet Manager. Full membership PKeys (0x8000 bit set) allow unrestricted communication within a partition, while limited membership prevents cross-partition traffic. Hardware-enforced PKey checking at the switch level provides true multi-tenant isolation without impacting bandwidth. This differs from QoS mechanisms (VLs) which prioritize traffic, or routing (LIDs) which determines paths but doesn‘t enforce security boundaries between tenant workloads.
Unattempted
InfiniBand partition isolation requires PKey (Partition Key) assignment through the Subnet Manager. Full membership PKeys (0x8000 bit set) allow unrestricted communication within a partition, while limited membership prevents cross-partition traffic. Hardware-enforced PKey checking at the switch level provides true multi-tenant isolation without impacting bandwidth. This differs from QoS mechanisms (VLs) which prioritize traffic, or routing (LIDs) which determines paths but doesn‘t enforce security boundaries between tenant workloads.
Question 55 of 60
55. Question
A multi-node H100 cluster running distributed LLM training experiences unexpectedly high AllReduce latency during gradient synchronization. UFM monitoring shows InfiniBand network health is optimal, but NCCL collective operation traces reveal periodic stalls. What is the most critical integration component to investigate for diagnosing this collective operation performance issue?
Correct
Diagnosing periodic NCCL collective stalls with healthy InfiniBand fabric requires UFM telemetry integration with NCCL‘s NVTX instrumentation to correlate network-level events with specific collective operation phases. This reveals whether InfiniBand‘s adaptive routing decisions create micro-congestion during specific NCCL algorithm phases (ring, tree, hierarchical AllReduce), causing serialization despite overall fabric health. This integration is critical for topology-aware optimization in large H100 clusters where collective algorithm patterns must align with fabric routing for optimal performance.
Incorrect
Diagnosing periodic NCCL collective stalls with healthy InfiniBand fabric requires UFM telemetry integration with NCCL‘s NVTX instrumentation to correlate network-level events with specific collective operation phases. This reveals whether InfiniBand‘s adaptive routing decisions create micro-congestion during specific NCCL algorithm phases (ring, tree, hierarchical AllReduce), causing serialization despite overall fabric health. This integration is critical for topology-aware optimization in large H100 clusters where collective algorithm patterns must align with fabric routing for optimal performance.
Unattempted
Diagnosing periodic NCCL collective stalls with healthy InfiniBand fabric requires UFM telemetry integration with NCCL‘s NVTX instrumentation to correlate network-level events with specific collective operation phases. This reveals whether InfiniBand‘s adaptive routing decisions create micro-congestion during specific NCCL algorithm phases (ring, tree, hierarchical AllReduce), causing serialization despite overall fabric health. This integration is critical for topology-aware optimization in large H100 clusters where collective algorithm patterns must align with fabric routing for optimal performance.
Question 56 of 60
56. Question
A cloud service provider needs to reduce CPU overhead for network-intensive workloads while maintaining high throughput for tenant VMs. The infrastructure uses NVIDIA BlueField-2 DPUs with standard Ethernet connectivity. When would you use Ethernet offload for network function offloading in this scenario?
Correct
Ethernet offload on BlueField DPUs moves network stack processing (TCP/IP, checksums, segmentation), overlay networking (VXLAN/GENEVE encap/decap), and stateful network functions (firewalls, NAT, load balancing) from host CPU to DPU ARM cores. This reduces host CPU utilization for network-intensive workloads while maintaining high Ethernet throughput through hardware acceleration. The correct use case targets CPU offload for standard Ethernet protocols and network functions, not RDMA alternatives or host-side optimizations.
Incorrect
Ethernet offload on BlueField DPUs moves network stack processing (TCP/IP, checksums, segmentation), overlay networking (VXLAN/GENEVE encap/decap), and stateful network functions (firewalls, NAT, load balancing) from host CPU to DPU ARM cores. This reduces host CPU utilization for network-intensive workloads while maintaining high Ethernet throughput through hardware acceleration. The correct use case targets CPU offload for standard Ethernet protocols and network functions, not RDMA alternatives or host-side optimizations.
Unattempted
Ethernet offload on BlueField DPUs moves network stack processing (TCP/IP, checksums, segmentation), overlay networking (VXLAN/GENEVE encap/decap), and stateful network functions (firewalls, NAT, load balancing) from host CPU to DPU ARM cores. This reduces host CPU utilization for network-intensive workloads while maintaining high Ethernet throughput through hardware acceleration. The correct use case targets CPU offload for standard Ethernet protocols and network functions, not RDMA alternatives or host-side optimizations.
Question 57 of 60
57. Question
During multi-node H100 training job troubleshooting, ibstat shows CA ‘mlx5_0‘ Port 1 as “State: Active“ and “Physical state: LinkUp“ but NCCL reports intermittent timeouts. Which ibstat output parameter most effectively identifies the degraded link causing collective operation failures?
Correct
ibstat port status verification for NCCL troubleshooting requires analyzing performance parameters beyond basic connectivity states. Rate parameter reveals negotiated link speed degradation (100 vs 200 Gb/sec HDR), the primary cause of collective operation bottlenecks. Active/LinkUp states confirm Layer 2 connectivity but don‘t expose bandwidth limitations. In H100 multi-node training with GPUDirect RDMA, degraded InfiniBand rates directly throttle NCCL all-reduce operations, manifesting as timeouts during synchronized gradient aggregation across nodes.
Incorrect
ibstat port status verification for NCCL troubleshooting requires analyzing performance parameters beyond basic connectivity states. Rate parameter reveals negotiated link speed degradation (100 vs 200 Gb/sec HDR), the primary cause of collective operation bottlenecks. Active/LinkUp states confirm Layer 2 connectivity but don‘t expose bandwidth limitations. In H100 multi-node training with GPUDirect RDMA, degraded InfiniBand rates directly throttle NCCL all-reduce operations, manifesting as timeouts during synchronized gradient aggregation across nodes.
Unattempted
ibstat port status verification for NCCL troubleshooting requires analyzing performance parameters beyond basic connectivity states. Rate parameter reveals negotiated link speed degradation (100 vs 200 Gb/sec HDR), the primary cause of collective operation bottlenecks. Active/LinkUp states confirm Layer 2 connectivity but don‘t expose bandwidth limitations. In H100 multi-node training with GPUDirect RDMA, degraded InfiniBand rates directly throttle NCCL all-reduce operations, manifesting as timeouts during synchronized gradient aggregation across nodes.
Question 58 of 60
58. Question
Which statement best describes Subnet Manager (SM) failover in an InfiniBand fabric?
Correct
SM failover is a high availability mechanism where standby Subnet Managers monitor the master SM‘s health. Upon detecting master SM failure through missed heartbeats, a standby SM automatically assumes the master role, taking over subnet management operations. This automatic failover ensures fabric continuity without manual intervention, preventing network downtime in production InfiniBand environments.
Incorrect
SM failover is a high availability mechanism where standby Subnet Managers monitor the master SM‘s health. Upon detecting master SM failure through missed heartbeats, a standby SM automatically assumes the master role, taking over subnet management operations. This automatic failover ensures fabric continuity without manual intervention, preventing network downtime in production InfiniBand environments.
Unattempted
SM failover is a high availability mechanism where standby Subnet Managers monitor the master SM‘s health. Upon detecting master SM failure through missed heartbeats, a standby SM automatically assumes the master role, taking over subnet management operations. This automatic failover ensures fabric continuity without manual intervention, preventing network downtime in production InfiniBand environments.
Question 59 of 60
59. Question
A team is deploying a high-throughput AI inference cluster with H100 GPUs using ConnectX-7 Ethernet adapters. Network profiling reveals significant CPU utilization during model serving traffic. Which ConnectX offload configuration would most effectively reduce CPU overhead while maintaining network performance?
Correct
ConnectX hardware offload features like TSO and checksum offload significantly reduce CPU utilization by moving TCP/IP processing to the NIC hardware. TSO handles packet segmentation for large transmissions, while checksum offload calculates and verifies checksums in hardware. These offloads are transparent to applications, work with standard TCP/IP traffic, and directly address the CPU overhead issue described in the scenario.
Incorrect
ConnectX hardware offload features like TSO and checksum offload significantly reduce CPU utilization by moving TCP/IP processing to the NIC hardware. TSO handles packet segmentation for large transmissions, while checksum offload calculates and verifies checksums in hardware. These offloads are transparent to applications, work with standard TCP/IP traffic, and directly address the CPU overhead issue described in the scenario.
Unattempted
ConnectX hardware offload features like TSO and checksum offload significantly reduce CPU utilization by moving TCP/IP processing to the NIC hardware. TSO handles packet segmentation for large transmissions, while checksum offload calculates and verifies checksums in hardware. These offloads are transparent to applications, work with standard TCP/IP traffic, and directly address the CPU overhead issue described in the scenario.
Question 60 of 60
60. Question
Your multi-node GPU cluster experiences intermittent RDMA timeouts during large-scale LLM training. Analysis reveals ConnectX-7 adapters running firmware versions ranging from 28.35.1012 to 28.37.1014 across nodes. What is the MOST effective approach to optimize firmware management using mlxfwmanager?
Correct
Firmware version inconsistency across InfiniBand adapters causes RDMA protocol mismatches, leading to timeout behaviors during collective operations. The optimal approach uses mlxfwmanager‘s query capabilities to inventory current state, then implements orchestrated rolling updates to standardize on a validated firmware version. This maintains cluster availability, provides rollback safety, and ensures RDMA stability. Mixed firmware versions particularly impact NCCL all-reduce operations in multi-node training, where timing-sensitive GPU-to-GPU communication requires protocol consistency across the fabric.
Incorrect
Firmware version inconsistency across InfiniBand adapters causes RDMA protocol mismatches, leading to timeout behaviors during collective operations. The optimal approach uses mlxfwmanager‘s query capabilities to inventory current state, then implements orchestrated rolling updates to standardize on a validated firmware version. This maintains cluster availability, provides rollback safety, and ensures RDMA stability. Mixed firmware versions particularly impact NCCL all-reduce operations in multi-node training, where timing-sensitive GPU-to-GPU communication requires protocol consistency across the fabric.
Unattempted
Firmware version inconsistency across InfiniBand adapters causes RDMA protocol mismatches, leading to timeout behaviors during collective operations. The optimal approach uses mlxfwmanager‘s query capabilities to inventory current state, then implements orchestrated rolling updates to standardize on a validated firmware version. This maintains cluster availability, provides rollback safety, and ensures RDMA stability. Mixed firmware versions particularly impact NCCL all-reduce operations in multi-node training, where timing-sensitive GPU-to-GPU communication requires protocol consistency across the fabric.
X
Use Page numbers below to navigate to other practice tests