You have already completed the Test before. Hence you can not start it again.
Test is loading...
You must sign in or sign up to start the Test.
You have to finish following quiz, to start this Test:
Your results are here!! for" NVIDIA NCP-AIN Practice Test 5 "
0 of 60 questions answered correctly
Your time:
Time has elapsed
Your Final Score is : 0
You have attempted : 0
Number of Correct Questions : 0 and scored 0
Number of Incorrect Questions : 0 and Negative marks 0
Average score
Your score
NVIDIA NCP-AIN
You have attempted: 0
Number of Correct Questions: 0 and scored 0
Number of Incorrect Questions: 0 and Negative marks 0
You can review your answers by clicking on “View Answers” option. Important Note : Open Reference Documentation Links in New Tab (Right Click and Open in New Tab).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Answered
Review
Question 1 of 60
1. Question
A multi-node H100 cluster using NCCL for distributed training experiences uneven network utilization across InfiniBand paths, with some links at 80% while others remain at 20%. What is the critical component that enables dynamic traffic redistribution across available paths to balance this load?
Correct
Adaptive routing‘s hash algorithm is the critical component for dynamic load balancing in InfiniBand fabrics. It monitors real-time congestion metrics and recalculates routing decisions to steer traffic toward underutilized paths, resolving the 80%/20% utilization imbalance. Static approaches lack congestion awareness, while NCCL and GPUDirect RDMA operate at application and transfer layers respectively, relying on the fabric‘s adaptive routing for optimal path selection across the network topology.
Incorrect
Adaptive routing‘s hash algorithm is the critical component for dynamic load balancing in InfiniBand fabrics. It monitors real-time congestion metrics and recalculates routing decisions to steer traffic toward underutilized paths, resolving the 80%/20% utilization imbalance. Static approaches lack congestion awareness, while NCCL and GPUDirect RDMA operate at application and transfer layers respectively, relying on the fabric‘s adaptive routing for optimal path selection across the network topology.
Unattempted
Adaptive routing‘s hash algorithm is the critical component for dynamic load balancing in InfiniBand fabrics. It monitors real-time congestion metrics and recalculates routing decisions to steer traffic toward underutilized paths, resolving the 80%/20% utilization imbalance. Static approaches lack congestion awareness, while NCCL and GPUDirect RDMA operate at application and transfer layers respectively, relying on the fabric‘s adaptive routing for optimal path selection across the network topology.
Question 2 of 60
2. Question
A network engineer needs to achieve maximum packet processing throughput for a high-frequency trading application running on NVIDIA ConnectX-7 adapters. The application requires sub-microsecond latency and must bypass the kernel network stack. Which technology should be implemented for optimal data plane acceleration?
Correct
DPDK with poll-mode drivers is the optimal choice for data plane acceleration on ConnectX adapters, providing kernel bypass and direct NIC access. This eliminates context switches and interrupt overhead, achieving sub-microsecond latency through zero-copy packet processing. DPDK‘s architecture allows applications to directly access NIC queues in userspace, maximizing throughput and minimizing latency for performance-critical workloads like high-frequency trading.
Incorrect
DPDK with poll-mode drivers is the optimal choice for data plane acceleration on ConnectX adapters, providing kernel bypass and direct NIC access. This eliminates context switches and interrupt overhead, achieving sub-microsecond latency through zero-copy packet processing. DPDK‘s architecture allows applications to directly access NIC queues in userspace, maximizing throughput and minimizing latency for performance-critical workloads like high-frequency trading.
Unattempted
DPDK with poll-mode drivers is the optimal choice for data plane acceleration on ConnectX adapters, providing kernel bypass and direct NIC access. This eliminates context switches and interrupt overhead, achieving sub-microsecond latency through zero-copy packet processing. DPDK‘s architecture allows applications to directly access NIC queues in userspace, maximizing throughput and minimizing latency for performance-critical workloads like high-frequency trading.
Question 3 of 60
3. Question
Your team is training a 70B parameter LLM across 16 H100 nodes using NCCL for distributed training. Network profiling shows AllReduce operations consume 40% of iteration time on InfiniBand HDR fabric. When would SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) integration with NCCL provide the most benefit?
Correct
SHARP with NCCL integration provides maximum benefit for multi-node distributed training with heavy AllReduce communication patterns on InfiniBand fabrics. By offloading gradient aggregation to network switches, SHARP reduces communication latency and CPU overhead. The 70B LLM scenario with 40% AllReduce time across 16 nodes represents an ideal use case, as SHARP can reduce network tree depth and accelerate collective operations critical for data-parallel gradient synchronization.
Incorrect
SHARP with NCCL integration provides maximum benefit for multi-node distributed training with heavy AllReduce communication patterns on InfiniBand fabrics. By offloading gradient aggregation to network switches, SHARP reduces communication latency and CPU overhead. The 70B LLM scenario with 40% AllReduce time across 16 nodes represents an ideal use case, as SHARP can reduce network tree depth and accelerate collective operations critical for data-parallel gradient synchronization.
Unattempted
SHARP with NCCL integration provides maximum benefit for multi-node distributed training with heavy AllReduce communication patterns on InfiniBand fabrics. By offloading gradient aggregation to network switches, SHARP reduces communication latency and CPU overhead. The 70B LLM scenario with 40% AllReduce time across 16 nodes represents an ideal use case, as SHARP can reduce network tree depth and accelerate collective operations critical for data-parallel gradient synchronization.
Question 4 of 60
4. Question
An InfiniBand fabric experiences frequent SM failover events between primary and standby Subnet Managers, causing brief disruptions in NCCL communication during multi-node H100 training. Network logs show both SMs have identical priority values and proper connectivity. What is the most critical misconfiguration preventing stable SM operation?
Correct
Stable SM failover requires properly tuned heartbeat and timeout parameters. Aggressive timeout values cause standby SMs to incorrectly interpret transient delays as primary failures, triggering unnecessary takeovers. Correct configuration uses heartbeat intervals of 3-5 seconds with timeout thresholds set to 3x heartbeat duration, ensuring genuine failure detection while preventing false positives. Priority values and GUID selection provide deterministic election but do not affect operational stability once roles are established.
Incorrect
Stable SM failover requires properly tuned heartbeat and timeout parameters. Aggressive timeout values cause standby SMs to incorrectly interpret transient delays as primary failures, triggering unnecessary takeovers. Correct configuration uses heartbeat intervals of 3-5 seconds with timeout thresholds set to 3x heartbeat duration, ensuring genuine failure detection while preventing false positives. Priority values and GUID selection provide deterministic election but do not affect operational stability once roles are established.
Unattempted
Stable SM failover requires properly tuned heartbeat and timeout parameters. Aggressive timeout values cause standby SMs to incorrectly interpret transient delays as primary failures, triggering unnecessary takeovers. Correct configuration uses heartbeat intervals of 3-5 seconds with timeout thresholds set to 3x heartbeat duration, ensuring genuine failure detection while preventing false positives. Priority values and GUID selection provide deterministic election but do not affect operational stability once roles are established.
Question 5 of 60
5. Question
A network administrator needs to configure a 100G uplink interface on a Cumulus Linux switch for GPU fabric connectivity. The interface must support RDMA over Converged Ethernet (RoCE) with priority flow control. Which configuration approach correctly enables the interface with necessary RoCE optimizations?
Correct
Cumulus Linux interface configuration for GPU fabric RoCE connectivity requires explicit speed settings and priority flow control through /etc/network/interfaces combined with lossless queue configuration. The link-speed parameter ensures 100G operation, link-pause enables flow control, and additional traffic.conf settings establish priority-based flow control necessary for RDMA workloads. This configuration prevents packet loss critical for NCCL collective operations in multi-GPU training environments using H100 or A100 GPUs.
Incorrect
Cumulus Linux interface configuration for GPU fabric RoCE connectivity requires explicit speed settings and priority flow control through /etc/network/interfaces combined with lossless queue configuration. The link-speed parameter ensures 100G operation, link-pause enables flow control, and additional traffic.conf settings establish priority-based flow control necessary for RDMA workloads. This configuration prevents packet loss critical for NCCL collective operations in multi-GPU training environments using H100 or A100 GPUs.
Unattempted
Cumulus Linux interface configuration for GPU fabric RoCE connectivity requires explicit speed settings and priority flow control through /etc/network/interfaces combined with lossless queue configuration. The link-speed parameter ensures 100G operation, link-pause enables flow control, and additional traffic.conf settings establish priority-based flow control necessary for RDMA workloads. This configuration prevents packet loss critical for NCCL collective operations in multi-GPU training environments using H100 or A100 GPUs.
Question 6 of 60
6. Question
An AI research team is configuring an 8-GPU DGX H100 system for training a 70B parameter LLM using tensor parallelism. They need to minimize GPU-to-GPU communication latency during all-reduce operations within the node. Which technology should they prioritize for optimal peer-to-peer GPU communication?
Correct
For peer-to-peer GPU communication within a DGX H100 node, NVLink 4.0 with NVSwitch is the optimal technology. It delivers 900 GB/s bidirectional bandwidth per GPU and creates a fully connected fabric enabling direct all-to-all GPU communication without PCIe or CPU traversal. This is essential for tensor parallelism workloads requiring frequent all-reduce operations. Alternative approaches like PCIe, GPUDirect RDMA, or GPUDirect Storage either introduce unnecessary overhead or address different use cases entirely.
Incorrect
For peer-to-peer GPU communication within a DGX H100 node, NVLink 4.0 with NVSwitch is the optimal technology. It delivers 900 GB/s bidirectional bandwidth per GPU and creates a fully connected fabric enabling direct all-to-all GPU communication without PCIe or CPU traversal. This is essential for tensor parallelism workloads requiring frequent all-reduce operations. Alternative approaches like PCIe, GPUDirect RDMA, or GPUDirect Storage either introduce unnecessary overhead or address different use cases entirely.
Unattempted
For peer-to-peer GPU communication within a DGX H100 node, NVLink 4.0 with NVSwitch is the optimal technology. It delivers 900 GB/s bidirectional bandwidth per GPU and creates a fully connected fabric enabling direct all-to-all GPU communication without PCIe or CPU traversal. This is essential for tensor parallelism workloads requiring frequent all-reduce operations. Alternative approaches like PCIe, GPUDirect RDMA, or GPUDirect Storage either introduce unnecessary overhead or address different use cases entirely.
Question 7 of 60
7. Question
A network administrator configures OSPF routing using NVUE commands on a Cumulus Linux switch. After a reboot, the administrator attempts to verify the configuration using vtysh and finds no OSPF settings displayed. What is the most likely cause of this issue?
Correct
NVUE and vtysh represent different configuration management paradigms in Cumulus Linux. NVUE provides a modern, declarative configuration interface that abstracts underlying services including FRRouting, maintaining its own configuration database. While NVUE properly configures and manages FRRouting daemons, vtysh queries FRRouting‘s native CLI configuration directly. This architectural separation means configurations made through NVUE may not be visible when querying through vtysh, even though they are active and functional. Administrators should use ‘nv show‘ commands to verify NVUE-managed configurations rather than switching to vtysh.
Incorrect
NVUE and vtysh represent different configuration management paradigms in Cumulus Linux. NVUE provides a modern, declarative configuration interface that abstracts underlying services including FRRouting, maintaining its own configuration database. While NVUE properly configures and manages FRRouting daemons, vtysh queries FRRouting‘s native CLI configuration directly. This architectural separation means configurations made through NVUE may not be visible when querying through vtysh, even though they are active and functional. Administrators should use ‘nv show‘ commands to verify NVUE-managed configurations rather than switching to vtysh.
Unattempted
NVUE and vtysh represent different configuration management paradigms in Cumulus Linux. NVUE provides a modern, declarative configuration interface that abstracts underlying services including FRRouting, maintaining its own configuration database. While NVUE properly configures and manages FRRouting daemons, vtysh queries FRRouting‘s native CLI configuration directly. This architectural separation means configurations made through NVUE may not be visible when querying through vtysh, even though they are active and functional. Administrators should use ‘nv show‘ commands to verify NVUE-managed configurations rather than switching to vtysh.
Question 8 of 60
8. Question
A network engineer needs to calculate maximum theoretical throughput for a 100 Gbps Ethernet link to validate line rate performance of NVIDIA ConnectX-7 adapters in an H100 GPU cluster. Assuming standard 1500-byte MTU frames, which tool or method provides the most accurate wire speed calculation accounting for Ethernet framing overhead?
Correct
Wire speed calculation requires accounting for Ethernet framing overhead: 8-byte preamble, 12-byte inter-frame gap, and 4-byte CRC per frame. The formula (Link Speed × Payload / Total Frame) provides theoretical maximum throughput. For 100 Gbps links with standard 1500-byte payload frames, this yields ~97.53 Gbps wire speed. This calculation validates whether ConnectX-7 adapters achieve line rate in H100 clusters.
Incorrect
Wire speed calculation requires accounting for Ethernet framing overhead: 8-byte preamble, 12-byte inter-frame gap, and 4-byte CRC per frame. The formula (Link Speed × Payload / Total Frame) provides theoretical maximum throughput. For 100 Gbps links with standard 1500-byte payload frames, this yields ~97.53 Gbps wire speed. This calculation validates whether ConnectX-7 adapters achieve line rate in H100 clusters.
Unattempted
Wire speed calculation requires accounting for Ethernet framing overhead: 8-byte preamble, 12-byte inter-frame gap, and 4-byte CRC per frame. The formula (Link Speed × Payload / Total Frame) provides theoretical maximum throughput. For 100 Gbps links with standard 1500-byte payload frames, this yields ~97.53 Gbps wire speed. This calculation validates whether ConnectX-7 adapters achieve line rate in H100 clusters.
Question 9 of 60
9. Question
An administrator is deploying RoCE v2 on a network fabric with NVIDIA Spectrum switches for GPU-to-GPU communication in a distributed training cluster. To prevent packet drops during congestion for lossless Ethernet operation, which PFC configuration step is required on the switch interfaces connecting to the GPU servers?
Correct
RoCE v2 requires Priority Flow Control (PFC) configured on the specific priority class used for RDMA traffic (typically CoS 3) to achieve lossless Ethernet operation. The switch interfaces must enable PFC on this priority class and configure DSCP trust mode to honor the traffic markings from GPU servers. This selective approach prevents packet drops during congestion for RoCE traffic while maintaining normal forwarding behavior for other traffic classes in the converged data center fabric.
Incorrect
RoCE v2 requires Priority Flow Control (PFC) configured on the specific priority class used for RDMA traffic (typically CoS 3) to achieve lossless Ethernet operation. The switch interfaces must enable PFC on this priority class and configure DSCP trust mode to honor the traffic markings from GPU servers. This selective approach prevents packet drops during congestion for RoCE traffic while maintaining normal forwarding behavior for other traffic classes in the converged data center fabric.
Unattempted
RoCE v2 requires Priority Flow Control (PFC) configured on the specific priority class used for RDMA traffic (typically CoS 3) to achieve lossless Ethernet operation. The switch interfaces must enable PFC on this priority class and configure DSCP trust mode to honor the traffic markings from GPU servers. This selective approach prevents packet drops during congestion for RoCE traffic while maintaining normal forwarding behavior for other traffic classes in the converged data center fabric.
Question 10 of 60
10. Question
What is the primary purpose of addressing head-of-line blocking in AI/HPC network architectures?
Correct
Head-of-line blocking occurs when a blocked packet at the front of a queue prevents transmission of ready packets behind it, causing network congestion. In AI/HPC workloads with intensive multi-GPU communication (NCCL all-reduce), this creates cascading delays across distributed training. Solutions include virtual output queuing, multiple priority queues, and lossless Ethernet with priority flow control to maintain high throughput during collective operations.
Incorrect
Head-of-line blocking occurs when a blocked packet at the front of a queue prevents transmission of ready packets behind it, causing network congestion. In AI/HPC workloads with intensive multi-GPU communication (NCCL all-reduce), this creates cascading delays across distributed training. Solutions include virtual output queuing, multiple priority queues, and lossless Ethernet with priority flow control to maintain high throughput during collective operations.
Unattempted
Head-of-line blocking occurs when a blocked packet at the front of a queue prevents transmission of ready packets behind it, causing network congestion. In AI/HPC workloads with intensive multi-GPU communication (NCCL all-reduce), this creates cascading delays across distributed training. Solutions include virtual output queuing, multiple priority queues, and lossless Ethernet with priority flow control to maintain high throughput during collective operations.
Question 11 of 60
11. Question
A network administrator needs to monitor fabric-wide EVPN route distribution and BGP neighbor states across 50 switches in real-time. Which NetQ management interface provides the most efficient approach for visualizing this topology-wide data with minimal command iterations?
Correct
NetQ UI provides the most efficient management interface for topology-wide visualization of distributed protocols like BGP and EVPN across large fabrics. Its graphical topology view eliminates repetitive CLI commands by displaying all neighbor relationships, route distributions, and link states simultaneously with interactive filtering. While CLI commands offer detailed per-device data and validation checks identify anomalies, the UI‘s visual correlation capabilities make it optimal for real-time monitoring of fabric-wide control plane state across 50 switches.
Incorrect
NetQ UI provides the most efficient management interface for topology-wide visualization of distributed protocols like BGP and EVPN across large fabrics. Its graphical topology view eliminates repetitive CLI commands by displaying all neighbor relationships, route distributions, and link states simultaneously with interactive filtering. While CLI commands offer detailed per-device data and validation checks identify anomalies, the UI‘s visual correlation capabilities make it optimal for real-time monitoring of fabric-wide control plane state across 50 switches.
Unattempted
NetQ UI provides the most efficient management interface for topology-wide visualization of distributed protocols like BGP and EVPN across large fabrics. Its graphical topology view eliminates repetitive CLI commands by displaying all neighbor relationships, route distributions, and link states simultaneously with interactive filtering. While CLI commands offer detailed per-device data and validation checks identify anomalies, the UI‘s visual correlation capabilities make it optimal for real-time monitoring of fabric-wide control plane state across 50 switches.
Question 12 of 60
12. Question
Your data center is deploying ConnectX-7 HCAs for an AI training cluster requiring 400 Gbps InfiniBand connectivity. The network team needs to configure link speed and port modes across 96 HCA ports to ensure optimal performance. Which technology is best for configuring link speed and mode settings at scale?
Correct
UFM (Unified Fabric Manager) is NVIDIA‘s enterprise-grade solution for managing InfiniBand fabrics at scale, providing centralized configuration, monitoring, and validation of ConnectX HCA parameters. For 96 ports requiring consistent link speed (400 Gbps NDR) and mode settings, UFM enables bulk operations with audit trails and configuration verification. While mlxconfig works for individual HCAs, it doesn‘t scale efficiently for production deployments requiring orchestrated changes across multiple adapters.
Incorrect
UFM (Unified Fabric Manager) is NVIDIA‘s enterprise-grade solution for managing InfiniBand fabrics at scale, providing centralized configuration, monitoring, and validation of ConnectX HCA parameters. For 96 ports requiring consistent link speed (400 Gbps NDR) and mode settings, UFM enables bulk operations with audit trails and configuration verification. While mlxconfig works for individual HCAs, it doesn‘t scale efficiently for production deployments requiring orchestrated changes across multiple adapters.
Unattempted
UFM (Unified Fabric Manager) is NVIDIA‘s enterprise-grade solution for managing InfiniBand fabrics at scale, providing centralized configuration, monitoring, and validation of ConnectX HCA parameters. For 96 ports requiring consistent link speed (400 Gbps NDR) and mode settings, UFM enables bulk operations with audit trails and configuration verification. While mlxconfig works for individual HCAs, it doesn‘t scale efficiently for production deployments requiring orchestrated changes across multiple adapters.
Question 13 of 60
13. Question
Your distributed training cluster experiences performance degradation due to uneven CPU core utilization when processing incoming RDMA traffic across multiple network flows. Network adapters show some cores overloaded while others remain idle. Which ConnectX feature should you configure to distribute incoming packet processing more evenly across CPU cores?
Correct
Receive-Side Scaling (RSS) is specifically designed to distribute incoming network traffic processing across multiple CPU cores by using hash functions on packet headers to assign flows to different receive queues. Each queue maps to a different CPU core, ensuring balanced interrupt handling and packet processing. This eliminates bottlenecks where single cores become overloaded while others idle, maximizing throughput in high-bandwidth scenarios like distributed training with NCCL over RDMA.
Incorrect
Receive-Side Scaling (RSS) is specifically designed to distribute incoming network traffic processing across multiple CPU cores by using hash functions on packet headers to assign flows to different receive queues. Each queue maps to a different CPU core, ensuring balanced interrupt handling and packet processing. This eliminates bottlenecks where single cores become overloaded while others idle, maximizing throughput in high-bandwidth scenarios like distributed training with NCCL over RDMA.
Unattempted
Receive-Side Scaling (RSS) is specifically designed to distribute incoming network traffic processing across multiple CPU cores by using hash functions on packet headers to assign flows to different receive queues. Each queue maps to a different CPU core, ensuring balanced interrupt handling and packet processing. This eliminates bottlenecks where single cores become overloaded while others idle, maximizing throughput in high-bandwidth scenarios like distributed training with NCCL over RDMA.
Question 14 of 60
14. Question
A multi-node H100 cluster with 64 GPUs across 8 DGX systems uses InfiniBand HDR networking for distributed training. The infrastructure team needs centralized visibility into fabric health, topology changes, and congestion hotspots. Which approach achieves unified fabric management for this NCCL-based training environment?
Correct
NVIDIA UFM (Unified Fabric Manager) is the purpose-built solution for centralized InfiniBand fabric management in AI training clusters. It provides real-time topology discovery, health monitoring, congestion tracking, and performance analytics across all InfiniBand switches. UFM correlates fabric-level events with application performance, enabling proactive identification of issues affecting NCCL collective operations. While DCGM monitors GPUs and NCCL logs track communications, only UFM delivers unified InfiniBand fabric management.
Incorrect
NVIDIA UFM (Unified Fabric Manager) is the purpose-built solution for centralized InfiniBand fabric management in AI training clusters. It provides real-time topology discovery, health monitoring, congestion tracking, and performance analytics across all InfiniBand switches. UFM correlates fabric-level events with application performance, enabling proactive identification of issues affecting NCCL collective operations. While DCGM monitors GPUs and NCCL logs track communications, only UFM delivers unified InfiniBand fabric management.
Unattempted
NVIDIA UFM (Unified Fabric Manager) is the purpose-built solution for centralized InfiniBand fabric management in AI training clusters. It provides real-time topology discovery, health monitoring, congestion tracking, and performance analytics across all InfiniBand switches. UFM correlates fabric-level events with application performance, enabling proactive identification of issues affecting NCCL collective operations. While DCGM monitors GPUs and NCCL logs track communications, only UFM delivers unified InfiniBand fabric management.
Question 15 of 60
15. Question
Your distributed training cluster with 16 H100 nodes experiences 23% GPU idle time during RDMA AllReduce operations. Profiling shows Completion Queue (CQ) polling overhead consuming significant CPU cycles. Which optimization approach will MOST effectively reduce CQ processing latency while maintaining high message throughput for sub-millisecond RDMA operations?
Correct
Completion Queue optimization for RDMA requires balancing low latency with CPU efficiency. Adaptive polling provides the best compromise: busy-wait captures immediate completions within microseconds (critical for AllReduce), while interrupt fallback prevents CPU waste on delayed operations. This approach addresses the 23% GPU idle time by reducing CQ processing overhead without sacrificing RDMA responsiveness. Pure polling wastes CPU, pure interrupts add latency, CQ sharing creates contention, and batching delays completions. The hybrid strategy aligns with NCCL‘s communication patterns in multi-GPU training.
Incorrect
Completion Queue optimization for RDMA requires balancing low latency with CPU efficiency. Adaptive polling provides the best compromise: busy-wait captures immediate completions within microseconds (critical for AllReduce), while interrupt fallback prevents CPU waste on delayed operations. This approach addresses the 23% GPU idle time by reducing CQ processing overhead without sacrificing RDMA responsiveness. Pure polling wastes CPU, pure interrupts add latency, CQ sharing creates contention, and batching delays completions. The hybrid strategy aligns with NCCL‘s communication patterns in multi-GPU training.
Unattempted
Completion Queue optimization for RDMA requires balancing low latency with CPU efficiency. Adaptive polling provides the best compromise: busy-wait captures immediate completions within microseconds (critical for AllReduce), while interrupt fallback prevents CPU waste on delayed operations. This approach addresses the 23% GPU idle time by reducing CQ processing overhead without sacrificing RDMA responsiveness. Pure polling wastes CPU, pure interrupts add latency, CQ sharing creates contention, and batching delays completions. The hybrid strategy aligns with NCCL‘s communication patterns in multi-GPU training.
Question 16 of 60
16. Question
What is the primary purpose of Adaptive Routing (AR) algorithms in InfiniBand fabric architecture?
Correct
Adaptive Routing algorithms enable InfiniBand fabrics to dynamically select optimal network paths based on real-time analysis of congestion, link utilization, and availability. Unlike static routing, AR continuously monitors fabric conditions and redirects traffic away from congested areas, maximizing throughput and minimizing latency. This is critical for GPU clusters running distributed training workloads with NCCL over InfiniBand.
Incorrect
Adaptive Routing algorithms enable InfiniBand fabrics to dynamically select optimal network paths based on real-time analysis of congestion, link utilization, and availability. Unlike static routing, AR continuously monitors fabric conditions and redirects traffic away from congested areas, maximizing throughput and minimizing latency. This is critical for GPU clusters running distributed training workloads with NCCL over InfiniBand.
Unattempted
Adaptive Routing algorithms enable InfiniBand fabrics to dynamically select optimal network paths based on real-time analysis of congestion, link utilization, and availability. Unlike static routing, AR continuously monitors fabric conditions and redirects traffic away from congested areas, maximizing throughput and minimizing latency. This is critical for GPU clusters running distributed training workloads with NCCL over InfiniBand.
Question 17 of 60
17. Question
What is database scalability in the context of UFM (Unified Fabric Manager) Architecture when supporting large fabric deployments?
Correct
Database scalability in UFM Architecture refers to the system‘s ability to efficiently manage increasing amounts of fabric data as network deployments grow. For large-scale AI clusters with hundreds of InfiniBand switches and thousands of ports, UFM‘s database must handle switch inventory, topology information, performance counters, and telemetry data without degrading monitoring or management capabilities. This scalability is essential for modern GPU clusters.
Incorrect
Database scalability in UFM Architecture refers to the system‘s ability to efficiently manage increasing amounts of fabric data as network deployments grow. For large-scale AI clusters with hundreds of InfiniBand switches and thousands of ports, UFM‘s database must handle switch inventory, topology information, performance counters, and telemetry data without degrading monitoring or management capabilities. This scalability is essential for modern GPU clusters.
Unattempted
Database scalability in UFM Architecture refers to the system‘s ability to efficiently manage increasing amounts of fabric data as network deployments grow. For large-scale AI clusters with hundreds of InfiniBand switches and thousands of ports, UFM‘s database must handle switch inventory, topology information, performance counters, and telemetry data without degrading monitoring or management capabilities. This scalability is essential for modern GPU clusters.
Question 18 of 60
18. Question
Your InfiniBand fabric experiences intermittent performance degradation affecting distributed training jobs. You need to identify patterns and correlate network issues with specific time periods over the past 30 days. Which UFM capability would be most effective for this analysis?
Correct
Historical analysis with trending and reporting is specifically designed for retrospective investigation of network performance over extended periods. It provides time-series metric visualization, statistical trending, and correlation capabilities essential for identifying patterns in intermittent issues. This approach enables administrators to analyze bandwidth utilization, error rates, and fabric health metrics across the 30-day timeframe, correlating degradation events with specific timestamps to support root cause determination for distributed training performance issues.
Incorrect
Historical analysis with trending and reporting is specifically designed for retrospective investigation of network performance over extended periods. It provides time-series metric visualization, statistical trending, and correlation capabilities essential for identifying patterns in intermittent issues. This approach enables administrators to analyze bandwidth utilization, error rates, and fabric health metrics across the 30-day timeframe, correlating degradation events with specific timestamps to support root cause determination for distributed training performance issues.
Unattempted
Historical analysis with trending and reporting is specifically designed for retrospective investigation of network performance over extended periods. It provides time-series metric visualization, statistical trending, and correlation capabilities essential for identifying patterns in intermittent issues. This approach enables administrators to analyze bandwidth utilization, error rates, and fabric health metrics across the 30-day timeframe, correlating degradation events with specific timestamps to support root cause determination for distributed training performance issues.
Question 19 of 60
19. Question
What is the primary purpose of QoS (Quality of Service) via Subnet Manager in an InfiniBand fabric?
Correct
QoS via Subnet Manager enables administrators to configure traffic classes and service levels (SLs 0-15) that differentiate packet handling in InfiniBand fabrics. This ensures high-priority traffic like GPU-to-GPU communication or storage I/O receives preferential scheduling, lower latency, and dedicated buffer resources compared to lower-priority traffic, optimizing performance for latency-sensitive AI/HPC workloads.
Incorrect
QoS via Subnet Manager enables administrators to configure traffic classes and service levels (SLs 0-15) that differentiate packet handling in InfiniBand fabrics. This ensures high-priority traffic like GPU-to-GPU communication or storage I/O receives preferential scheduling, lower latency, and dedicated buffer resources compared to lower-priority traffic, optimizing performance for latency-sensitive AI/HPC workloads.
Unattempted
QoS via Subnet Manager enables administrators to configure traffic classes and service levels (SLs 0-15) that differentiate packet handling in InfiniBand fabrics. This ensures high-priority traffic like GPU-to-GPU communication or storage I/O receives preferential scheduling, lower latency, and dedicated buffer resources compared to lower-priority traffic, optimizing performance for latency-sensitive AI/HPC workloads.
Question 20 of 60
20. Question
An AI infrastructure team is deploying a 64-node H100 cluster with InfiniBand HDR networking for distributed LLM training. During fabric initialization, they need to identify specific nodes for diagnostics and track switch port assignments for topology optimization. When would GUIDs versus LIDs be most appropriate for these node identification tasks?
Correct
GUIDs are permanent 64-bit hardware identifiers ideal for node tracking across fabric changes, while LIDs are dynamic 16-bit addresses optimized for efficient packet routing. For the deployment scenario, GUIDs enable reliable hardware diagnostics and asset tracking, while LIDs support active topology optimization and forwarding table management during training operations.
Incorrect
GUIDs are permanent 64-bit hardware identifiers ideal for node tracking across fabric changes, while LIDs are dynamic 16-bit addresses optimized for efficient packet routing. For the deployment scenario, GUIDs enable reliable hardware diagnostics and asset tracking, while LIDs support active topology optimization and forwarding table management during training operations.
Unattempted
GUIDs are permanent 64-bit hardware identifiers ideal for node tracking across fabric changes, while LIDs are dynamic 16-bit addresses optimized for efficient packet routing. For the deployment scenario, GUIDs enable reliable hardware diagnostics and asset tracking, while LIDs support active topology optimization and forwarding table management during training operations.
Question 21 of 60
21. Question
An AI infrastructure team is deploying a 16-node GPU cluster with H100 systems for multi-node LLM training. They need to configure InfiniBand switches running Onyx OS to optimize GPUDirect RDMA performance. Which technology should be implemented on the IB switches to achieve the lowest latency for NCCL collective operations?
Correct
For InfiniBand switch configuration supporting multi-node GPU training, Sharp (Scalable Hierarchical Aggregation and Reduction Protocol) with Adaptive Routing is the optimal technology. Sharp offloads NCCL collective operations to InfiniBand switch hardware, performing in-network aggregation that significantly reduces latency and GPU idle time. Adaptive Routing dynamically selects optimal paths to avoid congestion. This combination is specifically designed for GPUDirect RDMA workloads and is the recommended configuration for H100 clusters running distributed LLM training.
Incorrect
For InfiniBand switch configuration supporting multi-node GPU training, Sharp (Scalable Hierarchical Aggregation and Reduction Protocol) with Adaptive Routing is the optimal technology. Sharp offloads NCCL collective operations to InfiniBand switch hardware, performing in-network aggregation that significantly reduces latency and GPU idle time. Adaptive Routing dynamically selects optimal paths to avoid congestion. This combination is specifically designed for GPUDirect RDMA workloads and is the recommended configuration for H100 clusters running distributed LLM training.
Unattempted
For InfiniBand switch configuration supporting multi-node GPU training, Sharp (Scalable Hierarchical Aggregation and Reduction Protocol) with Adaptive Routing is the optimal technology. Sharp offloads NCCL collective operations to InfiniBand switch hardware, performing in-network aggregation that significantly reduces latency and GPU idle time. Adaptive Routing dynamically selects optimal paths to avoid congestion. This combination is specifically designed for GPUDirect RDMA workloads and is the recommended configuration for H100 clusters running distributed LLM training.
Question 22 of 60
22. Question
An AI training cluster uses H100 GPUs across multiple nodes for distributed LLM training. The infrastructure team needs to implement NDR 400 Gbps InfiniBand connectivity to minimize communication latency during all-reduce operations. Which ConnectX-7 configuration best achieves this NDR 400G capability for GPU-to-GPU communication?
Correct
ConnectX-7 HCAs provide NDR 400 Gbps InfiniBand capabilities essential for high-performance AI clusters. The dual-port NDR configuration with GPUDirect RDMA enables direct GPU-to-GPU memory access across nodes, bypassing CPU overhead. This is critical for NCCL-based distributed training where all-reduce operations dominate communication patterns. Native NDR InfiniBand delivers lower latency than RoCE alternatives and significantly higher bandwidth than previous-generation HDR (200G) or EDR (100G) solutions.
Incorrect
ConnectX-7 HCAs provide NDR 400 Gbps InfiniBand capabilities essential for high-performance AI clusters. The dual-port NDR configuration with GPUDirect RDMA enables direct GPU-to-GPU memory access across nodes, bypassing CPU overhead. This is critical for NCCL-based distributed training where all-reduce operations dominate communication patterns. Native NDR InfiniBand delivers lower latency than RoCE alternatives and significantly higher bandwidth than previous-generation HDR (200G) or EDR (100G) solutions.
Unattempted
ConnectX-7 HCAs provide NDR 400 Gbps InfiniBand capabilities essential for high-performance AI clusters. The dual-port NDR configuration with GPUDirect RDMA enables direct GPU-to-GPU memory access across nodes, bypassing CPU overhead. This is critical for NCCL-based distributed training where all-reduce operations dominate communication patterns. Native NDR InfiniBand delivers lower latency than RoCE alternatives and significantly higher bandwidth than previous-generation HDR (200G) or EDR (100G) solutions.
Question 23 of 60
23. Question
An AI infrastructure team deploys H100 GPUs with ConnectX-7 adapters for distributed LLM training. Network monitoring shows high CPU utilization during NCCL AllReduce operations despite enabling hardware offload features. Which configuration change would most effectively reduce CPU overhead during multi-GPU communication?
Correct
ConnectX-7 hardware offload features (TSO, LRO, checksum offload) are essential for reducing CPU overhead in high-throughput AI workloads. TSO and LRO together provide the most significant CPU relief by offloading packet segmentation and reassembly—the dominant overhead source during NCCL collective operations. These features allow the NIC to handle TCP protocol processing in hardware, freeing CPU resources for application workloads. Proper configuration requires enabling offload features via ethtool and verifying NCCL uses optimized network paths.
Incorrect
ConnectX-7 hardware offload features (TSO, LRO, checksum offload) are essential for reducing CPU overhead in high-throughput AI workloads. TSO and LRO together provide the most significant CPU relief by offloading packet segmentation and reassembly—the dominant overhead source during NCCL collective operations. These features allow the NIC to handle TCP protocol processing in hardware, freeing CPU resources for application workloads. Proper configuration requires enabling offload features via ethtool and verifying NCCL uses optimized network paths.
Unattempted
ConnectX-7 hardware offload features (TSO, LRO, checksum offload) are essential for reducing CPU overhead in high-throughput AI workloads. TSO and LRO together provide the most significant CPU relief by offloading packet segmentation and reassembly—the dominant overhead source during NCCL collective operations. These features allow the NIC to handle TCP protocol processing in hardware, freeing CPU resources for application workloads. Proper configuration requires enabling offload features via ethtool and verifying NCCL uses optimized network paths.
Question 24 of 60
24. Question
A deep learning team is training a 70B parameter LLM across 64 H100 GPUs distributed over 8 DGX nodes connected via InfiniBand HDR. Gradient synchronization currently consumes 35% of iteration time. Which SHARP Protocol approach achieves optimal all-reduce performance for this multi-node training scenario?
Correct
SHARP Protocol‘s in-network reduction offloads all-reduce operations from compute resources to InfiniBand switches, performing gradient aggregation during data transit through the fabric. For 64-GPU multi-node training, this eliminates redundant data transfers and reduces synchronization overhead by 30-50% compared to endpoint-based approaches. SHARP integrates with NCCL automatically, providing transparent optimization for distributed training collectives while maintaining full GPUDirect RDMA performance.
Incorrect
SHARP Protocol‘s in-network reduction offloads all-reduce operations from compute resources to InfiniBand switches, performing gradient aggregation during data transit through the fabric. For 64-GPU multi-node training, this eliminates redundant data transfers and reduces synchronization overhead by 30-50% compared to endpoint-based approaches. SHARP integrates with NCCL automatically, providing transparent optimization for distributed training collectives while maintaining full GPUDirect RDMA performance.
Unattempted
SHARP Protocol‘s in-network reduction offloads all-reduce operations from compute resources to InfiniBand switches, performing gradient aggregation during data transit through the fabric. For 64-GPU multi-node training, this eliminates redundant data transfers and reduces synchronization overhead by 30-50% compared to endpoint-based approaches. SHARP integrates with NCCL automatically, providing transparent optimization for distributed training collectives while maintaining full GPUDirect RDMA performance.
Question 25 of 60
25. Question
An AI training cluster requires ultra-low latency networking for 64 DGX H100 nodes running distributed LLM training with frequent NCCL AllReduce operations. The network must support 400GbE connectivity with minimal latency overhead. Which Spectrum switch series best addresses these requirements?
Correct
The SN5000 series with Spectrum-4 ASIC is NVIDIA‘s latest switching platform specifically designed for AI infrastructure, supporting 400/800GbE with ultra-low latency and AI-optimized features. It includes adaptive routing that dynamically optimizes paths for NCCL collective operations, advanced telemetry for congestion management, and sub-microsecond latency critical for distributed training. Previous generations (SN2000-SN4000) lack both the required 400GbE bandwidth and AI-specific optimizations essential for large-scale GPU clusters.
Incorrect
The SN5000 series with Spectrum-4 ASIC is NVIDIA‘s latest switching platform specifically designed for AI infrastructure, supporting 400/800GbE with ultra-low latency and AI-optimized features. It includes adaptive routing that dynamically optimizes paths for NCCL collective operations, advanced telemetry for congestion management, and sub-microsecond latency critical for distributed training. Previous generations (SN2000-SN4000) lack both the required 400GbE bandwidth and AI-specific optimizations essential for large-scale GPU clusters.
Unattempted
The SN5000 series with Spectrum-4 ASIC is NVIDIA‘s latest switching platform specifically designed for AI infrastructure, supporting 400/800GbE with ultra-low latency and AI-optimized features. It includes adaptive routing that dynamically optimizes paths for NCCL collective operations, advanced telemetry for congestion management, and sub-microsecond latency critical for distributed training. Previous generations (SN2000-SN4000) lack both the required 400GbE bandwidth and AI-specific optimizations essential for large-scale GPU clusters.
Question 26 of 60
26. Question
An infrastructure team is deploying NVIDIA UFM (Unified Fabric Manager) to manage a 128-node InfiniBand fabric for AI training workloads. The team needs to determine the minimum server hardware requirements for the UFM server. Which configuration meets the baseline prerequisites for this deployment scale?
Correct
UFM server deployment requires specific hardware and software prerequisites to effectively manage InfiniBand fabric infrastructure. The minimum configuration includes 4 CPU cores, 8GB RAM, and 50GB disk space to handle telemetry collection, topology management, and analytics for typical fabric scales. Software requirements mandate supported Linux distributions (Ubuntu 22.04 LTS, RHEL 8.x/9.x) with Python 3.8+ runtime. These prerequisites ensure UFM can perform real-time fabric monitoring, congestion management, and automated remediation tasks across the network.
Incorrect
UFM server deployment requires specific hardware and software prerequisites to effectively manage InfiniBand fabric infrastructure. The minimum configuration includes 4 CPU cores, 8GB RAM, and 50GB disk space to handle telemetry collection, topology management, and analytics for typical fabric scales. Software requirements mandate supported Linux distributions (Ubuntu 22.04 LTS, RHEL 8.x/9.x) with Python 3.8+ runtime. These prerequisites ensure UFM can perform real-time fabric monitoring, congestion management, and automated remediation tasks across the network.
Unattempted
UFM server deployment requires specific hardware and software prerequisites to effectively manage InfiniBand fabric infrastructure. The minimum configuration includes 4 CPU cores, 8GB RAM, and 50GB disk space to handle telemetry collection, topology management, and analytics for typical fabric scales. Software requirements mandate supported Linux distributions (Ubuntu 22.04 LTS, RHEL 8.x/9.x) with Python 3.8+ runtime. These prerequisites ensure UFM can perform real-time fabric monitoring, congestion management, and automated remediation tasks across the network.
Question 27 of 60
27. Question
What is the primary architectural characteristic of Cumulus Linux as a network operating system?
Correct
Cumulus Linux‘s defining architectural characteristic is being a Linux-based NOS built on Debian, running networking protocols in user space. This design allows network switches to be managed like Linux servers using standard tools, automation frameworks, and DevOps practices, fundamentally differentiating it from proprietary monolithic network operating systems.
Incorrect
Cumulus Linux‘s defining architectural characteristic is being a Linux-based NOS built on Debian, running networking protocols in user space. This design allows network switches to be managed like Linux servers using standard tools, automation frameworks, and DevOps practices, fundamentally differentiating it from proprietary monolithic network operating systems.
Unattempted
Cumulus Linux‘s defining architectural characteristic is being a Linux-based NOS built on Debian, running networking protocols in user space. This design allows network switches to be managed like Linux servers using standard tools, automation frameworks, and DevOps practices, fundamentally differentiating it from proprietary monolithic network operating systems.
Question 28 of 60
28. Question
A GPU cluster administrator runs ‘ibstat‘ on a compute node and observes ‘State: Down‘ for port 1, while ‘Physical state: LinkUp‘ shows active. The link was functioning 24 hours ago with no hardware changes. What is the critical component to verify first for diagnosing this port status discrepancy?
Correct
The ibstat command reports both physical and logical port states independently. ‘Physical state: LinkUp‘ confirms hardware-level connectivity (cable, transceiver, port electronics), while ‘State: Down‘ indicates the logical port failed Subnet Manager initialization. This specific combination overwhelmingly points to SM communication failure—the port cannot obtain LID assignments or operational parameters. Critical verification involves checking SM status (‘sminfo‘ command), verifying the node appears in SM‘s discovered topology, and confirming no fabric partitioning occurred. Hardware diagnostics are unnecessary when physical state shows LinkUp.
Incorrect
The ibstat command reports both physical and logical port states independently. ‘Physical state: LinkUp‘ confirms hardware-level connectivity (cable, transceiver, port electronics), while ‘State: Down‘ indicates the logical port failed Subnet Manager initialization. This specific combination overwhelmingly points to SM communication failure—the port cannot obtain LID assignments or operational parameters. Critical verification involves checking SM status (‘sminfo‘ command), verifying the node appears in SM‘s discovered topology, and confirming no fabric partitioning occurred. Hardware diagnostics are unnecessary when physical state shows LinkUp.
Unattempted
The ibstat command reports both physical and logical port states independently. ‘Physical state: LinkUp‘ confirms hardware-level connectivity (cable, transceiver, port electronics), while ‘State: Down‘ indicates the logical port failed Subnet Manager initialization. This specific combination overwhelmingly points to SM communication failure—the port cannot obtain LID assignments or operational parameters. Critical verification involves checking SM status (‘sminfo‘ command), verifying the node appears in SM‘s discovered topology, and confirming no fabric partitioning occurred. Hardware diagnostics are unnecessary when physical state shows LinkUp.
Question 29 of 60
29. Question
What is the primary role of BGP EVPN in a VXLAN overlay network‘s control plane?
Correct
BGP EVPN is the control plane protocol for VXLAN overlay networks, using MP-BGP to distribute MAC/IP reachability information between VTEPs. It enables automatic endpoint discovery, reducing flooding and providing efficient Layer 2/Layer 3 connectivity across the fabric. The underlay handles physical connectivity and load balancing, while EVPN focuses purely on reachability advertisement.
Incorrect
BGP EVPN is the control plane protocol for VXLAN overlay networks, using MP-BGP to distribute MAC/IP reachability information between VTEPs. It enables automatic endpoint discovery, reducing flooding and providing efficient Layer 2/Layer 3 connectivity across the fabric. The underlay handles physical connectivity and load balancing, while EVPN focuses purely on reachability advertisement.
Unattempted
BGP EVPN is the control plane protocol for VXLAN overlay networks, using MP-BGP to distribute MAC/IP reachability information between VTEPs. It enables automatic endpoint discovery, reducing flooding and providing efficient Layer 2/Layer 3 connectivity across the fabric. The underlay handles physical connectivity and load balancing, while EVPN focuses purely on reachability advertisement.
Question 30 of 60
30. Question
A service provider needs to host multiple customers on a shared EVPN-VXLAN fabric while ensuring complete traffic isolation between tenants. Each customer requires their own independent IP address space that may overlap with other tenants. Which approach achieves VXLAN-based isolation for this multi-tenancy requirement?
Correct
VXLAN-based multi-tenancy is achieved by assigning unique VNIs to each tenant for overlay segmentation and mapping them to separate VRF instances for routing isolation. This combination enables complete traffic separation, supports overlapping IP address spaces, and maintains scalability. VNIs provide Layer 2 isolation within VXLAN tunnels, while VRFs ensure Layer 3 routing independence, creating robust tenant boundaries on shared physical infrastructure.
Incorrect
VXLAN-based multi-tenancy is achieved by assigning unique VNIs to each tenant for overlay segmentation and mapping them to separate VRF instances for routing isolation. This combination enables complete traffic separation, supports overlapping IP address spaces, and maintains scalability. VNIs provide Layer 2 isolation within VXLAN tunnels, while VRFs ensure Layer 3 routing independence, creating robust tenant boundaries on shared physical infrastructure.
Unattempted
VXLAN-based multi-tenancy is achieved by assigning unique VNIs to each tenant for overlay segmentation and mapping them to separate VRF instances for routing isolation. This combination enables complete traffic separation, supports overlapping IP address spaces, and maintains scalability. VNIs provide Layer 2 isolation within VXLAN tunnels, while VRFs ensure Layer 3 routing independence, creating robust tenant boundaries on shared physical infrastructure.
Question 31 of 60
31. Question
What is the primary method for deploying Cumulus Linux on bare-metal switches in an enterprise data center environment?
Correct
Cumulus Linux deployment on bare-metal switches relies on ONIE (Open Network Install Environment), the pre-installed bootloader that automates network OS discovery and installation. ONIE enables zero-touch provisioning by automatically locating and installing Cumulus Linux images via HTTP, FTP, or local USB, making it the standard deployment method for data center switch infrastructure.
Incorrect
Cumulus Linux deployment on bare-metal switches relies on ONIE (Open Network Install Environment), the pre-installed bootloader that automates network OS discovery and installation. ONIE enables zero-touch provisioning by automatically locating and installing Cumulus Linux images via HTTP, FTP, or local USB, making it the standard deployment method for data center switch infrastructure.
Unattempted
Cumulus Linux deployment on bare-metal switches relies on ONIE (Open Network Install Environment), the pre-installed bootloader that automates network OS discovery and installation. ONIE enables zero-touch provisioning by automatically locating and installing Cumulus Linux images via HTTP, FTP, or local USB, making it the standard deployment method for data center switch infrastructure.
Question 32 of 60
32. Question
Your team is configuring NCCL for multi-node LLM training on an 8-node H100 cluster with InfiniBand HDR (200 Gbps) networking. Which environment variable configuration ensures NCCL uses InfiniBand transport with GPUDirect RDMA for optimal GPU-to-GPU communication across nodes?
Correct
Optimal NCCL configuration for InfiniBand requires enabling IB transport (NCCL_IB_DISABLE=0), maximizing GPUDirect RDMA level (NCCL_NET_GDR_LEVEL=5), and specifying HCAs. This enables direct GPU-to-GPU memory access across nodes, bypassing CPU for minimum latency. Never disable P2P (preserves NVLink for intra-node) or force socket transport (adds CPU overhead). Proper IB configuration is critical for multi-node training efficiency.
Incorrect
Optimal NCCL configuration for InfiniBand requires enabling IB transport (NCCL_IB_DISABLE=0), maximizing GPUDirect RDMA level (NCCL_NET_GDR_LEVEL=5), and specifying HCAs. This enables direct GPU-to-GPU memory access across nodes, bypassing CPU for minimum latency. Never disable P2P (preserves NVLink for intra-node) or force socket transport (adds CPU overhead). Proper IB configuration is critical for multi-node training efficiency.
Unattempted
Optimal NCCL configuration for InfiniBand requires enabling IB transport (NCCL_IB_DISABLE=0), maximizing GPUDirect RDMA level (NCCL_NET_GDR_LEVEL=5), and specifying HCAs. This enables direct GPU-to-GPU memory access across nodes, bypassing CPU for minimum latency. Never disable P2P (preserves NVLink for intra-node) or force socket transport (adds CPU overhead). Proper IB configuration is critical for multi-node training efficiency.
Question 33 of 60
33. Question
What is the primary purpose of DCQCN (Data Center Quantized Congestion Notification) tuning in RoCE networks?
Correct
DCQCN is the congestion control algorithm for RoCE networks that uses Explicit Congestion Notification (ECN) to dynamically adjust transmission rates. Tuning DCQCN parameters optimizes the balance between aggressive bandwidth utilization and congestion avoidance, critical for maintaining low-latency RDMA performance in GPU clusters.
Incorrect
DCQCN is the congestion control algorithm for RoCE networks that uses Explicit Congestion Notification (ECN) to dynamically adjust transmission rates. Tuning DCQCN parameters optimizes the balance between aggressive bandwidth utilization and congestion avoidance, critical for maintaining low-latency RDMA performance in GPU clusters.
Unattempted
DCQCN is the congestion control algorithm for RoCE networks that uses Explicit Congestion Notification (ECN) to dynamically adjust transmission rates. Tuning DCQCN parameters optimizes the balance between aggressive bandwidth utilization and congestion avoidance, critical for maintaining low-latency RDMA performance in GPU clusters.
Question 34 of 60
34. Question
A data center architect is designing a leaf-spine network fabric with 64 leaf switches and 16 spine switches. The team wants to minimize configuration complexity while implementing BGP for underlay routing. When would eBGP unnumbered be the most appropriate approach for establishing BGP peering relationships?
Correct
eBGP unnumbered simplifies BGP peering configuration in leaf-spine fabrics by eliminating the need for explicit IP address assignment on point-to-point links. It uses interface references and link-local addresses automatically, reducing configuration complexity in large-scale deployments with hundreds of interconnections. This approach is ideal for modern data center underlay networks where automation and scalability are priorities, but is not suitable for WAN links or environments requiring detailed IP address documentation.
Incorrect
eBGP unnumbered simplifies BGP peering configuration in leaf-spine fabrics by eliminating the need for explicit IP address assignment on point-to-point links. It uses interface references and link-local addresses automatically, reducing configuration complexity in large-scale deployments with hundreds of interconnections. This approach is ideal for modern data center underlay networks where automation and scalability are priorities, but is not suitable for WAN links or environments requiring detailed IP address documentation.
Unattempted
eBGP unnumbered simplifies BGP peering configuration in leaf-spine fabrics by eliminating the need for explicit IP address assignment on point-to-point links. It uses interface references and link-local addresses automatically, reducing configuration complexity in large-scale deployments with hundreds of interconnections. This approach is ideal for modern data center underlay networks where automation and scalability are priorities, but is not suitable for WAN links or environments requiring detailed IP address documentation.
Question 35 of 60
35. Question
Your multi-node H100 cluster running distributed LLM training shows inconsistent NCCL AllReduce performance across InfiniBand fabric. UFM telemetry reveals packet drops on specific switch ports during collective operations. What is the most effective optimization approach to monitor and resolve NCCL collective operation bottlenecks using UFM integration?
Correct
UFM integration with NCCL through SHARP provides the most effective optimization for collective operation monitoring by offloading AllReduce to InfiniBand switches and exposing detailed telemetry. This enables correlation between NCCL collective timing and fabric-level congestion, identifying specific switch ports causing packet drops. SHARP‘s switch-based aggregation reduces GPU traffic while UFM‘s real-time metrics pinpoint topology bottlenecks, enabling immediate resolution and continuous monitoring of multi-node distributed training performance.
Incorrect
UFM integration with NCCL through SHARP provides the most effective optimization for collective operation monitoring by offloading AllReduce to InfiniBand switches and exposing detailed telemetry. This enables correlation between NCCL collective timing and fabric-level congestion, identifying specific switch ports causing packet drops. SHARP‘s switch-based aggregation reduces GPU traffic while UFM‘s real-time metrics pinpoint topology bottlenecks, enabling immediate resolution and continuous monitoring of multi-node distributed training performance.
Unattempted
UFM integration with NCCL through SHARP provides the most effective optimization for collective operation monitoring by offloading AllReduce to InfiniBand switches and exposing detailed telemetry. This enables correlation between NCCL collective timing and fabric-level congestion, identifying specific switch ports causing packet drops. SHARP‘s switch-based aggregation reduces GPU traffic while UFM‘s real-time metrics pinpoint topology bottlenecks, enabling immediate resolution and continuous monitoring of multi-node distributed training performance.
Question 36 of 60
36. Question
A telecom provider is deploying NFV infrastructure using NVIDIA BlueField-3 DPUs to offload virtualized network functions from x86 servers. During integration testing, VNF packet processing latency is 40% higher than expected despite DPU utilization at only 35%. What is the most critical component to optimize for improving NFV performance on the DPU?
Correct
NFV on BlueField DPUs requires optimizing DOCA acceleration engines and zero-copy data paths to achieve expected performance. The combination of high latency and low DPU utilization indicates underutilization of hardware accelerators rather than resource exhaustion. BlueField‘s architecture provides specialized engines for packet parsing, crypto, and pattern matching that must be explicitly configured through DOCA APIs. Zero-copy mechanisms eliminate CPU intervention by enabling direct memory access between VNF instances, leveraging the DPU‘s integrated networking and compute fabric for optimal NFV performance.
Incorrect
NFV on BlueField DPUs requires optimizing DOCA acceleration engines and zero-copy data paths to achieve expected performance. The combination of high latency and low DPU utilization indicates underutilization of hardware accelerators rather than resource exhaustion. BlueField‘s architecture provides specialized engines for packet parsing, crypto, and pattern matching that must be explicitly configured through DOCA APIs. Zero-copy mechanisms eliminate CPU intervention by enabling direct memory access between VNF instances, leveraging the DPU‘s integrated networking and compute fabric for optimal NFV performance.
Unattempted
NFV on BlueField DPUs requires optimizing DOCA acceleration engines and zero-copy data paths to achieve expected performance. The combination of high latency and low DPU utilization indicates underutilization of hardware accelerators rather than resource exhaustion. BlueField‘s architecture provides specialized engines for packet parsing, crypto, and pattern matching that must be explicitly configured through DOCA APIs. Zero-copy mechanisms eliminate CPU intervention by enabling direct memory access between VNF instances, leveraging the DPU‘s integrated networking and compute fabric for optimal NFV performance.
Question 37 of 60
37. Question
An AI infrastructure team is deploying a multi-node H100 cluster for distributed LLM training using NCCL over 100GbE Ethernet fabric. They need to configure RoCEv2 for optimal RDMA performance with GPU-to-GPU communication across nodes. Which configuration approach ensures proper RoCEv2 protocol operation for RDMA over UDP/IP?
Correct
RoCEv2 protocol requires UDP/IP encapsulation on port 4791 to enable RDMA over routable Ethernet infrastructure. Proper configuration includes ECN for congestion notification, PFC on dedicated priority classes for lossless transport, DSCP-based QoS classification, and jumbo frames (MTU 9000+) for optimal throughput. This ensures GPUDirect RDMA operates efficiently across multi-node GPU clusters with NCCL collective operations during distributed training workloads.
Incorrect
RoCEv2 protocol requires UDP/IP encapsulation on port 4791 to enable RDMA over routable Ethernet infrastructure. Proper configuration includes ECN for congestion notification, PFC on dedicated priority classes for lossless transport, DSCP-based QoS classification, and jumbo frames (MTU 9000+) for optimal throughput. This ensures GPUDirect RDMA operates efficiently across multi-node GPU clusters with NCCL collective operations during distributed training workloads.
Unattempted
RoCEv2 protocol requires UDP/IP encapsulation on port 4791 to enable RDMA over routable Ethernet infrastructure. Proper configuration includes ECN for congestion notification, PFC on dedicated priority classes for lossless transport, DSCP-based QoS classification, and jumbo frames (MTU 9000+) for optimal throughput. This ensures GPUDirect RDMA operates efficiently across multi-node GPU clusters with NCCL collective operations during distributed training workloads.
Question 38 of 60
38. Question
An AI infrastructure team needs to diagnose intermittent training slowdowns in their 16-node H100 cluster running distributed LLM workloads. They suspect network congestion but need quantitative data to identify specific links experiencing issues. Which UFM monitoring approach would most effectively identify the problematic network segments?
Correct
UFM performance counters are the optimal tool for diagnosing network-related training slowdowns because they provide real-time, per-port granular metrics for throughput (bytes transmitted/received) and errors (symbol errors, CRC errors, link downed events). These counters enable precise identification of congested or failing links in the InfiniBand fabric. During distributed training with NCCL collective operations, performance counters correlate training slowdowns with specific network bottlenecks, unlike historical analytics or administrative logs.
Incorrect
UFM performance counters are the optimal tool for diagnosing network-related training slowdowns because they provide real-time, per-port granular metrics for throughput (bytes transmitted/received) and errors (symbol errors, CRC errors, link downed events). These counters enable precise identification of congested or failing links in the InfiniBand fabric. During distributed training with NCCL collective operations, performance counters correlate training slowdowns with specific network bottlenecks, unlike historical analytics or administrative logs.
Unattempted
UFM performance counters are the optimal tool for diagnosing network-related training slowdowns because they provide real-time, per-port granular metrics for throughput (bytes transmitted/received) and errors (symbol errors, CRC errors, link downed events). These counters enable precise identification of congested or failing links in the InfiniBand fabric. During distributed training with NCCL collective operations, performance counters correlate training slowdowns with specific network bottlenecks, unlike historical analytics or administrative logs.
Question 39 of 60
39. Question
A network operations team monitors 500 GPU compute nodes running distributed AI training workloads. SNMP polling at 30-second intervals causes CPU spikes on management servers and misses transient network congestion events during collective operations. What is the critical architectural difference that would address these limitations?
Correct
The critical architectural difference is streaming telemetry‘s push-based continuous data transmission versus SNMP‘s pull-based polling model. Streaming eliminates management server overhead by having network devices autonomously push metrics at sub-second intervals, capturing transient congestion during GPU collective operations that 30-second SNMP polling misses. This fundamental shift from request-response to event-driven streaming provides both reduced CPU load and microsecond-level visibility essential for monitoring high-speed InfiniBand/RoCE networks supporting NCCL distributed training workloads across hundreds of GPU nodes.
Incorrect
The critical architectural difference is streaming telemetry‘s push-based continuous data transmission versus SNMP‘s pull-based polling model. Streaming eliminates management server overhead by having network devices autonomously push metrics at sub-second intervals, capturing transient congestion during GPU collective operations that 30-second SNMP polling misses. This fundamental shift from request-response to event-driven streaming provides both reduced CPU load and microsecond-level visibility essential for monitoring high-speed InfiniBand/RoCE networks supporting NCCL distributed training workloads across hundreds of GPU nodes.
Unattempted
The critical architectural difference is streaming telemetry‘s push-based continuous data transmission versus SNMP‘s pull-based polling model. Streaming eliminates management server overhead by having network devices autonomously push metrics at sub-second intervals, capturing transient congestion during GPU collective operations that 30-second SNMP polling misses. This fundamental shift from request-response to event-driven streaming provides both reduced CPU load and microsecond-level visibility essential for monitoring high-speed InfiniBand/RoCE networks supporting NCCL distributed training workloads across hundreds of GPU nodes.
Question 40 of 60
40. Question
Your team is deploying an 8-GPU H100 DGX system for distributed LLM training using NeMo Framework with NCCL. During multi-GPU all-reduce operations for gradient synchronization, which interconnect technology provides optimal east-west GPU-to-GPU communication bandwidth within the node?
Correct
For east-west GPU-to-GPU traffic within a single node, NVLink 4.0 with NVSwitch is the optimal solution, providing 900 GB/s bidirectional bandwidth per GPU. This fully connected fabric enables direct peer-to-peer memory access for NCCL all-reduce operations during distributed training, bypassing both PCIe and CPU. InfiniBand and GPUDirect RDMA are designed for north-south inter-node communication, while PCIe Gen 5 offers significantly lower bandwidth (128 GB/s) for intra-node GPU transfers.
Incorrect
For east-west GPU-to-GPU traffic within a single node, NVLink 4.0 with NVSwitch is the optimal solution, providing 900 GB/s bidirectional bandwidth per GPU. This fully connected fabric enables direct peer-to-peer memory access for NCCL all-reduce operations during distributed training, bypassing both PCIe and CPU. InfiniBand and GPUDirect RDMA are designed for north-south inter-node communication, while PCIe Gen 5 offers significantly lower bandwidth (128 GB/s) for intra-node GPU transfers.
Unattempted
For east-west GPU-to-GPU traffic within a single node, NVLink 4.0 with NVSwitch is the optimal solution, providing 900 GB/s bidirectional bandwidth per GPU. This fully connected fabric enables direct peer-to-peer memory access for NCCL all-reduce operations during distributed training, bypassing both PCIe and CPU. InfiniBand and GPUDirect RDMA are designed for north-south inter-node communication, while PCIe Gen 5 offers significantly lower bandwidth (128 GB/s) for intra-node GPU transfers.
Question 41 of 60
41. Question
A data center architect is evaluating BlueField-3 DPU deployment modes for a multi-tenant HPC cluster requiring both network acceleration and isolated control plane management. The infrastructure team needs to decide between embedded and separated host modes. What is the critical architectural component that differentiates these two modes?
Correct
The critical differentiator between embedded and separated host modes is the control plane execution location. Embedded mode runs the entire DPU control plane (DOCA services, network stack, management functions) on the DPU‘s ARM cores, providing complete isolation from the host. Separated host mode executes the control plane on the x86 host CPU while the DPU handles data plane acceleration. This distinction impacts security boundaries, resource utilization, and operational independence. Network topology support, NVLink integration, and GPUDirect RDMA capabilities remain consistent across both modes.
Incorrect
The critical differentiator between embedded and separated host modes is the control plane execution location. Embedded mode runs the entire DPU control plane (DOCA services, network stack, management functions) on the DPU‘s ARM cores, providing complete isolation from the host. Separated host mode executes the control plane on the x86 host CPU while the DPU handles data plane acceleration. This distinction impacts security boundaries, resource utilization, and operational independence. Network topology support, NVLink integration, and GPUDirect RDMA capabilities remain consistent across both modes.
Unattempted
The critical differentiator between embedded and separated host modes is the control plane execution location. Embedded mode runs the entire DPU control plane (DOCA services, network stack, management functions) on the DPU‘s ARM cores, providing complete isolation from the host. Separated host mode executes the control plane on the x86 host CPU while the DPU handles data plane acceleration. This distinction impacts security boundaries, resource utilization, and operational independence. Network topology support, NVLink integration, and GPUDirect RDMA capabilities remain consistent across both modes.
Question 42 of 60
42. Question
A data center architect is integrating VXLAN overlay for network virtualization across multiple GPU compute clusters. The VXLAN segments must support multi-tenancy with Layer 2 extension while maintaining optimal east-west traffic flow for distributed AI training workloads. What is the critical component that EVPN-VXLAN provides to enable scalable MAC address learning without flooding in the VXLAN overlay?
Correct
EVPN-VXLAN integration fundamentally transforms VXLAN overlay network virtualization by replacing data plane flooding with BGP-based control plane MAC learning. Type-2 MAC/IP advertisement routes enable VTEPs to proactively distribute MAC reachability information, eliminating unknown unicast flooding and enabling efficient multi-tenant scaling. This architecture is critical for GPU compute clusters where east-west bandwidth efficiency directly impacts distributed training performance across VXLAN segments.
Incorrect
EVPN-VXLAN integration fundamentally transforms VXLAN overlay network virtualization by replacing data plane flooding with BGP-based control plane MAC learning. Type-2 MAC/IP advertisement routes enable VTEPs to proactively distribute MAC reachability information, eliminating unknown unicast flooding and enabling efficient multi-tenant scaling. This architecture is critical for GPU compute clusters where east-west bandwidth efficiency directly impacts distributed training performance across VXLAN segments.
Unattempted
EVPN-VXLAN integration fundamentally transforms VXLAN overlay network virtualization by replacing data plane flooding with BGP-based control plane MAC learning. Type-2 MAC/IP advertisement routes enable VTEPs to proactively distribute MAC reachability information, eliminating unknown unicast flooding and enabling efficient multi-tenant scaling. This architecture is critical for GPU compute clusters where east-west bandwidth efficiency directly impacts distributed training performance across VXLAN segments.
Question 43 of 60
43. Question
A research institute is deploying a 128-node GPU cluster for multi-node LLM training with H100 GPUs. The workload requires frequent all-reduce operations across all nodes with minimal latency. Which InfiniBand technology should be selected to support optimal GPU-to-GPU communication bandwidth?
Correct
NDR InfiniBand at 400 Gbps is the optimal choice for H100 multi-node clusters in December 2025. It provides sufficient bandwidth for GPUDirect RDMA and NCCL all-reduce operations across 128 nodes. HDR at 200 Gbps would bottleneck performance, while XDR at 800 Gbps remains unavailable. EDR is legacy technology insufficient for modern GPU clusters requiring high-bandwidth, low-latency inter-node communication.
Incorrect
NDR InfiniBand at 400 Gbps is the optimal choice for H100 multi-node clusters in December 2025. It provides sufficient bandwidth for GPUDirect RDMA and NCCL all-reduce operations across 128 nodes. HDR at 200 Gbps would bottleneck performance, while XDR at 800 Gbps remains unavailable. EDR is legacy technology insufficient for modern GPU clusters requiring high-bandwidth, low-latency inter-node communication.
Unattempted
NDR InfiniBand at 400 Gbps is the optimal choice for H100 multi-node clusters in December 2025. It provides sufficient bandwidth for GPUDirect RDMA and NCCL all-reduce operations across 128 nodes. HDR at 200 Gbps would bottleneck performance, while XDR at 800 Gbps remains unavailable. EDR is legacy technology insufficient for modern GPU clusters requiring high-bandwidth, low-latency inter-node communication.
Question 44 of 60
44. Question
Your 8-node DGX H100 cluster using HDR InfiniBand shows degraded NCCL AllReduce performance. Running ‘ibdiagnet‘ reveals LinkDowned counters incrementing on ports connected to spine switches, while SymbolErrorCounter remains zero. Physical inspections show no visible cable damage. What is the MOST effective optimization strategy to diagnose this physical layer issue?
Correct
LinkDowned counters incrementing with zero SymbolErrors indicates physical links establish successfully but drop intermittently due to environmental factors rather than signal integrity problems. Temperature-induced transceiver throttling or power delivery instability causes transceivers to cycle links without corrupting data streams. Effective diagnosis requires correlating LinkDowned timestamps with switch environmental telemetry (temperature sensors, power supply metrics) to identify thermal hotspots or power issues. This targeted approach isolates root causes efficiently compared to replacing functional cables or misconfiguring network parameters.
Incorrect
LinkDowned counters incrementing with zero SymbolErrors indicates physical links establish successfully but drop intermittently due to environmental factors rather than signal integrity problems. Temperature-induced transceiver throttling or power delivery instability causes transceivers to cycle links without corrupting data streams. Effective diagnosis requires correlating LinkDowned timestamps with switch environmental telemetry (temperature sensors, power supply metrics) to identify thermal hotspots or power issues. This targeted approach isolates root causes efficiently compared to replacing functional cables or misconfiguring network parameters.
Unattempted
LinkDowned counters incrementing with zero SymbolErrors indicates physical links establish successfully but drop intermittently due to environmental factors rather than signal integrity problems. Temperature-induced transceiver throttling or power delivery instability causes transceivers to cycle links without corrupting data streams. Effective diagnosis requires correlating LinkDowned timestamps with switch environmental telemetry (temperature sensors, power supply metrics) to identify thermal hotspots or power issues. This targeted approach isolates root causes efficiently compared to replacing functional cables or misconfiguring network parameters.
Question 45 of 60
45. Question
A distributed AI training cluster requires low-latency GPU-to-GPU communication across nodes using InfiniBand. When configuring RDMA Queue Pairs (QPs) for multi-node NCCL communication with GPUDirect RDMA, which QP connection type should be established for optimal point-to-point data transfers?
Correct
Reliable Connection (RC) Queue Pairs are the standard for RDMA-enabled distributed training over InfiniBand. RC QPs provide connection-oriented, reliable, ordered delivery essential for NCCL‘s all-reduce and collective operations. Each GPU establishes dedicated RC QPs to peer GPUs, enabling GPUDirect RDMA for direct GPU memory access across nodes. This configuration ensures data integrity while maximizing bandwidth utilization (200-400 Gbps on HDR/NDR InfiniBand) for multi-node training workloads.
Incorrect
Reliable Connection (RC) Queue Pairs are the standard for RDMA-enabled distributed training over InfiniBand. RC QPs provide connection-oriented, reliable, ordered delivery essential for NCCL‘s all-reduce and collective operations. Each GPU establishes dedicated RC QPs to peer GPUs, enabling GPUDirect RDMA for direct GPU memory access across nodes. This configuration ensures data integrity while maximizing bandwidth utilization (200-400 Gbps on HDR/NDR InfiniBand) for multi-node training workloads.
Unattempted
Reliable Connection (RC) Queue Pairs are the standard for RDMA-enabled distributed training over InfiniBand. RC QPs provide connection-oriented, reliable, ordered delivery essential for NCCL‘s all-reduce and collective operations. Each GPU establishes dedicated RC QPs to peer GPUs, enabling GPUDirect RDMA for direct GPU memory access across nodes. This configuration ensures data integrity while maximizing bandwidth utilization (200-400 Gbps on HDR/NDR InfiniBand) for multi-node training workloads.
Question 46 of 60
46. Question
A network team is deploying NetQ agents across 500 switches in their data center fabric. After installation, 150 agents fail to report telemetry data to the NetQ server, despite successful installation. Network connectivity tests show all switches can reach the NetQ server IP. What is the most critical component to verify for proper NetQ agent operation?
Correct
NetQ agent deployment requires two distinct phases: installation (package deployment) and configuration (server registration). The critical component for operational agents is the server configuration via ‘netq config add server port 31980‘, which establishes the telemetry streaming endpoint. Without this configuration, agents remain idle despite successful installation and network connectivity. This is the most common deployment failure pattern, where installation automation succeeds but configuration steps are missed, resulting in silent agents that cannot report data.
Incorrect
NetQ agent deployment requires two distinct phases: installation (package deployment) and configuration (server registration). The critical component for operational agents is the server configuration via ‘netq config add server port 31980‘, which establishes the telemetry streaming endpoint. Without this configuration, agents remain idle despite successful installation and network connectivity. This is the most common deployment failure pattern, where installation automation succeeds but configuration steps are missed, resulting in silent agents that cannot report data.
Unattempted
NetQ agent deployment requires two distinct phases: installation (package deployment) and configuration (server registration). The critical component for operational agents is the server configuration via ‘netq config add server port 31980‘, which establishes the telemetry streaming endpoint. Without this configuration, agents remain idle despite successful installation and network connectivity. This is the most common deployment failure pattern, where installation automation succeeds but configuration steps are missed, resulting in silent agents that cannot report data.
Question 47 of 60
47. Question
A network team is implementing NetQ validation for datacenter fabric upgrades affecting 120 switches. Before executing BGP configuration changes, they must establish a baseline for pre/post change verification. What is the critical component that ensures accurate change impact assessment?
Correct
Pre-change snapshot creation is the critical component for accurate change validation, establishing an immutable baseline of complete network state before modifications. NetQ snapshots capture routing protocols, adjacencies, forwarding tables, and configurations, enabling precise post-change comparison to quantify impact and verify intended outcomes. Without snapshots, validation relies on subjective observations rather than objective state comparison, making it impossible to definitively assess change success or identify unintended consequences during large-scale fabric upgrades.
Incorrect
Pre-change snapshot creation is the critical component for accurate change validation, establishing an immutable baseline of complete network state before modifications. NetQ snapshots capture routing protocols, adjacencies, forwarding tables, and configurations, enabling precise post-change comparison to quantify impact and verify intended outcomes. Without snapshots, validation relies on subjective observations rather than objective state comparison, making it impossible to definitively assess change success or identify unintended consequences during large-scale fabric upgrades.
Unattempted
Pre-change snapshot creation is the critical component for accurate change validation, establishing an immutable baseline of complete network state before modifications. NetQ snapshots capture routing protocols, adjacencies, forwarding tables, and configurations, enabling precise post-change comparison to quantify impact and verify intended outcomes. Without snapshots, validation relies on subjective observations rather than objective state comparison, making it impossible to definitively assess change success or identify unintended consequences during large-scale fabric upgrades.
Question 48 of 60
48. Question
A data center network architect needs to isolate GPU training traffic from storage traffic on an InfiniBand fabric managed by OpenSM. The architecture requires strict access control where compute nodes in partition 0x8001 cannot communicate with storage nodes in partition 0x8002. What is the critical component that must be correctly configured to enforce this partition isolation?
Correct
InfiniBand partition isolation is enforced through PKey table entries with membership types configured on each HCA port. The Subnet Manager programs these tables based on the partition policy, assigning full (0x8xxx) or limited (0x0xxx) membership. Limited members can only communicate with full members in the same partition, creating strict access control. This hardware-level enforcement at the HCA prevents unauthorized inter-partition communication, making PKey table configuration the critical component for partition isolation in multi-tenant or segmented fabric architectures.
Incorrect
InfiniBand partition isolation is enforced through PKey table entries with membership types configured on each HCA port. The Subnet Manager programs these tables based on the partition policy, assigning full (0x8xxx) or limited (0x0xxx) membership. Limited members can only communicate with full members in the same partition, creating strict access control. This hardware-level enforcement at the HCA prevents unauthorized inter-partition communication, making PKey table configuration the critical component for partition isolation in multi-tenant or segmented fabric architectures.
Unattempted
InfiniBand partition isolation is enforced through PKey table entries with membership types configured on each HCA port. The Subnet Manager programs these tables based on the partition policy, assigning full (0x8xxx) or limited (0x0xxx) membership. Limited members can only communicate with full members in the same partition, creating strict access control. This hardware-level enforcement at the HCA prevents unauthorized inter-partition communication, making PKey table configuration the critical component for partition isolation in multi-tenant or segmented fabric architectures.
Question 49 of 60
49. Question
An AI research team is designing a 32-node GPU cluster for distributed LLM training with 8x H100 GPUs per node. They require full bisection bandwidth to eliminate communication bottlenecks during all-reduce operations across all 256 GPUs. Which network fabric technology should they implement to achieve non-blocking, full bisection bandwidth across the entire cluster?
Correct
Full bisection bandwidth requires non-blocking fabric architecture where aggregate bandwidth between any two halves of the network equals the sum of all host connections. InfiniBand NDR with fat-tree topology and 1:1 subscription ratio achieves this by ensuring sufficient spine-to-leaf bandwidth to support concurrent communication across all nodes. This is critical for distributed LLM training where NCCL performs frequent all-reduce operations requiring simultaneous multi-node communication without contention.
Incorrect
Full bisection bandwidth requires non-blocking fabric architecture where aggregate bandwidth between any two halves of the network equals the sum of all host connections. InfiniBand NDR with fat-tree topology and 1:1 subscription ratio achieves this by ensuring sufficient spine-to-leaf bandwidth to support concurrent communication across all nodes. This is critical for distributed LLM training where NCCL performs frequent all-reduce operations requiring simultaneous multi-node communication without contention.
Unattempted
Full bisection bandwidth requires non-blocking fabric architecture where aggregate bandwidth between any two halves of the network equals the sum of all host connections. InfiniBand NDR with fat-tree topology and 1:1 subscription ratio achieves this by ensuring sufficient spine-to-leaf bandwidth to support concurrent communication across all nodes. This is critical for distributed LLM training where NCCL performs frequent all-reduce operations requiring simultaneous multi-node communication without contention.
Question 50 of 60
50. Question
A team is deploying a 32-node cluster with 8x H100 GPUs per node for training a 400B parameter LLM using 3D parallelism. Each node has NVLink for intra-node communication. Which approach correctly sizes the inter-node network bandwidth for efficient all-reduce operations during training?
Correct
Inter-node network sizing for multi-GPU LLM training requires calculating bandwidth based on gradient synchronization patterns during all-reduce operations. For 32-node H100 clusters, 400 Gbps InfiniBand NDR per node provides sufficient bandwidth for NCCL collectives with GPUDirect RDMA, considering aggregate gradient size and synchronization frequency. This prevents network bottlenecks that would stall GPU computation during distributed training with 3D parallelism.
Incorrect
Inter-node network sizing for multi-GPU LLM training requires calculating bandwidth based on gradient synchronization patterns during all-reduce operations. For 32-node H100 clusters, 400 Gbps InfiniBand NDR per node provides sufficient bandwidth for NCCL collectives with GPUDirect RDMA, considering aggregate gradient size and synchronization frequency. This prevents network bottlenecks that would stall GPU computation during distributed training with 3D parallelism.
Unattempted
Inter-node network sizing for multi-GPU LLM training requires calculating bandwidth based on gradient synchronization patterns during all-reduce operations. For 32-node H100 clusters, 400 Gbps InfiniBand NDR per node provides sufficient bandwidth for NCCL collectives with GPUDirect RDMA, considering aggregate gradient size and synchronization frequency. This prevents network bottlenecks that would stall GPU computation during distributed training with 3D parallelism.
Question 51 of 60
51. Question
What is the primary purpose of Cumulus Linux integration with NetQ in network management environments?
Correct
Cumulus Linux integration with NetQ enables comprehensive network telemetry and monitoring by deploying NetQ agents on Cumulus switches. These agents collect real-time data about network state, configurations, and performance metrics, providing operators with unified visibility across the network fabric for proactive issue detection and validation.
Incorrect
Cumulus Linux integration with NetQ enables comprehensive network telemetry and monitoring by deploying NetQ agents on Cumulus switches. These agents collect real-time data about network state, configurations, and performance metrics, providing operators with unified visibility across the network fabric for proactive issue detection and validation.
Unattempted
Cumulus Linux integration with NetQ enables comprehensive network telemetry and monitoring by deploying NetQ agents on Cumulus switches. These agents collect real-time data about network state, configurations, and performance metrics, providing operators with unified visibility across the network fabric for proactive issue detection and validation.
Question 52 of 60
52. Question
You are configuring an InfiniBand subnet manager for a 128-node AI training cluster with multiple paths between switches. The workload requires consistent low-latency routing with automatic failover. Which path computation algorithm should you configure to balance traffic across available links while maintaining deterministic routing?
Correct
The optimal configuration combines MinHop path computation with LMC to create multiple shortest paths per destination. The Subnet Manager assigns multiple LIDs (based on LMC value) to each port, enabling traffic distribution while maintaining deterministic per-QP routing. This approach leverages InfiniBand‘s path affinity mechanism where each Queue Pair consistently uses the same path, preventing packet reordering while balancing load across the fabric‘s multiple physical links—critical for NCCL collective communication in distributed training.
Incorrect
The optimal configuration combines MinHop path computation with LMC to create multiple shortest paths per destination. The Subnet Manager assigns multiple LIDs (based on LMC value) to each port, enabling traffic distribution while maintaining deterministic per-QP routing. This approach leverages InfiniBand‘s path affinity mechanism where each Queue Pair consistently uses the same path, preventing packet reordering while balancing load across the fabric‘s multiple physical links—critical for NCCL collective communication in distributed training.
Unattempted
The optimal configuration combines MinHop path computation with LMC to create multiple shortest paths per destination. The Subnet Manager assigns multiple LIDs (based on LMC value) to each port, enabling traffic distribution while maintaining deterministic per-QP routing. This approach leverages InfiniBand‘s path affinity mechanism where each Queue Pair consistently uses the same path, preventing packet reordering while balancing load across the fabric‘s multiple physical links—critical for NCCL collective communication in distributed training.
Question 53 of 60
53. Question
A multi-tenant AI cluster with 128 H100 GPUs uses InfiniBand HDR for distributed training. Different research teams require isolated network fabrics to prevent cross-tenant traffic interference and ensure secure communication. What is the critical component for implementing fabric isolation and multi-tenancy in this InfiniBand environment?
Correct
Partition keys (PKeys) are the critical InfiniBand component for fabric isolation and multi-tenancy. They create virtual network partitions within the physical fabric, enforced at the hardware level by InfiniBand adapters and switches. Only nodes with matching PKeys can communicate, providing security isolation between tenants. Unlike Ethernet VLANs or QoS mechanisms, PKeys are native to InfiniBand architecture and provide hardware-enforced access control. The subnet manager configures PKey assignments, but PKeys themselves are the fundamental technology enabling secure multi-tenant InfiniBand deployments in shared AI infrastructure.
Incorrect
Partition keys (PKeys) are the critical InfiniBand component for fabric isolation and multi-tenancy. They create virtual network partitions within the physical fabric, enforced at the hardware level by InfiniBand adapters and switches. Only nodes with matching PKeys can communicate, providing security isolation between tenants. Unlike Ethernet VLANs or QoS mechanisms, PKeys are native to InfiniBand architecture and provide hardware-enforced access control. The subnet manager configures PKey assignments, but PKeys themselves are the fundamental technology enabling secure multi-tenant InfiniBand deployments in shared AI infrastructure.
Unattempted
Partition keys (PKeys) are the critical InfiniBand component for fabric isolation and multi-tenancy. They create virtual network partitions within the physical fabric, enforced at the hardware level by InfiniBand adapters and switches. Only nodes with matching PKeys can communicate, providing security isolation between tenants. Unlike Ethernet VLANs or QoS mechanisms, PKeys are native to InfiniBand architecture and provide hardware-enforced access control. The subnet manager configures PKey assignments, but PKeys themselves are the fundamental technology enabling secure multi-tenant InfiniBand deployments in shared AI infrastructure.
Question 54 of 60
54. Question
A research team is training a 175B parameter LLM across 64 H100 GPUs distributed over 8 DGX nodes. Network bandwidth becomes a bottleneck during gradient synchronization, limiting training throughput to 35% of expected performance. Which network scaling approach would most effectively address this bottleneck?
Correct
Network scaling for large models requires high-bandwidth, low-latency interconnects optimized for GPU communication. InfiniBand NDR with GPUDirect RDMA enables direct GPU-to-GPU transfers across nodes, eliminating CPU bottlenecks. NCCL‘s hierarchical all-reduce efficiently combines NVLink (900 GB/s within nodes) and InfiniBand (400 Gbps across nodes) for optimal gradient synchronization at scale, essential for training models exceeding 100B parameters across multiple nodes.
Incorrect
Network scaling for large models requires high-bandwidth, low-latency interconnects optimized for GPU communication. InfiniBand NDR with GPUDirect RDMA enables direct GPU-to-GPU transfers across nodes, eliminating CPU bottlenecks. NCCL‘s hierarchical all-reduce efficiently combines NVLink (900 GB/s within nodes) and InfiniBand (400 Gbps across nodes) for optimal gradient synchronization at scale, essential for training models exceeding 100B parameters across multiple nodes.
Unattempted
Network scaling for large models requires high-bandwidth, low-latency interconnects optimized for GPU communication. InfiniBand NDR with GPUDirect RDMA enables direct GPU-to-GPU transfers across nodes, eliminating CPU bottlenecks. NCCL‘s hierarchical all-reduce efficiently combines NVLink (900 GB/s within nodes) and InfiniBand (400 Gbps across nodes) for optimal gradient synchronization at scale, essential for training models exceeding 100B parameters across multiple nodes.
Question 55 of 60
55. Question
What is the primary purpose of EVPN (Ethernet VPN) in modern data center networks?
Correct
EVPN is a standards-based control plane technology that uses BGP to distribute MAC and IP reachability information for Layer 2 and Layer 3 VPN services. It eliminates traditional flood-and-learn mechanisms, provides efficient multitenant segmentation, and is commonly used with VXLAN encapsulation in modern data center fabrics to create scalable overlay networks.
Incorrect
EVPN is a standards-based control plane technology that uses BGP to distribute MAC and IP reachability information for Layer 2 and Layer 3 VPN services. It eliminates traditional flood-and-learn mechanisms, provides efficient multitenant segmentation, and is commonly used with VXLAN encapsulation in modern data center fabrics to create scalable overlay networks.
Unattempted
EVPN is a standards-based control plane technology that uses BGP to distribute MAC and IP reachability information for Layer 2 and Layer 3 VPN services. It eliminates traditional flood-and-learn mechanisms, provides efficient multitenant segmentation, and is commonly used with VXLAN encapsulation in modern data center fabrics to create scalable overlay networks.
Question 56 of 60
56. Question
A data center team is deploying BGP on Cumulus Linux switches to establish routing between leaf and spine layers. They need to configure FRRouting to enable BGP and begin advertising networks. Which command sequence correctly enables BGP configuration in FRRouting?
Correct
FRRouting on Cumulus Linux requires accessing the vtysh shell for BGP configuration. The standard workflow involves entering vtysh, accessing configuration mode, and initiating BGP with the router bgp command specifying the ASN. This provides validated configuration with syntax checking and proper integration with FRRouting‘s routing daemon, ensuring reliable BGP deployment in data center fabric architectures.
Incorrect
FRRouting on Cumulus Linux requires accessing the vtysh shell for BGP configuration. The standard workflow involves entering vtysh, accessing configuration mode, and initiating BGP with the router bgp command specifying the ASN. This provides validated configuration with syntax checking and proper integration with FRRouting‘s routing daemon, ensuring reliable BGP deployment in data center fabric architectures.
Unattempted
FRRouting on Cumulus Linux requires accessing the vtysh shell for BGP configuration. The standard workflow involves entering vtysh, accessing configuration mode, and initiating BGP with the router bgp command specifying the ASN. This provides validated configuration with syntax checking and proper integration with FRRouting‘s routing daemon, ensuring reliable BGP deployment in data center fabric architectures.
Question 57 of 60
57. Question
What is the primary purpose of Zero Touch Provisioning (ZTP) in Cumulus Linux?
Correct
Zero Touch Provisioning (ZTP) automates the initial configuration of Cumulus Linux switches during first boot. When a switch powers on without configuration, ZTP uses DHCP to obtain network parameters and a script URL, then automatically downloads and executes provisioning scripts that install software, apply configurations, and prepare the device for production use without manual intervention.
Incorrect
Zero Touch Provisioning (ZTP) automates the initial configuration of Cumulus Linux switches during first boot. When a switch powers on without configuration, ZTP uses DHCP to obtain network parameters and a script URL, then automatically downloads and executes provisioning scripts that install software, apply configurations, and prepare the device for production use without manual intervention.
Unattempted
Zero Touch Provisioning (ZTP) automates the initial configuration of Cumulus Linux switches during first boot. When a switch powers on without configuration, ZTP uses DHCP to obtain network parameters and a script URL, then automatically downloads and executes provisioning scripts that install software, apply configurations, and prepare the device for production use without manual intervention.
Question 58 of 60
58. Question
A team is deploying distributed training for a 70B parameter LLM across 64 H100 GPUs spanning 8 DGX nodes connected via InfiniBand. The training uses NCCL 2.20 for gradient synchronization with AllReduce operations. Which collective communication algorithm should NCCL automatically select to optimize bandwidth utilization for this multi-node configuration?
Correct
NCCL automatically selects tree (hierarchical) algorithms for multi-node distributed training because they optimize for network topology structure. Tree algorithms perform local reductions within each node using fast NVLink, then reduce across nodes using InfiniBand, requiring only O(log N) inter-node hops versus O(N) for ring algorithms. For 8 DGX nodes, this means 3 inter-node hops instead of 7, significantly reducing communication time while maximizing use of high-bandwidth intra-node connections.
Incorrect
NCCL automatically selects tree (hierarchical) algorithms for multi-node distributed training because they optimize for network topology structure. Tree algorithms perform local reductions within each node using fast NVLink, then reduce across nodes using InfiniBand, requiring only O(log N) inter-node hops versus O(N) for ring algorithms. For 8 DGX nodes, this means 3 inter-node hops instead of 7, significantly reducing communication time while maximizing use of high-bandwidth intra-node connections.
Unattempted
NCCL automatically selects tree (hierarchical) algorithms for multi-node distributed training because they optimize for network topology structure. Tree algorithms perform local reductions within each node using fast NVLink, then reduce across nodes using InfiniBand, requiring only O(log N) inter-node hops versus O(N) for ring algorithms. For 8 DGX nodes, this means 3 inter-node hops instead of 7, significantly reducing communication time while maximizing use of high-bandwidth intra-node connections.
Question 59 of 60
59. Question
What is the QM8700 architecture in the context of NVIDIA Quantum switches?
Correct
The QM8700 is NVIDIA‘s Quantum switch based on HDR InfiniBand technology, delivering 64 ports of 200 Gb/s connectivity. It serves as the fabric backbone for large-scale AI training clusters, enabling efficient multi-node communication through GPUDirect RDMA and supporting NCCL collective operations essential for distributed deep learning workloads on DGX systems and GPU clusters.
Incorrect
The QM8700 is NVIDIA‘s Quantum switch based on HDR InfiniBand technology, delivering 64 ports of 200 Gb/s connectivity. It serves as the fabric backbone for large-scale AI training clusters, enabling efficient multi-node communication through GPUDirect RDMA and supporting NCCL collective operations essential for distributed deep learning workloads on DGX systems and GPU clusters.
Unattempted
The QM8700 is NVIDIA‘s Quantum switch based on HDR InfiniBand technology, delivering 64 ports of 200 Gb/s connectivity. It serves as the fabric backbone for large-scale AI training clusters, enabling efficient multi-node communication through GPUDirect RDMA and supporting NCCL collective operations essential for distributed deep learning workloads on DGX systems and GPU clusters.
Question 60 of 60
60. Question
A network administrator needs immediate notifications when InfiniBand link degradation occurs in a multi-tenant GPU cluster to prevent training job failures. Which UFM alerting configuration approach ensures real-time incident response while minimizing false positives?
Correct
Effective UFM alerting for InfiniBand networks requires threshold-based event detection with real-time notification delivery through SMTP or similar protocols. Configuring severity levels prevents alert fatigue by escalating only critical incidents requiring immediate action. For GPU training clusters, link degradation detection must occur within seconds to enable proactive response before distributed training jobs experience communication failures and resource waste.
Incorrect
Effective UFM alerting for InfiniBand networks requires threshold-based event detection with real-time notification delivery through SMTP or similar protocols. Configuring severity levels prevents alert fatigue by escalating only critical incidents requiring immediate action. For GPU training clusters, link degradation detection must occur within seconds to enable proactive response before distributed training jobs experience communication failures and resource waste.
Unattempted
Effective UFM alerting for InfiniBand networks requires threshold-based event detection with real-time notification delivery through SMTP or similar protocols. Configuring severity levels prevents alert fatigue by escalating only critical incidents requiring immediate action. For GPU training clusters, link degradation detection must occur within seconds to enable proactive response before distributed training jobs experience communication failures and resource waste.
X
Use Page numbers below to navigate to other practice tests