Home EnterpriseAI NVIDIA Spectrum-X Networking Powers xAI’s Colossus Supercomputer

NVIDIA Spectrum-X Networking Powers xAI’s Colossus Supercomputer

by Harold Fritts

NVIDIA has revealed that xAI’s Colossus supercomputer, boasting a remarkable 100,000 NVIDIA Hopper Tensor Core GPUs, is now fully operational in Memphis, TN.

NVIDIA has revealed that xAI’s Colossus supercomputer, boasting a remarkable 100,000 NVIDIA Hopper Tensor Core GPUs, is now fully operational in Memphis, TN. This achievement was made possible through NVIDIA’s Spectrum-X™ Ethernet networking platform, designed to deliver robust performance for hyperscale, multi-tenant AI data centers. Spectrum-X uses standards-based Ethernet with RDMA networking to ensure efficient communication and optimized data handling within these large-scale environments.

Colossus supercomputer switch

As the world’s largest AI supercomputer, Colossus currently powers the training of xAI’s Grok language model family, which includes chatbot functionalities for X Premium subscribers. xAI has further plans to expand Colossus to 200,000 NVIDIA Hopper GPUs, reinforcing its status as a premier AI computing resource. xAI and NVIDIA built this facility and the advanced computing infrastructure in a record 122 days, whereas similar projects typically span several months to years. Colossus began training operations within 19 days of the initial rack installation.

Colossus is achieving exceptional network performance while training large-scale models, benefiting from Spectrum-X’s congestion control and flow handling. This has resulted in the system experiencing zero latency degradation or packet loss due to flow collisions and maintaining a data throughput rate of 95%, a significant improvement over traditional Ethernet, which typically sees only 60% data throughput and frequent flow collisions.

The advancement of NVIDIA’s Spectrum-X implementation lies in its approach to handling network congestion in this massive GPU cluster. Traditional Ethernet networks struggle with the “incast” problem when thousands of GPUs communicate simultaneously, leading to packet drops and significant performance degradation. While InfiniBand traditionally solved this with its built-in Priority Flow Control (PFC) and hardware-level congestion management, Spectrum-X achieves similar results using RoCE v2 with enhanced congestion control mechanisms. This allows xAI to maintain InfiniBand-like performance characteristics while leveraging standard Ethernet infrastructure’s cost benefits and flexibility.

Spectrum-X’s adaptive routing and Direct Data Placement capabilities create a resilient network fabric that can handle the massive east-west traffic patterns typical in distributed AI training workloads. The result is a system that maintains consistent low latency and high throughput even when all 100,000 GPUs actively participate in collective operations.

Gilad Shainer, NVIDIA’s senior vice president of networking, emphasized that “AI is mission-critical” and requires a combination of performance, security, scalability, and cost-efficiency. He highlighted how NVIDIA’s Spectrum-X platform enables companies like xAI to accelerate processing, analysis, and execution for AI workloads, resulting in faster development and deployment of AI solutions.

An xAI spokesperson acknowledged NVIDIA’s Hopper GPUs and Spectrum-X technology, citing the system’s scale and performance as critical in enabling an optimized AI “factory” based on Ethernet standards.

Central to Spectrum-X is the Spectrum SN5600 Ethernet switch, which supports speeds up to 800Gb/s with the Spectrum-4 switch ASIC. xAI strategically paired this switch with NVIDIA’s BlueField-3® SuperNICs, achieving performance levels previously exclusive to InfiniBand. Spectrum-X Ethernet networking introduces features such as adaptive routing with Direct Data Placement, sophisticated congestion control, and improved AI fabric visibility and performance isolation—meeting the demanding requirements of multi-tenant AI environments and enterprise-level AI deployments.

Engage with StorageReview

Newsletter | YouTube | Podcast iTunes/Spotify | Instagram | Twitter | TikTok | RSS Feed