NVIDIA’s xAI Colossus: 100,000 GPUs Transforming Hyperscale AI Performance
NVIDIA’s xAI Colossus supercomputer in Memphis now has 100,000 Hopper GPUs, which are powered by the NVIDIA Spectrum-XTM Ethernet platform and improve performance for multi-tenant, hyperscale AI.
NVIDIA announced today that xAI’s Colossus supercomputer cluster in Memphis, Tennessee, with 100,000 NVIDIA Hopper GPUs, reached this huge size by using the NVIDIA Spectrum-XTM Ethernet networking platform for its Remote Direct Memory Access (RDMA) network.
The design of this platform aims to enhance the performance of multi-tenant, hyperscale AI factories through the use of standards-based Ethernet.
Colossus, the world’s largest AI supercomputer, is training xAI’s Grok family of large language models. X Premium subscribers can use chatbots as a perk. Colossus is about to get twice as big, with a total of 200,000 NVIDIA Hopper GPUs being added by xAI.
The supporting building and cutting-edge supercomputer were built in just 122 days by xAI and NVIDIA. Usually, it takes months or even years to build a system this big. From the moment the first rack was placed on the floor for 19 days, there was no training.
While training the very large Grok model, Colossus achieves unprecedented network performance. The system hasn’t had any application latency problems or packet loss because of flow collisions on any of the three layers of the network fabric. Spectrum-X congestion control has helped it keep up a 95% data throughput.
Standard Ethernet can’t handle this level of performance on a large scale because it causes thousands of flow collisions and only sends 60% of the data that it receives.
Gilad Shainer, senior vice president of networking at NVIDIA, said, “AI is becoming mission-critical and needs better performance, security, scalability, and cost-efficiency.”
“We designed the NVIDIA Spectrum-X Ethernet networking platform to assist AI innovators such as xAI in expediting the processing, analysis, and execution of AI workloads.” This speeds up the creation, deployment, and time to market of AI solutions.”
Elon Musk said on X, “Colossus is the most powerful training system in the world.” The xAI team, NVIDIA, and all of our partners and suppliers did a fantastic job.
A representative for xAI said, “We have built the world’s biggest and most powerful supercomputer.” “NVIDIA’s Hopper GPUs and Spectrum-X let us push the limits of training AI models on a huge scale, making an Ethernet-based AI factory that is super-fast and optimized.”
High-Speed AI Ethernet Networking with NVIDIA
Based on the Spectrum-4 switch ASIC, the Spectrum SN5600 Ethernet switch is the heart of the Spectrum-X platform. It has port speeds of up to 800 GB/s. To achieve unprecedented performance, xAI paired the Spectrum-X SN5600 switch with NVIDIA BlueField-3® SuperNICs.
Spectrum-X Ethernet networking for AI adds advanced features that make bandwidth more effective and scalable, with low latency and short tail latency. These features were only available in InfiniBand before.
These features include adaptive routing with NVIDIA Direct Data Placement technology, congestion control, better AI fabric visibility, and performance isolation. All of these are important for large business settings and generative AI clouds with many tenants.