NVIDIA AI unveils nGPT: Hypersphere-Based Transformer Boosts AI Training 20x Faster

Nvidia AI presents the Normalized Transformer (nGPT). The hypersphere-based transformer enhances the LLMs’ stability and makes training 4-20 times faster.

NVIDIA researchers have created the Normalized Transformer (nGPT), a new system to make training Transformer-based models more efficient. By building normalization into the model’s structure, nGPT cuts down on the time and computing power needed for training by a large amount without affecting performance.

Key Innovation: Hyperspherical Representation Learning

The idea behind nGPT is to show all the vectors in the model on a hypersphere. This includes embeddings, attention matrices, and hidden states. Input tokens can move across the surface of a hypersphere because this makes sure that the network’s levels are all the same. By thinking of training as small steps on this hypersphere, nGPT makes learning faster and more stable.

Unlike vanilla GPT, nGPT shows a 4 to 20 fold deduction in training steps. Instead of the classical approach of applying weight decay directly to the weights, the paper introduces the use of particular scaling parameters that are learned during the training process; using the proposed structure of the architecture, the paper also does not require the use of such additional techniques as LayerNorm or RMSNorm. During the optimization process, the model enables learnable edge learning rates that make every layer contribute optimally to the result.

Experimental Results

nGPT models do better than standard GPT models in tests using the OpenWebText dataset, especially when it comes to lowering validation loss. When the length of the context was longer, like 4k tokens, nGPT got similar validation results with a lot fewer rounds. nGPT also did great at tasks that came after, giving faster convergence and better generalization. The hyperspherical method also led to better embedding separability, which made standard tests more accurate.

This is a big step forward in training Transformer models because nGPT works very well with less hardware at NVIDIA. It would be clearer how normalization and representation could be made better in Transformer models with this new design — that put both into one framework. Other larger hybrid systems with more parts could also use it.

InAI, nGPT, Nvidia, Nvidia AI news

Meta’s AI Growth Powers 2024 Profit Surge, Stock Hits New Highs

OpenAI Warns of China Copying AI, Eyes Stronger U.S. Partnership

Meta Sets Up Four ‘War Rooms’ to Evaluate DeepSeek’s AI Model

Meta AI Uses Past Interactions to Personalize Facebook and Instagram Feeds

OpenAI Unveils ChatGPT Gov for U.S. Agencies, Backed by Microsoft

Meta Gears Up with $65 Billion Investment to Dominate AI

OpenAI Launches ‘Operator’ to Automate Web Tasks

Meta Supports Databricks’ $10B Round to Lead AI and LLM Innovations

OpenAI Faces Legal Challenge in India Over ChatGPT Data Dispute