Nvidia's Cosmos 3 Teaches Robots to Predict the Future

Nvidia has launched Cosmos 3, described as the first open omni-model for physical AI. The system, announced on May 31, integrates reasoning, world generation, and action into a single framework, allowing robots and autonomous vehicles to anticipate events in unpredictable environments.

Cosmos 3 can generate predictive video sequences up to 30 seconds long from text, image, or video inputs. This lets a robot effectively visualize the consequences of its actions before moving.

Built on a Mixture of Transformers architecture, the model processes sight, sound, and action simultaneously. It serves as the backbone for Nvidia's World Action Models, enabling machines to operate in unfamiliar environments.

Early adopters include Figure AI, Agility Robotics for humanoid robots, and Uber, Waabi, and Wayve for self-driving technology. The open-source nature of Cosmos 3 could accelerate development for smaller firms, reducing the need for costly real-world training data.