World Models: the Next Phase of Physical AI

How simulators, robots, and cryptography will reshape the real-world economy

Nov 24, 2025

To paraphrase Moravec’s paradox, winning a game of chess or discovering a new drug represent “easy” problems for AI to solve, but folding a shirt or cleaning up a table or baking a cake requires solving some of the messiest problems in perception, dexterous control, and common sense about the physical world.

Most of today’s AI is essentially a very well-read intern.

It has digested our code, contracts, PDFs, and slide decks. It can summarize, refactor, and draft faster than any human. That intern is now extremely good at “desk work”. But it is useless at unloading a truck, driving a semi through a blizzard, or cleaning a stranger’s kitchen.

Transformers act as a universal office copilot, predicting the next word in a sentence with superhuman accuracy. But they have no stable notion of 3D space, physics, or causality. They hallucinate because they are juggling fuzzy patterns rather than an explicit state of the world.

As Christopher Mims at the Wall Street Journal recently noted, a 1979 chess program on an Atari 2600 can still beat many frontier chatbots at legal chess. The Atari code keeps a tiny, perfect internal world model. The chatbot is just guessing based on vibes.

We believe the next decade of promising companies will come from Physical AI and systems powered by World Models.

From Predicting Tokens to Predicting Physics

If an LLM asks, “What symbol comes next?”, a World Model asks, “What happens if I take this action?”

A World Model is effectively a physics engine for intelligence. It maintains an internal simulation of how the environment evolves. It allows an agent to simulate a future, test a plan, and then act.

Foundation Models (like GPT-4) learn a prior over text.
World Foundation Models (like Nvidia Cosmos or AIMultiple’s examples) learn a prior over video and physics.

This is the missing piece for general-purpose robotics. Instead of hard-coding a robot to pick up a specific cup, you give it a World Model that understands mass, friction, and geometry.

The Uncanny Valley of Physics: My Experiment

I recently tested this distinction using World Labs’ generative tools. I took a photo of my favorite beach, Keawakapu: crisp horizon, distinct wave fronts, wet sand. I asked the model to animate it (”dream” the next few seconds).

At a glance, the result was stunning. The colors were perfect. The vibe was tropical. But when I looked closer, the physics was broken:

Wave fronts didn’t crash, they smeared into each other kind of like plastic sheets.
The shoreline blurred, wet and dry sand became a single muddy texture.
The depth was confused, warping the horizon.

The model nailed the aesthetic but failed the simulation. If a robot boat used that internal model to navigate, it would capsize.

Artia put it elegantly: we are seeing two different “World Model” regimes emerge:

1) Game-like worlds that mimic physics. Companies like generalintuition.com and worldlabs.ai are good examples since they effectively train on massive archives of gameplay footage and video to produce environments that look and feel right. That’s great for environment generation and storytelling but not directly useful for Physical AI.

2) Robot-grade environments with robot-complete, simulated physics. Here you use engines designed to simulate real-world dynamics well enough that a learned policy can actually survive sim2real transfer. This is far more compute-intensive and games can’t afford to render at this level of physical fidelity.

As Somi put it, AI thrives wherever it has dense rewards. We can simulate billions of games of chess and grade every move. We can’t do that for a new cake recipe or a warehouse layout without actually baking cakes or retooling facilities and asking thousands of people for feedback. Rewards in the physical world are bottlenecked. Massive simulation – world models with good physics and rich feedback – is how you un-bottleneck them.

And this illustrates the current gap: generative video is great for content creation, but simulation requires fidelity. We need models that don’t just look real, but act real.

Why Now? The Convergence

We are moving from “niche research” to “platform layer” thanks to three shifts:

Video Tech has Matured: Diffusion transformers can now generate consistent sequences at interactive frame rates. However we want to be cognizant that these rates are quite different from the latency requirements for robotics/real world use cases.
The “Cosmos” Moment: Tech majors are productizing this. Nvidia’s Cosmos platform is a “World Twin” that allows robots to live millions of lifetimes in simulation before touching a real object.
Generalist Robots: Companies like Physical Intelligence are proving that a single “vision-language-action” model (like their π0 policy) can fold laundry and bus tables.

The Investment Thesis

We are looking for the infrastructure that turns the messy physical world into a computable problem. We are moving away from “we trained a model for our game” and toward these high-value categories:

The “Physics Engine” for Reality
We are long on open, programmable simulators and the toolchains that support them. Think of this as the “weights and biases” or “Databricks” for spatial intelligence. We need debuggers (like those proposed by the PAN team) and safety layers that help developers trust a World Model.
Robot Foundation Models (VLAs)
We are backing teams building generalist policies. The value isn’t in a robot that can fold a shirt per se (although this laundry/folding example is quite overused). It’s in a model that understands “folding” well enough to handle a shirt, a towel, a tarp, a tent etc. etc.
Sim-to-Real Bridges
Simulation is never perfect. As noted in Critiques of World Models, models can drift or hallucinate physics over long horizons. There is massive opportunity in the layer that handles this “sim-to-real” gap allowing agents to learn in safety (sim) but deploy with confidence (real). As Artia points out, the harder / more critical the task, the higher fidelity sims you need to create. For e.g.,
- record simulations at ~10Hz (fps) to train a robot to pick up a mug
- record simulations at ~100Hz+ (fps) to simulate appendix surgery

Cryptographic rails for Decentralized World Models

We also believe cryptographic primitives will be key to how this stack scales and who controls it. A few areas stand out.

Shared training networks for world models

Nodes run world-model rollouts or training steps and submit proofs of useful work.
A protocol pays them in a native token when their work verifies.
This is Proof of Useful Work tuned to simulation: very long-horizon driving, robotics, climate, or city sims that no single lab can afford.

Cryptography matters because it gives us verifiable work, open weights, and transparent contribution accounting, similar in spirit to what Gensyn is doing for gradients, but focused on world-model trajectories.

Marketplaces for embodied data

Warehouse operators, fleets, hospitals, and cities contribute sensor and video logs into encrypted pools.
A protocol meters how downstream world models use that data and routes rewards back to contributors.

On-chain accounting + off-chain storage, privacy tech like TEEs or homomorphic encryption, and reputation systems give us an honest “data DAO” for the physical world. Crypto primitives let you pay for real-world logs without handing them to a single vendor.

Digital twin registries and “proof of world”

Each digital twin (warehouse, port, mall, wind farm) gets a canonical on-chain ID, version history, and ownership.
Builders publish upgrades and layout changes as transactions.
Agents and models reference twins by ID instead of ad hoc files.

On top of that, you can use zk-style proofs to attest that a given policy was evaluated across N scenarios in a specific twin with specific parameters. Insurers, regulators, and counterparties can verify claims like “this forklift policy passed 10,000 crash tests” without rerunning them.

Agent economies inside shared worlds

Agents have wallets, identities, and policies.
They buy sim time, sell services (routing, picking plans, maintenance schedules), and compete or cooperate inside registered twins.

Crypto gives them native incentives, clear ownership of IP, and auditable flows of value between whoever provides the world, the data, the compute, and the behavior.

These are still early ideas, but they rhyme with what we have already seen in DeFi and decentralized compute. The difference is that the scarce resource is faithful and trustworthy models of the real world.

The Road Ahead

Transformers made AI book smart. World models will give AI physical intuition: a sense of space, time, and consequence.

This transition will be a grind. It involves:

massive heterogeneous data collection (simulation data, teleop data, egocentric data etc.),
simulator engineering, and
navigating the complexities of labor and safety in the physical world.

But if you care about the real economy such as in manufacturing, logistics, transport, and healthcare, this is where the puck is going.

Special thanks to these folks for helping me with this post: Artia Moghbel, Somi Agarwal, ExcelMaxi, The Virutals Team, Sven Wellman, Danny Sursock, Prof Sriram Vishwanath, Naman Kapasi

Canonical

Discussion about this post

Ready for more?