Five Things | Reasoning, Evaluation, and Efficiency

Robotics reasoning, world-model evaluation, and efficiency-first AI systems

Anthony Avedissian, Anand Iyer, and Canonical

Dec 19, 2025

Each week, we share a small collection of ideas, conversations, and artifacts that shaped our internal thinking. Inspired by experiments like USV’s Librarian, this series is powered by an AI assistant that helps synthesize recurring themes from our discussions, alongside our own reflections.

NVIDIA’s R²D² work shows how combining simulation with language and vision-language models can improve robot manipulation by separating high-level reasoning from low-level action. Using sim-and-real co-training and generative tool design, it points toward more generalizable, long-horizon robot behavior. As robots move into real-world environments, this reasoning-first, simulation-driven approach may become foundational.

OpenRouter’s State of AI reveals a multi-model world where closed models dominate high-stakes, premium workloads, while open models capture massive volume in cost-sensitive tasks like coding and roleplay. Medium-sized models are emerging as the real winners, balancing capability and efficiency, while agentic inference is quickly becoming the default interaction pattern. What stood out most to us is the Cinderella “glass slipper” effect: durable retention only appears when a model is the first to unlock a previously impossible workload.

As DeFi lending has become more modular, risk has shifted from base protocols to a new curator layer built on top of shared infrastructure. While primitives like Aave now sit closer to the infrastructure layer, underwriting, leverage, and asset selection are increasingly handled by third-party vault managers and strategy curators, concentrating risk above the rails. Our takeaway: the next frontier in DeFi isn’t new lending primitives, but better transparency and tooling around how risk is packaged, disclosed, and redistributed.

New work from DeepMind shows how large video world models can be used to evaluate robot policies on performance, generalization, and safety without hardware testing or physics engines. What’s compelling is that the video model retains common-sense world knowledge even after fine-tuning, allowing it to surface failure modes and safety issues before anything breaks. If this holds up, world-model-based evaluation could become a critical layer for safely scaling robotics.

Anirudha Majumdar@Majumdar_Ani

Generalist robots need a generalist evaluator. But how do you test safety without breaking things? 💥 🌎 Introducing our new work from @GoogleDeepMind: Evaluating Gemini Robotics Policies in a Veo World Simulator veo-robotics.github.io 🧵👇

5:02 PM · Dec 12, 2025

95 Reposts · 557 Likes

NVIDIA released Nemotron 3, an open model family designed to push the accuracy–inference efficiency frontier. Using a hybrid MoE architecture with techniques like LatentMoE and multi-token prediction, it activates far more capacity at roughly the same inference cost. As agentic workloads grow, this kind of efficiency-first systems design feels increasingly central to how open models stay competitive.

We’ll share another edition next week.

Canonical

Discussion about this post

Ready for more?