Really enjoyed this. The verification evolving from simple benchmarks to reward models to hybrid human-AI loops to "living systems that resist reward hacking" is basically a compressed replay of reinforcement learning's own sixty-year history. RL went through the exact same arc: hardcoded rewards (Atari scores), then learned reward models (RLHF), then the realization you need something like a world model to predict consequences and self-verify. And each stage broke the same way. Goodhart's Law. Optimize hard against any fixed signal, the agent finds the exploit. The line about "graders drift, benchmarks saturate, reward hacking appears" is literally the specification gaming problem RL researchers have been wrestling with since the 1990s, just showing up in a new costume.
Which pushes toward an interesting endpoint. To verify whether an action is "correct," you need to predict its consequences. That's a world model. The RL environment companies that win are the ones that end up building the best world models of their workflows, whether they call it that or not. The market map here might actually be a subset of a much bigger one.
Thanks for writing this, it clarifies a lot. The semiconductor analogy is incredibly insightful. It really makes me reflect on how this 'verification problem' keeps reappearing. This feels like a natural next step after your piece on AI scaling; it’s teh crucial maturity layer we need for agentic AI.
Really enjoyed this. The verification evolving from simple benchmarks to reward models to hybrid human-AI loops to "living systems that resist reward hacking" is basically a compressed replay of reinforcement learning's own sixty-year history. RL went through the exact same arc: hardcoded rewards (Atari scores), then learned reward models (RLHF), then the realization you need something like a world model to predict consequences and self-verify. And each stage broke the same way. Goodhart's Law. Optimize hard against any fixed signal, the agent finds the exploit. The line about "graders drift, benchmarks saturate, reward hacking appears" is literally the specification gaming problem RL researchers have been wrestling with since the 1990s, just showing up in a new costume.
Which pushes toward an interesting endpoint. To verify whether an action is "correct," you need to predict its consequences. That's a world model. The RL environment companies that win are the ones that end up building the best world models of their workflows, whether they call it that or not. The market map here might actually be a subset of a much bigger one.
Insightful comment! Thank you.
Great read thank you for sharing your insights!
Thanks for writing this, it clarifies a lot. The semiconductor analogy is incredibly insightful. It really makes me reflect on how this 'verification problem' keeps reappearing. This feels like a natural next step after your piece on AI scaling; it’s teh crucial maturity layer we need for agentic AI.
well said! completely agree.
Great writeup! Loved it