Discussion about this post

User's avatar
Hugo's avatar

Really enjoyed this. The verification evolving from simple benchmarks to reward models to hybrid human-AI loops to "living systems that resist reward hacking" is basically a compressed replay of reinforcement learning's own sixty-year history. RL went through the exact same arc: hardcoded rewards (Atari scores), then learned reward models (RLHF), then the realization you need something like a world model to predict consequences and self-verify. And each stage broke the same way. Goodhart's Law. Optimize hard against any fixed signal, the agent finds the exploit. The line about "graders drift, benchmarks saturate, reward hacking appears" is literally the specification gaming problem RL researchers have been wrestling with since the 1990s, just showing up in a new costume.

Which pushes toward an interesting endpoint. To verify whether an action is "correct," you need to predict its consequences. That's a world model. The RL environment companies that win are the ones that end up building the best world models of their workflows, whether they call it that or not. The market map here might actually be a subset of a much bigger one.

Ridire Research's avatar

Great read thank you for sharing your insights!

4 more comments...

No posts

Ready for more?