Will Scaling Reasoning Models Like o3 and R1 Unlock Superhuman Reasoning?

Posted on Jan 26 2025, @chrisbarber
I asked: As we scale up training and inference compute for reasoning models, will they show: A) Strong general logical reasoning skills that work across most logically-bound tasks, B) Some generalization to other logic tasks, but perhaps requiring domain-specific retraining, or C) Minimal generalization?
Person Opinion Affiliations
Finbarr Timbers
Finbarr Timbers
Ex-DeepMind
Artfintel DeepMind DeepMind
We'll achieve superhuman performance on specific tasks with verifiable rewards. I see no evidence for general transfer, but it seems extremely plausible.
Artfintel DeepMind DeepMind
Gwern Branwen
Gwern Branwen
Independent Researcher
Gwern Gwern
"Everyone neglects to ask, what are we scaling?" Depends on what data they scale up on. The more you scale up on a few domains like coding, the less I expect transfer, as they become ultra-specialized.
Gwern Gwern
Jacob Buckman
Jacob Buckman
Founder of Manifest AI
Manifest Manifest AI Google Google Brain
Generalization can't really be predicted like that except empirically. All I know is as you add more compute and data you go from minimal transfer to some transfer to broad transfer. I have no clue where on that spectrum we stand when we run out of compute or data
Manifest Manifest AI Google Google Brain
Karthik Narasimhan
Karthik Narasimhan
Co-author of GPT paper
Sierra Sierra OpenAI OpenAI
I expect some generalization with domain specific retraining
Sierra Sierra OpenAI OpenAI
Near
Near
Independent
Independent
I think the "spikiness" of intelligence will continue to be notable (models which are extremely good at some things yet quite 'dumb' at others), but that it is easy to improve generalization in the areas we care about, since it just requires some data/RL fun.
Independent
Nathan Lambert
Nathan Lambert
Post-Training Lead at Allen AI
Allen AI Allen AI
New models trained to reason heavily about every subject will come to have better average performance than standard autoregression. In domains with explicit verifiers, this performance will be superhuman, in domains without, reasoning will still enable better performance, but maybe not more economical performance.
Allen AI Allen AI
Pengfei Liu
Pengfei Liu
Shanghai Jiao Tong University
SJTU SJTU
Increased compute and inference time will drive reasoning capabilities to expert-level performance where rich feedback loops exist. However, the development of general reasoning will be gated by two factors: the availability of problems requiring genuine deep thinking, and access to high-quality expert cognitive process data or well-defined reward signals.
SJTU SJTU
Ross Taylor
Ross Taylor
Led reasoning at Meta AI
Meta Meta AI
I think general reasoning will come fairly quickly. Right now it's easier to scale in domains where problems are easy to verify with an external signal. The generalisation will come if models themselves become good verifiers across domains.
Meta Meta AI
Shannon Sands
Shannon Sands
Nous Research
Nous Nous Research
There's at least some generalisation to other tasks like logic puzzles - but it might require more domain specific training to improve on many more out of domain tasks.
Nous Nous Research
Steve Newman
Steve Newman
Co-founder of Google Docs
AI Soup Google Google Docs
This is a trillion dollar question. If I had to guess: we'll see some transfer of reasoning skills across domains, but (on anything resembling current architectures) some specialized training will be needed in each domain. We'll learn a lot one way or another this year.
AI Soup Google Google Docs
Tamay Besiroglu
Tamay Besiroglu
Co-founder of Epoch AI
Epoch Epoch AI
I think minimal transfer is wrong because reasoning is a very general skill that you can apply to perform a wide range of actions. Planning, for instance, is something that requires good reasoning.
Epoch Epoch AI
Teortaxes
Teortaxes
Independent
Independent
I think there will be a period of strong 'natively verifiable reasoning overhang' which translates to more general verifiers using models' strong coding ability and general knowledge+tools, then they grok more general regularities of sound reasoning, and the next generation can natively generate good reasoning data for all domains.
Independent
Xeophon
Xeophon
Independent
Independent
We will see some generalizations into other domains where the model was not explicitly trained on. For example, R1 writes better and more creative stories than V3, the model it is based on. To push this further, models need to be trained on more data in other domains.
Independent
Chris Barber
Chris Barber
Creator of this
Independent
Synthesis: The expert takes point to generalization for all logically bound domains where we can construct verifiers for now, and trending in the direction of broad transfer in the future. More notes from experts: @chrisbarber.
Independent

Implications & Follow-up Discussion

The important question is to what extent the verifiers or judge models can be cheaply set up for each new domain. If you can quickly 'o3-ize' every important domain, particularly with general-purpose coders & mathematician expert models, then the scaling can continue cheaply.
Gwern Branwen profile photo
Gwern Branwen
Independent Researcher
Verification will become a test-time compute problem -> the longer you spend checking a solution, the more accurate the verification signal should be. Then the question will be: how do you verify the verifiers? I suspect we'll end up in a world where multiple agents are checking each others' work (a bit like how the research community works). Not clear if we're at the level of model capability for this to work yet, but I wouldn't bet against it!
Ross Taylor profile photo
Ross Taylor
Led reasoning at Meta AI

Is There a Ceiling?

I see no reason for reasoning models to plateau around world-expert level. The existing paradigm for reasoning models looks like RLVR, where the model learns to solve tasks with deterministic/verifiable rewards. No reason that has any limit around human levels. Look at deepseek r1, for instance.
Finbarr Timbers
Finbarr Timbers
Previously DeepMind
In general search can surpass humans. This was clear even pre-deep-learning, e.g. with "classical" chess AI. But most real-world domains don't have the nice properties (cheap to simulate, clear success condition) that make them amenable to search. The bottleneck becomes rate-of-search - you won't be able to improve performance faster per step of search than the expert.
Jacob Buckman
Jacob Buckman
Founder of Manifest AI, ex-Google Brain

Follow-up Questions

Tamay suggested a good follow up would be to "elicit ideas for experiments that they would expect to turn out one way conditional on "Weak transfer" and another if "Strong transfer" is correct" – let me know if you have ideas.

Questions

To answer, tag or DM me at @chrisbarber or email me at [email protected]

Thank you to Amir Haghighat, Arun Rao, Ash Bhat, Avery Lamp, Charlie Songhurst, Connor Mann, Daniel Kang, Dhruv Singh, Eric Jang, Ethan Beal-Brown, Finbarr Timbers, Flo Crivello, Griffin Choe, Gwern, Herrick Fang, Jacob Buckman, James Betker, Jay Hack, Josh Singer, Julian Michael, Katja Grace, Karthik Narasimhan, Logan Graham, Matt Figdore, Mike Choi, Nathan Lambert, Nicholas Carlini, Nitish Kulkarni, Pengfei Liu, Rick Barber, Robert Nishihara, Robert Wachen, Rohit Krishnan, Ron Bhattacharyay, Ross Taylor, Shannon Sands, Spencer Greenberg, Steve Newman, Tamay Besiroglu, Teknium, Teortaxes, Tim Shi, Tim Wee, Tyler Cowen, and Xeophon.

Full quotes and conclusions at X@chrisbarber