PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

2025-02-09

By C. Anderson et al.

I came across this fascinating paper that introduces new benchmarks derived from NPR Sunday puzzle challenges. The authors make a compelling argument that PhD-level benchmarks are often too specialized for non-experts to grasp. Instead, they’ve created about 600 puzzles that are both challenging and easy to verify, testing these across different reasoning models.

Key Takeaways

OpenAI’s o1 model significantly outperformed others, achieving 59% accuracy
DeepSeek R1 and Gemini Thinking showed notable reasoning failures and uncertainties
The study identifies common failure modes, including models giving up on problems or producing incorrect answers without justification
The study highlights how reasoning length impacts accuracy in some subjective ways

Notable Quotes

“We focus on evaluating the latest generation of models that use test-time compute to reason before producing a final answer”
— C. Anderson et al.

Context & Analysis

I think making things easier for non-experts is not the right metric to chase when trying to assess LLMs. The problem with LLMs is that they can memorize a lot of data, so you really need to assess if the outcome is taking the complexity of the thinking process into account. That’s what PhD level reasoning is all about - it’s not about being correct but the process itself that leads to the result.