ME Blog

PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

By C. Anderson et al.

I came across this fascinating paper that introduces new benchmarks derived from NPR Sunday puzzle challenges. The authors make a compelling argument that PhD-level benchmarks are often too specialized for non-experts to grasp. Instead, they’ve created about 600 puzzles that are both challenging and easy to verify, testing these across different reasoning models.

Key Takeaways

  • OpenAI’s o1 model significantly outperformed others, achieving 59% accuracy
  • DeepSeek R1 and Gemini Thinking showed notable reasoning failures and uncertainties
  • The study identifies common failure modes, including models giving up on problems or producing incorrect answers without justification
  • The study highlights how reasoning length impacts accuracy in some subjective ways

Notable Quotes

“We focus on evaluating the latest generation of models that use test-time compute to reason before producing a final answer”

— C. Anderson et al.

Context & Analysis

I think making things easier for non-experts is not the right metric to chase when trying to assess LLMs. The problem with LLMs is that they can memorize a lot of data, so you really need to assess if the outcome is taking the complexity of the thinking process into account. That’s what PhD level reasoning is all about - it’s not about being correct but the process itself that leads to the result.

Tags: #reasoning #AI #LLM