I came across this fascinating paper that introduces new benchmarks derived from NPR Sunday puzzle challenges. The authors make a compelling argument that PhD-level benchmarks are often too specialized for non-experts to grasp. Instead, they’ve created about 600 puzzles that are both challenging and easy to verify, testing these across different reasoning models.
Key Takeaways
- OpenAI’s o1 model significantly outperformed others, achieving 59% accuracy
- DeepSeek R1 and Gemini Thinking showed notable reasoning failures and uncertainties
- The study identifies common failure modes, including models giving up on problems or producing incorrect answers without justification
- The study highlights how reasoning length impacts accuracy in some subjective ways
Notable Quotes
“We focus on evaluating the latest generation of models that use test-time compute to reason before producing a final answer”
â C. Anderson et al.
Context & Analysis
I think making things easier for non-experts is not the right metric to chase when trying to assess LLMs. The problem with LLMs is that they can memorize a lot of data, so you really need to assess if the outcome is taking the complexity of the thinking process into account. That’s what PhD level reasoning is all about - it’s not about being correct but the process itself that leads to the result.