Why Top AI Models Still Fail High School Geometry

If you give a state-of-the-art AI a geometry problem, it’s likely not solving it—it’s remembering it. GeoBench, a new benchmark, exposes this gap by forcing vision-language models (VLMs) to prove they understand spatial logic instead of repeating textbook solutions.

The core problem with current AI tests is “data contamination”—the digital equivalent of a student sneaking a peek at the answer key before the exam. Many benchmarks use standard textbook problems, so models memorize answers rather than grasp geometric principles. This creates a false impression of intelligence that falls apart when problems stray from the training set.

GeoBench, created by Yuan Feng and colleagues, raises the bar. It shifts focus from just the final answer to the step-by-step reasoning needed to get there. Models must show their work like a formal geometry proof, not just guess correctly.

The benchmark has four levels: visual perception, goal-oriented planning, theorem application, and self-reflective backtracking. Using a tool called TrustGeoGen, the team built formally verified tasks that test a model’s ability to spot visual details and plan strategies. This design stops models from accidentally hitting the right answer without navigating the logic.

Early results deliver a reality check. Models like OpenAI-o3 start strong but falter as problems get complex. They struggle with breaking problems into smaller steps—called sub-goal decomposition—and often get distracted by irrelevant details. Surprisingly, the popular Chain-of-Thought prompting sometimes made performance worse, showing our current tricks to force AI reasoning are still crude.

GeoBench is a clear reminder: we’re building models that look smart but don’t think clearly. The road to AGI is still blocked by the same high school geometry proofs that stump students. Until these models can handle a compass and straightedge—figuratively—without tripping over logic, we should be wary of claims about their human-level reasoning.