AGI Gap Atlas

Capability is not one number.

This observatory maps where benchmark scores align with real usefulness and where they diverge across efficiency, calibration, and transfer.

Four visualizations that decompose capability into distinct dimensions - progress trajectory, cost efficiency, confidence calibration, and cross-benchmark transfer.

Twin Rivers Timeline

How progress differs across ARC-AGI and HLE over release time.

Progress on ARC-AGI and HLE does not move in lockstep. Some models advance on symbolic reasoning while stalling on broad knowledge.

Efficiency Illusion Map

Do higher ARC scores come from smarter models, higher cost, or both?

The highest-scoring models are often NOT the most efficient. Some achieve comparable performance at a fraction of the cost.

Confidence vs Competence Lens

A model is safer when it is accurate and well-calibrated.

A model that scores 40% but knows when it's wrong is more useful than one that scores 45% but is confidently incorrect.

Transfer Gap Matrix

How much does ARC-AGI score differ from HLE score for the same model?

Capability is not one number.

Looking at progress, efficiency, calibration, and transfer together reveals where benchmark gains are durable and where they are brittle.