The frontier is uneven
AI autonomy has expanded fast, but not uniformly. Some domains already support long autonomous runs; others still collapse quickly when task duration rises.
Autonomy frontier tracker
Weekly-updated view of METR benchmark data. Scroll to follow the story, then drill into model efficiency and cost tradeoffs.
AI autonomy has expanded fast, but not uniformly. Some domains already support long autonomous runs; others still collapse quickly when task duration rises.
At low task duration, success rates are high across domains. The first breakpoints appear only when tasks move into sustained multi-minute work.
The drop-off is nonlinear. A small increase in task length can produce a large success-rate loss, especially in cybersecurity and reasoning.
Looking only at aggregate model scores hides major differences. Domain-level curves reveal where autonomy is robust vs fragile.
Top model-domain pairs form an efficiency frontier: longer horizons, but still bounded in many real-world workflows.
Bubble size = evaluation points. X = horizon minutes (log scale), Y = success at 60 min.
Scroll steps on the left to highlight different views.
Lollipop view with confidence intervals for median 50% success horizon.
Assumes domain horizons keep doubling at a constant rate. Explore how many months until 1h, 8h, and 1d autonomous capability.
6 months
Heatmap of autonomy horizon (minutes). Higher intensity means longer sustained autonomy.
Model-level 50% and 80% horizons from METR's published benchmark summary. This can surface new releases before they appear in per-run `runs.jsonl` data.
Lower is better: tokens needed for one successful autonomous hour.
Bubble map with Pareto frontier. Lower-left region is best.
First 150 processed rows for auditability and quick spot checks.
| Benchmark | Domain | Model | Release Date | Human Minutes | Score |
|---|