METR Cross-Domain Time Horizon Tracker

The frontier is uneven

AI autonomy has expanded fast, but not uniformly. Some domains already support long autonomous runs; others still collapse quickly when task duration rises.

Short tasks are mostly solved

At low task duration, success rates are high across domains. The first breakpoints appear only when tasks move into sustained multi-minute work.

Then performance decays sharply

The drop-off is nonlinear. A small increase in task length can produce a large success-rate loss, especially in cybersecurity and reasoning.

Domain matters more than averages

Looking only at aggregate model scores hides major differences. Domain-level curves reveal where autonomy is robust vs fragile.

The best models still cluster near a frontier

Top model-domain pairs form an efficiency frontier: longer horizons, but still bounded in many real-world workflows.

Model frontier bubble map

Bubble size = evaluation points. X = horizon minutes (log scale), Y = success at 60 min.

How success decays with task length

Scroll steps on the left to highlight different views.

Domain autonomy horizons

Lollipop view with confidence intervals for median 50% success horizon.

Forecast explorer: months to longer autonomous tasks

Assumes domain horizons keep doubling at a constant rate. Explore how many months until 1h, 8h, and 1d autonomous capability.

Assumed doubling time (months)

6 months

Model x Domain leaderboard

Heatmap of autonomy horizon (minutes). Higher intensity means longer sustained autonomy.

METR headline model horizons (official feed)

Model-level 50% and 80% horizons from METR's published benchmark summary. This can surface new releases before they appear in per-run `runs.jsonl` data.

Option A: Token efficiency

Lower is better: tokens needed for one successful autonomous hour.

Option B: Cost vs efficiency

Bubble map with Pareto frontier. Lower-left region is best.

Assumed token split (input/output)

Records sample

First 150 processed rows for auditability and quick spot checks.

Benchmark	Domain	Model	Release Date	Human Minutes	Score