Score
Average score per attempt row (0–1). Rewards partial credit.
Formula: Mean of all attempt scores across all results rows: SUM(score) / COUNT(*) over the results table.
Ranks models that make consistent partial progress on hard tasks. A model that scores 0.5 on every task beats one that passes half and fails the rest on this metric.
Cost / run
Average total LLM cost per benchmark run in USD.
Formula: SUM(cost_usd) / run_count across all runs for this model.
Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.
Latency p50
Median per-task wall time (LLM call + compile + test), in milliseconds.
Formula: 50th percentile of per-task duration_ms: LLM latency + compile time + test time.
Use p50 for a typical-case latency expectation. Unaffected by outlier slow tasks.
Pass Rate
Fraction of distinct tasks solved in any attempt across all runs.
Formula: (tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / tasks_attempted_distinct
Primary ranking metric. Compare models here first: it directly measures how often the model delivers working code.
pass^n (strict)
Fraction of tasks the model solved in every single run (strict consistency).
Formula: tasks where ALL runs produced a passing result / tasks_attempted_distinct
Measures reliability under repetition. High pass^n means the model is unlikely to regress on a re-run, important for CI and production use.
$/Pass
Total cost divided by number of distinct tasks passed. Lower is better.
Formula: SUM(cost_usd) / tasks_passed_distinct across all runs.
Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.
Latency p95
95th-percentile per-task wall time. Captures tail latency.
Formula: 95th percentile of per-task duration_ms across all tasks in all runs.
Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.
Overview
GPT-5 has run on 0 occasions, attempting 0 tasks with an average score of 0.00.
Settings
Generation parameters used across this model's runs. "varies" indicates the value differed between runs.
- Temperature
- varies
- Thinking budget
- varies
- Avg tokens / run
- —
- Consistency
- —
History
Cost
No cost data yet.
Shortcomings
AL concepts GPT-5 struggles with. Click a row for description, correct pattern, and observed error codes.
No shortcomings analyzed yet
Shortcomings analysis is on the roadmap. The first analyzer run is scheduled for the P8 release; until then, this section reflects no data.
See methodologyRecent runs
| Started | Model | Tasks | Score | Cost | Duration | Status |
|---|
Methodology
Scores are computed per task, averaged across attempts. See the about page for details.