Leaderboard · CentralGauge

Leaderboard
#		Score Average score per attempt row (0–1). Rewards partial credit. Formula: `Mean of all attempt scores across all results rows: SUM(score) / COUNT() over the results table.` Ranks models that make consistent partial progress on hard tasks. A model that scores 0.5 on every task beats one that passes half and fails the rest on this metric.*	Pass Rate Fraction of distinct tasks solved in any attempt across all runs. Formula: `(tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / tasks_attempted_distinct` Primary ranking metric. Compare models here first: it directly measures how often the model delivers working code. HumanEval paper (Chen et al., 2021) ↗	CI Pass Rate 95% CI 95% Wilson confidence interval on the pass rate. Formula: `Wilson score interval: center ± half-width, where n = tasks_attempted_distinct.` Use to judge whether a lead over another model is statistically meaningful. Wide CIs indicate too few tasks to draw firm conclusions. Wilson score interval (Wikipedia) ↗	Cost / run Average total LLM cost per benchmark run in USD. Formula: `SUM(cost_usd) / run_count across all runs for this model.` Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.	$/Pass Total cost divided by number of distinct tasks passed. Lower is better. Formula: `SUM(cost_usd) / tasks_passed_distinct across all runs.` Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.	Latency p95 95th-percentile per-task wall time. Captures tail latency. Formula: `95th percentile of per-task duration_ms across all tasks in all runs.` Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.
1	Claude Opus 4.6 claude claude-opus-4-6	70.95	51/64	±9.7%	$0.17	$0.2098	187.4s	17h ago
2	Claude Opus 4.7 claude claude-opus-4-7	70.30	50/64	±10.0%	$0.16	$0.2001	183.5s	17h ago
3	Claude Sonnet 4 6 claude claude-sonnet-4-6	64.97	50/64	±10.0%	$0.11	$0.1399	185.6s	17h ago
4	OpenAI: GPT-5.4 gpt gpt-5.4	60.86	46/64	±10.8%	<$0.001	$0.0001	161.0s	3h ago
5	GPT-5.5 gpt gpt-5.5	59.78	50/64	±10.0%	$0.44	$0.5613	241.4s	17h ago
6	xAI: Grok 4.3 grok x-ai/grok-4.3	59.13	48/64	±10.4%	$0.03	$0.0435	207.7s	3h ago
7	Claude Haiku 4 5 20251001 claude claude-haiku-4-5-20251001	44.37	38/64	±11.7%	<$0.001	$0.0000	150.9s	3h ago
8	DeepSeek V4 Pro deepseek deepseek/deepseek-v4-pro	26.91	32/64	±11.9%	$0.02	$0.0337	369.6s	3h ago

Showing 8 of 8