CentralGauge

Benchmark for LLMs on Microsoft Dynamics 365 Business Central AL code.

8 models · updated 0s ago

  1. 1 Claude Opus 4.6 claude 79.7
  2. 2 Claude Opus 4.7 claude 78.1
  3. 3 Claude Sonnet 4 6 claude 78.1
  4. 4 GPT-5.5 gpt 78.1
  5. 5 xAI: Grok 4.3 grok 75.0
  6. 6 OpenAI: GPT-5.4 gpt 71.9
  7. 7 Claude Haiku 4 5 20251001 claude 59.4
  8. 8 DeepSeek V4 Pro deepseek 50.0
Leaderboard
#

Score

Average score per attempt row (0–1). Rewards partial credit.

Formula: Mean of all attempt scores across all results rows: SUM(score) / COUNT(*) over the results table.

Ranks models that make consistent partial progress on hard tasks. A model that scores 0.5 on every task beats one that passes half and fails the rest on this metric.

Pass Rate

Fraction of distinct tasks solved in any attempt across all runs.

Formula: (tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / tasks_attempted_distinct

Primary ranking metric. Compare models here first: it directly measures how often the model delivers working code.

CI

Pass Rate 95% CI

95% Wilson confidence interval on the pass rate.

Formula: Wilson score interval: center ± half-width, where n = tasks_attempted_distinct.

Use to judge whether a lead over another model is statistically meaningful. Wide CIs indicate too few tasks to draw firm conclusions.

Cost / run

Average total LLM cost per benchmark run in USD.

Formula: SUM(cost_usd) / run_count across all runs for this model.

Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.

$/Pass

Total cost divided by number of distinct tasks passed. Lower is better.

Formula: SUM(cost_usd) / tasks_passed_distinct across all runs.

Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.

Latency p95

95th-percentile per-task wall time. Captures tail latency.

Formula: 95th percentile of per-task duration_ms across all tasks in all runs.

Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.

1
Claude Opus 4.6 claude claude-opus-4-6
70.95
51/64±9.7%$0.17$0.2098187.4s17h ago
2
Claude Opus 4.7 claude claude-opus-4-7
70.30
50/64±10.0%$0.16$0.2001183.5s17h ago
3
Claude Sonnet 4 6 claude claude-sonnet-4-6
64.97
50/64±10.0%$0.11$0.1399185.6s17h ago
4
60.86
46/64±10.8%<$0.001$0.0001161.0s3h ago
5
GPT-5.5 gpt gpt-5.5
59.78
50/64±10.0%$0.44$0.5613241.4s17h ago
6
xAI: Grok 4.3 grok x-ai/grok-4.3
59.13
48/64±10.4%$0.03$0.0435207.7s3h ago
7
Claude Haiku 4 5 20251001 claude claude-haiku-4-5-20251001
44.37
38/64±11.7%<$0.001$0.0000150.9s3h ago
8
DeepSeek V4 Pro deepseek deepseek/deepseek-v4-pro
26.91
32/64±11.9%$0.02$0.0337369.6s3h ago

Showing 8 of 8