CentralGauge
Benchmark for LLMs on Microsoft Dynamics 365 Business Central AL code.
| # | Score Average score per attempt row (0–1). Rewards partial credit. Formula: Ranks models that make consistent partial progress on hard tasks. A model that scores 0.5 on every task beats one that passes half and fails the rest on this metric. | Pass Rate Fraction of distinct tasks solved in any attempt across all runs. Formula: Primary ranking metric. Compare models here first: it directly measures how often the model delivers working code. | CI Pass Rate 95% CI 95% Wilson confidence interval on the pass rate. Formula: Use to judge whether a lead over another model is statistically meaningful. Wide CIs indicate too few tasks to draw firm conclusions. | Cost / run Average total LLM cost per benchmark run in USD. Formula: Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view. | $/Pass Total cost divided by number of distinct tasks passed. Lower is better. Formula: Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates. | Latency p95 95th-percentile per-task wall time. Captures tail latency. Formula: Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts. | ||
|---|---|---|---|---|---|---|---|---|
| 1 | 70.95 | 51/64 | ±9.7% | $0.17 | $0.2098 | 187.4s | 17h ago | |
| 2 | 70.30 | 50/64 | ±10.0% | $0.16 | $0.2001 | 183.5s | 17h ago | |
| 3 | 64.97 | 50/64 | ±10.0% | $0.11 | $0.1399 | 185.6s | 17h ago | |
| 4 | 60.86 | 46/64 | ±10.8% | <$0.001 | $0.0001 | 161.0s | 3h ago | |
| 5 | 59.78 | 50/64 | ±10.0% | $0.44 | $0.5613 | 241.4s | 17h ago | |
| 6 | 59.13 | 48/64 | ±10.4% | $0.03 | $0.0435 | 207.7s | 3h ago | |
| 7 | 44.37 | 38/64 | ±11.7% | <$0.001 | $0.0000 | 150.9s | 3h ago | |
| 8 | 26.91 | 32/64 | ±11.9% | $0.02 | $0.0337 | 369.6s | 3h ago |
Showing 8 of 8